Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.
Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.
In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.
For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.
1) Use very light or minimal stemming to avoid losing semantic information.
Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:
computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput
international -> intern
internationals -> intern
intern -> intern
interns -> intern
A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.
So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.
2) Use stop words for those languages that we have them for.
This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.
The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.
For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.
3) Try and retain term consistency across all analyzers.
We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.
Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.
Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.
By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.
The Details (there’s always exceptions to rules)
- Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
- For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
- There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
- Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
- Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list. I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
- Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.
Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.
"analysis": {
"filter": {
"ar_stop_filter": {
"type": "stop",
"stopwords": ["_arabic_"]
},
"bg_stop_filter": {
"type": "stop",
"stopwords": ["_bulgarian_"]
},
"ca_stop_filter": {
"type": "stop",
"stopwords": ["_catalan_"]
},
"cs_stop_filter": {
"type": "stop",
"stopwords": ["_czech_"]
},
"da_stop_filter": {
"type": "stop",
"stopwords": ["_danish_"]
},
"de_stop_filter": {
"type": "stop",
"stopwords": ["_german_"]
},
"de_stem_filter": {
"type": "stemmer",
"name": "minimal_german"
},
"el_stop_filter": {
"type": "stop",
"stopwords": ["_greek_"]
},
"en_stop_filter": {
"type": "stop",
"stopwords": ["_english_"]
},
"en_stem_filter": {
"type": "stemmer",
"name": "minimal_english"
},
"es_stop_filter": {
"type": "stop",
"stopwords": ["_spanish_"]
},
"es_stem_filter": {
"type": "stemmer",
"name": "light_spanish"
},
"eu_stop_filter": {
"type": "stop",
"stopwords": ["_basque_"]
},
"fa_stop_filter": {
"type": "stop",
"stopwords": ["_persian_"]
},
"fi_stop_filter": {
"type": "stop",
"stopwords": ["_finnish_"]
},
"fi_stem_filter": {
"type": "stemmer",
"name": "light_finish"
},
"fr_stop_filter": {
"type": "stop",
"stopwords": ["_french_"]
},
"fr_stem_filter": {
"type": "stemmer",
"name": "minimal_french"
},
"he_stop_filter": {
"type": "stop",
"stopwords": [/*excluded for brevity*/]
},
"hi_stop_filter": {
"type": "stop",
"stopwords": ["_hindi_"]
},
"hu_stop_filter": {
"type": "stop",
"stopwords": ["_hungarian_"]
},
"hu_stem_filter": {
"type": "stemmer",
"name": "light_hungarian"
},
"hy_stop_filter": {
"type": "stop",
"stopwords": ["_armenian_"]
},
"id_stop_filter": {
"type": "stop",
"stopwords": ["_indonesian_"]
},
"it_stop_filter": {
"type": "stop",
"stopwords": ["_italian_"]
},
"it_stem_filter": {
"type": "stemmer",
"name": "light_italian"
},
"nl_stop_filter": {
"type": "stop",
"stopwords": ["_dutch_"]
},
"no_stop_filter": {
"type": "stop",
"stopwords": ["_norwegian_"]
},
"pt_stop_filter": {
"type": "stop",
"stopwords": ["_portuguese_"]
},
"pt_stem_filter": {
"type": "stemmer",
"name": "minimal_portuguese"
},
"ro_stop_filter": {
"type": "stop",
"stopwords": ["_romanian_"]
},
"ru_stop_filter": {
"type": "stop",
"stopwords": ["_russian_"]
},
"ru_stem_filter": {
"type": "stemmer",
"name": "light_russian"
},
"sv_stop_filter": {
"type": "stop",
"stopwords": ["_swedish_"]
},
"sv_stem_filter": {
"type": "stemmer",
"name": "light_swedish"
},
"tr_stop_filter": {
"type": "stop",
"stopwords": ["_turkish_"]
}
},
"analyzer": {
"ar_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "ar_stop_filter"],
"char_filter": ["html_strip"]
},
"bg_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "bg_stop_filter"],
"char_filter": ["html_strip"]
},
"ca_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "ca_stop_filter"],
"char_filter": ["html_strip"]
},
"cs_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "cs_stop_filter"],
"char_filter": ["html_strip"]
},
"da_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "da_stop_filter"],
"char_filter": ["html_strip"]
},
"de_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "de_stop_filter", "de_stem_filter"],
"char_filter": ["html_strip"]
},
"el_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "el_stop_filter"],
"char_filter": ["html_strip"]
},
"en_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "en_stop_filter", "en_stem_filter"],
"char_filter": ["html_strip"]
},
"es_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "es_stop_filter", "es_stem_filter"],
"char_filter": ["html_strip"]
},
"eu_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "eu_stop_filter"],
"char_filter": ["html_strip"]
},
"fa_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "fa_stop_filter"],
"char_filter": ["html_strip"]
},
"fi_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "fi_stop_filter", "fi_stem_filter"],
"char_filter": ["html_strip"]
},
"fr_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "fr_stop_filter", "fr_stem_filter"],
"char_filter": ["html_strip"]
},
"he_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "he_stop_filter"],
"char_filter": ["html_strip"]
},
"hi_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "hi_stop_filter"],
"char_filter": ["html_strip"]
},
"hu_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "hu_stop_filter", "hu_stem_filter"],
"char_filter": ["html_strip"]
},
"hy_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "hy_stop_filter"],
"char_filter": ["html_strip"]
},
"id_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "id_stop_filter"],
"char_filter": ["html_strip"]
},
"it_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "it_stop_filter", "it_stem_filter"],
"char_filter": ["html_strip"]
},
"ja_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_tokenizer",
"filter": ["icu_folding", "icu_normalizer"],
"char_filter": ["html_strip"]
},
"ko_analyzer": {
"type": "cjk"
},
"nl_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "nl_stop_filter"],
"char_filter": ["html_strip"]
},
"no_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "no_stop_filter"],
"char_filter": ["html_strip"]
},
"pt_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "pt_stop_filter", "pt_stem_filter"],
"char_filter": ["html_strip"]
},
"ro_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "ro_stop_filter"],
"char_filter": ["html_strip"]
},
"ru_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "ru_stop_filter", "ru_stem_filter"],
"char_filter": ["html_strip"]
},
"sv_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "sv_stop_filter", "sv_stem_filter"],
"char_filter": ["html_strip"]
},
"tr_analyzer": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer", "tr_stop_filter"],
"char_filter": ["html_strip"]
},
"zh_analyzer": {
"type": "custom",
"tokenizer": "smartcn_sentence",
"filter": ["icu_folding", "icu_normalizer", "smartcn_word"],
"char_filter": ["html_strip"]
},
"default": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": ["icu_folding", "icu_normalizer"],
"char_filter": ["html_strip"]
},
"wp_raw_lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase"]
}
},
"tokenizer": {
"kuromoji": {
"type": "kuromoji_tokenizer",
"mode": "search"
}
}
}