Three Principles for Multilingal Indexing in Elasticsearch

Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.

Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.

In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.

Update: In comments, Michael pointed out that since this post was written the langdetect plugin now has a custom mapping that the mapping example below is not using. I’d highly recommend checking it out for any new implementations.

For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.

1) Use very light or minimal stemming to avoid losing semantic information.

Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:

computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput

international -> intern
internationals -> intern
intern -> intern
interns -> intern

A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.

So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.

2) Use stop words for those languages that we have them for.

This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.

The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.

For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.

3) Try and retain term consistency across all analyzers.

We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.

Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.

Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.

By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

The Details (there’s always exceptions to rules)

  • Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
    • For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
    • There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
    • Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
  • Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list (Update: this site doesn’t exist anymore, our list of stopwords is located here). I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
  • Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.

Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.

Update [Feb 2014]: The PHP code we use for generating analyzers is now open sourced as a part of the wpes-lib project. See that code for the latest methods we are using.

Update [May 2014]: Based on the feedback in the comments and some issues we’ve come across running in production I’ve updated the mappings below. The changes we made are:

  • Perform ICU normalization before removing stopwords, and ICU folding after stopwords. Otherwise stopwords such as “même” in French will not be correctly removed.
  • Adjusted our Japanese language analysis based on a slightly adjusted use of GMO Media’s methodology. We were seeing a significantly lower click through rate on Japanese related posts than for other languages, and there was pretty good evidence that the morphological language analysis would help.
  • Added the Elision Token filter to French. “l’avion” => “avion”

Potential improvements I haven’t gotten a chance to test yet because we need to run real performance tests to be sure they will actually be an improvement:

  • Duplicate tokens to handle different spellings (eg “recognize” vs “recognise”).
  • Morphological analysis of en and ru
  • Should we run spell checking or phonetic analysis
  • Include all stopwords and rely on cutoff_frequency to avoid the performance problems this will introduce
  • Index bigrams with the shingle analyzer
  • Duplicate terms, stem them, then unique the terms to try and index both stemmed and non-stemmed terms

Thanks to everyone in the comments who have helped make our multi-lingual indexing better.

{
  "filter": {
    "ar_stop_filter": {
      "type": "stop",
      "stopwords": ["_arabic_"]
    },
    "bg_stop_filter": {
      "type": "stop",
      "stopwords": ["_bulgarian_"]
    },
    "ca_stop_filter": {
      "type": "stop",
      "stopwords": ["_catalan_"]
    },
    "cs_stop_filter": {
      "type": "stop",
      "stopwords": ["_czech_"]
    },
    "da_stop_filter": {
      "type": "stop",
      "stopwords": ["_danish_"]
    },
    "de_stop_filter": {
      "type": "stop",
      "stopwords": ["_german_"]
    },
    "de_stem_filter": {
      "type": "stemmer",
      "name": "minimal_german"
    },
    "el_stop_filter": {
      "type": "stop",
      "stopwords": ["_greek_"]
    },
    "en_stop_filter": {
      "type": "stop",
      "stopwords": ["_english_"]
    },
    "en_stem_filter": {
      "type": "stemmer",
      "name": "minimal_english"
    },
    "es_stop_filter": {
      "type": "stop",
      "stopwords": ["_spanish_"]
    },
    "es_stem_filter": {
      "type": "stemmer",
      "name": "light_spanish"
    },
    "eu_stop_filter": {
      "type": "stop",
      "stopwords": ["_basque_"]
    },
    "fa_stop_filter": {
      "type": "stop",
      "stopwords": ["_persian_"]
    },
    "fi_stop_filter": {
      "type": "stop",
      "stopwords": ["_finnish_"]
    },
    "fi_stem_filter": {
      "type": "stemmer",
      "name": "light_finish"
    },
    "fr_stop_filter": {
      "type": "stop",
      "stopwords": ["_french_"]
    },
    "fr_stem_filter": {
      "type": "stemmer",
      "name": "minimal_french"
    },
    "he_stop_filter": {
      "type": "stop",
      "stopwords": [/*excluded for brevity*/]
    },
    "hi_stop_filter": {
      "type": "stop",
      "stopwords": ["_hindi_"]
    },
    "hu_stop_filter": {
      "type": "stop",
      "stopwords": ["_hungarian_"]
    },
    "hu_stem_filter": {
      "type": "stemmer",
      "name": "light_hungarian"
    },
    "hy_stop_filter": {
      "type": "stop",
      "stopwords": ["_armenian_"]
    },
    "id_stop_filter": {
      "type": "stop",
      "stopwords": ["_indonesian_"]
    },
    "it_stop_filter": {
      "type": "stop",
      "stopwords": ["_italian_"]
    },
    "it_stem_filter": {
      "type": "stemmer",
      "name": "light_italian"
    },
    "ja_pos_filter": {
      "type": "kuromoji_part_of_speech",
      "stoptags": ["\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c", "\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
    },
    "nl_stop_filter": {
      "type": "stop",
      "stopwords": ["_dutch_"]
    },
    "no_stop_filter": {
      "type": "stop",
      "stopwords": ["_norwegian_"]
    },
    "pt_stop_filter": {
      "type": "stop",
      "stopwords": ["_portuguese_"]
    },
    "pt_stem_filter": {
      "type": "stemmer",
      "name": "minimal_portuguese"
    },
    "ro_stop_filter": {
      "type": "stop",
      "stopwords": ["_romanian_"]
    },
    "ru_stop_filter": {
      "type": "stop",
      "stopwords": ["_russian_"]
    },
    "ru_stem_filter": {
      "type": "stemmer",
      "name": "light_russian"
    },
    "sv_stop_filter": {
      "type": "stop",
      "stopwords": ["_swedish_"]
    },
    "sv_stem_filter": {
      "type": "stemmer",
      "name": "light_swedish"
    },
    "tr_stop_filter": {
      "type": "stop",
      "stopwords": ["_turkish_"]
    }
  },
  "analyzer": {
    "ar_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ar_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "bg_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "bg_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ca_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ca_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "cs_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "cs_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "da_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "da_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "de_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "de_stop_filter", "de_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "el_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "el_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "en_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "es_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "es_stop_filter", "es_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "eu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "eu_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fa_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fa_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fi_stop_filter", "fi_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "elision", "fr_stop_filter", "fr_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "he_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "he_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hi_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hu_stop_filter", "hu_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hy_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hy_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "id_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "id_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "it_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "it_stop_filter", "it_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ja_analyzer": {
      "type": "custom",
      "filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
      "tokenizer": "kuromoji_tokenizer"
    },
    "ko_analyzer": {
      "type": "cjk",
      "filter": []
    },
    "nl_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "nl_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "no_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "no_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "pt_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "pt_stop_filter", "pt_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ro_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ro_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ru_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ru_stop_filter", "ru_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "sv_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "sv_stop_filter", "sv_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "tr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "tr_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "zh_analyzer": {
      "type": "custom",
      "filter": ["smartcn_word", "icu_normalizer", "icu_folding"],
      "tokenizer": "smartcn_sentence"
    },
    "lowercase_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "keyword"
    },
    "default": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    }
  },
  "tokenizer": {
    "kuromoji": {
      "type": "kuromoji_tokenizer",
      "mode": "search"
    }
  }
}

 

Leave a comment

41 Comments

  1. Gregor

     /  May 27, 2013

    So for indexing you use the language detection plugin to determine the language of the document and use the corresponding analyzer.
    And for searching you always relay on the default analyzer without attempting to “guess” the language?

    Reply
    • Greg

       /  May 27, 2013

      For indexing, yes we do language detection to select the analyzer.

      When querying, it depends. If we have a good guess at the user’s language (ie they are on de.search.wordpress.com or the site they are on has a particular language selected) then we can use the appropriate language. But when we don’t have a good guess, then we can fall back to the default analyzer which should work pretty well across most languages.

      Ideally we try and use the appropriate language analyzer, but there are definitely cases where I know we won’t be able to so having a fallback is important. The biggest concern with the fallback is how stemming will truncate terms. Hopefully using only minimal stemming will minimize how much impact this has.

      I haven’t done any deep analysis of what impact this has on search relevancy yet though.

      Reply
      • Nate

         /  November 20, 2013

        Greg,

        So when you say you do “language detection” are you doing this independently from elastic search? Or is there a way to tie content.lang as set by the plugin to a particular analyzer automatically? I am very new to elastic search and it would be helpful to know.

      • Greg

         /  November 20, 2013

        Hi Nate

        We run the elasticsearch-langdetect plugin on the same ES cluster and then when indexing first make a call to it to determine the language of the content of the doc. Then we make a separate call to index the document.

        I don’t believe there is a way to index the document and determine the language at the same time.

        It’s also possible to run the langdetect code independent of ES (potentially in your client), but for us using the ES plugin made it easier to deploy and it doesn’t add much load to the cluster.

      • Nate

         /  November 21, 2013

        Thanks for the prompt reply! Makes sense.

  2. Avi G

     /  October 24, 2013

    Amazing post! Helped me a lot. Thank you for all the information!

    Reply
  3. Ale

     /  October 31, 2013

    Hei,
    It’s not clear for me how do you decide which analyzer to use depending on the field’s content. Do you have a field for each language, or were you able to use different analyzers for the same field at indexing ?

    Thanks !

    Reply
    • Greg

       /  October 31, 2013

      You can specify the analyzer to use when indexing. In my case I have a field for each document called lang_analyzer which specifies which analyzer to use for the document.

      You configure which field is used for specifying the analysis in the _analyzer mapping field.

      For querying you either need to specify the analyzer or you just rely on the default. Using the ICU plugins for analysis ensures consistent tokenization across all languages so that the default should work pretty well.

      Reply
  4. Daniel

     /  December 17, 2013

    Hi Greg! First of all, great post, thanks for it!
    How would you go about if your data was region names in various languages, for instance, I have more than 100k regions, and one document in ES contains the names Munich, München, Munique, etc. Same goes for the 100k+ regions. Having one document per language would make my index grow a lot.
    What I want to have is an auto complete where people can search regions, but I don’t really know the language they best know the region, so they can be seeing the site in English but searching the region in German. So to have an educated guess of the language is hard. Do you think a set up like the one you presented would be appropriate for data such as the one I stated? Or would you do something different?

    Thanks a lot,
    Daniel

    Reply
    • Greg

       /  December 18, 2013

      If I understand the use case, I think you could just use the ICU tokenizer, folding, and normalization on a single field without any stemming or stop words (the “default” analyzer in the code above). If you are only indexing place names across multiple languages you shouldn’t need stemming/stop words anyways. ICU should give you results that work pretty well across European languages at least. You wouldn’t have any fancy tokenization of Korean, Japanese, or Chinese. I don’t know enough about place names in those languages to know how big a problem that would be.

      If all of the place names you have are already separated, then be sure to index them as an array of strings, and consider indexing them as an analyzed and a non-analyzed (see the multi-field type mapping example).

      That way you can retain the original text and sequence of words. Probably some other details to work out to get auto suggest working well also, but I haven’t yet played with the new suggest features.

      Reply
      • Daniel

         /  January 8, 2014

        Thanks for the feedback Greg, I’m trying some stuff to see how it works, and what is faster, and your tips certainly helped.

        Thank you

  5. Michael

     /  January 22, 2014

    any idea how to plug in the polish (stempel) analyzer? have you tried it?

    Reply
  6. Michael

     /  January 22, 2014

    also, how does one uses the elasticsearch-langdetect plugin to automatically apply the right analyzer based on computed language?

    Reply
    • Greg

       /  January 23, 2014

      You can’t auto apply the analyzer to a field unfortunately. You need to make one request to analyze a block of text and get the language and then a separate request to index the data with the appropriate analyzer specified.

      Reply
      • Greg

         /  January 27, 2014

        Oh cool! The langdetect plugin has been updated since I originally wrote this post, and I hadn’t noticed that change.

        Yes, I think that should work. I’ll need to use this method in the future. Thanks!

      • Michael

         /  January 27, 2014

        unfortunately the _langdetect method is wayyy inaccurate, especially for short phrases..

      • Greg

         /  January 27, 2014

        Ya, I have some custom client code wrapping my call to langdetect so that if there is less than 300 chars of actual text then we don’t bother running it and use some fallbacks.

        I hacked together a quick (probably not working) gist of how we call langdetect: https://gist.github.com/gibrown/8652399

        Might be good to submit an issue against the plugin with specific examples. Short text is generally a harder problem, but there may be some simple changes that will make things better.

  7. Excellent article. I thought readers might be interested in Rosette Search Essentials for Elasticsearch, from Basis Technologies, which we launched last night at hack/reduce in Cambridge, MA. It’s a plugin that does three neat things that improve multilingual search quality:

    – Intelligent CJKT tokenization/segmentation
    – Lemmatization: performs morphological analysis to find the “lemma” or dictionary form of the words in your documents, which is far superior to stemming.
    – Decompounding: languages like German contain compound words that don’t always make great index terms. We break these up into their constituents so you can index them too.

    Handles Arabic, Chinese, Czech, Danish, Dutch, English, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Swedish, Thai and Turkish.

    Check it out here: http://basistech.com/elasticsearch

    Read a bit more about the recall and precision benefits that lemmatization and decompounding can offer here: See this paper for examples: http://www.basistech.com/search-in-european-languages-whitepaper/)

    I’m the Director of Product Management at Basis. I would love feedback on the product and to hear from anyone who has gnarly multilingual search problems.

    Reply
    • Greg

       /  February 9, 2014

      Hi Gregor, thanks for pointing this out and for working to make multi-lingual search better.

      I pretty strongly recommend against using a closed source solution such as yours for something so fundamental as search. My reasoning got lengthy, so I turned it into a full post.

      Happy to discuss more, either publicly or privately.

      Cheers.

      Reply
  8. slushi

     /  February 14, 2014

    the link to hebrew stop words seems to be broken. any ideas on where a good list can be found?

    Reply
  9. slushi

     /  February 14, 2014

    I tried out the above settings. I suspected that the above definition could cause issues when language specific stop words contain “special” characters that would be folded into ascii characters. I built a gist that demonstrates the problem in french.

    Did you guys decide this is acceptable? I think if the folding filter is moved to the end of the filter chain, this issue would disappear, but I don’t know what other effects that would have.

    Reply
    • Greg

       /  February 14, 2014

      Wow, you’re totally right. No, its not really acceptable, definitely a bug. Thanks!

      I think the folding filter should be last in the list, or we should use custom stopword lists that have the characters already folded. Probably this:

      "filter": ["icu_normalizer", "fr_stop_filter", "fr_stem_filter", "icu_folding"]
      

      This bug probably doesn’t affect search quality too much. It only applies to a few words in each language. However, including stop words in the index definitely makes the index bigger and could significantly slow down searches.

      We’ll have to do some experimentation to figure out what the right filtering is. Will be interesting to see how much of a performance improvement we get from this change.

      FYI, character folding is definitely very worthwhile. We did some work with one of our VIPs on a French site, and without character folding there were definitely complaints about the search.

      Thanks again!

      Reply
  10. Thanks for the nice article. One of the links is dead. The article on searchworkings has moved to: http://blog.trifork.com/2011/12/07/analysing-european-languages-with-lucene/

    regards Jettro

    Reply
  11. I love this post – come back here from time to time, because you’re regularly updating it – thanks for that! Learned a lot here! We’ve used that information for improving search results on our multilingual site Pixabay.com (20 languages).

    To give back something – as a German based company, we could fine tune some things for search in German:

    Instead of plan “icu_folding” one should better use a customized filter and exclude a few special characters:

    “filter”: {
    “de_icu_folding”: { “type”: “icu_folding”, “unicodeSetFilter”: “[^ßÄäÖöÜü]” },
    “de_stem_filter”: { “type”: “stemmer”, “name”: “minimal_german” },
    }

    Then, add a char filter to transform the excluded characters:

    “char_filter”: {
    “de_char_filter”: {
    “type”: “mapping”,
    “mappings”: [u"ß=>ss", u"Ä=>ae", u"ä=>ae", u"Ö=>oe", u"ö=>oe", u"Ü=>ue", u"ü=>ue", "ph=>f"]
    }
    }

    Put it all together in the analyzer:

    “de_analyzer”: {
    “type”: “custom”, “tokenizer”: “icu_tokenizer”,
    “filter”: ["de_stop_filter", "de_icu_folding", "de_stem_filter", "icu_normalizer"],
    “char_filter”: ["de_char_filter"]
    }

    Advantage: For example there are the words like “blut” and “blüte” in German, meaning “blood” and “blossom”. Using standard icu_folding, both terms are treated exactly the same way. With the custom char filter, results work as expected. The character “ü” may be written as “ue” in German, which is what the transformation basically does.

    Reply
    • Greg Ichneumon Brown

       /  May 23, 2014

      This is very helpful, thanks.

      I’ve been testing these changes out today, and I’m looking at adding this with a few slight changes into wpes-lib:
      – I just used the default icu_folding because as far as I could tell the char_filter will have changed these characters anyways
      – I also changed the order of the filters to put the normalizer first since one of the reasons for this filter is to combine multi-character sequences into one character before folding.

      I think both of these changes matter more when you are dealing with multi-lingual content in a single document. Any problems you see with this? For your examples it seems to still work well.

      I’m also curious if you have looked at all at using a decompounder in German.

      Reply
      • If the char_filter is applied before icu_folding takes place, it should work. In which order does ES go though those filters?

        I think, icu normalizer first makes totally sense – I’ll change that in our own code right away.

        Didn’t know about the decompounder so far – but it sounds great! Going to test this soon!

        Thanks, Simon

      • Greg Ichneumon Brown

         /  May 23, 2014

        ES always applies char filters first (even before tokenization), so ya that should work well.

        I’d be really interested to hear how the decompounder works for you. It feels like too big a change for me to universally change without doing some thorough testing of its performance. I’d also like to test it for multiple languages and just don’t have the time to devote to it right now.

        Thanks again for the help, I’m going to commit these changes and make them live when we rebuild our index in a few weeks.

      • Not sure if that’s interesting for you, but we also use a word delimiter filter for all latin languages, so not for ja, zh, ko: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

        “filter”: {
        “my_word_delimiter”: {
        “type”: “word_delimiter”,
        “generate_word_parts”: False,
        “catenate_words”: True,
        “catenate_numbers”: True,
        “split_on_case_change”: False,
        “preserve_original”: True
        },
        }

      • Greg Ichneumon Brown

         /  May 23, 2014

        Good to hear that works well for you.

        I have used a word delimiter on some smaller indices, but I (vaguely) remember running into problems in a few cases. I think I decided I didn’t have enough data to figure out how to configure it properly.

        I still feel like my analyzers don’t do a good job with product names and other words where punctuation or case is used as part of the word.

        I’m surprised you don’t use the same filter for ja, zh, and ko. I often see a lot of latin languages mixed in with Asian languages.

      • I guess it wouldn’t really hurt, but in our case, the delimiter also wouldn’t make a (relevant) difference for ja, ko, zh. We’re not dealing with full texts/sentences, but with a lot of keywords that are strictly separated into the different languages. There are a few latin names for cities, countries and the likes, but they would not be affected by the delimiter. So the delimiter would only cost a bit of performance with no real benefit …

      • I’ve looked at the German decompounder – in theory it really looks good and I’d like to use it. However, it’s not well maintained. The update frequency appears to be rather low and there’s no working version for the current ES server 1.1.x or 1.2.

      • Greg Ichneumon Brown

         /  May 27, 2014

        Thanks for the update.

        jprante has been pretty responsive to Github issues I’ve submitted elsewhere in the past, so maybe either submit one or even better build it locally and submit a pull request. My guess is that very little has changed in 1.1.x that would affect this.

  12. Florian

     /  May 22, 2014

    very helpful article – thanks a lot greg!

    i have two questions:
    – what would be a good way to deal with a non detected/defined language? i build a mapping along the lines of the gist Michael posted. each language needs to be defined… content.en, content.ja, etc. how would i deal with a language that had not been defined there?

    – is there a way to use the langdetect plugin to also add/populate a field in the mapping that would contain the language code – for example to use it as a filter?

    cheers
    _f

    Reply
    • Greg Ichneumon Brown

       /  May 23, 2014

      Both of your questions would probably be good feature requests for the langdetect plugin. We still make a separate call to ES to do language detection and then set our lang_analyzer field to indicate which analyzer to apply. There’s three reasons we do this:
      – langdetect does not support every language
      – We do not have a custom analyzer for every language, some need to fall back on our default analyzer (eg Latvian).
      – We have other potential fallbacks we can use if the language detection fails. For example: user settings, lang detection on other content, or predicting based on other user behavior.

      Reply
      • Florian

         /  May 27, 2014

        using the detection separately (or inferring the language from ui settings etc.), works fine for me too. i would have to send the name of the analyzer to use with the query though and i ran into a small problem:

        if i use this query i do net get the highlights. if i remove the analyzer parameter i do get hightlights, but it then uses the default analyzer…

        am i doing something wrong with the parameter? do you have an example query somewhere that you could post that shows how you send the language/analyzer parameter with the query?

        thanks.

  1. Elasticsearch: Vyhledáváme hezky česky | IT mag - novinky z IT

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,058 other followers

%d bloggers like this: