May 6, 2013

Now live: WordPress.com VIP Search

Reblogged from WordPress.com VIP:

Click to visit the original post

WordPress's standard search features are capable and easy to use, but when you're developing search-driven web applications with WordPress, you need a tool ready-made for that purpose. That's why today we're introducing our new WordPress.com VIP Search add-on, and are excited to debut it as part of the relaunch of the Kaiser Family Foundation here on WordPress.com VIP.

WordPress.com VIP Search is a new premium service for our Cloud Hosting customers that delivers the features and flexibility of the powerful 

Read more… 128 more words

Really love the search interface that Alley Interactive built for the new KFF site. All powered by Elasticsearch (and WordPress of course) behind the scenes.
May 1, 2013

Three Principles for Multilingal Indexing in Elasticsearch

Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.

Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.

In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.

For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.

1) Use very light or minimal stemming to avoid losing semantic information.

Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:

computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput

international -> intern
internationals -> intern
intern -> intern
interns -> intern

A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.

So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.

2) Use stop words for those languages that we have them for.

This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.

The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.

For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.

3) Try and retain term consistency across all analyzers.

We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.

Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.

Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.

By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

The Details (there’s always exceptions to rules)

  • Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
    • For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
    • There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
    • Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
  • Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list. I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
  • Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.

Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.

"analysis": {
    "filter": {
        "ar_stop_filter": {
            "type": "stop",
            "stopwords": ["_arabic_"]
        },
        "bg_stop_filter": {
            "type": "stop",
            "stopwords": ["_bulgarian_"]
        },
        "ca_stop_filter": {
            "type": "stop",
            "stopwords": ["_catalan_"]
        },
        "cs_stop_filter": {
            "type": "stop",
            "stopwords": ["_czech_"]
        },
        "da_stop_filter": {
            "type": "stop",
            "stopwords": ["_danish_"]
        },
        "de_stop_filter": {
            "type": "stop",
            "stopwords": ["_german_"]
        },
        "de_stem_filter": {
            "type": "stemmer",
            "name": "minimal_german"
        },
        "el_stop_filter": {
            "type": "stop",
            "stopwords": ["_greek_"]
        },
        "en_stop_filter": {
            "type": "stop",
            "stopwords": ["_english_"]
        },
        "en_stem_filter": {
            "type": "stemmer",
            "name": "minimal_english"
        },
        "es_stop_filter": {
            "type": "stop",
            "stopwords": ["_spanish_"]
        },
        "es_stem_filter": {
            "type": "stemmer",
            "name": "light_spanish"
        },
        "eu_stop_filter": {
            "type": "stop",
            "stopwords": ["_basque_"]
        },
        "fa_stop_filter": {
            "type": "stop",
            "stopwords": ["_persian_"]
        },
        "fi_stop_filter": {
            "type": "stop",
            "stopwords": ["_finnish_"]
        },
        "fi_stem_filter": {
            "type": "stemmer",
            "name": "light_finish"
        },
        "fr_stop_filter": {
            "type": "stop",
            "stopwords": ["_french_"]
        },
        "fr_stem_filter": {
            "type": "stemmer",
            "name": "minimal_french"
        },
        "he_stop_filter": {
            "type": "stop",
            "stopwords": [/*excluded for brevity*/]
        },
        "hi_stop_filter": {
            "type": "stop",
            "stopwords": ["_hindi_"]
        },
        "hu_stop_filter": {
            "type": "stop",
            "stopwords": ["_hungarian_"]
        },
        "hu_stem_filter": {
            "type": "stemmer",
            "name": "light_hungarian"
        },
        "hy_stop_filter": {
            "type": "stop",
            "stopwords": ["_armenian_"]
        },
        "id_stop_filter": {
            "type": "stop",
            "stopwords": ["_indonesian_"]
        },
        "it_stop_filter": {
            "type": "stop",
            "stopwords": ["_italian_"]
        },
        "it_stem_filter": {
            "type": "stemmer",
            "name": "light_italian"
        },
        "nl_stop_filter": {
            "type": "stop",
            "stopwords": ["_dutch_"]
        },
        "no_stop_filter": {
            "type": "stop",
            "stopwords": ["_norwegian_"]
        },
        "pt_stop_filter": {
            "type": "stop",
            "stopwords": ["_portuguese_"]
        },
        "pt_stem_filter": {
            "type": "stemmer",
            "name": "minimal_portuguese"
        },
        "ro_stop_filter": {
            "type": "stop",
            "stopwords": ["_romanian_"]
        },
        "ru_stop_filter": {
            "type": "stop",
            "stopwords": ["_russian_"]
        },
        "ru_stem_filter": {
            "type": "stemmer",
            "name": "light_russian"
        },
        "sv_stop_filter": {
            "type": "stop",
            "stopwords": ["_swedish_"]
        },
        "sv_stem_filter": {
            "type": "stemmer",
            "name": "light_swedish"
        },
        "tr_stop_filter": {
            "type": "stop",
            "stopwords": ["_turkish_"]
        }
    },
    "analyzer": {
        "ar_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "ar_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "bg_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "bg_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "ca_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "ca_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "cs_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "cs_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "da_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "da_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "de_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "de_stop_filter", "de_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "el_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "el_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "en_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "en_stop_filter", "en_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "es_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "es_stop_filter", "es_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "eu_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "eu_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "fa_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "fa_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "fi_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "fi_stop_filter", "fi_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "fr_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "fr_stop_filter", "fr_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "he_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "he_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "hi_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "hi_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "hu_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "hu_stop_filter", "hu_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "hy_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "hy_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "id_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "id_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "it_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "it_stop_filter", "it_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "ja_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_tokenizer",
            "filter": ["icu_folding", "icu_normalizer"],
            "char_filter": ["html_strip"]
        },
        "ko_analyzer": {
            "type": "cjk"
        },
        "nl_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "nl_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "no_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "no_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "pt_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "pt_stop_filter", "pt_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "ro_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "ro_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "ru_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "ru_stop_filter", "ru_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "sv_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "sv_stop_filter", "sv_stem_filter"],
            "char_filter": ["html_strip"]
        },
        "tr_analyzer": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer", "tr_stop_filter"],
            "char_filter": ["html_strip"]
        },
        "zh_analyzer": {
            "type": "custom",
            "tokenizer": "smartcn_sentence",
            "filter": ["icu_folding", "icu_normalizer", "smartcn_word"],
            "char_filter": ["html_strip"]
        },
        "default": {
            "type": "custom",
            "tokenizer": "icu_tokenizer",
            "filter": ["icu_folding", "icu_normalizer"],
            "char_filter": ["html_strip"]
        },
        "wp_raw_lowercase_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }
    },
    "tokenizer": {
        "kuromoji": {
            "type": "kuromoji_tokenizer",
            "mode": "search"
        }
    }
}
April 19, 2013

A New Way to Visualize Decision Trees

Reblogged from The Official Blog of BigML.com:

Click to visit the original post
  • Click to visit the original post
  • Click to visit the original post
  • Click to visit the original post
  • Click to visit the original post
  • Click to visit the original post
  • Click to visit the original post

If you’ve built decision trees with BigML or explored our gallery, then you should be familiar with our tree visualizations. They're a classic and intuitive way to view trees. The root is at the top, its children are the next level down, the grandchildren are deeper still, and so forth.

While intuitive, this sort of visualization does have some drawbacks.

Read more… 561 more words

Note to self... need to do more data analysis with decision trees. Besides this bigML article, I recently saw a great presentation at a meetup that reminded me of what a great job decision trees do for analyzing features.
April 17, 2013

Mapping WordPress Posts to Elasticsearch

I thought I’d share the Elasticsearch type mapping I am using for WordPress posts. We’ve refined it over a number of iterations and it combines dynamic templates and multi_field mappings along with a number of more standard mappings. So this is probably a good general example of how to index real data from a traditional SQL database into Elasticsearch.

If you aren’t familiar with the WordPress database scheme it looks like this:

These Elasticsearch mappings focus on the wp_posts, wp_term_relationships, wp_term_taxonomy, and wp_terms tables.

To simplify things I’ll just index using an English analyzer and leave discussing multi-lingual analyzers to a different post.

"analysis": {
    "filter": {
        "stop_filter": {
            "type": "stop",
            "stopwords": ["_english_"]
        },
        "stemmer_filter": {
            "type": "stemmer",
            "name": "minimal_english"
        }
    },
    "analyzer": {
        "wp_analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filter": ["lowercase", "stop_filter", "stemmer_filter"],
            "char_filter": ["html_strip"]
        },
        "wp_raw_lowercase_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }
    }
}

A few notes on the analyzers:

  • The minimal_english stemmer only removes plurals rather than potentially butchering the difference between words like “computer”, “computes”, and “computing”.
  • Lowercase keyword analyzer makes doing an exact search without case possible.

Let’s take a look at the post mapping:

"post": {
    "dynamic_templates": [
        {
            "tax_template_name": {
                "path_match": "taxonomy.*.name",
                "mapping": {
                    "type": "multi_field",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer"
                        }
                    }
                }
            }
        }, {
            "tax_template_slug": {
                "path_match": "taxonomy.*.slug",
                "mapping": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }, {
            "tax_template_term_id": {
                "path_match": "taxonomy.*.term_id",
                "mapping": {
                    "type": "long"
                }
            }
        }
    ],
    "_all": {
        "enabled": false
    },
    "properties": {
        "post_id": {
            "type": "long"
        },
        "blog_id": {
            "type": "long"
        },
        "site_id": {
            "type": "long"
        },
        "post_type": {
            "type": "string",
            "index": "not_analyzed"
        },
        "lang": {
            "type": "string",
            "index": "not_analyzed"
        },
        "url": {
            "type": "string",
            "index": "not_analyzed"
        },
        "location": {
            "type": "geo_point",
            "lat_lon": true
        },
        "date": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "date_gmt": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "author": {
            "type": "multi_field",
            "fields": {
                "author": {
                    "type": "string",
                    "index": "analyzed",
                    "analyzer": "wp_analyzer"
                },
                "raw": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        },
        "author_login": {
            "type": "string",
            "index": "not_analyzed"
        },
        "title": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "content": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "tag": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "tag"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "tag.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "tag.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
        "category": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "category"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "category.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "category.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
    }
}

Most of the fields are pretty self explanatory, so I’ll just outline to more complex ones:

  • date and date_gmt: We define the allowed formats because we are taking the dates out of MySQL. We also do some checking of the dates since MySQL will allow some things in a DATETIME field that ES will balk at and cause the indexing operation to fail. For instance MySQL accepts leap dates in non-leap years.
  • content: Content gets stripped of HTML and shortcodes, then converted to UTF-8 in cases where it isn’t already.
  • author and author.raw: The author field corresponds to the user’s display_name. Clearly we need to analyze the field so “Greg Ichneumon Brown” can be matched on a search for “Greg”, but what about when we facet on the field. If we use the analyzed field then the results would have the terms “greg”, “ichneumon”, and “brown”. Instead, by using ES’s multi_field mapping feature to auto generate author.raw the faceted results on that field will give us “Greg Ichneumon Brown”.
  • tag and category: Tags and Categories similarly need raw versions for faceting so we preserve the original tag. Additionally there are a number of ways users can filter the content. WordPress builds slugs from each category/tag to uniquely identify them in a human readable way and there is a unique integer (term_id) associated with each term. The tag.raw_lc is used for exact matching a term without worrying about the case. This may seem like a lot of duplication, but the overriding goal here is to avoid using MySQL for search so we index everything. Extracting data into multiple fields ensures that we will have flexibility when filtering the data in the future.
  • taxonomy.*: WordPress allows custom taxonomies (of which categories and tags are two built-in taxonomies) so we need a way to create a custom path in each document that allows access to each taxonomy. This is where Elasticsearch’s dynamic templates shine. For a custom taxonomy such as “company” the paths will become taxonomy.company.name, taxonomy.company.nametaxonomy.company.name.raw, taxonomy.company.slug, and taxonomy.company.term_id.

The ES documentation is very complete, but it’s not always easy to see how to build complex mappings that fit the individual pieces together. I hope this helps in your own ES development efforts.

February 21, 2013

Building Word Clouds with Faceted Search

Elasticsearch’s faceted results are a great way to analyze the contents of a set of documents. For over a year now, Polldaddy has used Elasticsearch to create reports for the most popular answers and words given to free text survey responses. For more details take a look at the feature announcement.

However, running faceted search on such a wide array of user data can be difficult. Faceted Search in Elasticsearch can consume a lot of memory which leads to the suggestion in the ES documentation to “make sure the number of unique tokens a field can have is not large”. To make sure that we can accept any arbitrary user input we use a couple of tricks.

First let’s take a look at the mapping we use for documents in the polldaddy-survey index:

{
  "polldaddy-survey" : {
    "freetext" : {
      "_routing" : {
        "required" : true,
        "path" : "survey_id"
      },
      "properties" : {
        "resp_id" : {
          "type" : "long",
        },
        "survey_id" : {
          "type" : "long",
        },
        "text" : {
          "type" : "string",
          "analyzer" : "analyzer_polldaddy",
        },
        "text_string" : {
          "type" : "string",
          "index" : "not_analyzed",
        }
      }
    }
  }
}

Each user’s response is its own document where the response is stored in its analyzed form in the text field and unanalyzed in the text_string field. (I should have come up with better names.) By doing a terms facet query on these two fields we can get the most popular words and most popular answers respectively.

However, blindly doing this across the many millions of responses in Polldaddy would run into some serious memory problems due to the overall size of the vocabularies in those fields. For that reason we are using _routing to make sure that all documents related to a single survey go to the same shard. We then allocate a very large number of shards (100 in our case) to limit the number of unique terms in each shard. By routing our query only to one shard the amount of memory that needs to be allocated is greatly reduced, and we can even handle surveys with decent-sized vocabularies.

So just how important is routing to a single shard, well a bug snuck into our code at one point and disabled the routing. Here’s what happened to the cache memory consumption from when it was broken to when it got fixed.

Boom! Fixed a bug.

A pretty dramatic change. Without the routing to a single shard the cache would occasionally try to load a very large vocabulary and allocate 10+ GB. This of course would slow down all queries on the server.

But it could have been worse. Commenting on a previous post on this site Bruno asked me why I suggested setting index.cache.field.type: soft given that it reduces the caching performance. This memory consumption activity is why. Before setting the field cache to soft (and before adding the routing) these queries would sometimes consume so much memory that we would run out of the 24GB we had allocated on the servers. In fact it seemed like no matter how much memory we gave the servers, they would use it and cause OutOfMemory errors that would bring the cluster to a painful halt. Setting the field cache to soft is really the only way I can ensure that we won’t hit those conditions regardless of what data gets entered into a poll.

I will relish the day when there’s a good bug fix in ES for the term facet memory consumption (Issue #1531). There’s so many great applications that can be built on top of faceted searches, I’d just love to not have to worry about running out of memory because of a stray query.

A big thanks to Shay Bannon (lead ES developer) who originally suggested that I use routing and a large number of shards. And of course many thanks to my colleagues on the Polldaddy team.

January 26, 2013

UNIX, Bi-Grams, Tri-Grams, and Topic Modeling

I’ve built up a list of UNIX commands over the years for doing basic text analysis on written language. I’ve built this list from a number of sources (Jim Martin‘s NLP class, StackOverflow, web searches), but haven’t seen it much in one place. With these commands I can analyze everything from log files to user poll responses.

Mostly this just comes down to how cool UNIX commands are (which you probably already know). But the magic is how you mix them together. Hopefully you find these recipes useful. I’m always looking for more so please drop into the comments to tell me what I’m missing.

For all of these examples I assume that you are analyzing a series of user responses with one response per line in a single file: data.txt. With a few cut and paste commands I often apply the same methods to CSV files and log files.

Generating a Random Sample

Sometimes you get confronted with a set of results that are far larger than you want to analyze. If you want to cull out a few lines from a file, but you want to eliminate the biased ordering in the file, it’s very helpful to create a random sample of them.

awk 'BEGIN {srand()} {printf "%05.0f %s \n",rand()*99999, $0; }' data.txt | sort -n | head -100 | sed 's/^[0-9]* //'

This just adds a random number to the beginning of each line, sorts the list, takes the top 100 lines, and removes the random number. A quick and easy way to get a random sample. I also use this when testing new commands where I want to just try the command on 10 lines to verify I got the command right. I’m a big believer that randomized testing will find corner cases faster than you can think of them.

Most Frequent Response

cat data.txt | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn

This is pretty straight-forward, take each line (lowercased) and sort alphabetically. Then use the awesome uniq -c command to count number of identical responses. Finally, sort by most frequent response.

This is why sort | uniq -c | sort -rn is easily my favorite UNIX command.

Most Frequent Words (Uni-Grams)

Along with most frequent response you often want to look at most frequent words. This is just a natural extension of our previous command, but we want to remove stop words (“the”, “of”, “and”, etc) since they provide no useful information.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | tr ' ' '\n' | grep -v -w -f stopwords_en.txt | sort | uniq -c | sort -rn

Pretty much the same command except we are replacing spaces with carriage returns to break the document into words rather than lines. It would be a good improvement to do better tokenization than just splitting on whitespace, but for most purposes this works well.

The stopwords_en.txt file is just one stop word per line. I usually pull my list of stop words from Ranks.nl which also has stopwords in many other languages besides English.

Most Frequent Bi-Grams

Most Frequent words are great, but they throw away a lot of context (and hence meaning). By examining pairs of words (bi-grams) you can retain a lot more of the context.

cat data.txt | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' | sed 's/,//' | sed G | tr ' ' '\n' > tmp.txt
tail -n+2 tmp.txt > tmp2.txt
paste -d ',' tmp.txt tmp2.txt | grep -v -e "^," | grep -v -e ",$" | sort | uniq -c | sort -rn

Here we take our list of words and concatenate sequential words together separated by a comma. To indicate the end/beginning of a response we use sed G to add an extra line between each response before we split the responses into words. Then we filter out those beginning and ending words (grep -v -e "^," | grep -v -e ",$") so that we are left with only the bi-grams.

I’m not doing any removal of stop words in this case. To do that you would want to remove all bi-grams where both words were stop words which would probably mean creating an exhaustive list of them. Not too hard to do, just haven’t found it necessary yet.

Tri-Grams

Why stop at bi-grams?

tail -n+2 tmp2.txt > tmp3.txt
paste -d ',' tmp.txt tmp2.txt tmp3.txt | grep -v -e "^," | grep -v -e ",$" | grep -v -e ",," | sort | uniq -c | sort -rn

All you need to do is create a third file to concatenate together. Everything else is pretty much the same. Of course we could continue to expand this to 4-gram, 5-grams, etc, but if your documents are short then this won’t differ very much from your most frequent response results.

Topic Modeling

This is not a UNIX command, but is such a great, easy way to get better information about the ideas in a set of responses that I have to include it.

I’m not going to explain the math for how topic modeling works, but essentially it groups words that co-occur together in a document to create a list of topics across the entire document set. Each “topic” is a weighted list of words associated with the topic, and each topic has a weight that indicates how frequent that topic is across all documents. By looking at this weighted list of words you can easily pick out the most common themes across your responses.

The easiest way I’ve found to run topic modeling is to download and install Mallet. You can follow Mallet’s main topic modeling instructions, but I’ve reduced them down to a couple of command lines that almost always works for me:

#Import data that has one "document" per line:
bin/mallet import-file --input data.txt --output data.mallet --keep-sequence --remove-stopwords

#Import data that has one "document" per file:
bin/mallet import-dir --input data/* --output data.mallet --keep-sequence --remove-stopwords

lib/mallet-2.0.6/bin/mallet train-topics \
    --input data.mallet \
    --alpha 50.0 \
    --beta 0.01 \
    --num-topics 100 \
    --num-iterations 1000 \
    --optimize-interval 10 \
    --output-topic-keys data.topic-keys.out \
    --topic-word-weights-file data.topic-word-weights.out

#sort by most frequent topic, and remove the topic number
cat data.topics | cut -f 2-20 | sort -rn > data.sorted-topics

Depending on the size of your dataset, you almost certainly will need to play with the number of topics you generate. 50 or 100 is often fine, but if you were generating topics across something as diverse as Wikipedia you’d clearly need many more. If you don’t have enough topics then it is very easy for the topics to seem like a meaningless grouping of words. I usually look at the data results with 50, 100, and 300 topics to get a feel for the data.

Once you decide how many topics make sense with your dataset this technique is a powerful way to extract and rank the meaning from a large set of responses.

January 24, 2013

Elasticsearch: Five Things I was Doing Wrong

I’ve been working with Elasticsearch off and on for over a year, but recently I attended Elasticsearch.com’s training class (well worth the time and money) and discovered a few significant things that I was doing just plain wrong.

Before using Elasticsearch I used Lucene directly, and so a few of the errors I made were due to not understanding some of the things ES does for you behind the scenes.

As background, most of the data I’m indexing conforms to the WordPress database schema.

1. Use Arrays for Fields with Multiple Values

For some reason I had neglected to use arrays when creating fileds such as a list of tags attached to a document. At some point I started concatenating the tags together into a long string separated by semicolons and I used a custom analyzer to break them apart like this:

"analysis" : {
  "tokenizer" : {
    "semicolon_token" : {
      "type" => "pattern",
      "pattern" => ";"
  } },
  "analyzer" : {
    "wp_tag_analyzer" : {
      "type" => "custom",
      "tokenizer" => "semicolon_token",
  } }
}

Or, for fields that were lists of URLs I just separated them by spaces and used the whitespace analyzer. Both methods worked fine for the initial applications, but have some obvious drawbacks. Explicitly inserting a character sequence as a delimiter will almost always means you will hit an edge case somewhere down the road where it will break.

Using an array of items is a much easier way, but somehow, after initially reading about the array mapping, I completely forgot that it existed. I think I was thinking of ES too much as a text searching engine and not enough as a general JSON data store.

2. Don’t Use store=true When Mapping Fields

If you are storing the full _source of the document, then there is very little reason to store individual fields separately. You just inflate your index size. I originally started storing the content and titles of documents because I thought it might speed up the highlighting. In practice, I don’t think it did anything for me, and many of our queries don’t do any highlighting at all.

In the end this was a case of premature optimization. Maybe at some point if I find that 90% of the time we are just returning the post_id and using that to lookup the original content in MySQL we’ll consider storing that separately to reduce network traffic and load caused by extracting the post_id field from _source, but that still feels premature at this point.

For debugging reasons I would never consider turning off storing _source. It is far too useful to know exactly what data was entered, and you never know when you might want to use a different field for a new application.

3. Don’t Manually Flush, Optimize, or Refresh

Elasticsearch takes care of these core Lucene operations for me, there was never any good reason for me to issue one of these commands when the default ES settings would accomplish it within a few minutes.

The optimize command in particular is dangerous since it merges all segments in the Lucene index (a very time consuming operation). The code I wrote which at first was issuing innocuous optimize commands after doing some bulk indexing by hand eventually started getting called repeatedly in automated jobs. Fortunately it never rose to a level of causing real problems, but its easy for code you write to get unintentionally called.

Again, this was a case of premature optimization.

4. Set the Appropriate Production Flags

This is another case that didn’t cause a real issue, but could have in the future. The default settings for ES are set to ensure it works to quickly start development. This means that a few of the default settings are not what you want when in production. In particular:

  • discovery.zen.minimum_master_nodes
    • Should be set to something like N/2 + 1 where N is the number of available master nodes.
  • action.disable_delete_all_indices
    • Do you really want to allow a single command (that could be mistyped) to delete all of your indices? No, neither do I.
  • gateway.recover_after_nodes
    • How many nodes need to be up before the recovery process starts replicating data around the cluster.
  • index.cache.field.type: soft (in 0.90 this field name changed to index.fielddata.cache. Thanks Olivier for the heads up.)
    • I started setting my field cache to soft to ensure that it never created OutOfMemory errors. I think this was particularly helpful because we are doing a lot of faceting.

5. Do Not Use _type as Another Field

The _type field can entice you to use it as another field to indicate a category for your document. Don’t let it.

Here’s where I went wrong. WordPress posts can have different types (post_type) which allow displaying the content of the post in different ways (e.g. image posts, video posts, quotes, a status message). This despite the different post types all using the same schema. This seemed to align pretty well with the _type fields so I used an ES dynamic mapping to have post_type == _type.

The biggest problem with this is how do you determine the document’s _type after a post has been deleted from the database and you want to also delete it from your index. A document is uniquely identified both by its _id and its _type.

  • If you delete from your RDBMS first (or NoSQL data store flavor of the month), then you may no longer have the _type available to delete the object.
  • If you delete from ES first then what if the RDBMS delete operation fails for some reason.

Making the _type independent of any data within the document ensures that all you will need is the document id. This was one of those “Oh, that was dumb of me” bugs that I completely missed when building my index.

October 26, 2012

Jetpack 1.9: Toolbar Notifications

Reblogged from Jetpack for WordPress:

Click to visit the original post

Jetpack 1.9 is here. That's right, it's time for another big helping of Jetpack awesomeness. This release brings you Toolbar Notifications, Mobile Push Notifications, Custom CSS for mobile themes, a JSON API, and improvements to the Contact Form.

Notifications adds a menu to your toolbar that lets you read, moderate, reply to comments from any page on your blog. Plus, if find yourself on…

Read more… 147 more words

In addition to a notifications system that spans across WordPress.com and WordPress sites with Jetpack installed; the new Jetpack also has a great JSON API.
August 24, 2012

Solr vs. ElasticSearch: Part 1 - Overview

Reblogged from Sematext Blog:

A good Solr vs. ElasticSearch coverage is long overdue.  We make good use of our own Search Analytics and pay attention to what people search for.  Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch, and this SolrCloud vs. ElasticSearch question is something we regularly address in our search consulting engagements.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene - Apache Solr and ElasticSearch.

Read more… 2,448 more words

I've been wanting a good comparison between ES and Solr for a while, since the question often comes up, and I don't have enough Solr experience to address it well. The start of this series looks really promising. Some things from this post that make ES so great for our use cases:
  • Multiple types of documents with different structures (think about indexing posts and comments from a WordPress blog in the same index)
  • Pretty much all settings and document mappings can be changed on the fly without restarting the cluster (though it does take some forethought to ensure you can use them all)
  • Routing allows limiting a search to a single shard. Particularly useful for faceted search.
 
August 20, 2012

Quickly Build Faceted Search with ElasticSearch and Backbone.js

I’ve been working with ElasticSearch off and on for the past year, and recently I’ve done a lot of work using Backbone.js to build interactive elements for web pages. Time to combine them together into a modular library: es-backbone.js.

Faceted search is one of the more powerful aspects of ElasticSearch. For es-backbone I was inspired by Karel Minarik’s very cool data visualization example. Initially I started from his implementation, then veered away from Protovis to use jQuery Flot for the charting when I realized my Javascript graphics abilities were not up to making use of Protovis. Flot makes it very fast to build and customize charts.

But managing all the data that comes back with your faceted search results, displaying it for the user, and allowing them to interact with the data and filter it further is also a bit of a headache. Using Backbone to keep a model of the current query the user is doing and another model of the search results helps keep the data well-organized and easy to update. By creating highly modular Backbone Views I can quickly customize a search page depending on what fields the data contains and how we want to display it.

I don’t have any public ElasticSearch data for a good demo, so a screenshot will have to do:

Each part of the page is a separate Backbone View, which allows you to customize the page very quickly based on what data you have. Think that pie chart of authors is too busy, replacing it with a list of the top sites (similar to the tags list) is only a few lines of code. Currently the library has Views for displaying facets as:

  • A pie chart of terms or ranges
  • A list of terms (with counts and percentages)
  • A timeline of dates (which auto switches scales between months, weeks, and days )

All of these views will re-filter your results when you click on the pie/chart/list so you can drill down into your search results. And I used Select2.js in tagging mode to make it easy to add and remove filters. I think the interaction is pretty cool, and wish I had some live public data to show it off on.

The library definitely allows new sites to be built very fast. Once the data is indexed, you can build a site in less than an hour (my last one took 35 min). I’ve released it on github in the hopes that others will find it useful (patches welcome). I’ve included a simple example which you can grep through for “TODO” to find the parts you need to edit to customize for your own application.

Follow

Get every new post delivered to your Inbox.

Join 426 other followers