Scaling Elasticsearch Part 1: Overview

We recently launched Related Posts across WordPress.com, so its time to pop the hood and take a look at what ended up in our engine.

There’s a lot of good information spread across the web on how to use Elasticsearch, but I haven’t seen too many detailed discussions of what it looks like to scale an ES cluster for a large application. Being an open source company means we get to talk about these details. Keep in mind though that scaling is very dependent on the end application so not all of our methods will be applicable for you.

I’m going to spread our scaling story out over four posts:

Scale

WordPress.com is in the top 20 sites on the Internet. We host very big clients all the way down the long tail where this blog resides. The primary project goal was to provide related posts on all of those page views (14 billion a month and growing). We also needed to replace our aging search engine that was running search.wordpress.com for global searches, and we have plans for many more applications in the future.

I’m only going to focus on the related posts queries (searches within a single blog) and the global queries (searches across all blogs). They illustrate some really nice features of ES.

Currently every day we average:

  • 23m queries for related posts within a single shard
  • 2m global queries across all shards
  • 13m docs indexed
  • 10m docs updated
  • 2.5m docs deleted
  • 250k delete by query operations

Our index has about 600 million documents in it, with 1.5 million added every day. Including replication there is about 9 TB of data.

System Overview

We mostly rely on the default ES settings, but there are a few key places where we override them. I’m going to leave the details of how we partition data to the post on indexing.

  • The system is running 0.90.8.
  • We have a single cluster spread across 3 data centers. There are risks with doing this. You need low latency links, and some longer timeouts to prevent communication problems within the cluster. Even with very good links between DCs local nodes are still 10x faster to reach than remote nodes:
discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

This also helps prevent nodes from being booted if they experience long GC cycles.

  • 10 data nodes and one dedicated master node in each DC (30 data nodes total)
  • Currently 175 shards with 3 copies of each shard.
  • Disable multicast, and only list the master nodes in the unicast list. Don’t waste time pinging servers that can’t be masters.
  • Dedicated hardware (96GB RAM, SSDs, fast CPUs) – CPUs definitely seem to the bottleneck that drives us to add more nodes.
  • ES is configured to use 30GB of the RAM with indices.fielddata.cache.size set to 20GB. This is still not ideal, but better than our previous setting of index.fielddata.cache: soft. 1.0 is going to have an interesting new feature that applies a “circuit breaker” to try and detect and squash out of memory problems. Elasticsearch has come a long way with preventing OutOfMemory exceptions, but I still have painful memories from when we were running into them fairly often in 0.18 and 0.19.
  • Use shard allocation awareness to spread replicas across data centers and within data centers.
cluster.routing.allocation.awareness.attributes: dc, parity

dc is the name of the data center, parity we set to 0 or 1 based on the es host number. Even hosts are on one router and odd on another.

  • We use fixed size thread pools, very deep for indexing because we don’t want to lose index operations, much shorter for search and get since if an operation has been queued that deeply the client will probably time out by the time it gets a response.
threadpool:
  index:
    type: fixed
    size: 30
    queue_size: 1000
  bulk:
    type: fixed
    size: 30
    queue_size: 1000
  search:
    type: fixed
    size: 100
    queue_size: 200
  get:
    type: fixed
    size: 100
    queue_size: 200
  • Elastica PHP Client. I’ll add the disclaimer that I think this client is slightly over object-oriented which makes it pretty big. We use it mostly because it has great error handling, and we mostly just set raw queries rather than building a sequence of objects. This limits how much of the code we have to load because we use some PHP autoloading magic:
//Autoloader for Elastica interface client to ElasticSearch
function elastica_autoload( $class_name ) {
	if ( 'Elastica' === substr( $class_name, 0, strlen( 'Elastica' ) ) ) {
		$path = str_replace( '\\', '/', $class_name );
		include_once dirname( __FILE__ ) . '/Elastica/lib/' . $path . '.php';
	}
}
spl_autoload_register( 'elastica_autoload' );
  • Custom round robin ES node selection built by extending the Elastica client. We track stats on errors, number of operations, and mark nodes as down for a few minutes if certain errors occur (eg connection errors).
  • Some customizations for node recovery. See my post on speeding up restart time.

Well, that’s not so hard…

If you’ve worked with Elasticsearch a bit then most of those settings probably seem fairly expected.

Elasticsearch scaling is deceptively easy when you first start out. Before building this cluster we ran a two node cluster for over a year and were indexing a few hundred very active sites. We threw more and more data at that small cluster, and more and more queries, and the cluster hardly blinked. It has about 50 million documents on it now, and gets 1-2 million queries a day.

It is still pretty shocking to me how much harder it was to scale a 10-100x larger cluster. There is a substantial and humbling difference between running a small cluster with 10s of GB vs a cluster with 9 TB of data.

In the next part in this series I’ll talk about how we overcame the indexing problems we ran into: how our index structure scales over time, our optimizations for real time indexing, and how we handle indexing failures.


Posted

in

by

Comments

13 responses to “Scaling Elasticsearch Part 1: Overview”

  1. bitsofinfo Avatar
    bitsofinfo

    What is your minimum master nodes set to?

    Like

    1. Greg Avatar
      Greg

      We set it to two.

      Like

      1. bitsofinfo Avatar
        bitsofinfo

        Have you guys ever experienced any split brains like this one? https://github.com/elasticsearch/elasticsearch/issues/2488

        Like

      2. Greg Avatar
        Greg

        The last time we ran into a split brain (maybe the only time) was probably a year and a half ago before 0.90 when running a two node cluster without any dedicated masters.

        My 2c on that bug:
        – Things have gotten a lot better since that bug (and the related one) was originally opened (I’ve been using ES since before either of them were opened).
        – There are still some legitimate corner cases which is why it hasn’t been closed.
        – These cases are much less likely to happen when using dedicated master nodes and when you have an odd number of them.

        I really can’t stress enough how important dedicated master nodes are. Not having them is an easy way to end up with bad things happening. Particularly when you have any decent cluster load.

        We do still run a two node cluster in production and haven’t had any issues for over a year. But it is fairly beefy and not very heavily utilized. I’m constantly considering upgrading it to have dedicated masters, but haven’t quite been able to justify it to myself yet.

        Like

  2. Marco Schirrmeister Avatar

    That is an interesting post. Thanks for the information about a bigger setup.

    I saw your comment about the split brain and that things got better in the latest versions.
    If you have not experienced that situation I assume you have a very stable link between the data centers? Or maybe even multiple links with proper dynamic routing?

    I did some lab tests on a smaller scale where I simulated 2 DC and 3 DC setups with network link failures between the DCs.
    I would not consider network link failures corner cases. Unless you make your own standards and define to run it only in 1 region with low latency stable links between the DCs. Like you mentioned.

    With the 2 DC setup one site with the minor master nodes went offline which was ok. No chance of inconstant data.
    With the 3 DC setup and link failure between 2 sites I ended up with 2 clusters and the master in the 3rd site was part of both clusters.

    What I took away from it is that I don’t recommend running one multi data center cluster, when the DCs are in different geographic regions or if you can’t guarantee stable links between the DCs.

    Like

    1. Greg Avatar
      Greg

      If you are doing multiple data centers you need really good links and low link latency for ES to work OK. We don’t cross any oceans to get between our DCs.

      Part of your question becomes what are the consequences of a failure? For us, we’ll still keep serving queries. Some of our data will become out of sync or stale, but very little of it. For our currently deployed search applications (related posts and search.wordpress.com) this is not a good scenario, but neither is it catastrophic.

      We are able to bulk reindex our data now in less than 4 days, and our system logs those blogs that are very active for faster reindexing.

      Pushing ES to be better is extremely important to me, so I don’t want to dismiss the issues you bring up. They are very important. I also think it is worth looking at the larger picture of the consequences of such a failure compared to not launching an application at all.

      Thanks for describing your testing. Have you looked to see if the master branch of ES has any potential fixes that you can test out and verify. With 1.0 out the door I would bet this problem can become a higher priority.

      Like

      1. Marco Schirrmeister Avatar

        I have not tested the RC releases of 1.0 yet. If those bugs would be really gone in 1.0, then I think there are fundamental changes done how the clustering works under the hood.

        At the moment it’s that full mesh in ES, each node connects to each other node.
        With my 3 DC setup (A,B,C), if the connection between A and B is gone, nodes in C still tell nodes in B that nodes in A are up.
        Which means nodes in C try to connect to nodes in A, which of course fails and the cluster is in a weird state.
        I would love to have an environment like you, where this situation maybe never occurs. 🙂 But unfortunately that’s not the case.

        But I hope I can try 1.0 relative soon to see how that behaves with network link failures between nodes or DCs.

        Like

  3. subsonico517 Avatar

    Hi, interesting work. I’m facing now elasticsearch for the first time and I got one question:

    You added 600 million documents. How do you get 9 TB of data? It’s approximately 5 MB for each document, counting replicas. Isn’t it too much? What is that occupies so much space? It’s a matter of elasticsearch? The same amount of data (still do not counting replicas) occupies more or less space in a SQL database?

    Nicola

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      Hi Nicola,

      Hmmm, I’m coming up with 5kb per doc (9e12/6e8).

      The doc size does vary a fair bit based on application. Also its good to expect that you will need a fair bit of extra space depending on how much real time indexing you are doing. On our current index we are seeing almost 30% of the index taken up by deleted docs. Partly this is due to our large shard size, but even on an improved sharding scheme we are working on we see 18% deleted docs.

      Deleted docs have real performance and disk space implications that you should be sure to test out.

      Like

  4. subsonico517 Avatar

    Thanks. Shame on me, was really a little arithmetics. I should use more Occam’s razor 😉

    Very, very interesting. I found elaticsearch innovative and modern for storing data such wordpress posts, since it could solve some limitations of current wordpress search engine. I’m not an expert of data systems and i’m reading docs here and there looking for indexing experiences to figure out which are the most important features to consider before deciding how many data centers to use and with which skills. I underlined these skills:

    – RAM allocated for elasticsearch around 30-50%
    – Uniform shard size improves querying speed
    – Faceting query need larger heaps
    – Multicore processing is better for multiple sharding

    There is something else that I should consider?

    PS: Little doubt. You wrote dedicated hardware is 96 GB. It is the total amount or it is 96 GB for each data center?

    Nicola

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      Each of our servers has 96GB of RAM. 30 of that is allocated to ES, the rest is used by the OS for file system caching.

      How powerful the servers are depends a lot on how much data and querying you need to do.

      If you’re building your own WordPress search you may want to take a look at https://github.com/alleyinteractive/searchpress

      Like

  5. subsonico517 Avatar

    Thanks. This plugin has different features, compared to Wp-Elasticsearch?

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      I haven’t looked in too much detail at WP-Elasticsearch, it is a separate project from SearchPress. There is also another project called Fantastic Elasticsearch.

      Like

Leave a comment