kentoh - Fotolia

Use Elasticsearch shard allocation to distribute workload demand

Learn how to use Elasticsearch shards to distribute data storage and retrieval across physical resources, as workload demands grow.

Elasticsearch is a NoSQL document database that can store any kind of JSON-formatted data, from log data for systems management and monitoring to customer data for business intelligence. Administrators must configure Elasticsearch for optimal performance, which requires an understanding of workloads, indices and shards.

Elasticsearch uses indices to organize data by shared characteristics. Users can create, join and split indices. Because an index could contain a large quantity of interrelated documents or data, Elasticsearch enables users to configure shards -- subdivisions of an index -- to direct documents across multiple servers. This practice spreads out a workload when an index has more data than one server can handle.

Use this tutorial to walk through the Elasticsearch shard allocation process.

Editor's note: Check out the author's companion article to further explore the use of indices, shards and mapping for document organization and management in Elasticsearch.

Elasticsearch shard allocation filtering

To follow the tutorial below, first set up an Elasticsearch cluster.

Then, create and define a custom attribute to route documents. Place the attribute in /etc/elasticsearch/elasticsearch.yml, and use any of the following options to route documents based upon whether they have, require or exclude the given attribute:

index.routing.allocation.include.{attribute}
index.routing.allocation.exclude.{attribute}
index.routing.allocation.require.{attribute}

You can assign shards to groupings of physical IT resources -- think racks, data centers or any other custom attribute. Additionally, there are built-in attributes, such as node name or IP address. To assign shards to nodes, add node.attr.{attribute} to /etc/elasticsearch/elasticsearch.yml.

For example, in this two-node cluster, with nodes paris and eurovps, store all docs in the inventory index on the paris server. Give it a description attribute, such as position, and then assign one of the servers the value position=left, in /etc/elasticsearch/elasticsearch.yml.

cluster.name: paris
node.name: paris
node.attr.position: left
node.master: true
node.data: true
node.ingest: true

Then, create the index settings. Use the index.routing.allocation.require.{attribute} option to tell Elasticsearch where to store the documents in that index. In the example below, the value for number_of_replicas is set to 0, for illustration purposes.

curl -XPUT --header 'Content-Type: application/json' http://parisx:9200/inventory -d '{
      "settings" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 0 ,
            "number_of_routing_shards" : 2

},
        "index.routing.allocation.require.position": "left"
    }
}'

To test this configuration, add three documents -- numbered 1, 2 and 3 -- to the inventory index. In this example, each document provides information on the example company's shoes in inventory:

curl -XPOST --header 'Content-Type: application/json' http://parisx:9200/inventory/_doc/1 -d '{
       "item":  "shoes",
       "color":  "blue",
       "size":  44    
}'

curl -XPOST --header 'Content-Type: application/json' http://parisx:9200/inventory/_doc/2 -d '{
       "item":  "shoes",
       "color":  "white",
       "size":  44    
}'

curl -XPOST --header 'Content-Type: application/json' http://parisx:9200/inventory/_doc/3 -d '{
       "item":  "shoes",
       "color":  "black",
       "size":  44    
}'

Now, call the _cat/shards API to see on which node Elasticsearch stored the documents in this index.

As shown below, p indicates whether an entry is a replica or primary; n is node; and d indicates the number of documents. Type curl -XGET --header 'Content-Type: application/json' http://parisx:9200/_cat/shards/?help to get assistance working with the arguments used for Elasticsearch shard allocation.

Below, the results list the paris node twice, which could prove confusing. But look closely, and you'll see that the sum is three documents on the paris node. The other node, eurovps, has no documents in the inventory index.

curl -XGET --header 'Content-Type: application/json' http://parisx:9200/_cat/shards/inventory?h=d,n,p
 
2 paris p
1 paris p

Split a shard

You can split Elasticsearch shards as the system grows. However, splitting a shard, which is almost like a reindexing operation, requires additional space and resources. As a result, if possible, it's best to plan the shard distribution before you run out of space or see a drop in performance.

First, set the index to read-only:

curl -XPUT --header 'Content-Type: application/json' http://parisx:9200/inventory/_settings -d '{
  "settings": {
    "index.blocks.write": true
  }
}'

Copy the docs to a new index, and specify two shards. You cannot do this operation on existing data and instead must copy it to a new place.

The new value for number_of_shards has to be a factor of the result for number_of_routing_shards, which, in this example, is set at 2. Because the shards must be a factor of routing shards, number_of_routing_shards = 3 is invalid for number_of_shards at 2, since 2 does not divide 3 equally, meaning into a whole number. Summarized simply, number_of_routing_shards is equal to the number of times you can split a shard.

curl -XPOST --header 'Content-Type: application/json' http://parisx:9200/inventory/_split/new_inventory?copy_settings=true -d '{
  "settings": {
    "index.number_of_shards": 2
  }
}'

Indices and shards enable users to grow data in Elasticsearch without affecting search performance, as the engine can perform parallel searches on more than one shard at once. Elasticsearch shard allocation also enables the distribution of data storage and retrieval across physical resources. This can be helpful, for example, as IT organizations collect hundreds of operations logs daily and wish to store the data to predict capacity demands on a quarterly or even yearly basis.

Dig Deeper on IT systems management and monitoring