New Programming

Migrating Elasticsearch Data Between Servers

Post Photo



As we get closer to launch day of our new Site Explorer, an analytics and research app that helps you understand where you and your competition stand online, we needed to move an ElasticSearch index from a development server to our production server.

In theory, it was supposed to be a smooth process, since both servers run the same Centos7 OS and ElasticSearch version 6.1.0, we could just copy paste the "Data" folder (as its name suggest, where everything is stored) from server A to B, and we would be good to go...

Boy was I wrong...

In hopes of helping other devs out, and save their precious time, I quickly wanted to write about the challenges we faced and migration methods available out there, unfortunately, internet lacks a practical and up to date guide on the topic.



There are actually quite a few official and unofficial ways of migrating data between different servers. Each comes with different advantages and trade offs;


Simply copy-pasting "Data" folder


In our situation, dragging the data folder from server A to B resulted in a red cluster health status, which simply means;
1- no data available
2- there's missing data.
Since neither option is acceptable, we had to reverse changes we made.

It turns out, having the same ES version isn't enough, both ElasticSearch installations must also have the same configuration.

Check out elasticsearch.yml files on your installations, if they share the same characteristics such as cluster name and number of shards. Give this method a try, and see what happens. Because it's the "go-to" method for many devs and should work for single-node, simple configurations.

However, after not being able to activate shards, we moved on to other methods, without investigating this one further.


Snapshot/Restore method


Long story short, we ended up using this method. You can perfectly use snapshot/restore for this task as long as you have a shared file system or a single-node cluster.

The advantage of this approach is that the data and the indexes are snapshotted entirely.

Pros

  • All data and the indices are saved/restored as they are
  • Only 2 calls to the Elasticsearch API is needed
  • Easiest way of migrating between different versions
  • It took us just a few hours get the job done

Cons

  • If cluster has more than 1 node, you need to setup a shared FS or to use some cloud storage
  • Snapshots of indices created in 1.x cannot be restored to 5.x or 6.x, and snapshots of indices created in 2.x cannot be restored to 6.x

How we done it?


First we need to take a snapshot of the source data;

1. Open up elasticsearch.yml and insert the following line that states directory path, where you want to store the snapshot;

path.repo: ["/etc/elasticsearch/backups"]

2. Register the shared file system repository with the name "my_backup" or whatever you want with the following command

                                    

                                        curl -XPUT 'localhost:9200/_snapshot/my_fs_backup?pretty' -H 'Content-Type: application/json' -d'
                                            {

                                                "type": "fs",
                                                "settings": {
                                                "location": "/etc/elasticsearch/backups",
                                                "compress": true
                                            }

                                        }

                                    
                                

3. Restart Elasticsearch

                                
                                    $ sudo service elasticsearch restart
                                
                                

4. Finally create the snapshot

                                    
                                        curl -XPUT 'localhost:9200/_snapshot/my_backup?pretty' -H 'Content-Type: application/json' -d'

                                        {

                                        "type": "fs",

                                        "settings": {

                                        "location": "/etc/elasticsearch/backups"

                                        }

                                        }

                                    
                                

5. As the last step, move the snapshot into your target server, insert backup path into YML file, restart ES, register the shared file system repository, and finally, restore the snapshot with the following command;

                                   
                                        curl -XPOST 'localhost:9200/snapshot/my_backup/snapshot_1/restore?
                                    
                                

That's it!

Restart Elasticsearch, or reboot your server and you're good to go.


_Reindex API


It might be easier to use _reindex API to transfer data from one ES cluster to another. There is a special Reindex from Remote mode that allows exactly this use case.

What reindex actually does is a scroll on the source index and a lot of bulk inserts to the target index (which can be remote).

However, there are couple of issues you should take care of:

1. Setting up the target index (no mapping, no settings will be set by reindex) 2. If some fields on the source index are excluded from _source then their contents won't be copied to the target index

Pros

  • Works for cluster of any size
  • Data is copied directly (no intermediate storage required)
  • 1 call to the ES API is needed

Cons

  • Data excluded from _source will be lost

Third party plugins


Since Snapshot/Restore API solved our issue, we haven't this method properly. But you may use one of the supported plugins for S3, HDFS and other cloud storages.

That was all. I hope it helps someone.

Share this

Cem Akbulut

Founded Seotify. Solving problems for a living. These are my opinions. Not necessarily shared by reasonable-minded people nor my benevolent corporate overlords