Elasticsearch with Python

Start here:

Elasticsearch is a distributed storage and real-time search engine.
  • Distributed storage – you just need to setup and add Elasticsearch nodes, it’ll keep the data distributed on the cluster nodes. The distributed-ness makes data durable and highly-available too.
  • Real-time search engine – You can get to query the data the moment it’s been written.
Due to the above 2 attributes you have been listening and reading about Elasticsearch, wherever there’s a discussion of real-time data analysis. It’d not be an overstatement to say technologies like Elasticsearch set the foundation for any efficient and reliable search engine.
The ES itself is implemented in Java, but it provides a good RESTful api interface which makes it possible to use it with any programming language.


1 – Install Elasticsearch

Here’s a step-by-step guide for installing Elasticsearch on CentOS 7.x – Install Elasticsearch 5 on CentOS 7.x

2 – Install elasticesearch-py

elasticsearch-py is the official low-level Python client for Elasticsearch.
While Elasticsearch itself is a RESTful API (wiki link here ) and supports the CRUD operations (Create, Read, Update, Delete) over the HTTP without any client i.e. you can get the data using command-line tool (i.e. curl), or simply via your Internet browser, for example:
 curl -XGET 'http://localhost:9200/dummydata-*/_search?pretty'
It will return all the indexes have their name starting with ‘dummydata-‘ in JSON data format.
Why elasticsearch-py:
  • Integrating Elasticsearch as a data storage and search component into your Python dominant infrastructure.
  • Indexing  data without worrying about translation of basic Python data types to json.
  • Load balancing across all the Elasticsearch nodes.
  • Thread safety.
Install Python client for Elasticsearch using pip:
pip install elasticsearch==5.1.0
Note: For Elasticsearch 5.0 use the elasticsearch-py major version 5.

3 – Adding data to Elasticsearch

As Elasticsearch is used primarily for real-time searching (and distributed storage) – so first thing we need to do here is load the data. ‘Create’ and ‘Update’ of CRUD are termed as indexing in Elasticsearch i.e. if you index data with a specific type and ID that does not already exists, it’ll get inserted i.e. POST/Create operation, if it exists it’ll get overwritten i.e. PUT/Update. Let’s load our first set of data on Elasticsearch.

REST call

 curl -XPUT 'http://localhost:9200/<index>/<doc_type>/[<doc_id>] ' -d {<data_to_load>}

Python Client call

es.index(index='<index> ', doc_type='<doc_type> ', id=<doc_id> , body=json.loads(<data_to_load>))
Index and type are required, while the Id param is optional. If you don’t specify an Id, ES will generate it for you.
The index name is arbitrary. If there isn’t an index with present already on the ES, one will be created using default configuration.
Similarly type name is also arbitrary. It serves several purposes, including:
  • Each type has its own ID space.
  • Different types can have schema of their own, within an index.
  • Can search documents using types as a filter.

An Example

For this tutorial let’s use the publicly available Star Wars Information API swapi – currently has 6 different endpoints, I am choosing ‘planets’ http://swapi.co/api/planets/ .
Elasticsearch-py
  • Line 5 – getting the elasticsearch connection object.
  • Line 7 – checking if ES is up and running i.e. 200 OK response
  • Line 9 – loop will execute until it gets a non 200 response i.e. 404 or Not Found
  • Line 10 – getting planet data one id at a time
  • Line 11 – Indexing data to Elasticsearch – ‘swapi’ the name of the index, and ‘planets’ the doc_type
Note: You can download the above Python script from Github.
Executing the above Python script indexed 63 entries in ES
es-py result

4 – Get a document

To verify if the data got indexed properly, let’s get and search data on ES. In Elasticsearch get and search are equivalent of ‘Read’ operation of CRUD. For getting a specific entry of data using ID there’s get method – index, doc_type, id params are mandatory. Get planet with id 10:
from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
es.get(index='swapi', doc_type='planets', id=10)
It’ll return the planet ‘Kamino’:
elasticsearch index

5. Search documents using Query DSL

In case you don’t have an ID to look for the exact item, you can search an index in ES just using a data attribute i.e. let’s search for planet named ‘Kamino’
es.search(index="swapi", body={"query": {"match": {'name':'Kamino'}}})
It’ll return a single planet with Id 10:
elasticsearch get
In case you don’t even have the exact term (planet name) to search for – query using ‘prefix‘:
es.search(index="swapi", body={"query": {"prefix": {'name':'ka'}}})
It’ll return four planets ‘Kashyyyk‘ with Id 14, ‘Kamino‘ with Id 10, ‘Haruun Kal‘ with Id 42, and ‘Kalee‘ with Id 59.
elasticsearch search
The complete list of ES DSL query parameters.

6. Delete a document

es.delete(index='swapi', doc_type='planets', id=10)
Response:
elasticsearch delete

Leave a Reply

Your email address will not be published. Required fields are marked *