Elasticsearch in Django

Written by Ferdy Galema on 11th March 2015

There are several options for adding full text search to a general Python on Django project. At first, you can fully rely on the searching capabilities of your database, but soon you’ll find out that it’s severely limited in functionality and even basic text searches can become painfully slow. This article describes our experiences with using Elasticsearch for implementing search in Django.

Elasticsearch in a nutshell

In short, Elasticsearch is a very popular search server, based on Apache Lucene. It is open source, RESTful, schema-free and has a distributed design. It supports clients for many programming languages and has a lot of features. My favourite part of it is that it is really easy to start using it out of the box. If you don’t need any fancy stuff, you can set up a server, index documents and search them in a matter of seconds. See the following example.

# 1) Download and untar
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz && tar xf elasticsearch-1.4.4.tar.gz
# 2) Start
elasticsearch-1.4.4/bin/elasticsearch
# 3) Index
curl -XPUT localhost:9200/blogs/blog/1 -d '{"title": "Something to say", "publish_date": "2015-04-01", "content": "This blog is about blah blah and did you know that yada yada yada.."}'
# 4) Search
curl localhost:9200/blogs/_search?q=something

Schema-free simply means you do not have to explicitly define the fields of your index model. Also note that the index doesn’t have to exist: Elasticsearch will create it automatically once the first document is indexed. This is a great way to get comfortable with Elasticsearch.

Because Elasticsearch is a stand alone service (consisting of one or multiple servers), you need to have a way to sync your database with Elasticsearch, at least for those models you are interested in to have in the index. Although some people might use Elasticsearch directly as a Django backend, this is highly experimental and you will lose many nice features of your relational database (such as transactions and consistency among multiple tables).

So, how do we get our data into Elasticsearch?

Why we didn’t choose Haystack

Haystack is a search app for Django, that provides pluggable search backends, similar to having an abstraction for the database layer. It supports multiple engines such as Solr and Elasticsearch. We started out by using Haystack as a layer between Django and Elasticsearch.

It was not difficult to set up. After installing Haystack, you configure an engine and create an Index class implementation for every model you wish to index. A nice feature of Haystack is that it has built-in support for real time model updates. It does this by listening to Django signals. This way, the index keeps in sync with the database.

In the Index class for a model, there are some basic options for how fields need to be indexed. Such as the name of the field, type of the field etc. However, we noticed that we needed more fine grained tuning than Haystack allowed. In particular, we wanted to specify index field analyzers. Analyzers define how fields are to be split into tokens and basically control how certain terms in the document match up with search terms entered by the user. Haystack did not offer a way to specify those, at least not in a clean way.

ElasticUtils

Next thing we tried was ElasticUtils. ElasticUtils is a Django search library that is specifically tailored to Elasticsearch. It provides some basic building blocks for your application, but leaves plenty of room for custom behaviour. In the end you can always retrieve the raw Elasticsearch client, that allows for all sorts of interaction with the search server. Of course you might do something similar with Haystack, but that defeats the point of having an abstraction in the first place.

Like Haystack, you must provide a class for every model that needs to be in the index: a Mapping class. It describes how a model instance translates to a dictionary that is the input for a document in the index. In our application, we extend this mapping to have a definition of what analyzers the fields should use.

Batch indexing is pretty easy to implement, and this is usually implemented in a management command. Unlike Haystack, ElasticUtils does not seem to provide its own management command, but it’s pretty easy to write one yourself. Basically, you iterate over all objects for which you have a Mapping and feed them into the index as documents (real time model updates is another story however.. which I’ll discuss in a bit). Prior to feeding the documents, the index is created with the appropriate analyzers as defined in the Mapping. If you want to re-index and keep serving documents at the same time, you should definitely look at the aliases functionality of Elasticsearch. Aliases provide a great way to always have a live index available (by real time switching the live index alias to a newly built index).

To mimic the QuerySet syntax of Django, it supports an S class, which is the entry point for searching (I’m not really sure why they choose a single letter class name). Like QuerySet, it is lazy and you can chain it to define your search:

s = S().query(title='something').order_by('-publish_date')

For basic searching this is fine. For advanced uses, you can use the filter_raw method. With this it’s possible to add any filter Elasticsearch offers by specifying it in dictionary format:

s = s.filter_raw({'and': a_list_of_filters})

The list of filter is a list of dictionaries specifying various Elasticsearch filters. They are all applied to the query, by means of the ‘and’ filter in this example.

Real time model updates

Model updates are not supported out of the box, so we decided to create our own signal handler logic. The signal handler listens on every save of every model, and checks Mapping classes to see if they are about the currently saved object. If a Mapping does, it creates a document of the object and writes it to the index.

Note that some index documents depend on multiple models. For example, a ‘blog’ index type may have an author name by means of an Author foreign key on the Blog model. To support this, every Mapping class can specify related models that trigger an index document update for the specified model. We introduced a method for this purpose named “get_related_models”. This is how it might look like for the Blog Mapping:

@classmethod
def get_related_models(cls):
   """
   Maps related models, how to get an instance list from a signal sender.
   """
   return {
       Author: lambda obj: obj.blog_set.all(),
   }

In other words: our signal handler keeps track of changes to the Author model to possibly update Blog documents if necessary. If the Author name field changes, it runs the above lambda so it gets a list of every object that needs to be re-indexed. The signal handler method that checks this looks like the following (called from the main signal handler method).

def check_related(sender, instance):
   # For every mapping check if this signal is related.
   for mapping in get_model_mappings().values():
       related = mapping.get_related_models().get(type(instance))
       if related:
           # Retrieve a list of objects that need updating (call the lambda).
           for obj in related(instance):
               # Double check to match the model. (Needed for generic foreign keys).
               if type(obj) is mapping.get_model():
                   update_in_index(obj, mapping)

This solution provides for an elegant way to specify how one model is related to another and how this affects “combined” documents in the index. This is by no means the only solution to the foreign key scenario. Elasticsearch has some support for nested documents or parent/child relations. This makes it possible to model the index type more like the database (meaning better normalization, less redundant fields). This makes querying a bit more complicated though, where you need to somehow join multiple queries and/or index documents in order to present it as a single document to the user.

You can skip signal handling altogether, by using Elasticsearch river functionality. This service periodically queries the database and keeps the database in sync with the index (using a push or pull strategy). This completely bypasses Django because it directly connects Elasticsearch to the database service. This may work equally well in certain cases. But at the moment, the “combining objects at index-time with signals” approach works best for us.

Final thoughts

What search engine or framework to use for your Django projects mainly depends on your needs and level of experience. If you have a lot of experience with an existing search engine such as Solr you may choose that, with or without an intermediate layer like Haystack. The obvious advantage of Haystack being able to switch search back-ends at will and enjoying basic functionality out of the box.

If you’re starting fresh in the search arena, I’d recommended Elasticsearch. If you plan to invest some time in it and want to use advanced features, you should skip Haystack and use a client directly on top of Elasticsearch. This pays off in the long run.

Unfortunately ElasticUtils is deprecated as of a few months, but most pointers in this article should more or less apply to any alternatives too. In any case, as you start using more advanced features of Elasticsearch, you’ll notice there is not much choice but to directly use the official client.

If you have any suggestions or feedback regarding this article, don’t hesitate to let me know!

Your thoughts

  • Written by Fernando on 10th November 2015

    Hi,

    I like your analysis, minutes ago I was trying to figure out how to populate Elastic search fields from custom methods in DjangoModels.
    Your article convinced me that I need to dig further in Elasticsearch in other words I am still stuck, but I started reading: https://www.elastic.co/guide/en/elasticsearch/guide/current/_preface.html and http://www.opencrowd.com/blog/post/elasticsearch-django-tutorial/ to have a better understanding on how integrate django-rest-framework and elasticsearch directly.

    Thanks for sharing!

  • Written by Dev1410 on 11th November 2015

    Regarding to your article, I agree to use it on top of the client API to use it in more advanced feature even I have to spend more time to implement. But this is a very fundamental to inprove the new world techno method, the problem is , I’m a new in python or django, hard to understand, within 3 days I have the basic and ready to implement it not only for search engine, the speed of elasticsearch is awesome to make some datastore for our frontend. Overall this article should have more reader to open their mind who stick to use haystack or deprecated elasticutils. Thank you for the article. Cheers.

  • Written by 7WebPages on 15th April 2016

    Here you can see the series of articles about how to use Elasticsearch with Python and Django: https://qbox.io/blog/author/alex-khliestov. It is explained how to make a Django app with Elasticsearch integrated, about a database with automatically generated data.

  • Written by rajitha on 21st February 2017

    nice. is there any documentation for this?

Devhouse Spindle