Found it! How to use results from Elasticsearch
Written by Hans Adema on 20th June 2017
In my last blog post, I created an overview of the various search engines available at the moment and concluded that Elasticsearch would be the best option for VoIPGRID. With Elasticsearch I will create a search functionality to optimize the search results on the VoIPGRID telephony platform.
There are tons of articles available online on how to search for stuff using Elasticsearch. All those “Elasticsearch 101” guides explain how to setup Elasticsearch, send a query and show you what the result data looks like. But what do you do with the result data? In this blog post, I’d like to compare a few different methods of turning an Elasticsearch response into data you can use to show the results to your users.
The first method which may come up is to use the Elasticsearch response data directly in your frontend application.
So you’ve got your fancy Angular/React/Ember/Vue/(insert JS framework of the month) frontend app set up and realize Elasticsearch just talks JSON. Now you think “hey, I can just interact directly with Elasticsearch without any backend in between, it’s so fast and easy!”. Never, ever, ever do this. Elasticsearch does not come with any access control by default, so exposing your Elasticsearch to the world wide web would allow literally anyone to view, edit and delete any data in your entire Elasticsearch cluster. You don’t want your cluster to be one of the many Elasticsearch clusters which were hijacked and held hostage.
Always put some software between your precious search cluster and the big bad internet. In most cases, this is going to be your backend application transforming and proxying the search queries to Elasticsearch. However, I’ve heard of people using rudimentary request filters in NGINX as well to enforce authorization. But whichever method you choose, make sure you tightly control what your users can do in your search engine.
Using the Elasticsearch results directly from the backend app
This is the method implied by most guides: search your data in Elasticsearch and convert the responses to your own view format. Elasticsearch always returns all the fields which have been indexed, which you can show to the user as well.
The biggest advantage of this method is that it’s easy to implement initially. You can just build your search query, execute it and stitch the results to your views. It’s pretty fast because you’re just painting the JSON returned by Elasticsearch.
This method also allows you to use some of the fancy features Elasticsearch offers to stylize results. For example, Elasticsearch can automatically highlight search terms in results and group (collapse) search results.
When adding or changing documents in Elasticsearch, there is a slight delay between writing the data and the changes being reflected in the search data (called the refresh interval). When using Elasticsearch for just search, this isn’t such a big problem. But if you write data to Elasticsearch and then immediately try to fetch and show the updated data, the changes may not be there yet. To a user who just clicked the save button, that’s confusing.
This issue can be alleviated somewhat by issuing a “refresh” with updates, which forces updates to be available immediately. However, a refresh is a rather costly operation. So if your users would frequently update data in your index, the continuous refreshes could easily bring down your cluster.
Another disadvantage is that you may want to show related data in the views as well. You want to show a link to the user’s latest message, display some details about the owner of a user or something like that. However, Elasticsearch is not a relational database. So modeling relations is complicated.
There are some methods to deal with related objects in Elasticsearch, but all have their fair share of problems. Application side joins require additional queries to Elasticsearch, which takes time. Nested objects and parent-child relations allow you to fetch all the data in one query. However, it requires you to manually index the related data as part of the document, and keeping that data up to date can be cumbersome and messy.
Finally, you sometimes need to denormalize data to be able to search it properly. However, in your views, you may want to show the normalized data instead. If that happens, you need to store the same data twice.
Using the Elasticsearch results to fetch your primary DB data
Another method to use results from Elasticsearch is to only use Elasticsearch for searching and get the actual data from the primary data store. That means you first send the search query to Elasticsearch and retrieve the results. Then, you use the data from the search results to find the appropriate data in your primary data store. This way, Elasticsearch can be used for what it does best (searching, filtering, sorting, etc.) while basing the data on the database.
The main advantage of this method is that you get to use your regular ORM objects to build your view data. That means you get all the ORM goodness, like fancy serializers, helper methods, related objects and so on.
You also only need to index the data you want to actually search or filter for, skipping everything else. This means there is less data you need to index and search, speeding up index and search times.
And finally, all data which is updated in the primary data store is immediately visible in the frontend. There is no need to wait for Elasticsearch to update the index. Updated results may not immediately show up in the search results, but the data which is shown is always up to date.
The big issue with this strategy is that you need to query Elasticsearch first and then map the results to your ORM. This means you need to query both Elasticsearch and your primary datastore. Because you need to query two data sources, performing a search can become relatively slow and taxing on your servers.
Also, mapping Elasticsearch documents to ORM models can be cumbersome. Depending on how you indexed your data, the mapping could be complicated. That said, simply reusing the primary key from your main data store as the document ID and being able to derive the model from the index and document type of the result is pretty easy to do.
However, you do need to build it yourself. None of the Elasticsearch libraries available for Python properly handle this integration with Django.
Finally, all the features like keyword highlighting Elasticsearch offers to preprocess results don’t work if you throw away most of the response and use the data from your primary data store.
What will you choose?
Both methods have their advantages and disadvantages, both offer you different features and it depends on your application which method is most useful for you.
There are a number of factors to consider, like:
- Are you integrating Elasticsearch into existing pages or new pages?
- Do you care more about speed or consistency?
- How complex is your data model and how well can you map it to Elasticsearch?
For VoIPGRID, I’m leaning towards using the data from the primary DB. That way, Elasticsearch can be rolled out faster because existing views don’t need to be rewritten to improve existing search functions using Elasticsearch. It also allows me to use the existing permission systems to filter results from Elasticsearch.