The Search Engine Showdown
Written by Hans Adema on 22nd March 2017
In February I started my internship at Spindle. I’m currently in my last year of the study Informatica at the Hanze University of Applied Sciences. Besides studying, I run my own web hosting company where I offer free hosting as well as affordable upgrades.
At Spindle, I will be investigating the options for an epic search functionality to optimize the search results on the VoIPGRID telephony platform. My main goal is to build a prototype for an all-in-one super searcher. One search box, which is accessible throughout the whole platform where you can search everything on the platform. To make this possible, however, there are some technical challenges. The search functionality can not just be built, a special search database will be needed.
The 7 most popular search engines
In our CRM Lily, the search engine Elasticsearch was integrated to provide fast and effective search capabilities to end users. While Elasticsearch is the most popular search engine around, there are plenty of other options. According to DB-Engines.com, the most popular free(-ish) search engines are, in order of popularity, Elasticsearch, Solr, Sphinx, Google Search Appliance, Microsoft Azure Search, Amazon CloudSearch and Algolia. What are their distinguishing features, and what are their most important differences?
Elasticsearch is the most popular search engine around, and for good reason! It’s ridiculously easy to get started with, just a single download and a single command and you’re ready to start hacking. It has a convenient REST-API, allowing you to submit data as well as your search queries through JSON. Overall, it needs very little time and knowledge to get started with.
But Elasticsearch will happily grow along with your needs. While it needs next to no configuration to get started with it, it has tons of options and modules to tailor to virtually any searching need. Through its integrated and custom designed ZenDiscovery system, it’s also quite easy to scale Elasticsearch over a cluster of servers.
Finally, Elasticsearch isn’t just suitable for user-facing search, it’s also commonly used for offline data analysis. With Logstash, it’s easy to pull data into Elasticsearch from many different sources and visualize them through Kibana. These packages together are called the ELK stack and enable users to set up a log analysis system within minutes.
Solr is the second biggest search engine. In the past (2010 or so, when Elasticsearch was just released), Solr was powerful, but hard to use. They had a Java API and an HTTP API (not REST-ful) with XML. Clustering was complicated and real-time search was next to impossible.
Then Elasticsearch came along and forced Solr to update as well. By now, getting started with Solr is easy and their API added support for JSON. They also came up with SolrCloud, a very resilient clustering system powered by ZooKeeper.
As Elasticsearch gained stability and more features and Solr became easier to use, the gap between both solutions shrunk. By now, both Solr and Elasticsearch are mature and feature-rich, meaning either of them is a good choice for most projects.
If you want a search engine without having to write code, then Sphinx could be an option. The biggest benefit of Sphinx is that it’s accessible through an SQL-style syntax, using any MySQL-compatible library. Indexing data is done by defining SQL queries in its configuration files, which you can then execute a command to make Sphinx pull the data from your SQL database (MySQL or PostgreSQL) and index it.
To conclude that Sphinx is easy to use would be a misconception, though. Many features which Elasticsearch and Solr users can take for granted, like (Near) Real Time Search, sharding and high availability, are not natively supported in Sphinx and require a lot of effort to make them sort-of work.
Google Search Appliance
Are you familiar with Google Custom Search, the widget from Google you can paste on your site to allow users to search the pages? Google Search Appliance is basically that, but for enterprises. Google Search Appliance is a physical box you install in your data center and it will index the web pages and text documents you tell it to, which you can then search. It’s pretty much a black box, except for the fact it’s mostly yellow.
While it’s (apparently) still quite popular, it has already been discontinued by Google. No new boxes can be bought, but existing licenses can be renewed for a few more years.
Microsoft Azure Search
Microsoft Azure Search is the hosted search engine offering of Microsoft’s Azure public cloud solution. Under the hood, it’s powered by Elasticsearch, so it’s similar in terms of features. However, Azure Search adds easier integration with other Azure services, as well as an alternative search API similar to SharePoint Search. So if you’re used to SharePoint Search or are already invested in the Azure cloud, check it out. If not, just use Elasticsearch.
Amazon CloudSearch is a hosted search engine, part of Amazon Web Services. CloudSearch is powered by Solr, granting AWS users largely the same capabilities as other Solr users. However, CloudSearch has a custom API to integrate more smoothly with other AWS tools. If you’re already using AWS, check it out. However, note that AWS also provides hosted Elasticsearch.
If you just want fast search features on your website or in your app and don’t want to be bothered by complicated infrastructure management or unneeded features related to data analysis, then Algolia could be a great option for you.
Algolia is a hosted service for search. They provide easy to use modules for most popular languages and frameworks and even have some integrations for CMS and hosted services. However, Algolia’s biggest selling point is that their service is fast. Searching in Algolia is much faster than in Elasticsearch (between 13 and 200 times, according to the measurements from Algolia). They built their own indexing structure and search systems, allowing them to achieve greater speeds than would be possible with existing software.
They even provide search only API keys so you can let your web applications or mobile apps get the search results directly from Algolia without having to filter them through your own servers. All to shave off some valuable milliseconds from the search queries and improve your users’ experience.
And the winner is…
This blog should provide a basic overview of the various search options available at the moment, and should help you determine which search engine might be best for your next project.
For VoIPGRID, I’m leaning towards using Elasticsearch. It’s powerful and we’ve had good experience with it so far. It also has the benefit of being self-hosted, meaning we won’t have to trust external parties to keep the details of our customers safe.
However, I’m not done yet. Even though I’ve chosen a search engine, there are still plenty of things to research. Elasticsearch has tons of options to tune relevance scores, deal with typos and generally make sure the right search results are shown first. Which options are available, and what’s the best way to make certain attributes searchable? Epic search doesn’t come easy, so expect more articles soon!