When your application needs search functionality, you can follow the path of least resistance and implement simple search via your backend, such as MySQL, or you can plan for the future. Modern applications need much more than static search. They need the benefits of facets, best bets, relevance scoring, advanced queries, fuzzy logic, etc. — functionality that you would never want to write on your own. But no matter how valuable search engines are, the options are limited. When it came time to add search functionality to our application, it was a struggle, and we eventually ended up using Azure Search. This post is about that adventure.
This post is part of a series of posts discussing Fixate IO’s process in building our application called Re:each. Because our app runs almost exclusively on Azure (no frowns), many of our posts will refer to that technology. But the approaches apply to any cloud and tech.
Re:each is a platform for generating credible technical content. We are still pre-beta launch, and we’re learning a lot on our journey of App Dev and DevOps. Because Fixate IO is also a research and content marketing firm in the area of DevOps, we put a lot of emphasis on tooling, even when it’s not really applicable for such a small team.
In our application, we have a global search option. Results can be any of three types of content: News, Profiles, or Blog Posts. Each of these result types have headers, free text (bodies of content, prose), and tags.
While adding search seemed trivial (and integration really is), how we wanted to approach adding search actually became quite complex.
Our approach options were:
- Tag only search: Don’t allow users to search the body of result types, just the static tags that have been assigned to them. It’s super easy to implement, but very limiting. You have to know what tags are available, you have to know how to spell perfectly, and it might miss critical content where the tags do not represent the desired query. There is also the issue of performance. We have to do a query across tags for each result type, and then dedupe across the entire result set. After a while, that can be a large performance hit.
- SQL-based tag search, with search engine based free text search: This gives us the best of both worlds. We get free text search and tag search functionality. We can implement tag search faster, and limit our effort on free text search to just the body of result types. The problem? The need to federate the results, which can be another performance hit. It also exposes another point of failure and more update and delete operations.
- Both tag and free text search that are search engine-based: Because we are making the effort anyway, we can implement all aspects of search in a proper search engine. No federation required, and better performance. It also avoids the issue of federated results, and trying to maintain separate algorithms for reconciliation from DB to search engine. But there is more overhead in the app for repeated creation, updates, and deletes. For every new/update/delete instance of a result type, we have to do double the work in both the DB and the search engine.
After review, option 3 is the best. (Reaching that decision required contemplating several architectures and setups with option 2, which we originally thought was the best plan.)
The next task — picking the best search tool. There is an argument to be made for the use of any NoSQL database that supports JSON documents (all of them), and libraries that have search functionality on top of that. But our dev egos are not that big, and it would be a waste of time, so we opted to find a proper search engine.
The choices are limited:
- Elastic.co: Super popular. It supports PHP (our app language), has a JDBC driver for MySQL integration, and is based on Lucene. But we do not like how Eleastic.co pushes you to support contracts. It is easy to install and get started, but blasted hard to get working with the entire ELK (Elastic, Logstack, Kibana) stack to leverage data manipulation and visualizations. On top of that, maintenance is a pain.
- Solr: Also based on Lucene, and very popular, albeit dated. Very limited support out of the box for PHP applications. Their UX is poor, and the setup is difficult and tough to maintain.
They both also have the benefit of being “open-source.” We tend to be more realistic about what this actually means, and not at all opposed to commercial solutions. Because we are in Azure, there are compute instances for both Elastic and Solr. (But also their own!)
We became a little hyper about leveraging Elastic. It was only when we started to wire everything up that we got annoyed about the maintenance and setup. We also discovered that the JDBC driver (which was a critical deciding factor) did a few things we did not like. Foremost, it was not real-time, which meant we were going to have to use the SDK directly anyway. And we did not like the document structure it created based on tables. This would force us to do weird federation of search results across tables, and all result types should be in one index.
So we found another option. Microsoft, as usual, does not do a great job of making their services known. And there are so many that hunting is necessary. But when we found their search services, we were very excited:
- Azure Search: While it is not yet commonly used, in early tests, it looked promising. It would keep us in the Azure ecosystem. And it is 100% Platform as a Service (PaaS), so no server maintenance, or worrying about updates. The key benefits were fast, easy testing via the search explore UI, and PHP-based libraries. It also supports indexing of our blob document storage, which is fantastic (and phase two of our search functionality).
SIDEBAR: One of the themes we have been pushing for in our application (which I still believe is the next step in modern development), is to keep it PaaS as much as possible. For a long time we were able to do this using Azure WebSites, WebJobs, and ClearDB MySQL service, until we decided that we would get better performance and bang for the buck if we moved our backend to MariaDB on a dedicated compute instance.
So Azure was our choice. Setup was easy but we made a few mistakes along the way.
- We decided to give each environment its own index, reachdev, reachstage, and reachprod in a single service. But an Index is as much a part of the hierarchy of the search feature as the data stored in it. To be extensible, there should be a service per environment, and indexes in the services based on function. So we ended up creating a new search service for each instance. For now we are using the free tier, but have to upgrade to paid soon. (It is pricey.)
- Get your fields right on day one. We did not do this, and currently Azure does not support field updates. So one of two options when you make a mistake is to add a replacement field. (This is not clean. And it’s problematic.) Or, you can delete the index and start over. We had to do this several times. (Mostly because I did the setup and I don’t do well with details.) Plan ahead for your document structures and fields, because you want all the result types to show up in a single query. We generalized our document fields to support all three types of results, and conceivable future types.
Messed up tags field — It needs searchable attribute
Well — thanks, Microsoft
Our document structure ended up looking like this:
Once it was set up, the testing was so easy. We used the getpostman app for Google to quickly test the creation of documents, and the search explorer to test queries. We did this in a separate test index at first. We were aware that many instances of fake/orphaned content, even in dev, were going to cause a serious problem and lots of cleanup work.
So after that, we achieved robust search functionality. BUT …. There was one last thing.
Just as we finished implementing things, we discovered Algolia, and for us, initially, it looked awesome. The benefits, if we’d seriously considered Algolia, are:
- one-quarter the price for same level of service
- Better libraries for PHP
- Better UX for analytics
- It does not index Azure Blob or any Document store with files like .docx and .pdf without doing a manual text extraction step. Azure is in beta with an indexing option for Azure blobs. Just add it to the data source and it indexes a lot of file formats.
- It is yet another tool, and external of Azure. We are trying to minimize this for our application.
Such is the cost of living in the tech world. We wonder what magical tool will show up next.