Elasticsearch and Umbraco: Elasticsearch provider for Umbraco Examine

Elastic All The Things

What is Umbraco Examine?

Examine is an abstraction around Lucene.Net and is used by Umbraco to index and search Umbraco content. The latest version of Examine uses Lucene.Net 3.0.3, which was released in 2012. The latest version of Lucene.Net is 4.8.

When working with Examine for more demanding websites there a number of issues.

The implementation of Lucene on the .NET platform when compared to the original Java version of Lucene is also quite dated, Lucene.net is on 4.8 the java version is on 8.2.0.

The latest version of Lucene.Net has many performance enhancements and better multilingual support, particularly for CJK languages (Chinese, Japanese and Korean) namely morphological analysis as opposed to standard tokenisation.

Examine is a problematic provider when the website is using a high-availability setup with load balancing, as Examine is based on files and you can get file locking issues in a load-balanced setup.

For Umbraco v8 and v7 there is an Azure blob storage for Examine however this is still experimental.

UPDATE: Examine.AzureDirectory package compatible with v7 was released on 11 February 2020. Anyway, v8 is still an experimental package.

One of the strong points of Examine is that it makes working with indexing/searching with Lucene a lot easier than working with raw Lucene, however this ease comes with a strong binding between Examine and Lucene. So when Lucene updates, it is not as straightforward as updating it in Examine, it requires a lot of changes in Examine.

We want the flexibility of Examine but we also want all the latest goodness on the latest version of Lucene. Cue Elasticsearch!

What is Elasticsearch and why use it?

Elasticsearch is a high-performance search provider which supports replication and zero downtime rebuild. It is perfect for CJK and has support for many more languages. It was designed as a search for all types of data (text, numeric, geo, structured and unstructured). Elastic also supports a lot of different types of queries from Lucene to SQL queries.

Additionally, in contrast to Examine, Elastic provides great developer tools, like Kibana, which allow you to simulate queries, debug them and analyse indexes. Elastic is also designed for high availability, which means load balancing, replication, zero downtime reindex and more.

The history behind the project

The idea for building an Examine Elastic provider started when I saw a presentation about search in Umbraco v7 at Polish Festival 2018, where Ismail Mayat presented a POC of an indexer for v7 which used content crawling. After the presentation, I was looking for a better way of doing that, without using external processes to index content.

I found a few helpful sources. First was a package called ‘Umbraco. Elasticsearch’, created by Phil Oyston, and an article called ‘Elasticating Examine – an experimental Examine provider’, written by Tristan Thompson. Neither solution satisfied me, as the first was creating unnecessary logic around Umbraco and reimplementing Examine provider, and the second was just for custom indexes and required a few changes to work with Umbraco.

V7 package and running into a brick wall

For Umbraco v7, I reviewed all available Elastic packages and decided not to use that as a base for my project. In my opinion, it wasn’t a good approach as it was reimplementing indexing, management and other options which I would just replace in Umbraco instead of having still indexes in two locations (Umbraco Examine Files and Elastic Instance).

At that point, the solution proposed by Tristan Thompson was looking closer to my idea as it was just creating a translation layer between Examine and Elastic. I decided to continue my work based on what was already working in that experimental provider.

One of the first changes I made was to bump the version of Elastic to 6.5 and start working to allow indexing of all types of content like Media, Content and Members. At that point, all was working and I decided to start replacing the internal index with Elastic. Here I hit a few small problems:

As we were using only published versions of the content, search was not always relevant to actual content which was not published.
The index couldn’t show health status in Umbraco Backoffice.
All properties of the document were moved to the properties object and, because of that operation, all properties had to use a prefix on name of index fields.
If an implementation were based fully on the NEST, it wouldn’t be compatible with Umbraco Lucene queries.

I started by solving an issue with NEST/Lucene query implementation, where I decided to expose two ways to query:

Snippet 1 Search Methods

Why did I give up on that project?

Umbraco v7 wasn’t designed to support custom providers, and on that version you could only make them work in a hacky way, like reimplementing logic for search in Backoffice.
Any updates of Umbraco could break compatibility of the packages as it was relying on the core functionality of Umbraco, so any change would stop the provider.
There was no abstraction in Examine which would make it easier to maintain the package with Umbraco v7.
Implementation of ISearchableTree wasn’t something that I would use to replace Backoffice search providers, as here you need to reimplement all logic in search, and I don’t think this is a good way of doing that. I think ISearchableTree should only be used in cases of custom back-office search, which is not a replacement for basic one.

Umbraco V8

After abandoning the project I spoke with a few people about how it would be great to have a better option for doing this in Umbraco, and how I was even looking in the Umbraco source code for how to change hardcoded parts to be more abstractive and extendable.

At that point, I was speaking with Shannon from Umbraco HQ, and he suggested that in a new version of Umbraco they will make changes which would allow me to continue my project. I decided to base this on code for Azure Search, as it was an example of how to use the new abstraction layer in Umbraco.

Where is my config file?

Umbraco v8 changed the configuration of Examine from config files to code.

Instead of using the old-fashioned way (which I prefer) of config files, we have a new, shiny programmatical way of changing the Examine config.

There were a few discussions about the pros and cons of that solution, but at the time I started working on that provider, there was still not a documented way of changing anything in Examine config. I was following a suggestion from people on GitHub, but a few times I had to spend time reverse engineering how Umbraco handled Dependency Injection, Composing and Examine setup.

How to switch indexes via components

As the new version changed how to handle switching settings in Umbraco, I had to find out how to handle switching out from Examine to Elastic in source code. Like the first part, it was required to find out how to disable the Examine Component, which attaches basic indexes. As Umbraco was using UmbracoIndex, which inherits from LuceneIndex, I had to reimplement Content Populators, which populate content between all indexes created based on ElasticSearchIndex.

Stay close to Umbraco and emulate events

Umbraco provides a few default events on indexes and it won't stay compatible with most of the code for Umbraco, I have to emulate Lucene fields even if I am not using them at all. Lucene fields are not comfortable to work with as, in opposition to most of the objects, they don’t provide the real type of data. Since you have to work only on a string, you finish with converting all types which you need to string dynamically when someone gets them from Dictionary of Document:

Snippet 3 Emulating two methods from the list of fields

As shown in Snippet 3, I am emulating two methods from the actual list of fields in Document from Lucene.Net.

Plans for the future

As I want to work with Elastic on every possible project, I think as part of future changes I will focus on delivering the best developer experience, and I will try to reimplement as much as possible to support NEST instead of Lucene queries inside my providers.

I am also planning to look into Umbraco source code and propose changes which will allow developers to use better abstraction in the core of Umbraco.

You can find the package file here.

Elastic All the Things