Speeding up Sitecore index rebuilds with a custom MediaItemContentExtractor

Speeding up Sitecore index rebuilds with a custom MediaItemContentExtractor

On a recent Sitecore project of mine we relied very heavily on Lucene indexes and the Sitecore Content Search API. For various reasons we had to rebuild the indexes in almost every production deployment – and the number of content items was approaching 200,000. It would take hours to rebuild the indexes, making deployments unnecessary painful.

We identified a number of tweaks when digging through the Content Search API. The biggest improvement by far came from slightly changing the way media items are indexed.

A reasonable large portion of the content in this project was media items, the contents of which we also had to index (e.g. PDF, MS Word and RTF documents – including some very large ones). Sitecore indexes media items using the MediaItemContentExtractor computed field, which relies on COM calls and Windows IFilters to read and extract indexable content from media items. The process is as follows:

  1. Get item to index
  2. Retrieve the media item content blob from the database
  3. Write the media item to the file system
  4. Invoke the IFilter (using COM technology)
  5. Wait for the IFilter to read the document
  6. Delete the file from the file system
  7. Store the result of the IFilter on the index

When using Lucene, the above steps are performed on each server (e.g. CM and each CD) for each media item. This means every server pulls down the media item, writes it to the disk, invokes the IFilter and so on. Quite a heavy operation and the outcome would be the same every time – and that’s where our small but very effective tweak came into play: We added a simple SQL server based cache for the IFilter result – meaning that the above process would only be invoked once per media item. Every subsequent index operation would look like this:

  1. Get item to index
  2. Get IFilter result from cache
  3. Store the result of the IFilter on the index

With this cache in place we were able to bring a rebuild of the master index from hours (>3h) down to 20 minutes!

The code below shows a (simplified) version of the custom MediaItemContentExtractor which inherits from the Sitecore out of the box version). I have removed some null checks etc for readability. This exact example is based on Sitecore 7.5.

It simply replaces the existing MediaItemContentExtractor which can be done through a Sitecore patch file:

The cache is backed by SQL so that it is shared across all servers and we are using a simple data context with Entity Framework. We decided to store a cached version for each revision of an item (allowing users to e.g. replace the content of an item which would result in a new revision ID). The MediaCacheRepository could look something like this (simplified):

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.