Monday, April 28, 2008

Improving internal search

I think that it is critical that our people have tools to find information inside and outside our organization. The more high quality information we can place in the hands of the practitioner, the better they can do their job. I believe that intranet search is an area where we can improve on this capability. The first step towards improving is to measure where we are today. If we do not know where we are, we cannot know if we have improved. We have taken that baseline. We have started to "tweak" our environment to improve on that baseline. As we prove out improvements in the lab, we are moving those out to production as Beta implementations. As improvements are proved out in Beta they will be moved into production search.

How did we take that baseline? The standard for measuring search is based on a subjective, human understanding of relevance. The standard was set by the NIST. The NIST holds an annual event where they test and tweak search. They use a standard set of content, fairly small. They have experts evaluate the content and determine the ideal documents within that standard set. They then use complex queries supplied by those experts to bring back documents from the search engine. They measure how many documents within a number of results are from the ideal set - precision. They measure how many ideal documents can be found by the search engine - recall. This seemed like a reasonable process, so that is how we measured our impact on the search engine.

We established a test environment – a new instance of the search engine, indexing the production content. We asked for volunteers to act as our experts. We asked them to establish an area of expertise for themselves. We then had them identify the "top" 25 documents within that area of expertise, given a query that they suggested. This became our ideal set. We used that ideal set to measure precision and recall at 3, 10 and 25 results.

Usually search engine vendors apply more than one approach to search and combine them together to create a relevance ranking. Generally, the various approaches to calculating relevancy - Vector, Probability, Language Model, Inference Networks, Boolean Indexing, Latent Semantic Indexing, Neural Networks, Genetic Algorithms and Fuzzy Set Retrieval - are all capable of retrieving a good set of documents with a query. On a similar set of content, with a similar query, they will each retrieve the same documents in about the same order. The difference comes in with the work done to "tweak" the engine to particular content collections.

One thing that can improve search relevance is a set of known good documents for a particular query presented either as a "Best Bets" or used behind the scenes as a "golden sample" to refine relevancy. This presupposes that your queries follow a Zipf distribution and you can target most of your users by tuning a relatively small set of queries. Unfortunately our current search logs do not follow a Zipf distribution. Our top 100 queries do not represent 10% of our total queries. To reach a 80% penetration rate we would need to generate 20,000 separate "golden samples". This is a tuning method that is dependant on the Zipf distribution to make it effective. When your logs are flat, like ours are, it does not scale.

We are also investigating other search technologies and evaluating them on our current content to see how well they retrieve documents with the same queries. What we are seeing is that these other search engines are not significantly better against our content than our existing implementation, according to our measures of precision and recall. What other search engines offer is significantly improved tools for tweaking the search engine based on our content collection.

I hope all of the above has been of interest to you. I have found it interesting to compare vendor claims to reality as we have been working through these projects. The difference is quite large. Nothing is ever simple, or easy. I'd rather have a "silver bullet", but recent evidence shows that one does not exist. The only silver bullet is to invest hard work, pay attention to the details and persevere. I think improving search is worth hard work, attention to detail and perseverance.