I am not going to claim this is the only way to evaluate a enterprise search engine. But, early on in my search program we started thinking about how we could measure our progress with search. How could we tell if a search engine tweak caused the results to improve, or get worse, or stay the same? Seems like a simple question, no? How do you evaluate a search engine?
Companies have been selling search engines for decades. IBM started with a product called STAIRS in 1960. Given that long history, you would think there would be a simple answer - do x, look at y and if it is larger than z, you have a good search engine. Evaluating a search engine is actually a complex question. Search is a very context sensitive behavior. In a knowledge environment, the documents that are of interest to you are not the same as the documents that are of interest to me. Search really can only be evaluated within a specific context for a specific user.
There is, however, a standard for search evaluation. The standard is based on a subjective, human understanding of relevance. The standard was set by the NIST. The NIST holds an annual event where they test and tweak search. They use a standard set of content, fairly small. They have experts evaluate the content and determine the ideal documents within that standard set. They then use queries supplied by those experts to bring back documents from the search engine. They measure how many documents within a number of results are from the ideal set - precision. They measure how many ideal documents can be found by the search engine - recall. This seemed like a reasonable process, so that is how we measured our impact on the search engine.
We established a test environment – a new instance of the search engine, indexing the production content. We asked for volunteers to act as our experts. We asked them to establish an area of expertise for themselves. We then had them identify the “top” 25 documents within that area of expertise, given a query that they suggested. This became our ideal set. We used that ideal set to measure precision and recall at 3, 10 and 25 results.
We also established another way of measuring the impact of our changes. We asked for real users to tell us how we are doing – have we improved, declined or stayed about the same - by creating a Beta site.
From this testing we determined that we could improve relevancy, and that these improvements would be noticible to the end user.