Filtering a Lucene search

Erik Hatcher, Otis Gospodnetić, and Michael McCandless

This article is taken from Chapter 5 of Lucene in Action, 2nd Edition, by Erik Hatcher, Otis Gospodnetić, and Michael McCandless. This article addresses using filters in Lucene. Some projects need more than the basic searching mechanisms. Filters constrain document search space, regardless of the query. For the book's table of contents, the Author Forum, and other resources, go to http://manning.com/hatcher3/.

Filtering is a mechanism of narrowing the search space, allowing only a subset of the documents to be considered as possible hits. They can be used to implement search-within-search features to successively search within a previous set of results or to constrain the document search space for security or external data reasons. A security filter is a powerful example, allowing users to only see search results of documents they own even if their query technically matches other documents that are off limits; we provide an example of a security filter in the section "Security filters".

You can filter any Lucene search, using the overloaded search methods that accept a Filter parameter. There are several built-in Filter implementations:

Before you get concerned about mentions of caching results, rest assured that it's done with a tiny data structure (a BitSet) where each bit position represents a document.

Consider, also, the alternative to using a filter: aggregating required clauses in a BooleanQuery. Now let's discuss each of the built-in filters as well as the BooleanQuery alternative.

Using RangeFilter

RangeFilter filters on a range of terms in a specific field. This is actually very useful, depending on the original type of the field. If the field is a date field, then you get a date range filter. If it's an integer field, you can filter by numeric range. If the field is simply textual, for example last names, then you can filter for all names within a certain alphabetic range such as M to Q.

Let's start with date filtering. Having a date field, you filter as shown in testDateFilter() in Listing 1. Our book data indexes the last modified date of each book data file as a modified field, indexed as with Field.Index.NOT_ANALYZED and Field.Store.YES. We test the date RangeFilter by using an all-inclusive query, which by itself returns all documents.

Listing 1: Using RangeFilter to filter by date range

public class FilterTest extends LiaTestCase { 
  private Query allBooks; 
  private IndexSearcher searcher; 
  private int numAllBooks; 
 
  protected void setUp() throws Exception {    // #1 
    super.setUp(); 
 
    allBooks = new ConstantScoreRangeQuery( 
                      "pubmonth", 
                      "190001", 
                      "200512", 
                      true, true); 
    searcher = new IndexSearcher(bookDirectory); 
    TopDocs hits = searcher.search(allBooks, 20); 
    numAllBooks = hits.totalHits; 
  } 
 
  public void testDateFilter() throws Exception { 
    String jan1 = parseDate("2004-01-01"); 
    String jan31 = parseDate("2004-01-31"); 
    String dec31 = parseDate("2004-12-31"); 
 
    RangeFilter filter = new RangeFilter("modified", jan1, dec31, true, true); 
 
    TopDocs hits = searcher.search(allBooks, filter, 20); 
    assertEquals("all modified in 2004", numAllBooks, hits.scoreDocs.length); 
 
    filter = new RangeFilter("modified", jan1, jan31, true, true); 
    hits = searcher.search(allBooks, filter, 20); 
    assertEquals("none modified in January", 0, hits.scoreDocs.length); 
  } 
 
  
#1 setUp() establishes a baseline count of all the books in our index,
allowing for comparisons when we use an all inclusive date filter.

The first parameter to both of the RangeFilter constructors is the name of a date field in the index. In our sample data this field name is modified; this field is the last modified date of the source data file. The two final boolean arguments to the constructor for RangeFilter, includeLower and includeUpper, determine whether the lower and upper terms should be included or excluded from the filter.

Open-ended range filtering

RangeFilter also supports open-ended ranges. To filter on ranges with one end of the range specified and the other end open, just pass null for whichever end should be open:

filter = new RangeFilter("modified", null, jan31, false, true); 

filter = new RangeFilter("modified", jan1, null, true, false); 

RangeFilter provides two static convenience methods to achieve the same thing:

filter = RangeFilter.Less("modified", jan31); 
 
filter = RangeFilter.More("modified", jan1); 

Using QueryWrapperFilter

More generically useful than RangeFilter is QueryFilter. QueryFilter uses the hits of one query to constrain available documents from a subsequent search. The result is a DocIdSet representing which documents were matched from the filtering query. Using a QueryWrapperFilter, we restrict the documents searched to a specific category:

  public void testQueryWrapperFilter() throws Exception { 
    TermQuery categoryQuery = new TermQuery(new Term("category", "/philosophy/eastern")); 
 
    Filter categoryFilter = new QueryWrapperFilter(categoryQuery); 
 
    TopDocs hits = searcher.search(allBooks, categoryFilter, 20); 
    assertEquals("only tao te ching", 1, hits.scoreDocs.length); 
  } 

Here we're searching for all the books (see setUp() in Listing 1) but constraining the search using a filter for a category which contains a single book. We explain the last assertion of testQueryFilter() shortly.

QueryWrapperFilter can even replace RangeFilter usage, although it requires a few more lines of code, isn't nearly as elegant looking and likely has worse performance. The following code demonstrates date filtering using a QueryWrapperFilter on a RangeQuery:

  public void testQueryWrapperFilterWithRangeQuery() throws Exception { 
    String jan1 = parseDate("2004-01-01"); 
    String dec31 = parseDate("2004-12-31"); 
 
    Query rangeQuery = new RangeQuery("modified", jan1, dec31, true, true); 
 
    Filter filter = new QueryWrapperFilter(rangeQuery); 
    TopDocs hits = searcher.search(allBooks, filter, 20); 
    assertEquals("all of 'em", numAllBooks, hits.scoreDocs.length); 
  } 

Security filters

Another example of document filtering constrains documents with security in mind. Our example assumes documents are associated with an owner, which is known at indexing time. We index two documents; both have the term info in their keywords field, but each document has a different owner:

public class SecurityFilterTest extends TestCase { 
  private RAMDirectory directory; 
 
  protected void setUp() throws Exception { 
    directory = new RAMDirectory(); 
    IndexWriter writer = new IndexWriter(directory, 
                                         new WhitespaceAnalyzer(), 
                                         IndexWriter.MaxFieldLength.LIMITED); 
 
    // Elwood 
    Document document = new Document(); 
    document.add(new Field("owner", "elwood",
	                    Field.Store.YES, Field.Index.NOT_ANALYZED)); 
    document.add(new Field("keywords", "elwood's sensitive info",
	                    Field.Store.YES, Field.Index.ANALYZED)); 
    writer.addDocument(document); 
 
    // Jake 
    document = new Document(); 
    document.add(new Field("owner", "jake",
	                    Field.Store.YES, Field.Index.NOT_ANALYZED)); 
    document.add(new Field("keywords", "jake's sensitive info",
	                    Field.Store.YES, Field.Index.ANALYZED)); 
 
    writer.addDocument(document); 
 
    writer.close(); 
  } 
} 

Using a TermQuery for info in the keywords field results in both documents found, naturally. Suppose, though, that Jake is using the search feature in our application, and only documents he owns should be searchable by him. Quite elegantly, we can easily use a QueryWrapperFilter to constrain the search space to only documents he is the owner of, as shown in Listing 2.

Listing 2: Securing the search space with a filter

  public void testSecurityFilter() throws Exception { 
    TermQuery query = new TermQuery(new Term("keywords", "info"));     //#1 
 
    IndexSearcher searcher = new IndexSearcher(directory); 
    TopDocs hits = searcher.search(query, 10);                         //#2 
    assertEquals("Both documents match", 2, hits.totalHits);           //#2 
 
    Filter jakeFilter = new QueryWrapperFilter(                        //#3 
      new TermQuery(new Term("owner", "jake")));                       //#3 
 
    hits = searcher.search(query, jakeFilter, 10); 
    assertEquals(1, hits.totalHits);                                   //#4 
    assertEquals("elwood is safe",                                     //#4 
                 "jake's sensitive info",                              //#4 
        searcher.doc(hits.scoreDocs[0].doc).get("keywords"));          //#4 
  } 
   
#1 This is a general TermQuery for info.
#2 All documents containing info are returned.
#3 Here, the filter constrains document searches to only owned by "jake".
#4 Only Jake's document is returned, using the same info TermQuery.

If your security requirements are this straightforward, where documents can be associated with users or roles during indexing, using a QueryWrapperFilter will work nicely. However, this scenario is oversimplified for most needs; the ways that documents are associated with roles may be quite a bit more dynamic. QueryWrapperFilter is useful only when the filtering constraints are present as field information within the index itself.

A QueryWrapperFilter alternative

You can constrain a query to a subset of documents another way, by combining the constraining query to the original query as a required clause of a BooleanQuery. There are a couple of important differences, despite the fact that the same documents are returned from both. If you use CachingWrapperFilter around your QueryWrapperFilter, you can cache the set of documents allowed, probably speeding up successive searches using the same filter. In addition, normalized document scores are unlikely to be the same. The score difference makes sense when you're looking at the scoring formula. The IDF factor may be dramatically different. When you're using BooleanQuery aggregation, all documents containing the terms are factored into the equation, whereas a filter reduces the documents under consideration and impacts the inverse document frequency factor.

This test case demonstrates how to "filter" using BooleanQuery aggregation and illustrates the scoring difference compared to testQueryFilter:

  public void testFilterAlternative() throws Exception { 
    TermQuery categoryQuery = new TermQuery(new Term("category", "/philosophy/eastern")); 
 
    BooleanQuery constrainedQuery = new BooleanQuery();
    constrainedQuery.add(allBooks, BooleanClause.Occur.MUST);
    constrainedQuery.add(categoryQuery, BooleanClause.Occur.MUST);
 
    TopDocs hits = searcher.search(constrainedQuery, 20); 
    assertEquals("only tao te ching", 1, hits.scoreDocs.length);
  } 

The technique of aggregating a query in this manner works well with QueryParser parsed queries, allowing users to enter free-form queries yet restricting the set of documents searched by an API-controlled query.

PrefixFilter

PrefixFilter matches documents containing Terms starting with a specified prefix. We can use this to search for all books published any year in the 1900s:
  public void testPrefixFilter() throws Exception { 
    Filter prefixFilter = new PrefixFilter(new Term("pubmonth", "19")); 
 
    TopDocs hits = searcher.search(allBooks, prefixFilter, 20); 
    assertEquals("only 19XX books", 7, hits.totalHits); 
  } 

Caching filter results

The biggest benefit from filters comes when they are cached and reused, using CachingWrapperFilter, which takes care of caching automatically (internally using a WeakHashMap, so that dereferenced entries get garbage collected). You can cache any Filter using CachingWrappingFilter. Filters cache by using the IndexReader as the key, which means searching should also be done with the same instance of IndexReader to benefit from the cache. If you aren't constructing IndexReader yourself, but rather are creating an IndexSearcher from a directory, you must use the same instance of IndexSearcher to benefit from the caching. When index changes need to be reflected in searches, discard IndexSearcher and IndexReader and reinstantiate.

To demonstrate its usage, we return to the date-range filtering example. We want to use RangeFilter, but we'd like to benefit from caching to improve performance:

  public void testCachingWrapper() throws Exception { 
    String jan1 = parseDate("2004-01-01"); 
    String dec31 = parseDate("2004-12-31"); 
 
    RangeFilter dateFilter = new RangeFilter("modified", jan1, dec31, true, true); 
 
    CachingWrapperFilter cachingFilter = new CachingWrapperFilter(dateFilter); 
    TopDocs hits = searcher.search(allBooks, cachingFilter, 20); 
    assertEquals("all of 'em", numAllBooks, hits.totalHits); 
  } 

Successive uses of the same CachingWrapperFilter instance with the same IndexSearcher instance will bypass using the wrapped filter, instead using the cached results.

Beyond the built-in filters

Lucene isn't restricted to using the built-in filters.

An additional filter found in the Lucene Sandbox, ChainedFilter, allows for complex chaining of filters. Writing custom filters allows external data to factor into search constraints; however, a bit of detailed Lucene API know-how may be required to be highly efficient.

And if these filtering options aren't enough, Lucene adds another interesting use of a filter. The FilteredQuery filters a query, like IndexSearcher's search(Query, Filter, int) can, except it is itself a query: Thus it can be used as a single clause within a BooleanQuery. Using FilteredQuery seems to make sense only when using custom filters, but that is for another day.

Resources