You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/12/30 12:25:07 UTC

[Lucene-java Wiki] Update of "SearchNumericalFields" by UweSchindler

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by UweSchindler:
http://wiki.apache.org/lucene-java/SearchNumericalFields

The comment on the change is:
Add TrieRangeQuery. After release of 2.9, we should remove the really old parts

------------------------------------------------------------------------------
  = Searching Numerical Fields =
+ 
+ == TrieRangeQuery (in contrib/search since version 2.9-dev, which is not yet released) ==
+ 
+ Because Apache Lucene is a full-text search engine and not a conventional database, it cannot handle numerical ranges (e.g., field value is inside user defined bounds, even dates are numerical values). A contrib extension was developed, that stores the numerical values in a special string-encoded format with variable precision (all numerical values like doubles, longs, and timestamps are converted to lexicographic sortable string representations and stored with different precisions from one byte to the full 8 bytes - depending on the variant used). A range is then divided recursively into multiple intervals for searching: The center of the range is searched only with the lowest possible precision in the trie, the boundaries are matched more exactly. This reduces the number of terms dramatically. See: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/search/trie/package-summary.html
+ 
+ This dramatically improves the performance of Apache Lucene with range queries, which is no longer dependent on the index size and number of distinct values because there is an upper limit not related to any of these properties.
+ 
+ Trie''''''Range''''''Query can be used for date/time searches (if you need variable precision of date and time downto milliseconds), double searches (e.g. spatial search for latitudes or longitudes), prices (if encoded as long using cent values, doubles are not good for price values because of rounding problems). The document fields containing the trie encoded values are generated by the Trie''''''Utils class. The values can also be stored in index using the trie encoding, for displaying they can be converted back to the primitive types. Trie''''''Utils also supplies a factory for Sort''''''Field instances on trie encoded fields that automatically uses an Extended''''''Field''''''Cache.Long''''''Parser for efficient sorting of the primitive types.
+ 
+ Currently Trie''''''Range''''''Query is only available for 64bit values (long, double, Date), 32bit (int, float) is in preparation. Because of the trie encoding, the additional unused bits are no problem for search performance, but index size is larger (more terms per numerical document field).
+ 
+ == Other possibilities with storing numerical values stored in more readable form in index ==
  
   Original post: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg07107.html
   Goal: We want to do a search for something like this: [[BR]]
@@ -10, +22 @@

  
   Answer (from: http://issues.apache.org/eyebrowse/ReadMsg?listId=30&msgNo=7103, Erik Hatcher)
  
- == Utility to pad the numbers ==
+ === Utility to pad the numbers ===
   public class Number''''''Utils {
     private static final Decimal''''''Format formatter =
         new Decimal''''''Format("00000"); // make this as wide as you need
@@ -20, +32 @@

     }
   }
  
- == Index the relevant fields using the pad function ==
+ === Index the relevant fields using the pad function ===
  
         doc.add(Field.Keyword("id", Number''''''Utils.pad(i)));
  
  
- == Use a Custom RangeFilter ==
+ === Use a Custom RangeFilter ===
  
  If you have a size field indexed using NumberTools build a chained RangeFilter to include a subset such as 1-1500.
  {{{
@@ -41, +53 @@

     return rf; 
     } 
  }}}
- == Consider Using a Filter ==
+ === Consider Using a Filter ===
  
   Building a Query that for a number (or a range of numbers) is just like building a Query for a word -- it involves scoring based on the frequency of that word (or number) in the index which isn't usually what people want.  So you may want to consider "Filtering" using the RangeFilter class instead.  It can be a lot more efficient then using the RangeQuery class because it can skip all of the score related issues.
  
   http://nagoya.apache.org/eyebrowse/BrowseList?listName=lucene-user@jakarta.apache.org&by=thread&from=943115
   FelixSchwarz: The link above does not work for me. Do you mean http://mail-archives.apache.org/mod_mbox/lucene-java-user/200411.mbox/%3cPine.LNX.4.58.0411221818360.19461@hal.rescomp.berkeley.edu%3e
  
- == Create a custom QueryParser subclass: ==
+ === Create a custom QueryParser subclass: ===
  
    public class Custom''''''Query''''''Parser extends Query''''''Parser {
     public Custom''''''Query''''''Parser(String field, Analyzer analyzer) {
@@ -89, +101 @@

                              query.toString("field"));
  
  
- == For decimals ==
+ === For decimals ===
  
    You can use a multiplier to make sure you don't have decimals if they cause problems.(comment by sv)
  
- == Handling positive and negative numbers. ==
+ === Handling positive and negative numbers. ===
   
   If you want a numerical field that may contain positive and negative numbers, you still need to format them as strings. What you must ensure is that for any numbers a and b, if a<b then format(a)<format(b). The problem cases are
     * when one number is negative and the other is positve
@@ -147, +159 @@

   }
   }}}
   
- == Handling larger numbers ==
+ === Handling larger numbers ===
   
   The code for a class for handling all possible long values is here. http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04790.html