You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by kaleemxii <ka...@gmail.com> on 2012/03/26 19:41:50 UTC
Proximity search with dates and number ranges

        So recently while working with solr I came up with this problem, I
had an use case where I want to search dates and numbers nearby a word  that
occur in the text content of a document  as a 
https://issues.apache.org/jira/browse/SOLR-1604 complex proximity  query
like 

        Query->  *“word1  date:[ 1999-01-01T00:00:00Z TO
2011-01-01T00:00:00Z]”~5 *
                       i.e.  word1 within 5 words of a date that is ranged
from 1999 to 2011
                                      
        Now this proximity search is possible only if they(all the words
,numbers and dates) are indexed in same field and range search on dates and
numbers is possible only if the dates and numbers are indexed in separate
fields(one for each) , so there is a conflict here.

If everything goes into one field , the problem here will be  the sorting of
terms in index which is done based on field type (lexicographical for string
types) and in my case the main content the document is indexed under string
type so when the numbers are passed as strings it sorts them
lexicographically, which makes ex: 12 to appear before 2 which results in
wrong results in range queries.

For the dates its completely different because each part of the date is
tokenized and considered as an independent term 
Ex: if a doc has text like 

        *“The tax for the period 2009 Jan 01 to 2010 Jan 01 is $200”*
The analyzer generates terms and stores them as
      *  01,200,2009,Jan,period,tax *
Performing a date search on this type of index is not possible because the
date is divided into three different terms.

So to overcome these problems I thought of these solutions, which might be
the best ones 
        For Numbers:
               Problem  :  lexicographic sorting
               Solution :  Padding all the numbers that appear in document
content with fixed number of zeros like 
                               2 -> *#nm000000002*  (#nm is prefix for all
numbers)
                               12-> *#nm000000012*
               Now even if they are sorted lexicographically 2 appears
before 12 , hence problem solved.
        
        For Dates:
               Problem  :  a date is divided into multiple tokens.
               Solution :  Recognizing all the dates in document content and
converting them to a standard format like 
                               1st Jan 2001 -> *#dt20010101*     (#dt is
prefix for all dates and the format is yyyymmdd)
                               12-31-2010  -> *#dt20100112*
                               1999/02/21  -> *#dt19990221 *  using 
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceFilterFactory.html
pattern replace filter 


        This replacement is done at analyzing level before passing it to
index.

               Now if a doc has text as
                     *  “The tax for the period 2009 Jan 01 to 2010 Jan 01
is $200”  *
               The new text will be                           
                     *  “The tax for the period #dt20090101 to #dt20100101
is $#nm000000200” *
                And this will be indexed and to produce tokens 
                       #dt20090101,#dt20100101,$#nm000000200,period,tax
               Now this index is capable of supporting complex range search
                       like* “period [#dt2008 TO #dt2011]“~5*
But again here I have a big  problem with memory management .. because the
complex proximity queries are handed through span near queries in lucene ..
which is implemented in such a way that it loads all the qualifying terms
(like all the dates in the index, if i want any date such as this query
*“period [#dt1600 TO #dt2025]“~5*) into the memory, which will blow up the
jvm heap size if I have a big index (and I do (2 TB)) .. 


So can someone help me here to arrive at the best implementation to solve
this problem..




--
View this message in context: http://lucene.472066.n3.nabble.com/Proximity-search-with-dates-and-number-ranges-tp3858868p3858868.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org