You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Paul Tyson <ph...@sbcglobal.net> on 2014/06/05 04:30:04 UTC

Indexing integer ranges for point search

I'm new to Lucene and search technology. I've read just enough to be
confused and dangerous, so please bear with me.

My documents have sets of integer ranges, like 1-3,
12-20,....13290-16509, ....

The total enumeration of ranges across all documents will be tens of
thousands. Most documents will have only a few ranges, but some will
have dozens or hundreds.

I need to search for documents with a range that includes a given
number. So in the above example, the document would match "2" or
"15001", but not "4" or "11".

Is this an appropriate use case for Lucene, and if so can someone sketch
out a solution so I can connect the dots? Or is there example code, or
documentation for this sort of thing? I've studied dynamic range facets,
but those don't seem right because I have all the ranges at index time.

Thanks,
--Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing integer ranges for point search

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

It all depends on the statistics: how the ranges are correlated.  If the 
integer range is small: from 1-20000, for example, you might consider 
indexing every integer in each range as a separate value, especially if 
most documents will only have a small number of small ranges.

If there are too many integers, but the ranges have a large degree of 
overlap, you could consider taking a partition of all the integers that 
you see in the documents: a division into non-overlapping sets that can 
be union-ed to form the original intervals.  Then index ids of the 
partition-sets.  This is complicated because the partition needs to be 
updated as you add documents (since it depends on the distribution of 
the numbers among your documents), but you can also take a hybrid 
approach where you allow some degree of overlap among these sets, which 
allows for updating documents while keeping a fixed partition.

As an example, suppose you had documents A and B:

A: 1-3, 12-20, 1000-1200
B: 10-100, 900-1100

Then you could create a partition:
[ a{1-3}, b{10-11}, c{12-20}, d{21-100}, e{900-999}, f{1000-1100}, 
g{1101-1200} ]

and index each of the documents with the tokens a ... g

then when you query for an integer, you compare it against the partition 
to see which range it falls in (you could do this as a Lucene index 
lookup e.g.), and then search for that range token

The complication is that when a new document comes in,

C: 50-60

you can either update the partition and then all of the affected 
documents (B in the example above) have to be reindexed as well, or you 
can delay that and in the meantime index C with a new range [50-60] (ie 
allowing some overlap, and increasing the number of tokens you index and 
query for) until you have a chance to reduce the partition down to a 
minimal one in some periodic rebalancing.

I worked up this idea when looking at indexing access-control lists; 
users can get permission to arbitrary sets of documents - it's a 
many-many relation, basically, and we want a way of filtering by user 
access.  We found we were getting queries with a large number 
(thousands) of OR-ed terms and in some cases performance was suffering.  
The idea is that all these users' permissions, although not correlated 
by rule or by design, do in fact exhibit a high degree of correlation, 
and we can use that to reduce the size of the term lists.

In the end this seemed too complex a solution for our ACL problem, which 
we were able to solve by simply caching the slow queries as filters and 
prewarming them in the background at log in time, so I don't have any 
working code to share (just some numerical experiments), but perhaps 
this idea might help here?

-Mike

On 6/4/2014 10:30 PM, Paul Tyson wrote:
> I'm new to Lucene and search technology. I've read just enough to be
> confused and dangerous, so please bear with me.
>
> My documents have sets of integer ranges, like 1-3,
> 12-20,....13290-16509, ....
>
> The total enumeration of ranges across all documents will be tens of
> thousands. Most documents will have only a few ranges, but some will
> have dozens or hundreds.
>
> I need to search for documents with a range that includes a given
> number. So in the above example, the document would match "2" or
> "15001", but not "4" or "11".
>
> Is this an appropriate use case for Lucene, and if so can someone sketch
> out a solution so I can connect the dots? Or is there example code, or
> documentation for this sort of thing? I've studied dynamic range facets,
> but those don't seem right because I have all the ranges at index time.
>
> Thanks,
> --Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Indexing integer ranges for point search

Posted by Mindaugas Žakšauskas <mi...@gmail.com>.

Hi,

Continuing your example, you could do the following:

Document:

range1_from:1
range1_to:3
range2_from:12
range2_to:20
range3_from:13290
range3_to:16509
... other fields...


Query (for "2"):

(+range1_from:[* TO 2] +range1_to:[2 TO *]) OR
 (+range2_from:[* TO 2] +range2_to:[2 TO *]) OR
 (+range3_from:[* TO 2] +range3_to:[2 TO *])

I hope I got the query right, haven't tested it. But you should get
the general idea.

m.

On Thu, Jun 5, 2014 at 3:30 AM, Paul Tyson <ph...@sbcglobal.net> wrote:
> I'm new to Lucene and search technology. I've read just enough to be
> confused and dangerous, so please bear with me.
>
> My documents have sets of integer ranges, like 1-3,
> 12-20,....13290-16509, ....
>
> The total enumeration of ranges across all documents will be tens of
> thousands. Most documents will have only a few ranges, but some will
> have dozens or hundreds.
>
> I need to search for documents with a range that includes a given
> number. So in the above example, the document would match "2" or
> "15001", but not "4" or "11".
>
> Is this an appropriate use case for Lucene, and if so can someone sketch
> out a solution so I can connect the dots? Or is there example code, or
> documentation for this sort of thing? I've studied dynamic range facets,
> but those don't seem right because I have all the ranges at index time.
>
> Thanks,
> --Paul
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org