You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Tim Sturge <ts...@hi5.com> on 2008/11/07 21:26:22 UTC

Term numbering and range filtering

Hi,

I¹m wondering if there is any easy technique to number the terms in an index
(By number I mean map a sequence of terms to a contiguous range of integers
and map terms to these numbers efficiently)

Looking at the Term class and the .tis/.tii index format it appears that the
terms are stored in an ordered and prefix-compressed format, but while there
are pointers from a term to the .frq and .prx files, neither is really
suitable as a sequence number.

The reason I have this question is that I am writing a multi-filter for
single term fields. My index contains many fields for which each document
contains a single term (e.g. date, zipcode, country) and I need to perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)

A cached RangeFilter works well when there are a small number of potential
options (e.g. for countries) but when there are many options (consider a
date range or a set of zipcodes) there are too many potential choices to
cache each possibility and it is too inefficient to build a filter on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)

Therefore I was considering building a int[reader.maxDocs()] array for each
field and putting into it the term number for each document. This relies on
the fact that each document contains only a single term for this field, but
with it I should be able to quickly construct a ³multi-filter² (that is,
something that iterates the array and checks that the term is in the range
or set).

Right now it looks like I can do some very ugly surgery and perhaps use the
offset to the prx file even though it is not contiguous. But I¹m hoping
there is a better technique that I¹m just not seeing right now.

Thanks,

Tim

Re: Term numbering and range filtering

Posted by Paul Elschot <pa...@xs4all.nl>.

Tim,

Op Wednesday 19 November 2008 02:32:40 schreef Tim Sturge:
...
> >>
> >> This is less than 2x slower than the dedicated bitset and more
> >> than 50x faster than the range boolean query.
> >>
> >> Mike, Paul, I'm happy to contribute this (ugly but working) code
> >> if there is interest. Let me know and I'll open a JIRA issue for
> >> it.
> >
> > In case you think more performance improvements based on this
> > are possible...
>
> I think this is generally useful for range and set queries on
> non-text based fields (dates, location data, prices, general
> enumerations). These all have the required property that there is
> only one value (term) per document.
>
> I've opened LUCENE-1461.

I finally got the point, see my comments there.

Thanks a lot, if only to show an unexpected tradeoff possibility
opened by the new Filter api.
I don't know whether you followed LUCENE-584 (Decouple Filter
from BitSet), but a contribution like this multi range filter makes
it all worthwhile.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Tim Sturge <ts...@hi5.com>.


> With "Allow Filter as clause to BooleanQuery":
> https://issues.apache.org/jira/browse/LUCENE-1345
> one could even skip the ConstantScoreQuery with this.
> Unfortunately 1345 is unfinished for now.
> 

That would be interesting; I'd like to see how much performance improves.

>> startup: 2811
>> Hits: 11156
>> first query: 1395
>> 100 queries: 441 (back down to 4.41msec per query)
>> 
>> This is less than 2x slower than the dedicated bitset and more than
>> 50x faster than the range boolean query.
>> 
>> Mike, Paul, I'm happy to contribute this (ugly but working) code if
>> there is interest. Let me know and I'll open a JIRA issue for it.
> 
> In case you think more performance improvements based on this
> are possible...

I think this is generally useful for range and set queries on non-text based
fields (dates, location data, prices, general enumerations). These all have
the required property that there is only one value (term) per document.

I've opened LUCENE-1461.

Tim

> 
> Regards,
> Paul Elschot.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Wednesday 19 November 2008 00:43:56 schreef Tim Sturge:
> I've finished a query time implementation of a column stride filter,
> which implements DocIdSetIterator. This just builds the filter at
> process start and uses it for each subsequent query. The index itself
> is unchanged.
>
> The results are very impressive. Here are the results on a 45M
> document index:
>
> Firstly without an age constraint as a baseline:
>
> Query "+name:tim"
> startup: 0
> Hits: 15089
> first query: 1004
> 100 queries: 132 (1.32 msec per query)
>
> Now with a cached filter. This is ideal from a speed standpoint but
> there are too many possible start/end combinations to cache all the
> filters.
>
> Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on cached
> RangeFilter) startup: 3
> Hits: 11156
> first query: 1830
> 100 queries: 287 (2.87 msec per query)
>
> Now with an uncached filter. This is awful.
>
> Query "+name:tim age:[18 TO 35]" (uncached ConstantScoreRangeQuery)
> startup: 3
> Hits: 11156
> first query: 1665
> 100 queries: 51862 (yes, 518 msec per query, 200x slower)
>
> A RangeQuery is slightly better but still bad (and has a different
> result set)
>
> Query "+name:tim age:[18 TO 35]" (uncached RangeQuery)
> startup: 0
> Hits: 10147
> first query: 1517
> 100 queries: 27157 (271 msec is 100x slower than the filter)
>
> Now with the prebuilt column stride filter:
>
> Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on prebuilt
> column stride filter)

With "Allow Filter as clause to BooleanQuery":
https://issues.apache.org/jira/browse/LUCENE-1345
one could even skip the ConstantScoreQuery with this.
Unfortunately 1345 is unfinished for now.

> startup: 2811
> Hits: 11156
> first query: 1395
> 100 queries: 441 (back down to 4.41msec per query)
>
> This is less than 2x slower than the dedicated bitset and more than
> 50x faster than the range boolean query.
>
> Mike, Paul, I'm happy to contribute this (ugly but working) code if
> there is interest. Let me know and I'll open a JIRA issue for it.

In case you think more performance improvements based on this
are possible...

Regards,
Paul Elschot.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Tim Sturge <ts...@hi5.com>.

I've finished a query time implementation of a column stride filter, which
implements DocIdSetIterator. This just builds the filter at process start
and uses it for each subsequent query. The index itself is unchanged.

The results are very impressive. Here are the results on a 45M document
index:

Firstly without an age constraint as a baseline:

Query "+name:tim" 
startup: 0 
Hits: 15089
first query: 1004
100 queries: 132 (1.32 msec per query)

Now with a cached filter. This is ideal from a speed standpoint but there
are too many possible start/end combinations to cache all the filters.

Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on cached RangeFilter)
startup: 3
Hits: 11156
first query: 1830
100 queries: 287 (2.87 msec per query)

Now with an uncached filter. This is awful.

Query "+name:tim age:[18 TO 35]" (uncached ConstantScoreRangeQuery)
startup: 3
Hits: 11156
first query: 1665
100 queries: 51862 (yes, 518 msec per query, 200x slower)

A RangeQuery is slightly better but still bad (and has a different result
set)

Query "+name:tim age:[18 TO 35]" (uncached RangeQuery)
startup: 0
Hits: 10147
first query: 1517
100 queries: 27157 (271 msec is 100x slower than the filter)

Now with the prebuilt column stride filter:

Query "+name:tim age:[18 TO 35]" (ConstantScoreQuery on prebuilt column
stride filter)
startup: 2811
Hits: 11156
first query: 1395
100 queries: 441 (back down to 4.41msec per query)

This is less than 2x slower than the dedicated bitset and more than 50x
faster than the range boolean query.

Mike, Paul, I'm happy to contribute this (ugly but working) code if there is
interest. Let me know and I'll open a JIRA issue for it.

Tim


On 11/11/08 1:27 PM, "Michael McCandless" <lu...@mikemccandless.com> wrote:

> 
> Paul Elschot wrote:
> 
>> Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless:
>>> Also, one nice optimization we could do with the "term number column-
>>> stride array" is do bit packing (borrowing from the PFOR code)
>>> dynamically.
>>> 
>>> Ie since we know there are X unique terms in this segment, when
>>> populating the array that maps docID to term number we could use
>>> exactly the right number of bits.  Enumerated fields with not many
>>> unique values (eg, country, state) would take relatively little RAM.
>>> With LUCENE-1231, where the fields are stored column stride on disk,
>>> we could do this packing during index such that loading at search
>>> time is very fast.
>> 
>> Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410.
>> I think what you're referring to is PDICT, which has frame exceptions
>> for values that occur infrequently.
> 
> Yes let's move the discussion to Jira.
> 
> Actually I was referring to simple bit-packing.
> 
> For encoding array of compact enum terms (eg city, state, color, zip)
> I'm guessing the exceptions logic won't buy us much and would hurt
> seeking needed for column-stride fields.  But we should certainly test
> it.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Michael McCandless <lu...@mikemccandless.com>.

Paul Elschot wrote:

> Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless:
>> Also, one nice optimization we could do with the "term number column-
>> stride array" is do bit packing (borrowing from the PFOR code)
>> dynamically.
>>
>> Ie since we know there are X unique terms in this segment, when
>> populating the array that maps docID to term number we could use
>> exactly the right number of bits.  Enumerated fields with not many
>> unique values (eg, country, state) would take relatively little RAM.
>> With LUCENE-1231, where the fields are stored column stride on disk,
>> we could do this packing during index such that loading at search
>> time is very fast.
>
> Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410.
> I think what you're referring to is PDICT, which has frame exceptions
> for values that occur infrequently.

Yes let's move the discussion to Jira.

Actually I was referring to simple bit-packing.

For encoding array of compact enum terms (eg city, state, color, zip)  
I'm guessing the exceptions logic won't buy us much and would hurt  
seeking needed for column-stride fields.  But we should certainly test  
it.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless:
> Also, one nice optimization we could do with the "term number column-
> stride array" is do bit packing (borrowing from the PFOR code)
> dynamically.
>
> Ie since we know there are X unique terms in this segment, when
> populating the array that maps docID to term number we could use
> exactly the right number of bits.  Enumerated fields with not many
> unique values (eg, country, state) would take relatively little RAM.
> With LUCENE-1231, where the fields are stored column stride on disk,
> we could do this packing during index such that loading at search
> time is very fast.

Perhaps we'd better continue this at LUCENE-1231 or LUCENE-1410.
I think what you're referring to is PDICT, which has frame exceptions
for values that occur infrequently.

Regards,
Paul Elschot


>
> Mike
>
> Paul Elschot wrote:
> > Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless:
> >> The other part of your proposal was to somehow "number" term text
> >> such that term range comparisons can be implemented fast int
> >> comparison.
> >
> > ...
> >
> >>   http://fontoura.org/papers/paramsearch.pdf
> >>
> >> However that'd be quite a bit deeper change to Lucene.
> >
> > The cheap version is hierarchical prefixing here:
> >
> > http://wiki.apache.org/jakarta-lucene/DateRangeQueries
> >
> > Regards,
> > Paul Elschot
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Michael McCandless <lu...@mikemccandless.com>.

Also, one nice optimization we could do with the "term number column- 
stride array" is do bit packing (borrowing from the PFOR code)  
dynamically.

Ie since we know there are X unique terms in this segment, when  
populating the array that maps docID to term number we could use  
exactly the right number of bits.  Enumerated fields with not many  
unique values (eg, country, state) would take relatively little RAM.   
With LUCENE-1231, where the fields are stored column stride on disk,  
we could do this packing during index such that loading at search time  
is very fast.

Mike

Paul Elschot wrote:

> Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless:
>>
>> The other part of your proposal was to somehow "number" term text
>> such that term range comparisons can be implemented fast int
>> comparison.
> ...
>>
>>   http://fontoura.org/papers/paramsearch.pdf
>>
>> However that'd be quite a bit deeper change to Lucene.
>
> The cheap version is hierarchical prefixing here:
>
> http://wiki.apache.org/jakarta-lucene/DateRangeQueries
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless:
>
> The other part of your proposal was to somehow "number" term text
> such that term range comparisons can be implemented fast int
> comparison.
...
>
>    http://fontoura.org/papers/paramsearch.pdf
>
> However that'd be quite a bit deeper change to Lucene.

The cheap version is hierarchical prefixing here:

http://wiki.apache.org/jakarta-lucene/DateRangeQueries

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Michael McCandless <lu...@mikemccandless.com>.

The other part of your proposal was to somehow "number" term text such
that term range comparisons can be implemented fast int comparison.

I like the idea of building dynamic filters on top of a
"column-stride" array of field values.  You could extend it to be a
real Scorer, too.  EG I could imagine holding a "last changed"
column-stride array, and then building a Scorer on top of that which
returns a "recency score", and then using function query to multiply
that with the actual relevance score.

Maybe the right place for this to land is function queries, since they
are already creating queries based on DocValues (= a column-stride
store)?  But your change would extend function queries to also be able
to do filtering based on the DocValues (today function queries can
only alter the score, I think).

However, this approach is still a linear O(maxDocID) search.  It
should have a very low constant in front, since 1) it's all in RAM and
2) if you number term-text then you're using much faster int
comparison, not String comparison.  Also, once LUCENE-1231
(column-stride fields) is in, loading your column-stride array should
be much faster than it is today with FieldCache.

Yet another option, which gives faster than linear performance for
Term range searching, are the ideas in this paper for enabling
efficient range searching:

   http://fontoura.org/papers/paramsearch.pdf

However that'd be quite a bit deeper change to Lucene.

Mike

Tim Sturge wrote:

> Reading this I realize how unclear it is, so let me give a concrete  
> example:
>
> I want to do a search restricting users by age range. So someone can  
> ask for
> the users 18-35, 40-60 etc.
>
> Here are the options I considered:
>
> 1) construct a RangeQuery. This is a 20-40 clause boolean subquery  
> in an
> otherwise 1 or 2 clause query and I'd like to avoid that. It also has
> scoring artifacts I wish to avoid (I don't want users to rank higher  
> just
> because we have less users of that particular age).
>
> 2) construct a ConstantScoreRangeQuery. Then I'm forced to iterate  
> all the
> users in the age range for each query. This was cost-prohibitive.
>
> 3) cache filters for each age range. Problem is there are 50  
> starting points
> and 50 ending points and caching 2500 filters is unrealistic.
>
> 4) cache filters for each age, and or them together for each query  
> (there's
> stuff in contrib that does this.) This is best so far, but does  
> require
> caching 50 filters and doing 20 bitset ors per query on an 8MB bitset.
>
> So what I thought would be interesting is
>
> 5) build an array of bytes where the n-th byte contains the age of  
> user n.
> Given the range it's fairly trivial to make this behave like a  
> filter (ie
> it's relatively easy to implement next() and skipTo() efficiently and
> trivial to decide if a document is in the range or not.)
>
> Then I realized this approach would make sense not just for ages,  
> but also
> for countries, date ranges and zip code sets, so I thought I'd ask  
> if anyone
> had tried it before.
>
> Part of me assumes that someone must have done this already; so either
> there's an implementation out there already or there's a good reason  
> I don't
> see that this is entirely impractical. So I'm interested to get  
> feedback.
>
> Tim
>
>
>
>
> On 11/10/08 2:26 PM, "Tim Sturge" <ts...@hi5.com> wrote:
>
>> I think we've gone around in a loop here. It's exactly due to the  
>> inadequacy
>> of cached filters that I'm considering what I'm doing.
>>
>> Here's the section from my first email that is most illuminating:
>> "
>> The reason I have this question is that I am writing a multi-filter  
>> for single
>> term fields. My index contains many fields for which each document  
>> contains a
>> single term (e.g. date, zipcode, country) and I need to perform  
>> range queries
>> or set matches over these fields, many of which are very inclusive  
>> (they match
>>> 10% of the total documents)
>>
>> A cached RangeFilter works well when there are a small number of  
>> potential
>> options (e.g. for countries) but when there are many options  
>> (consider a date
>> range or a set of zipcodes) there are too many potential choices to  
>> cache each
>> possibility and it is too inefficient to build a filter on the fly  
>> for each
>> query (as you have to visit 10% of documents to build the filter  
>> despite the
>> query itself matching 0.1%)
>>
>> Therefore I was considering building a int[reader.maxDocs()] array  
>> for each
>> field and putting into it the term number for each document. This  
>> relies on
>> the fact that each document contains only a single term for this  
>> field, but
>> with it I should be able to quickly construct a “multi- 
>> filter” (that is,
>> something that iterates the array and checks that the term is in  
>> the range or
>> set).
>> "
>>
>> Does this help explain my rationale? The reason I'm posting here is  
>> that I
>> imagine there are lots of people with this issue. In particular  
>> date ranges
>> seem to be something that lots of people use but Lucene implements  
>> fairly
>> poorly.
>>
>> Tim
>>
>> On 11/10/08 1:58 PM, "Paul Elschot" <pa...@xs4all.nl> wrote:
>>
>>> Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
>>>> Hmmm -- I hadn't thought about that so I took a quick look at the
>>>> term vector support.
>>>>
>>>> What I'm really looking for is a compact but performant
>>>> representation of a set of filters on the same (one term field).
>>>> Using term vectors would mean an algorithm similar to:
>>>>
>>>> String myfield;
>>>> String myterm;
>>>> TermVector tv;
>>>> for (int i = 0 ;  i < maxDoc ; i++) {
>>>>    tv = reader.getTermFreqVector(i,country)
>>>>    if (tv.indexOf(myterm) != -1) {
>>>>          // include this doc...
>>>>        }
>>>> }
>>>>
>>>> The key thing I am looking to achieve here is performance  
>>>> comparable
>>>> to filters. I suspect getTermFremVector() is not efficient enough  
>>>> but
>>>> I'll give it a try.
>>>>
>>>
>>> Better use a TermDocs on myterm for this, have a look at the code of
>>> RangeFilter.
>>>
>>> Filters are normally created from a slower query by setting a bit  
>>> in an
>>> OpenBitSet at "include this doc". Then they are reused for their  
>>> speed.
>>>
>>> Filter caching could help. In case memory becomes a problem
>>> and the filters are sparse enough, try and use SortedVIntList
>>> as the underlying data structure in the cache. (Sparse enough means
>>> less than 1 in 8 of all docs available the index reader.)
>>> See also LUCENE-1296 for caching another data structure than the
>>> one used to collect the filtered docs.
>>>
>>> Regards,
>>> Paul Elschot
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Tim Sturge <ts...@hi5.com>.

Reading this I realize how unclear it is, so let me give a concrete example:

I want to do a search restricting users by age range. So someone can ask for
the users 18-35, 40-60 etc.

Here are the options I considered:

1) construct a RangeQuery. This is a 20-40 clause boolean subquery in an
otherwise 1 or 2 clause query and I'd like to avoid that. It also has
scoring artifacts I wish to avoid (I don't want users to rank higher just
because we have less users of that particular age).

2) construct a ConstantScoreRangeQuery. Then I'm forced to iterate all the
users in the age range for each query. This was cost-prohibitive.

3) cache filters for each age range. Problem is there are 50 starting points
and 50 ending points and caching 2500 filters is unrealistic.

4) cache filters for each age, and or them together for each query (there's
stuff in contrib that does this.) This is best so far, but does require
caching 50 filters and doing 20 bitset ors per query on an 8MB bitset.

So what I thought would be interesting is

5) build an array of bytes where the n-th byte contains the age of user n.
Given the range it's fairly trivial to make this behave like a filter (ie
it's relatively easy to implement next() and skipTo() efficiently and
trivial to decide if a document is in the range or not.)

Then I realized this approach would make sense not just for ages, but also
for countries, date ranges and zip code sets, so I thought I'd ask if anyone
had tried it before.

Part of me assumes that someone must have done this already; so either
there's an implementation out there already or there's a good reason I don't
see that this is entirely impractical. So I'm interested to get feedback.

Tim

On 11/10/08 2:26 PM, "Tim Sturge" <ts...@hi5.com> wrote:

> I think we've gone around in a loop here. It's exactly due to the inadequacy
> of cached filters that I'm considering what I'm doing.
> 
> Here's the section from my first email that is most illuminating:
> "
> The reason I have this question is that I am writing a multi-filter for single
> term fields. My index contains many fields for which each document contains a
> single term (e.g. date, zipcode, country) and I need to perform range queries
> or set matches over these fields, many of which are very inclusive (they match
> >10% of the total documents)
> 
> A cached RangeFilter works well when there are a small number of potential
> options (e.g. for countries) but when there are many options (consider a date
> range or a set of zipcodes) there are too many potential choices to cache each
> possibility and it is too inefficient to build a filter on the fly for each
> query (as you have to visit 10% of documents to build the filter despite the
> query itself matching 0.1%)
> 
> Therefore I was considering building a int[reader.maxDocs()] array for each
> field and putting into it the term number for each document. This relies on
> the fact that each document contains only a single term for this field, but
> with it I should be able to quickly construct a ³multi-filter² (that is,
> something that iterates the array and checks that the term is in the range or
> set).
> "
> 
> Does this help explain my rationale? The reason I'm posting here is that I
> imagine there are lots of people with this issue. In particular date ranges
> seem to be something that lots of people use but Lucene implements fairly
> poorly.
> 
> Tim
> 
> On 11/10/08 1:58 PM, "Paul Elschot" <pa...@xs4all.nl> wrote:
> 
>> Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
>>> Hmmm -- I hadn't thought about that so I took a quick look at the
>>> term vector support.
>>> 
>>> What I'm really looking for is a compact but performant
>>> representation of a set of filters on the same (one term field).
>>> Using term vectors would mean an algorithm similar to:
>>> 
>>> String myfield;
>>> String myterm;
>>> TermVector tv;
>>> for (int i = 0 ;  i < maxDoc ; i++) {
>>>     tv = reader.getTermFreqVector(i,country)
>>>     if (tv.indexOf(myterm) != -1) {
>>>           // include this doc...
>>>         }
>>> }
>>> 
>>> The key thing I am looking to achieve here is performance comparable
>>> to filters. I suspect getTermFremVector() is not efficient enough but
>>> I'll give it a try.
>>> 
>> 
>> Better use a TermDocs on myterm for this, have a look at the code of
>> RangeFilter.
>> 
>> Filters are normally created from a slower query by setting a bit in an
>> OpenBitSet at "include this doc". Then they are reused for their speed.
>> 
>> Filter caching could help. In case memory becomes a problem
>> and the filters are sparse enough, try and use SortedVIntList
>> as the underlying data structure in the cache. (Sparse enough means
>> less than 1 in 8 of all docs available the index reader.)
>> See also LUCENE-1296 for caching another data structure than the
>> one used to collect the filtered docs.
>> 
>> Regards,
>> Paul Elschot
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Tim Sturge <ts...@hi5.com>.

I think we've gone around in a loop here. It's exactly due to the inadequacy
of cached filters that I'm considering what I'm doing.

Here's the section from my first email that is most illuminating:
"
The reason I have this question is that I am writing a multi-filter for
single term fields. My index contains many fields for which each document
contains a single term (e.g. date, zipcode, country) and I need to perform
range queries or set matches over these fields, many of which are very
inclusive (they match >10% of the total documents)

A cached RangeFilter works well when there are a small number of potential
options (e.g. for countries) but when there are many options (consider a
date range or a set of zipcodes) there are too many potential choices to
cache each possibility and it is too inefficient to build a filter on the
fly for each query (as you have to visit 10% of documents to build the
filter despite the query itself matching 0.1%)

Therefore I was considering building a int[reader.maxDocs()] array for each
field and putting into it the term number for each document. This relies on
the fact that each document contains only a single term for this field, but
with it I should be able to quickly construct a ³multi-filter² (that is,
something that iterates the array and checks that the term is in the range
or set).
"

Does this help explain my rationale? The reason I'm posting here is that I
imagine there are lots of people with this issue. In particular date ranges
seem to be something that lots of people use but Lucene implements fairly
poorly.

Tim

On 11/10/08 1:58 PM, "Paul Elschot" <pa...@xs4all.nl> wrote:

> Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
>> Hmmm -- I hadn't thought about that so I took a quick look at the
>> term vector support.
>> 
>> What I'm really looking for is a compact but performant
>> representation of a set of filters on the same (one term field).
>> Using term vectors would mean an algorithm similar to:
>> 
>> String myfield;
>> String myterm;
>> TermVector tv;
>> for (int i = 0 ;  i < maxDoc ; i++) {
>>     tv = reader.getTermFreqVector(i,country)
>>     if (tv.indexOf(myterm) != -1) {
>>           // include this doc...
>>         }
>> }
>> 
>> The key thing I am looking to achieve here is performance comparable
>> to filters. I suspect getTermFremVector() is not efficient enough but
>> I'll give it a try.
>> 
> 
> Better use a TermDocs on myterm for this, have a look at the code of
> RangeFilter.
> 
> Filters are normally created from a slower query by setting a bit in an
> OpenBitSet at "include this doc". Then they are reused for their speed.
> 
> Filter caching could help. In case memory becomes a problem
> and the filters are sparse enough, try and use SortedVIntList
> as the underlying data structure in the cache. (Sparse enough means
> less than 1 in 8 of all docs available the index reader.)
> See also LUCENE-1296 for caching another data structure than the
> one used to collect the filtered docs.
> 
> Regards,
> Paul Elschot
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
> Hmmm -- I hadn't thought about that so I took a quick look at the
> term vector support.
>
> What I'm really looking for is a compact but performant
> representation of a set of filters on the same (one term field).
> Using term vectors would mean an algorithm similar to:
>
> String myfield;
> String myterm;
> TermVector tv;
> for (int i = 0 ;  i < maxDoc ; i++) {
>     tv = reader.getTermFreqVector(i,country)
>     if (tv.indexOf(myterm) != -1) {
>           // include this doc...
>         }
> }
>
> The key thing I am looking to achieve here is performance comparable
> to filters. I suspect getTermFremVector() is not efficient enough but
> I'll give it a try.
>

Better use a TermDocs on myterm for this, have a look at the code of
RangeFilter.

Filters are normally created from a slower query by setting a bit in an 
OpenBitSet at "include this doc". Then they are reused for their speed.

Filter caching could help. In case memory becomes a problem
and the filters are sparse enough, try and use SortedVIntList
as the underlying data structure in the cache. (Sparse enough means
less than 1 in 8 of all docs available the index reader.)
See also LUCENE-1296 for caching another data structure than the
one used to collect the filtered docs.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Tim Sturge <ts...@hi5.com>.

Hmmm -- I hadn't thought about that so I took a quick look at the term
vector support. 

What I'm really looking for is a compact but performant representation of
a set of filters on the same (one term field). Using term vectors would
mean an algorithm similar to:

String myfield;
String myterm;
TermVector tv;
for (int i = 0 ;  i < maxDoc ; i++) {
    tv = reader.getTermFreqVector(i,country)
    if (tv.indexOf(myterm) != -1) {
          // include this doc...
        }
}

The key thing I am looking to achieve here is performance comparable to
filters. I suspect getTermFremVector() is not efficient enough but I'll give
it a try.

Tim


On 11/10/08 10:54 AM, "Paul Elschot" <pa...@xs4all.nl> wrote:

> Tim,
> 
> I didn't follow all the details, so this may be somewhat off,
> but did you consider using TermVectors?
> 
> Regards,
> Paul Elschot
> 
> 
> Op Monday 10 November 2008 19:18:38 schreef Tim Sturge:
>> Yes, that is a significant issue. What I'm coming to realize is that
>> either I will end up with something like
>> 
>> class MultiFilter {
>>    String field;
>>    private int[] termInDoc;
>>    Map<Term,int> termToInt;
>>    ...
>> }
>> 
>> which can be entirely built on the current lucene APIs but has
>> significantly more overhead (the termToInt mapping in particular and
>> the need to construct the mapping and array on startup)
>> 
>> 
>> Or I can go deep into the guts and add a data file per-segment with a
>> format something like
>> 
>> int version
>> int numFields
>> (int fieldNum, long offset) ^ numFields
>> (int termForDoc) ^ (maxDocs * numFields)
>> 
>> and add something to FieldInfo  like  boolean storeMultiFilter;
>> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to
>> add an int termNum to the .tis file as well.
>> 
>> This is clearly a lot more work than the first solution, but it is a
>> lot nicer to deal with as well. Is this interesting to anyone other
>> than me?
>> 
>> Tim
>> 
>> On 11/9/08 12:23 PM, "Michael McCandless" <lu...@mikemccandless.com>
> wrote:
>>> Conceivably, TermInfosReader could track the sequence number of
>>> each term.
>>> 
>>> A seek/skipTo would know which sequence number it just jumped too,
>>> because the index is regular (every 128 terms by default), and then
>>> each next() call could increment that.  Then retrieving this number
>>> would be as costly as calling eg IndexReader.docFreq(Term) is now.
>>> 
>>> But I'm not sure how a multi-segment index  would work, ie how
>>> would MultiSegmentReader compute this for its terms?  Or maybe
>>> you'd just do this per-segment?
>>> 
>>> Mike
>>> 
>>> Tim Sturge wrote:
>>>> Hi,
>>>> 
>>>> I¹m wondering if there is any easy technique to number the terms
>>>> in an index
>>>> (By number I mean map a sequence of terms to a contiguous range of
>>>> integers
>>>> and map terms to these numbers efficiently)
>>>> 
>>>> Looking at the Term class and the .tis/.tii index format it
>>>> appears that the
>>>> terms are stored in an ordered and prefix-compressed format, but
>>>> while there
>>>> are pointers from a term to the .frq and .prx files, neither is
>>>> really suitable as a sequence number.
>>>> 
>>>> The reason I have this question is that I am writing a
>>>> multi-filter for
>>>> single term fields. My index contains many fields for which each
>>>> document
>>>> contains a single term (e.g. date, zipcode, country) and I need to
>>>> perform
>>>> range queries or set matches over these fields, many of which are
>>>> very inclusive (they match >10% of the total documents)
>>>> 
>>>> A cached RangeFilter works well when there are a small number of
>>>> potential
>>>> options (e.g. for countries) but when there are many options
>>>> (consider a
>>>> date range or a set of zipcodes) there are too many potential
>>>> choices to
>>>> cache each possibility and it is too inefficient to build a filter
>>>> on the
>>>> fly for each query (as you have to visit 10% of documents to build
>>>> the filter despite the query itself matching 0.1%)
>>>> 
>>>> Therefore I was considering building a int[reader.maxDocs()] array
>>>> for each
>>>> field and putting into it the term number for each document. This
>>>> relies on
>>>> the fact that each document contains only a single term for this
>>>> field, but
>>>> with it I should be able to quickly construct a ³multi-filter²
>>>> (that is,
>>>> something that iterates the array and checks that the term is in
>>>> the range
>>>> or set).
>>>> 
>>>> Right now it looks like I can do some very ugly surgery and
>>>> perhaps use the
>>>> offset to the prx file even though it is not contiguous. But I¹m
>>>> hoping
>>>> there is a better technique that I¹m just not seeing right now.
>>>> 
>>>> Thanks,
>>>> 
>>>> Tim
>>> 
>>> -------------------------------------------------------------------
>>> -- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Paul Elschot <pa...@xs4all.nl>.

Tim,

I didn't follow all the details, so this may be somewhat off,
but did you consider using TermVectors?

Regards,
Paul Elschot


Op Monday 10 November 2008 19:18:38 schreef Tim Sturge:
> Yes, that is a significant issue. What I'm coming to realize is that
> either I will end up with something like
>
> class MultiFilter {
>    String field;
>    private int[] termInDoc;
>    Map<Term,int> termToInt;
>    ...
> }
>
> which can be entirely built on the current lucene APIs but has
> significantly more overhead (the termToInt mapping in particular and
> the need to construct the mapping and array on startup)
>
>
> Or I can go deep into the guts and add a data file per-segment with a
> format something like
>
> int version
> int numFields
> (int fieldNum, long offset) ^ numFields
> (int termForDoc) ^ (maxDocs * numFields)
>
> and add something to FieldInfo  like  boolean storeMultiFilter;
> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to
> add an int termNum to the .tis file as well.
>
> This is clearly a lot more work than the first solution, but it is a
> lot nicer to deal with as well. Is this interesting to anyone other
> than me?
>
> Tim
>
> On 11/9/08 12:23 PM, "Michael McCandless" <lu...@mikemccandless.com> 
wrote:
> > Conceivably, TermInfosReader could track the sequence number of
> > each term.
> >
> > A seek/skipTo would know which sequence number it just jumped too,
> > because the index is regular (every 128 terms by default), and then
> > each next() call could increment that.  Then retrieving this number
> > would be as costly as calling eg IndexReader.docFreq(Term) is now.
> >
> > But I'm not sure how a multi-segment index  would work, ie how
> > would MultiSegmentReader compute this for its terms?  Or maybe
> > you'd just do this per-segment?
> >
> > Mike
> >
> > Tim Sturge wrote:
> >> Hi,
> >>
> >> I¹m wondering if there is any easy technique to number the terms
> >> in an index
> >> (By number I mean map a sequence of terms to a contiguous range of
> >> integers
> >> and map terms to these numbers efficiently)
> >>
> >> Looking at the Term class and the .tis/.tii index format it
> >> appears that the
> >> terms are stored in an ordered and prefix-compressed format, but
> >> while there
> >> are pointers from a term to the .frq and .prx files, neither is
> >> really suitable as a sequence number.
> >>
> >> The reason I have this question is that I am writing a
> >> multi-filter for
> >> single term fields. My index contains many fields for which each
> >> document
> >> contains a single term (e.g. date, zipcode, country) and I need to
> >> perform
> >> range queries or set matches over these fields, many of which are
> >> very inclusive (they match >10% of the total documents)
> >>
> >> A cached RangeFilter works well when there are a small number of
> >> potential
> >> options (e.g. for countries) but when there are many options
> >> (consider a
> >> date range or a set of zipcodes) there are too many potential
> >> choices to
> >> cache each possibility and it is too inefficient to build a filter
> >> on the
> >> fly for each query (as you have to visit 10% of documents to build
> >> the filter despite the query itself matching 0.1%)
> >>
> >> Therefore I was considering building a int[reader.maxDocs()] array
> >> for each
> >> field and putting into it the term number for each document. This
> >> relies on
> >> the fact that each document contains only a single term for this
> >> field, but
> >> with it I should be able to quickly construct a ³multi-filter²
> >> (that is,
> >> something that iterates the array and checks that the term is in
> >> the range
> >> or set).
> >>
> >> Right now it looks like I can do some very ugly surgery and
> >> perhaps use the
> >> offset to the prx file even though it is not contiguous. But I¹m
> >> hoping
> >> there is a better technique that I¹m just not seeing right now.
> >>
> >> Thanks,
> >>
> >> Tim
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Michael McCandless <lu...@mikemccandless.com>.

It seems like for many of your examples (age, zip code, country),
simply computing & storing the mapping yourself (your first option
below) would actually be viable?

Also: I think in fact you never need to merge the term numbering for
many segments during searching?  Ie, the search runs one IndexReader
at a time, so your term numbering only needs to run in the context of
a single IndeReader?

If merging-for-searching is not necessary then I think this ("get term
number" for a single segment) is a straightforward extension to
TermInfosReader and then you can accomplish forward (Term -> number)
mappping via TermInfosReader without holding an all-in-ram map (it
will cost you a seek to look up the number of a term though, but, seek
will soonish be much lower cost, with SSDs, so...).

Also, if the reverse mapping is ever needed (I think it may not be?)
TermInfosReader could also be extended to seek by term number.

And... the way FieldCache API works now is actually good for your use
case, because it enums the Terms in order, and then for each Term
enums all docs having that Term.  You could extend that to increment
the term counter as its loading.

Whereas LUCENE-1231 (column stride fields) wouldn't initially make
this ("translate term text to term number while loading") possible,
though maybe we should in fact make that an option (which is I think
the same as your 2nd option below).

So I think the separable idea of using term numbers, both in the index
format when storing column-stride fields, and then in RAM for fast
RAM-based range filtering/scoring, makes alot of sense.

It will get tricky when you need a different Collator than simple Java
String comparisons, but LUCENE-1435 is already working towards letting
you use a different Collator per-field.

Mike

Tim Sturge wrote:

> Yes, that is a significant issue. What I'm coming to realize is that  
> either
> I will end up with something like
>
> class MultiFilter {
>   String field;
>   private int[] termInDoc;
>   Map<Term,int> termToInt;
>   ...
> }
>
> which can be entirely built on the current lucene APIs but has  
> significantly
> more overhead (the termToInt mapping in particular and the need to  
> construct
> the mapping and array on startup)
>
>
> Or I can go deep into the guts and add a data file per-segment with  
> a format
> something like
>
> int version
> int numFields
> (int fieldNum, long offset) ^ numFields
> (int termForDoc) ^ (maxDocs * numFields)
>
> and add something to FieldInfo  like  boolean storeMultiFilter;
> and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to  
> add
> an int termNum to the .tis file as well.
>
> This is clearly a lot more work than the first solution, but it is a  
> lot
> nicer to deal with as well. Is this interesting to anyone other than  
> me?
>
> Tim
>
>
>
>
> On 11/9/08 12:23 PM, "Michael McCandless"  
> <lu...@mikemccandless.com> wrote:
>
>>
>> Conceivably, TermInfosReader could track the sequence number of each
>> term.
>>
>> A seek/skipTo would know which sequence number it just jumped too,
>> because the index is regular (every 128 terms by default), and then
>> each next() call could increment that.  Then retrieving this number
>> would be as costly as calling eg IndexReader.docFreq(Term) is now.
>>
>> But I'm not sure how a multi-segment index  would work, ie how would
>> MultiSegmentReader compute this for its terms?  Or maybe you'd just  
>> do
>> this per-segment?
>>
>> Mike
>>
>> Tim Sturge wrote:
>>
>>> Hi,
>>>
>>> I’m wondering if there is any easy technique to number the terms in
>>> an index
>>> (By number I mean map a sequence of terms to a contiguous range of
>>> integers
>>> and map terms to these numbers efficiently)
>>>
>>> Looking at the Term class and the .tis/.tii index format it appears
>>> that the
>>> terms are stored in an ordered and prefix-compressed format, but
>>> while there
>>> are pointers from a term to the .frq and .prx files, neither is  
>>> really
>>> suitable as a sequence number.
>>>
>>> The reason I have this question is that I am writing a multi-filter
>>> for
>>> single term fields. My index contains many fields for which each
>>> document
>>> contains a single term (e.g. date, zipcode, country) and I need to
>>> perform
>>> range queries or set matches over these fields, many of which are  
>>> very
>>> inclusive (they match >10% of the total documents)
>>>
>>> A cached RangeFilter works well when there are a small number of
>>> potential
>>> options (e.g. for countries) but when there are many options
>>> (consider a
>>> date range or a set of zipcodes) there are too many potential
>>> choices to
>>> cache each possibility and it is too inefficient to build a filter
>>> on the
>>> fly for each query (as you have to visit 10% of documents to build  
>>> the
>>> filter despite the query itself matching 0.1%)
>>>
>>> Therefore I was considering building a int[reader.maxDocs()] array
>>> for each
>>> field and putting into it the term number for each document. This
>>> relies on
>>> the fact that each document contains only a single term for this
>>> field, but
>>> with it I should be able to quickly construct a “multi-filter” (that
>>> is,
>>> something that iterates the array and checks that the term is in the
>>> range
>>> or set).
>>>
>>> Right now it looks like I can do some very ugly surgery and perhaps
>>> use the
>>> offset to the prx file even though it is not contiguous. But I’m
>>> hoping
>>> there is a better technique that I’m just not seeing right now.
>>>
>>> Thanks,
>>>
>>> Tim
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Tim Sturge <ts...@hi5.com>.

Yes, that is a significant issue. What I'm coming to realize is that either
I will end up with something like

class MultiFilter {
   String field;
   private int[] termInDoc;
   Map<Term,int> termToInt;
   ...
}

which can be entirely built on the current lucene APIs but has significantly
more overhead (the termToInt mapping in particular and the need to construct
the mapping and array on startup)


Or I can go deep into the guts and add a data file per-segment with a format
something like

int version
int numFields
(int fieldNum, long offset) ^ numFields
(int termForDoc) ^ (maxDocs * numFields)

and add something to FieldInfo  like  boolean storeMultiFilter;
and FieldInfos something like STORE_MULTIFILTER = 0x40; I'd need to add
an int termNum to the .tis file as well.

This is clearly a lot more work than the first solution, but it is a lot
nicer to deal with as well. Is this interesting to anyone other than me?

Tim




On 11/9/08 12:23 PM, "Michael McCandless" <lu...@mikemccandless.com> wrote:

> 
> Conceivably, TermInfosReader could track the sequence number of each
> term.
> 
> A seek/skipTo would know which sequence number it just jumped too,
> because the index is regular (every 128 terms by default), and then
> each next() call could increment that.  Then retrieving this number
> would be as costly as calling eg IndexReader.docFreq(Term) is now.
> 
> But I'm not sure how a multi-segment index  would work, ie how would
> MultiSegmentReader compute this for its terms?  Or maybe you'd just do
> this per-segment?
> 
> Mike
> 
> Tim Sturge wrote:
> 
>> Hi,
>> 
>> I¹m wondering if there is any easy technique to number the terms in
>> an index
>> (By number I mean map a sequence of terms to a contiguous range of
>> integers
>> and map terms to these numbers efficiently)
>> 
>> Looking at the Term class and the .tis/.tii index format it appears
>> that the
>> terms are stored in an ordered and prefix-compressed format, but
>> while there
>> are pointers from a term to the .frq and .prx files, neither is really
>> suitable as a sequence number.
>> 
>> The reason I have this question is that I am writing a multi-filter
>> for
>> single term fields. My index contains many fields for which each
>> document
>> contains a single term (e.g. date, zipcode, country) and I need to
>> perform
>> range queries or set matches over these fields, many of which are very
>> inclusive (they match >10% of the total documents)
>> 
>> A cached RangeFilter works well when there are a small number of
>> potential
>> options (e.g. for countries) but when there are many options
>> (consider a
>> date range or a set of zipcodes) there are too many potential
>> choices to
>> cache each possibility and it is too inefficient to build a filter
>> on the
>> fly for each query (as you have to visit 10% of documents to build the
>> filter despite the query itself matching 0.1%)
>> 
>> Therefore I was considering building a int[reader.maxDocs()] array
>> for each
>> field and putting into it the term number for each document. This
>> relies on
>> the fact that each document contains only a single term for this
>> field, but
>> with it I should be able to quickly construct a ³multi-filter² (that
>> is,
>> something that iterates the array and checks that the term is in the
>> range
>> or set).
>> 
>> Right now it looks like I can do some very ugly surgery and perhaps
>> use the
>> offset to the prx file even though it is not contiguous. But I¹m
>> hoping
>> there is a better technique that I¹m just not seeing right now.
>> 
>> Thanks,
>> 
>> Tim
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Term numbering and range filtering

Posted by Michael McCandless <lu...@mikemccandless.com>.

Conceivably, TermInfosReader could track the sequence number of each  
term.

A seek/skipTo would know which sequence number it just jumped too,  
because the index is regular (every 128 terms by default), and then  
each next() call could increment that.  Then retrieving this number  
would be as costly as calling eg IndexReader.docFreq(Term) is now.

But I'm not sure how a multi-segment index  would work, ie how would  
MultiSegmentReader compute this for its terms?  Or maybe you'd just do  
this per-segment?

Mike

Tim Sturge wrote:

> Hi,
>
> I’m wondering if there is any easy technique to number the terms in  
> an index
> (By number I mean map a sequence of terms to a contiguous range of  
> integers
> and map terms to these numbers efficiently)
>
> Looking at the Term class and the .tis/.tii index format it appears  
> that the
> terms are stored in an ordered and prefix-compressed format, but  
> while there
> are pointers from a term to the .frq and .prx files, neither is really
> suitable as a sequence number.
>
> The reason I have this question is that I am writing a multi-filter  
> for
> single term fields. My index contains many fields for which each  
> document
> contains a single term (e.g. date, zipcode, country) and I need to  
> perform
> range queries or set matches over these fields, many of which are very
> inclusive (they match >10% of the total documents)
>
> A cached RangeFilter works well when there are a small number of  
> potential
> options (e.g. for countries) but when there are many options  
> (consider a
> date range or a set of zipcodes) there are too many potential  
> choices to
> cache each possibility and it is too inefficient to build a filter  
> on the
> fly for each query (as you have to visit 10% of documents to build the
> filter despite the query itself matching 0.1%)
>
> Therefore I was considering building a int[reader.maxDocs()] array  
> for each
> field and putting into it the term number for each document. This  
> relies on
> the fact that each document contains only a single term for this  
> field, but
> with it I should be able to quickly construct a “multi-filter” (that  
> is,
> something that iterates the array and checks that the term is in the  
> range
> or set).
>
> Right now it looks like I can do some very ugly surgery and perhaps  
> use the
> offset to the prx file even though it is not contiguous. But I’m  
> hoping
> there is a better technique that I’m just not seeing right now.
>
> Thanks,
>
> Tim


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org