You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Andy Liu (JIRA)" <ji...@apache.org> on 2007/04/02 23:53:32 UTC

[jira] Created: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

MemoryCachedRangeFilter to boost performance of Range queries
-------------------------------------------------------------

Key: LUCENE-855
URL: https://issues.apache.org/jira/browse/LUCENE-855
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Affects Versions: 2.1
Reporter: Andy Liu

Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets.

MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices.

TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.

Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.

The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.

MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.

So in summery, MemoryCachedRangeFilter can be useful when:
- Performance is critical
- Memory is not an issue
- Field contains many unique numeric values
- Index contains large amount of documents

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486763 ] 

Yonik Seeley commented on LUCENE-855:
-------------------------------------

There is also something from Mark Harwood:
https://issues.apache.org/jira/browse/LUCENE-798

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter.patch

Fixed a bug with the BitSets nextSetBit(i) and nextClearBit(i) I wrote a test to verify that it returns the same values as a Normal BitSet . I dont use these functions if someone wants to verify my fix that would be great.

Added the ASF to the top of each file 
And fixed all of Otis bugs


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter.patch

Here is my Version of a FieldCacheRangeFilter

I used different class names so both patch can be applied to lucene but they do almost the same thing so I do not think they both should be committed. 

This Filter uses the Field Cache to to get values out of the index and then creates BitSets that are proxies to the field cache.  So when you do a BitSet.get(int bitIndex) It will check the Field Cache 

One think to note is that since this is a proxy to the field cache it will not work with the current version of ChainedFilters.java (But I have a fix for that also) Since the Chained Filter will make a copy of the bit set and flip the bits. This bit set will not work. 

This version will use less memory since there is only 1 copy if the data and the BitSet is just a proxy it has no data in it. 

I want to thank Andy as I have uses/ stolen all of your tests and modified them just a but so they work with my version. And since we have the same performance tests here are the numbers

Using Andy's code 

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Sun Apr 07 13:49:15 PDT 2002
    [junit] End interval: Fri Apr 06 13:49:15 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...
    [junit] Standard RangeFilter finished in 58585ms
    [junit] MemoryCachedRangeFilter inished in 825ms
    [junit] ------------- ---------------- ---------------

Using My code 

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Sun Apr 07 13:40:52 PDT 2002
    [junit] End interval: Fri Apr 06 13:40:52 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...
    [junit] Standard RangeFilter finished in 58528ms
    [junit] FieldCacheRangeFilter inished in 30ms
    [junit] ------------- ---------------- ---------------


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by robert engels <re...@ix.netcom.com>.

On Dec 4, 2008, at 4:10 PM, Paul Elschot wrote:

> Op Thursday 04 December 2008 23:03:40 schreef robert engels:
>> The biggest benefit I see of using the field cache to do filter
>> caching, is that the same cache can be used for sorting - thereby
>> improving the performance and memory usage.
>
> Would it be possible to build such Filter caching into
> CachingWrapperFilter instead of into QueryFilter?
>
> Both filter caching and the field value caching will need
> access to the underlying (segment?) readers.
>

I don't see why not. The QueryFilter extends from that... We are just  
on a much older code base.

Not really sure why this hierarchy exists tough, as the only  
extenders are QueryFilter, and CachingWrapperFilterHelper.

I would prefer QueryFilter, and then extend that as CachingQueryFilter.

I've always been taught that is you see the words Wrapper, or Helper,  
there is probably a design problem, or at least a naming problem.


>>
>> The downside I see is that if you have a common filter that is built
>> from many fields, you are going to use a lot more memory, as every
>> field used needs to be cached. With my code you would only have a
>> single "bitset" for the filter.
>
> But with many ranges that would mean many bitsets, and
> MemoryCachedRangeFilter only needs to cache the field values once
> for any number of ranges. It's a tradeoff.
>

That was my point. I don't see the field based caching and the filter  
based caching as solving the same problem to a degree. It is going to  
depend on the actual usage - that is why I would like to support both.

> Regards,
> Paul Elschot
>
>
>>
>> On Dec 4, 2008, at 4:00 PM, robert engels wrote:
>>> Lucene-831 is far more comprehensive.
>>>
>>> I also think that by exposing access to the sub-readers it can be
>>> far simpler (closer to what I have provided).
>>>
>>> In the mean-time, you should be able to use the provided class with
>>> a few modifications.
>>>
>>> The "reload the entire cache" was a deal breaker for us, so I came
>>> up the attached. Works very well.
>>>
>>> On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:
>>>> I am looking all the time to LUCENE-831, which is a new version of
>>>> FieldCache that is compatible with IndexReader.reopen() and
>>>> invalidates only
>>>> reloaded segments. In each release of Lucene I am very unhappy,
>>>> because it
>>>> is still not in. The same problem like yours is if you have a one
>>>> million
>>>> documents index that is updated by adding a few documents each
>>>> half hour. If
>>>> you use sorting by a field, whenever the index is reopened and you
>>>> really
>>>> only a very small segment is added, nevertheless the complete
>>>> FieldCache is
>>>> rebuild, very bad :(.
>>>>
>>>>
>>>> So I think the ultimative fix would be to hopefully apply
>>>> LUCENE-831 soon
>>>> and also use LUCENE-1461 as RangeFilter cache.
>>>>
>>>> -----
>>>> Uwe Schindler
>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>>> http://www.thetaphi.de
>>>> eMail: uwe@thetaphi.de
>>>> ________________________________________
>>>> From: robert engels [mailto:rengels@ix.netcom.com]
>>>> Sent: Thursday, December 04, 2008 9:39 PM
>>>> To: java-dev@lucene.apache.org
>>>> Subject: Re: [jira] Commented: (LUCENE-855)
>>>> MemoryCachedRangeFilter to boost
>>>> performance of Range queries
>>>>
>>>> I can't seem to post to Jira, so I am attaching here...
>>>>
>>>> I attached QueryFilter.java.
>>>>
>>>> In reading this patch, and other similar ones, the problem seems
>>>> to be that
>>>> if the index is modified, the cache is invalidated, causing a
>>>> complete
>>>> reload of the cache. Do I have this correct?
>>>>
>>>> The attached patch works really well in a highly interactive
>>>> environment, as
>>>> the cache is only invalidated at the segment level.
>>>>
>>>> The MyMultiReader is a subclass that allows access to the
>>>> underlying SegmentReaders.
>>>>
>>>> The patch cannot be applied, but I think the implementation works
>>>> far better
>>>> in many cases - it is also far less memory intensive. Scanning the
>>>> bitset
>>>> could also be optimized very easily using internal skip values.
>>>>
>>>> Maybe this is completely off-base, but the solution has worked
>>>> very well for
>>>> us. Maybe this is a completely different issue and separate
>>>> incident should
>>>> be opened ?
>>>>
>>>> is there any interest in this?
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------
>>>> --- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>> -------------------------------------------------------------------
>>> -- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Thursday 04 December 2008 23:03:40 schreef robert engels:
> The biggest benefit I see of using the field cache to do filter
> caching, is that the same cache can be used for sorting - thereby
> improving the performance and memory usage.

Would it be possible to build such Filter caching into 
CachingWrapperFilter instead of into QueryFilter?

Both filter caching and the field value caching will need
access to the underlying (segment?) readers.

>
> The downside I see is that if you have a common filter that is built
> from many fields, you are going to use a lot more memory, as every
> field used needs to be cached. With my code you would only have a
> single "bitset" for the filter.

But with many ranges that would mean many bitsets, and
MemoryCachedRangeFilter only needs to cache the field values once
for any number of ranges. It's a tradeoff.

Regards,
Paul Elschot


>
> On Dec 4, 2008, at 4:00 PM, robert engels wrote:
> > Lucene-831 is far more comprehensive.
> >
> > I also think that by exposing access to the sub-readers it can be
> > far simpler (closer to what I have provided).
> >
> > In the mean-time, you should be able to use the provided class with
> > a few modifications.
> >
> > The "reload the entire cache" was a deal breaker for us, so I came
> > up the attached. Works very well.
> >
> > On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:
> >> I am looking all the time to LUCENE-831, which is a new version of
> >> FieldCache that is compatible with IndexReader.reopen() and
> >> invalidates only
> >> reloaded segments. In each release of Lucene I am very unhappy,
> >> because it
> >> is still not in. The same problem like yours is if you have a one
> >> million
> >> documents index that is updated by adding a few documents each
> >> half hour. If
> >> you use sorting by a field, whenever the index is reopened and you
> >> really
> >> only a very small segment is added, nevertheless the complete
> >> FieldCache is
> >> rebuild, very bad :(.
> >>
> >>
> >> So I think the ultimative fix would be to hopefully apply
> >> LUCENE-831 soon
> >> and also use LUCENE-1461 as RangeFilter cache.
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe@thetaphi.de
> >> ________________________________________
> >> From: robert engels [mailto:rengels@ix.netcom.com]
> >> Sent: Thursday, December 04, 2008 9:39 PM
> >> To: java-dev@lucene.apache.org
> >> Subject: Re: [jira] Commented: (LUCENE-855)
> >> MemoryCachedRangeFilter to boost
> >> performance of Range queries
> >>
> >> I can't seem to post to Jira, so I am attaching here...
> >>
> >> I attached QueryFilter.java.
> >>
> >> In reading this patch, and other similar ones, the problem seems
> >> to be that
> >> if the index is modified, the cache is invalidated, causing a
> >> complete
> >> reload of the cache. Do I have this correct?
> >>
> >> The attached patch works really well in a highly interactive
> >> environment, as
> >> the cache is only invalidated at the segment level.
> >>
> >> The MyMultiReader is a subclass that allows access to the
> >> underlying SegmentReaders.
> >>
> >> The patch cannot be applied, but I think the implementation works
> >> far better
> >> in many cases - it is also far less memory intensive. Scanning the
> >> bitset
> >> could also be optimized very easily using internal skip values.
> >>
> >> Maybe this is completely off-base, but the solution has worked
> >> very well for
> >> us. Maybe this is a completely different issue and separate
> >> incident should
> >> be opened ?
> >>
> >> is there any interest in this?
> >>
> >>
> >>
> >> ------------------------------------------------------------------
> >>--- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by Earwin Burrfoot <ea...@gmail.com>.

It would be cool to be able to explicitly list subreaders that were
added/removed as a result of reopen(), or have some kind of
notification mechanism.
We have filter caches, custom field/sort caches here and they are all
reader-bound. Currently warm-up delay is negated by reopening and
warming up in background, before switching to the new reader/caches,
but it still limits our minimum between-reopens delay.

On Fri, Dec 5, 2008 at 01:03, robert engels <re...@ix.netcom.com> wrote:
> The biggest benefit I see of using the field cache to do filter caching, is
> that the same cache can be used for sorting - thereby improving the
> performance and memory usage.
>
> The downside I see is that if you have a common filter that is built from
> many fields, you are going to use a lot more memory, as every field used
> needs to be cached. With my code you would only have a single "bitset" for
> the filter.
>
> On Dec 4, 2008, at 4:00 PM, robert engels wrote:
>
>> Lucene-831 is far more comprehensive.
>>
>> I also think that by exposing access to the sub-readers it can be far
>> simpler (closer to what I have provided).
>>
>> In the mean-time, you should be able to use the provided class with a few
>> modifications.
>>
>> The "reload the entire cache" was a deal breaker for us, so I came up the
>> attached. Works very well.
>>
>> On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:
>>
>>> I am looking all the time to LUCENE-831, which is a new version of
>>> FieldCache that is compatible with IndexReader.reopen() and invalidates
>>> only
>>> reloaded segments. In each release of Lucene I am very unhappy, because
>>> it
>>> is still not in. The same problem like yours is if you have a one million
>>> documents index that is updated by adding a few documents each half hour.
>>> If
>>> you use sorting by a field, whenever the index is reopened and you really
>>> only a very small segment is added, nevertheless the complete FieldCache
>>> is
>>> rebuild, very bad :(.
>>>
>>>
>>> So I think the ultimative fix would be to hopefully apply LUCENE-831 soon
>>> and also use LUCENE-1461 as RangeFilter cache.
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: uwe@thetaphi.de
>>> ________________________________________
>>> From: robert engels [mailto:rengels@ix.netcom.com]
>>> Sent: Thursday, December 04, 2008 9:39 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to
>>> boost
>>> performance of Range queries
>>>
>>> I can't seem to post to Jira, so I am attaching here...
>>>
>>> I attached QueryFilter.java.
>>>
>>> In reading this patch, and other similar ones, the problem seems to be
>>> that
>>> if the index is modified, the cache is invalidated, causing a complete
>>> reload of the cache. Do I have this correct?
>>>
>>> The attached patch works really well in a highly interactive environment,
>>> as
>>> the cache is only invalidated at the segment level.
>>>
>>> The MyMultiReader is a subclass that allows access to the underlying
>>> SegmentReaders.
>>>
>>> The patch cannot be applied, but I think the implementation works far
>>> better
>>> in many cases - it is also far less memory intensive. Scanning the bitset
>>> could also be optimized very easily using internal skip values.
>>>
>>> Maybe this is completely off-base, but the solution has worked very well
>>> for
>>> us. Maybe this is a completely different issue and separate incident
>>> should
>>> be opened ?
>>>
>>> is there any interest in this?
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by robert engels <re...@ix.netcom.com>.

The biggest benefit I see of using the field cache to do filter  
caching, is that the same cache can be used for sorting - thereby  
improving the performance and memory usage.

The downside I see is that if you have a common filter that is built  
from many fields, you are going to use a lot more memory, as every  
field used needs to be cached. With my code you would only have a  
single "bitset" for the filter.

On Dec 4, 2008, at 4:00 PM, robert engels wrote:

> Lucene-831 is far more comprehensive.
>
> I also think that by exposing access to the sub-readers it can be  
> far simpler (closer to what I have provided).
>
> In the mean-time, you should be able to use the provided class with  
> a few modifications.
>
> The "reload the entire cache" was a deal breaker for us, so I came  
> up the attached. Works very well.
>
> On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:
>
>> I am looking all the time to LUCENE-831, which is a new version of
>> FieldCache that is compatible with IndexReader.reopen() and  
>> invalidates only
>> reloaded segments. In each release of Lucene I am very unhappy,  
>> because it
>> is still not in. The same problem like yours is if you have a one  
>> million
>> documents index that is updated by adding a few documents each  
>> half hour. If
>> you use sorting by a field, whenever the index is reopened and you  
>> really
>> only a very small segment is added, nevertheless the complete  
>> FieldCache is
>> rebuild, very bad :(.
>>
>>
>> So I think the ultimative fix would be to hopefully apply  
>> LUCENE-831 soon
>> and also use LUCENE-1461 as RangeFilter cache.
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>> ________________________________________
>> From: robert engels [mailto:rengels@ix.netcom.com]
>> Sent: Thursday, December 04, 2008 9:39 PM
>> To: java-dev@lucene.apache.org
>> Subject: Re: [jira] Commented: (LUCENE-855)  
>> MemoryCachedRangeFilter to boost
>> performance of Range queries
>>
>> I can't seem to post to Jira, so I am attaching here...
>>
>> I attached QueryFilter.java.
>>
>> In reading this patch, and other similar ones, the problem seems  
>> to be that
>> if the index is modified, the cache is invalidated, causing a  
>> complete
>> reload of the cache. Do I have this correct?
>>
>> The attached patch works really well in a highly interactive  
>> environment, as
>> the cache is only invalidated at the segment level.
>>
>> The MyMultiReader is a subclass that allows access to the underlying
>> SegmentReaders.
>>
>> The patch cannot be applied, but I think the implementation works  
>> far better
>> in many cases - it is also far less memory intensive. Scanning the  
>> bitset
>> could also be optimized very easily using internal skip values.
>>
>> Maybe this is completely off-base, but the solution has worked  
>> very well for
>> us. Maybe this is a completely different issue and separate  
>> incident should
>> be opened ?
>>
>> is there any interest in this?
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by robert engels <re...@ix.netcom.com>.

Lucene-831 is far more comprehensive.

I also think that by exposing access to the sub-readers it can be far  
simpler (closer to what I have provided).

In the mean-time, you should be able to use the provided class with a  
few modifications.

The "reload the entire cache" was a deal breaker for us, so I came up  
the attached. Works very well.

On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:

> I am looking all the time to LUCENE-831, which is a new version of
> FieldCache that is compatible with IndexReader.reopen() and  
> invalidates only
> reloaded segments. In each release of Lucene I am very unhappy,  
> because it
> is still not in. The same problem like yours is if you have a one  
> million
> documents index that is updated by adding a few documents each half  
> hour. If
> you use sorting by a field, whenever the index is reopened and you  
> really
> only a very small segment is added, nevertheless the complete  
> FieldCache is
> rebuild, very bad :(.
>
>
> So I think the ultimative fix would be to hopefully apply  
> LUCENE-831 soon
> and also use LUCENE-1461 as RangeFilter cache.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
> ________________________________________
> From: robert engels [mailto:rengels@ix.netcom.com]
> Sent: Thursday, December 04, 2008 9:39 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter  
> to boost
> performance of Range queries
>
> I can't seem to post to Jira, so I am attaching here...
>
> I attached QueryFilter.java.
>
> In reading this patch, and other similar ones, the problem seems to  
> be that
> if the index is modified, the cache is invalidated, causing a complete
> reload of the cache. Do I have this correct?
>
> The attached patch works really well in a highly interactive  
> environment, as
> the cache is only invalidated at the segment level.
>
> The MyMultiReader is a subclass that allows access to the underlying
> SegmentReaders.
>
> The patch cannot be applied, but I think the implementation works  
> far better
> in many cases - it is also far less memory intensive. Scanning the  
> bitset
> could also be optimized very easily using internal skip values.
>
> Maybe this is completely off-base, but the solution has worked very  
> well for
> us. Maybe this is a completely different issue and separate  
> incident should
> be opened ?
>
> is there any interest in this?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

RE: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by Uwe Schindler <uw...@thetaphi.de>.

I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and invalidates only
reloaded segments. In each release of Lucene I am very unhappy, because it
is still not in. The same problem like yours is if you have a one million
documents index that is updated by adding a few documents each half hour. If
you use sorting by a field, whenever the index is reopened and you really
only a very small segment is added, nevertheless the complete FieldCache is
rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply LUCENE-831 soon
and also use LUCENE-1461 as RangeFilter cache.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de
________________________________________
From: robert engels [mailto:rengels@ix.netcom.com] 
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost
performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems to be that
if the index is modified, the cache is invalidated, causing a complete
reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive environment, as
the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying
SegmentReaders.

The patch cannot be applied, but I think the implementation works far better
in many cases - it is also far less memory intensive. Scanning the bitset
could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked very well for
us. Maybe this is a completely different issue and separate incident should
be opened ?

is there any interest in this?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by robert engels <re...@ix.netcom.com>.

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems to  
be that if the index is modified, the cache is invalidated, causing a  
complete reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive  
environment, as the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying  
SegmentReaders.

The patch cannot be applied, but I think the implementation works far  
better in many cases - it is also far less memory intensive. Scanning  
the bitset could also be optimized very easily using internal skip  
values.

Maybe this is completely off-base, but the solution has worked very  
well for us. Maybe this is a completely different issue and separate  
incident should be opened ?

is there any interest in this?



On Dec 4, 2008, at 2:10 PM, Andy Liu (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-855? 
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
> tabpanel&focusedCommentId=12653450#action_12653450 ]
>
> Andy Liu commented on LUCENE-855:
> ---------------------------------
>
> Yes, it looks the same.  Glad this will finally make it to the source!
>
>> MemoryCachedRangeFilter to boost performance of Range queries
>> -------------------------------------------------------------
>>
>>                 Key: LUCENE-855
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Search
>>    Affects Versions: 2.1
>>            Reporter: Andy Liu
>>         Attachments: contrib-filters.tar.gz,  
>> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch,  
>> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch,  
>> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch,  
>> FieldCacheRangeFilter_Lucene_2.3.0.patch,  
>> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch,  
>> TestRangeFilterPerformanceComparison.java,  
>> TestRangeFilterPerformanceComparison.java
>>
>>
>> Currently RangeFilter uses TermEnum and TermDocs to find documents  
>> that fall within the specified range.  This requires iterating  
>> through every single term in the index and can get rather slow for  
>> large document sets.
>> MemoryCachedRangeFilter reads all <docId, value> pairs of a given  
>> field, sorts by value, and stores in a SortedFieldCache.  During  
>> bits(), binary searches are used to find the start and end indices  
>> of the lower and upper bound values.  The BitSet is populated by  
>> all the docId values that fall in between the start and end indices.
>> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory- 
>> backed index with random date values within a 5 year range.   
>> Executing bits() 1000 times on standard RangeQuery using random  
>> date intervals took 63904ms.  Using MemoryCachedRangeFilter, it  
>> took 876ms.  Performance increase is less dramatic when you have  
>> less unique terms in a field or using less number of documents.
>> Currently MemoryCachedRangeFilter only works with numeric values  
>> (values are stored in a long[] array) but it can be easily changed  
>> to support Strings.  A side "benefit" of storing the values are  
>> stored as longs, is that there's no longer the need to make the  
>> values lexographically comparable, i.e. padding numeric values  
>> with zeros.
>> The downside of using MemoryCachedRangeFilter is there's a fairly  
>> significant memory requirement.  So it's designed to be used in  
>> situations where range filter performance is critical and memory  
>> consumption is not an issue.  The memory requirements are: (sizeof 
>> (int) + sizeof(long)) * numDocs.
>> MemoryCachedRangeFilter also requires a warmup step which can take  
>> a while to run in large datasets (it took 40s to run on a 3M  
>> document corpus).  Warmup can be called explicitly or is  
>> automatically called the first time MemoryCachedRangeFilter is  
>> applied using a given field.
>> So in summery, MemoryCachedRangeFilter can be useful when:
>> - Performance is critical
>> - Memory is not an issue
>> - Field contains many unique numeric values
>> - Index contains large amount of documents
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653450#action_12653450 ] 

Andy Liu commented on LUCENE-855:
---------------------------------

Yes, it looks the same.  Glad this will finally make it to the source!

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487378 ] 

Andy Liu commented on LUCENE-855:
---------------------------------

Hey Matt,

The way you implemented FieldCacheRangeFilter is very simple and clever!  Here's a couple comments:

1. My performance test that we both used is no longer valid, since FieldCacheRangeFilter.bits() only returns a wrapper around a BitSet.  The test only calls bits() .  Since you're wrapping BitSet, there's some overhead incurred when applying it to an actual search.  I reran the performance test applying the Filter to a search, and your implementation is still faster, although only slightly.

2. Your filter currently doesn't work with ConstantRangeQuery.  CRQ calls bits.nextSetBit() which fails in your wrapped BitSet implementation.  Your incomplete implementation of BitSet may cause problems elsewhere.

If you can fix #2 I'd vote for your implementation since it's cleaner and faster, although I might take another stab at trying to improve my implementation.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487595 ] 

Andy Liu commented on LUCENE-855:
---------------------------------

In your updated benchmark, you're combining the range filter with a term query that matches one document.  I don't believe that's the typical use case for a range filter.  Usually the user employs a range to filter a large document set.  

I created a different benchmark to compare standard range filter, MemoryCachedRangeFilter, and Matt's FieldCacheRangeFilter using MatchAllDocsQuery, ConstantScoreQuery, and TermQuery (matching one doc like the last benchmark).  Here are the results:

Reader opened with 100000 documents.  Creating RangeFilters...
RangeFilter w/MatchAllDocsQuery:
========================
  * Bits: 4421
  * Search: 5285

RangeFilter w/ConstantScoreQuery:
========================
  * Bits: 4200
  * Search: 8694

RangeFilter w/TermQuery:
========================
  * Bits: 4088
  * Search: 4133

MemoryCachedRangeFilter w/MatchAllDocsQuery:
========================
  * Bits: 80
  * Search: 1142

MemoryCachedRangeFilter w/ConstantScoreQuery:
========================
  * Bits: 79
  * Search: 482

MemoryCachedRangeFilter w/TermQuery:
========================
  * Bits: 73
  * Search: 95

FieldCacheRangeFilter w/MatchAllDocsQuery:
========================
  * Bits: 0
  * Search: 1146

FieldCacheRangeFilter w/ConstantScoreQuery:
========================
  * Bits: 1
  * Search: 356

FieldCacheRangeFilter w/TermQuery:
========================
  * Bits: 0
  * Search: 19

Here's some points:

1. When searching in a filter, bits() is called, so the search time includes bits() time.
2. Matt's FieldCacheRangeFilter is faster for ConstantScoreQuery, although not by much.  Using MatchAllDocsQuery, they run neck-and-neck.  FCRF is much faster for TermQuery since MCRF has to create the BItSet for the range before the search is executed.
3. I get less document hits when running FieldCacheRangeFilter with ConstantScoreQuery.  Matt, there may be a bug in getNextSetBit().  Not sure if this would affect the benchmark.
4. I'd be interested to see performance numbers when FieldCacheRangeFilter is used with ChainedFilter.  I suspect that MCRF would be faster in this case, since I'm assuming that FCRF has to reconstruct a standard BitSet during clone().

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486788 ] 

Yonik Seeley commented on LUCENE-855:
-------------------------------------

> LUCENE-798 caches RangeFilters so that if the same exact range is executed again [...]

It's not just the exact same range though... it can reuse parts of ranges AFAIK.



> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Liu updated LUCENE-855:
----------------------------

    Attachment: contrib-filters.tar.gz

I made a few changes to MemoryCachedRangeFilter:

- SortedFieldCache's values[] now contains only sorted unique values, while docId[] has been changed to a ragged 2D array with an array of docId's corresponding to each unique value.  Since there's no longer repeated values in values[]. forward() and rewind() are no longer required.  This also addresses the O(n) special case that Hoss brought up where every value is identical.
- bits() now returns OpenBitSetWrapper, a subclass of BitSet that uses Solr's OpenBitSet as a delegate.  Wrapping OpenBitSet presents some challenges.  Since the internal bits store of BitSet is private, it's difficult to perform operations between BitSet and OpenBitSet (like or, and, etc).
- An in-memory OpenBitSet cache is kept.  During warmup, the global range is partitioned and OpenBitSet instances are created for each partition.  During bits(), these cached OpenBitSet instances that fall in between the lower and upper ranges are used.
- Moved MCRF to contrib/ due to the Solr dependancy

Using the current (and incomplete) benchmark, MemoryCachedRangeFilter is slightly faster than FCRF when used in conjuction with ConstantRangeQuery and MatchAllDocsQuery:

Reader opened with 100000 documents.  Creating RangeFilters...

TermQuery

FieldCacheRangeFilter
  * Total: 88ms
  * Bits: 0ms
  * Search: 14ms

MemoryCachedRangeFilter
  * Total: 89ms
  * Bits: 17ms
  * Search: 31ms

RangeFilter
  * Total: 9034ms
  * Bits: 4483ms
  * Search: 4521ms

Chained FieldCacheRangeFilter
  * Total: 33ms
  * Bits: 3ms
  * Search: 9ms

Chained MemoryCachedRangeFilter
  * Total: 77ms
  * Bits: 19ms
  * Search: 30ms


ConstantScoreQuery

FieldCacheRangeFilter
  * Total: 541ms
  * Bits: 2ms
  * Search: 485ms

MemoryCachedRangeFilter
  * Total: 473ms
  * Bits: 23ms
  * Search: 390ms

RangeFilter
  * Total: 13777ms
  * Bits: 4451ms
  * Search: 9298ms

Chained FieldCacheRangeFilter
  * Total: 12ms
  * Bits: 2ms
  * Search: 5ms

Chained MemoryCachedRangeFilter
  * Total: 80ms
  * Bits: 16ms
  * Search: 44ms


MatchAllDocsQuery

FieldCacheRangeFilter
  * Total: 1231ms
  * Bits: 3ms
  * Search: 1115ms

MemoryCachedRangeFilter
  * Total: 1222ms
  * Bits: 53ms
  * Search: 1149ms

RangeFilter
  * Total: 10689ms
  * Bits: 4954ms
  * Search: 5583ms

Chained FieldCacheRangeFilter
  * Total: 937ms
  * Bits: 1ms
  * Search: 862ms

Chained MemoryCachedRangeFilter
  * Total: 921ms
  * Bits: 19ms
  * Search: 894ms

Hoss, those were great comments you made.  I'd be happy to continue on and make those changes, although if the feeling around town is that Matt's range filter is the preferred implementation, I'll stop here.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962 ] 

Hoss Man commented on LUCENE-855:
---------------------------------

On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote:

: I'd love to know what Hoss and other big Filter users think about this.
: Solr makes a lof of use of (Range?)Filters, I believe.

This is one of those Jira issues that i didn't really have time to follow when it was first opened, and so the Jira emails have just been piling up waiting ofr me to read.

Here's the raw notes i took as i read through the patches...

----------------
FieldCacheRangeFilter.patch  from 10/Apr/07 01:52 PM

 * javadoc cut/paste errors (FieldCache)
 * FieldCacheRangeFilter should work with simple strings
   (using FieldCache.getStrings or FieldCache.getStringIndex)
   just like regular RangeFilter
 * it feels like the various parser versions should be in
   seperate subclasses (common abstract base class?)
 * why does clone need to construct a raw BitSet?  what exactly didn't
   work about ChainedFilter without this?
   (could cause other BitSet usage problems)
 * or/and/andNot/xor can all be implemented using convertToBitSet
 * need FieldCacheBitSet methods: cardinality, get(int,int)
 * need equals and hashCode methods in all new classes
 * FieldCacheBitSet.clear should be UnsuppOp
 * convertToBitSet can be cached.
 * FieldCacheBitSet should be abstract, requiring get(int) be implemented


MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM

 * "tuples" should be initialized to fieldCache.length ... serious
   ArrayList resizing going on there
   (why is it an ArrayList, why not just Tules[] ?)
 * doesn't "cache" need synchronization? ... seems like the same
   CreationPlaceholder pattern used in FieldCache might make sense here.
 * this looks wrong...
     } else if ( (!includeLower) && (lowerIndex >= 0) ) {
   ...consider case where lower==5, includeLower==false, and all values
   in index are 5, binary search could leave us in the middle of hte index,
   so we still need for move forward to the end?
 * ditto above concern for finding upperIndex
 * what is pathological worst case for rewind/forward when *lots* of
   duplicate values in index?  should another binarySearch be used?
 * a lot of code in MemoryCachedRangeFilter.bits for finding
   lowerIndex/upperIndex would probably make more sense as methods in
   SortedFieldCache
 * only seems to handle longs, at a minimum should deal with arbitrary
   strings, with optional add ons for longs/ints/etc...
 * I can't help but wonder how MemoryCachedRangeFilter would compare if it
   used Solr's OpenBitSet (facaded to implement the BitSet API)

TestRangeFilterPerformanceComparison.java   from 10/Apr/07

 * I can't help but wonder how RangeFilter would compare if it used Solr's
   OpenBitSet (facaded to implement the BitSet API)
 * no test of includeLower==false or includeUpper==false
 * i don't think the ranges being compared are the same for RangeFilter as they 
   are for the other Filters ... note the use of DateTools when building the index, 
   vs straight string usage in RangeFilter, vs Long.parseLong in 
   MemoryCachedRangeFilter and FieldCacheRangeFilter
 * is it really a fair comparison to call MemoryCachedRangeFilter.warmup
   or FieldCacheRangeFilter.bits outside of the timing code?
   for indexes where the IndexReader is reopened periodicaly this may
   be a significant number to be aware of.
----------------

Questions about the legitimacy of the testing aside...

In general, I like the approach of FieldCacheBitSet -- but it should be generalized into an "AbstractReadOnlyBitSet" where all methods are implemented via get(int) in subclasses -- we should make sure that every method in the BitSet API works as advertised in Java1.4.  

I don't really like the various hoops FieldCacheRangeFilter has to jump through to support int/float/long ... I think at it's core it should support simple Strings, with alternate/sub classes for dealing with other FieldCache formats ... i just really dislike all the crazy nested ifs to deal with the different Parser types, if there's going to be separate constructors for longs/floats/ints, they might as well be separate sub-classes.

the really nice thing this has over RangeFilter is that people can index raw numeric values without needing to massage them into lexicographically ordered Strings (since the FieldCache will take care of parsing them appropriately) 

My gut tells me that the MemoryCachedRangeFilter approach will never ever be able to compete with the FieldCacheRangeFilter facading BitSet approach since it needs to build the FieldCache, then the SortedFieldCache, then a BitSet ...it seems like any optimization into that pipeline can always be beaten by using the same logic, but then facading the BitSet




> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486767 ] 

Andy Liu commented on LUCENE-855:
---------------------------------

Otis, looking forward to your colleague's patch.

LUCENE-798 caches RangeFilters so that if the same exact range is executed again, the cached RangeFilter is used.  However, the first time a range is encountered, you'll still have to calculate the RangeFilter, which can be slow.  I haven't looked at the patch, but I'm sure LUCENE-798 can be used in conjunction with MemoryCachedRangeFilter to further boost performance for repeated range queries.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Yiqing Jin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488297 ] 

Yiqing Jin commented on LUCENE-855:
-----------------------------------

After i changed the code in ChainedFilter#doChain to
case AND:
            	BitSet bit = (BitSet)filter.bits(reader).clone();
                result.and(bit);
                break;
the result is fine.  but i know that's a bad way.
Since the FieldCacheBitSet is not a real BitSet and uses a fake get() method just get value from the FieldCache. I think the current imp is still not fit for the ChainedFilter because FieldCacheBitSet  do not have a good implementation of the logical cperotion such as 'and'. 
Maybe we could make the FieldCacheBitSet  public and implement all the methods in it's own way instead of having a convertToBitSet() to make things messed.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic reassigned LUCENE-855:
---------------------------------------

    Assignee:     (was: Otis Gospodnetic)

Can somebody have a look and commit?  I *believe* this is a good patch - it was good when I looked at it when Eric first contributed it.  Thanks (behind on too many fronts)

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Yiqing Jin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488547 ] 

Yiqing Jin commented on LUCENE-855:
-----------------------------------

That's true you can't do the ''and '  or 'or'  as usual. but i am thingking  the FieldCacheBitSet  may hold some private varables to store the range and field infomation and we do the 'and', 'or', 'xor'  in a tricky way by setting the value of the varables.  And we implement the #get() using the varables as a judgement .

Changing the ChainedFilter is  a good way, maybe we could have a special FieldCaheChainedFilter ^_^. 

i'm having a busy day but i'll try to do some experiment on it if had time.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651837#action_12651837 ] 

Paul Elschot commented on LUCENE-855:
-------------------------------------

On the face of it, this has some overlap with the recent FieldCacheRangeFilter of LUCENE-1461 .
Any comments?

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter.patch

Lets try this again. 

I am very sorry to everyone for the last patch. I had some trouble with my environment  not correctly re-building.

I have done ant clean before testing.
Andy take a look at this patch and tell me what you think.



> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Liu updated LUCENE-855:
----------------------------

    Attachment: MemoryCachedRangeFilter.patch

Patch produced from latest from SVN

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter.patch

This version if the Field Cache Range filter has a new AbstractGetOnlyBitSet a base BitSet that will use all of its functions like nextClearBit using get() 

I have also added a RuntimeChainedFilter this will work just like the normal chained filter but it does not do the AND or OR or XOR until you call the get() function this will allow for the BitSets that the FieldCacheRangeFilter create to be chained correctly. 

There are also testes for all of my new code. The FieldCacheRangeFilter still has the nested If statements to allow for Long, Ints and Floats. I think these are not that complicated and it allows users to pick the type of filter they want while saving space. In My application we use Ints for dates even though we know it will only support dates going up to 2038 as right now we need the memory. 
This code will give the flexibility to the user creating the filters so they can tune their app just they way they want it.

I hope you all like it. Please let me know what you think 

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter_Lucene_2.3.0.patch

I have changed the FiledCacheRangeFilter so that it now works with Lucene 2.3.0 
IT does not have to change the FieldCache since It can just use Extended Field Cache 

Here are the performance numbers again

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Tue Feb 04 22:34:22 PST 2003
    [junit] End interval: Sun Feb 03 22:34:22 PST 2008
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...
    [junit] Standard RangeFilter finished in 41111ms
    [junit] FieldCacheRangeFilter finished in 112ms
    [junit] ------------- ---------------- ---------------


I hope this helps and I hope this gets added to Lucene

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>            Assignee: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "vivek (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12560912#action_12560912 ] 

vivek commented on LUCENE-855:
------------------------------

Any plans to have this part of Lucene 2.3?

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>            Assignee: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Otis Gospodnetic reassigned LUCENE-855:
---------------------------------------

    Assignee: Otis Gospodnetic

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487380 ] 

Matt Ericson commented on LUCENE-855:
-------------------------------------

I will be happy to fix #2 or a to try to fix #2 

The test had the real work done out side the Timing 

The other thing I like about is is that there is less data saved in cache. Some of our indexes are 10 Gigs so every bite counts at least in my application. 




> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487108 ] 

Matt Ericson commented on LUCENE-855:
-------------------------------------

I am almost done with my patch and I wanted to test it against this patch so see who has the faster version 
But the MemoryCachedRangeFilter is written using Java 1.5

And as far as I know Lucene is still on java 1.4 

Lines like 
private static WeakHashMap<IndexReader, Map<String,SortedFieldCache>> cache = new WeakHashMap<IndexReader, Map<String, SortedFieldCache>>();


Will not compile in java 1.4 Andy I would love to see who has the faster patch if you would convert your patch to use java 1.4 I would be happy to put them side by side

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653900#action_12653900 ] 

Otis Gospodnetic commented on LUCENE-855:
-----------------------------------------

Hi Matt! :)

Tim, want to benchmark the two? (since you already benchmarked 1461, you should be able to plug in Matt's thing and see how it compares)


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Liu updated LUCENE-855:
----------------------------

    Attachment: MemoryCachedRangeFilter_1.4.patch

Here's a patch that should compile in Java 1.4 .  It includes

src/java/org/apache/lucene/search/MemoryCachedRangeFilter.java
src/test/org/apache/lucene/search/TestMemoryCachedRangeFilter.java
src/test/org/apache/lucene/search/TestMemoryCachedRangeFilterPerformance.java

You can try using TestMemoryCachedRangeFilterPerformance to compare runtime speed numbers.  Let me know if you have any problem running these.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487590 ] 

Otis Gospodnetic commented on LUCENE-855:
-----------------------------------------

Comments about the patch so far:
Cosmetics:
- You don't want to refer to Andy's class in javadocs, as that class won't go in unless Andy makes it faster.
- I see some incorrect (copy/paste error) javadocs and javadocs/comments with typos in both the test classes and non-test classes.
- Please configure your Lucene project in Eclipse to use 2 spaces instead of 4.  In general, once you get the code formatting settings right, it's a good practise to format your code with that setting before submitting a patch.

Testing:
- You can put the testPerformance() code from  TestFieldCacheRangeFilterPerformance  in the other unit test class you have there.
- Your testPerformance() doesn't actually assert...() anything, just prints out numbers to stdout.  You can keep the printing, but it would be better to also do some asserts, so we can always test that the FCRangerFilter beats the vanilla RangeFilter without looking at the stdout.
- You may want to close that searcher in testPerformance() before opening a new one.  Probably won't make any difference, but still.
- You may also want to just close the searcher at the end of the method.


Impl:
- In the inner FieldCacheBitSet class, I see:
+        public boolean intersects(BitSet set)  {
+            for (int i =0; i < length; i++) {
+                if (get(i) && set.get(i)) {
+                    return true;
+                }
+            }
+            return false;
+        }

Is there room for a small optimization?  What if BitSets are not of equal size?  Wouldn't it make sense to loop through a smaller BitSet then?  Sorry if I'm off, I hardly ever work with BitSets.

- I see you made *_PARSERs in FCImpl public (were private).  Is that really needed?  Would ackage protected be enough?

- Make sure ASL is in all test and non-test classes, I don't see it there now.


Overall, I like it - slick and elegant usage of FC!

I'd love to know what Hoss and other big Filter users think about this.  Solr makes a lof of use of (Range?)Filters, I believe.


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Yiqing Jin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488075 ] 

Yiqing Jin commented on LUCENE-855:
-----------------------------------

This seems very useful. Just one thing i would like to know, do this Filter could work properly with the ChainedFilter? Since some times we have to filter the result with more than one range for different field, say  search in an area by lat lon. 
I have made a simple test filter two fields with ChainedFilter and it seems that i can't find anything even there are docs in that range. 
Maybe there are some bugs in my code, i'll check it tomorrow.
BTW the value type i used is Float.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488125 ] 

Hoss Man commented on LUCENE-855:
---------------------------------

Another thing that occurred to me this morning is that the comparison test doesn't consider the performance of the various Filter's when cached and reused  (with something like CacheWrappingFilter)  ... you may actually see the stock RangeFilter be faster then either implementation when you can reuse the same exact Filter over and over on the same IndexReader -- a fairly common use case.

In general the numbers that really need to be conpared are...

  1) the time overhead of an implementation when opening a new IndexReader (and whether that overhead is per field)
  2) the time overhead of an implementation the first time a specific Filter is used on an IndexReader
  3) the time on average that it takes to use a Filter

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487897 ] 

Andy Liu commented on LUCENE-855:
---------------------------------

Hey Matt, I get this exception when running your newest FCRF with the performance test.  Can you check to see if you get this also?

java.lang.ArrayIndexOutOfBoundsException: 100000
	at org.apache.lucene.search.FieldCacheRangeFilter$5.get(FieldCacheRangeFilter.java:231)
	at org.apache.lucene.search.IndexSearcher$1.collect(IndexSearcher.java:136)
	at org.apache.lucene.search.Scorer.score(Scorer.java:49)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
	at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
	at org.apache.lucene.search.Hits.<init>(Hits.java:53)
	at org.apache.lucene.search.Searcher.search(Searcher.java:46)
	at org.apache.lucene.misc.TestRangeFilterPerformanceComparison$Benchmark.go(TestRangeFilterPerformanceComparison.java:312)
	at org.apache.lucene.misc.TestRangeFilterPerformanceComparison.testPerformance(TestRangeFilterPerformanceComparison.java:201)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:585)
	at junit.framework.TestCase.runTest(TestCase.java:154)
	at junit.framework.TestCase.runBare(TestCase.java:127)
	at junit.framework.TestResult$1.protect(TestResult.java:106)
	at junit.framework.TestResult.runProtected(TestResult.java:124)
	at junit.framework.TestResult.run(TestResult.java:109)
	at junit.framework.TestCase.run(TestCase.java:118)
	at junit.framework.TestSuite.runTest(TestSuite.java:208)
	at junit.framework.TestSuite.run(TestSuite.java:203)
	at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128)
	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)



> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Yiqing Jin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488291 ] 

Yiqing Jin commented on LUCENE-855:
-----------------------------------

Hi, Matt
As i tried the FieldCacheRangeFilter i have got problem.

I added a test block at the end of TestFieldCacheRangeFilter

        FieldCacheRangeFilter f1 =  new FieldCacheRangeFilter("id", (float)minIP, (float)maxIP, T, F);
        FieldCacheRangeFilter f2 =  new FieldCacheRangeFilter("id", (float)minIP, (float)maxIP, F, T);
      
        ChainedFilter f = new ChainedFilter(new Filter[]{f1,f2},ChainedFilter.AND);
        result = search.search(q, f);
        assertEquals("all but ends", numDocs-2, result.length());

This could not pass and in fact the result.length() is 0; Nothing could be found. 


I checked my code and traced the running but still can't get result expected. It seems the Filter won't work with the ChainedFilter. 
after the doChain the BitSet seems to be empty.(Either 'and' or 'or' operation). 
CODE:
[
case AND:
            	BitSet bit = filter.bits(reader);
                result.and(bit);
]
The bit is already empty before it's added to the result.


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter.patch

Andy was correct the 2 performance tests were bogus as they did not call get() from the bit sets. And my code does all of the work int the get() call.  I guess I should have looked a little closer at the tests before using it

I changes his tests and mine to call and IndexSearcher.search(q,filter) and actually do the search 
Here are the results 

Using the MemoryCachedRangeFilter

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Tue Apr 09 14:32:14 PDT 2002
    [junit] End interval: Sun Apr 08 14:32:14 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...
    [junit] Standard RangeFilter finished in 57533ms
    [junit] MemoryCachedRangeFilter inished in 905ms
    [junit] ------------- ---------------- ---------------

Using FieldCacheRangeFilter

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Tue Apr 09 14:30:29 PDT 2002
    [junit] End interval: Sun Apr 08 14:30:29 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...
    [junit] Standard RangeFilter finished in 58822ms
    [junit] FieldCacheRangeFilter inished in 102ms
    [junit] ------------- ---------------- ---------------

They are much closer this time 

I have fixed my BitSets to allow a user to call nextClearBit or nextSetBit

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486791 ] 

Andy Liu commented on LUCENE-855:
---------------------------------

Ah, you're right.  I didn't read closely enough!

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter.patch

This version will create a real BitSet() when cloned and will allow chained filter to work correctly 



> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Andy Liu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Liu updated LUCENE-855:
----------------------------

    Attachment: TestRangeFilterPerformanceComparison.java

Here's my new benchmark.

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486758 ] 

Otis Gospodnetic commented on LUCENE-855:
-----------------------------------------

A colleague of mine is working on something similar, but possibly more efficient (less sorting and binary searching).  He'll probably attach his patch to this issue.


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Tim Sturge (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653414#action_12653414 ] 

Tim Sturge commented on LUCENE-855:
-----------------------------------

Matt, Andy,

Please take a look at LUCENE-1461. As far as I can tell it is identical in purpose and design to this patch.

Matt,

I would like to add you to the CHANGES.txt credits for LUCENE-1461. Are you OK with that?



> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: TestRangeFilterPerformanceComparison.java

Andy thank you for that test 

I took at Moved it to contrib/miscellaneous and added a few more tests including the Chained Filter test. Here is my version. Also I fixed a few bugs in my code that I will be attaching next .

I also reformatted my results I think they are a little easer to read. 
Here is what I get and your right if you use a MatchAllDocsQuery our 2 version of the code are about the same 

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Thu Apr 11 10:55:02 PDT 2002
    [junit] End interval: Tue Apr 10 10:55:02 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...

    [junit] TermQuery

    [junit] FieldCacheRangeFilter
    [junit]   * Total: 13ms
    [junit]   * Bits: 0ms
    [junit]   * Search: 9ms
    [junit] MemoryCachedRangeFilter
    [junit]   * Total: 209ms
    [junit]   * Bits: 90ms
    [junit]   * Search: 115ms
    [junit] RangeFilter
    [junit]   * Total: 12068ms
    [junit]   * Bits: 6009ms
    [junit]   * Search: 6051ms
    [junit] Chained FieldCacheRangeFilter
    [junit]   * Total: 15ms
    [junit]   * Bits: 1ms
    [junit]   * Search: 10ms
    [junit] Chained MemoryCachedRangeFilter
    [junit]   * Total: 177ms
    [junit]   * Bits: 83ms
    [junit]   * Search: 90ms

    [junit] ConstantScoreQuery

    [junit] FieldCacheRangeFilter
    [junit]   * Total: 480ms
    [junit]   * Bits: 1ms
    [junit]   * Search: 474ms
    [junit] MemoryCachedRangeFilter
    [junit]   * Total: 757ms
    [junit]   * Bits: 90ms
    [junit]   * Search: 663ms
    [junit] RangeFilter
    [junit]   * Total: 18749ms
    [junit]   * Bits: 6083ms
    [junit]   * Search: 12655ms
    [junit] Chained FieldCacheRangeFilter
    [junit]   * Total: 11ms
    [junit]   * Bits: 0ms
    [junit]   * Search: 8ms
    [junit] Chained MemoryCachedRangeFilter
    [junit]   * Total: 776ms
    [junit]   * Bits: 87ms
    [junit]   * Search: 682ms

    [junit] MatchAllDocsQuery

    [junit] FieldCacheRangeFilter
    [junit]   * Total: 1344ms
    [junit]   * Bits: 5ms
    [junit]   * Search: 1334ms
    [junit] MemoryCachedRangeFilter
    [junit]   * Total: 1468ms
    [junit]   * Bits: 81ms
    [junit]   * Search: 1381ms
    [junit] RangeFilter
    [junit]   * Total: 13360ms
    [junit]   * Bits: 6091ms
    [junit]   * Search: 7254ms
    [junit] Chained FieldCacheRangeFilter
    [junit]   * Total: 924ms
    [junit]   * Bits: 4ms
    [junit]   * Search: 916ms
    [junit] Chained MemoryCachedRangeFilter
    [junit]   * Total: 1507ms
    [junit]   * Bits: 84ms
    [junit]   * Search: 1415ms
    [junit] ------------- ---------------- ---------------


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653635#action_12653635 ] 

Matt Ericson commented on LUCENE-855:
-------------------------------------

Looks similar to what I wrote but it uses a more data structures. I  
liked the what I built as it just has direct access to the Field Cache  
and there are no other data structures and that if once you load the  
data in the FC you can do any other search on that field and not have  
to rebuild anything you can just re-use the data.

But I think all 3 are improvements on what's there but as I am  
prejudiced and I really like they one I wrote and I think it will  
stack up faster then the 1461 if you do load tests on it.

Just my $0.02

Matt





> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488412 ] 

Matt Ericson commented on LUCENE-855:
-------------------------------------

I have done a little research and I do not think I can get my bit set to act
the same as a normal bit set so this will not work with  ChainedFilter as
ChainedFilter calls BitSet.and() or BitSet.or()

I looked at these functions and they access private varables inside of the
BitSet and do the 'and', 'or', 'xor' on the bits in memory. Since my BitSet
is just a proxy for the field cache ChainedFilter will  not work unless we
also change ChainedFilter

Matt



> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487587 ] 

Otis Gospodnetic commented on LUCENE-855:
-----------------------------------------

OK.  I'll wait for the new performance numbers before committing.  Andy, if you see anything funky in Matt's patch or if you managed to make your version faster, let us know, please.


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Posted by "Matt Ericson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491677 ] 

Matt Ericson commented on LUCENE-855:
-------------------------------------

Can someone take a look at the code I attached and let me know if there is anything we need to change?
Or did it get added to lucene?

I don't really know how long this should take? 

Matt

> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org