You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@thorntothehorn.org> on 2011/04/05 08:24:29 UTC

DocIdSet to represent small numberr of hits in large Document set

I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).

Many of our indexes are 5M+ Documents, however, only a small subset of these are 
relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet, is 
rather inefficient in terms of memory use, what is the recommended way to 
DocIdSet implementation to use in this scenario?

Seems like SortedVIntList can be used to store the info, but it has no methods 
to build the list in the first place, requiring an array or bitset in the 
constructor.

I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2 deployment, 
but want to move away from that Nutch dependency, so wondered if Lucene had a 
way to do this?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DocIdSet to represent small numberr of hits in large Document set

Posted by Michael McCandless <lu...@mikemccandless.com>.
This (HashDocSet, and any other impls that handle the sparse case
well) could be useful to have in Lucene's core.

For example, for certain MultiTermQuerys  we have this
CONSTANT_SCORE_AUTO_REWRITE, which has iffy smelling heuristics to try
to determine the best cutover point from
ConstantScoreQuery(BooleanQuery(<OR of Terms>)) to FILTER_REWRITE,
because FILTER_REWRITE is costly in the sparse case.

Mike

http://blog.mikemccandless.com

On Tue, Apr 5, 2011 at 10:53 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
> I think Solr has a HashDocSet implementation?
>
> On Tue, Apr 5, 2011 at 3:19 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> Can we simply factor out (poach!) those useful-sounding classes from
>> Nutch into Lucene?
>>
>> Mike
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman <ad...@thorntothehorn.org> wrote:
>>> I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).
>>>
>>> Many of our indexes are 5M+ Documents, however, only a small subset of these
>>> are relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet,
>>> is rather inefficient in terms of memory use, what is the recommended way to
>>> DocIdSet implementation to use in this scenario?
>>>
>>> Seems like SortedVIntList can be used to store the info, but it has no
>>> methods to build the list in the first place, requiring an array or bitset
>>> in the constructor.
>>>
>>> I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2
>>> deployment, but want to move away from that Nutch dependency, so wondered if
>>> Lucene had a way to do this?
>>>
>>> Thanks
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DocIdSet to represent small numberr of hits in large Document set

Posted by Jason Rutherglen <ja...@gmail.com>.
I think Solr has a HashDocSet implementation?

On Tue, Apr 5, 2011 at 3:19 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> Can we simply factor out (poach!) those useful-sounding classes from
> Nutch into Lucene?
>
> Mike
>
> http://blog.mikemccandless.com
>
> On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman <ad...@thorntothehorn.org> wrote:
>> I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).
>>
>> Many of our indexes are 5M+ Documents, however, only a small subset of these
>> are relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet,
>> is rather inefficient in terms of memory use, what is the recommended way to
>> DocIdSet implementation to use in this scenario?
>>
>> Seems like SortedVIntList can be used to store the info, but it has no
>> methods to build the list in the first place, requiring an array or bitset
>> in the constructor.
>>
>> I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2
>> deployment, but want to move away from that Nutch dependency, so wondered if
>> Lucene had a way to do this?
>>
>> Thanks
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DocIdSet to represent small numberr of hits in large Document set

Posted by Michael McCandless <lu...@mikemccandless.com>.
Can we simply factor out (poach!) those useful-sounding classes from
Nutch into Lucene?

Mike

http://blog.mikemccandless.com

On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman <ad...@thorntothehorn.org> wrote:
> I'm converting a Lucene 2.3.2 to 2.4.1 (with a view to going to 2.9.4).
>
> Many of our indexes are 5M+ Documents, however, only a small subset of these
> are relevant to any user.  As a DocIdSet, backed by a BitSet or OpenBitSet,
> is rather inefficient in terms of memory use, what is the recommended way to
> DocIdSet implementation to use in this scenario?
>
> Seems like SortedVIntList can be used to store the info, but it has no
> methods to build the list in the first place, requiring an array or bitset
> in the constructor.
>
> I had used Nutch's DocSet and HashDocSet implementations in my 2.3.2
> deployment, but want to move away from that Nutch dependency, so wondered if
> Lucene had a way to do this?
>
> Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: DocIdSet to represent small numberr of hits in large Document set

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman <ad...@thorntothehorn.org> wrote:
> Seems like SortedVIntList can be used to store the info, but it has no
> methods to build the list in the first place, requiring an array or bitset
> in the constructor.

It has a constructor that takes DocIdSetIterator - so you can pass an
iterator obtained from anywhere else (a Scorer actually is a
DocIdSetIterator, and you can get a DocIdSet from a Filter), or
implement your own.  It's a simple iterator interface.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org