You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jean Claude van Johnson <va...@gmail.com> on 2017/09/05 00:09:28 UTC

What is the fastest way to loop over all documents in an index?

Hi there,

I have an use case, were I need to iterate over all documents in an index from time to time.
It seems that the MatchAllDocsQuery is what I should use for this, however it creates a bunch of Objects (Score etc) that I don’t really need.

My question to you is: 

What is the fastest way to loop over all documents in an index?
Is it looping over all possible doc id’s (+filtering out deleted documents)?

Thank you very much.

Best regards
Claude

Re: Re: What is the fastest way to loop over all documents in an index?

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.

I believe that's the case. Leave the deleted docs out, though (which can be
computed by intersecting with some other bitset.).

On Tue, Sep 5, 2017 at 2:04 PM, Ahmet Arslan <io...@yahoo.com> wrote:

>
> Hi Ishan,
>
> I saw following loop is suggested for this task in the stack overflow.
>
> for (int i=0; i<reader.maxDoc(); i++)
>
> How can we confirm that internal Lucene IDs are subsequent numbers from 0
> to maxDoc()-1?
>
> I thought that they are arbitrary integers.
>
> Ahmet
>
>
>
>
> On Tuesday, September 5, 2017, 7:54:31 AM GMT+3, Ishan Chattopadhyaya <
> ichattopadhyaya@gmail.com> wrote:
>
>
>
>
>
> Maybe IndexReader#document(), looping over docids is the best here?
> http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/
> index/IndexReader.html#document-int-
>
> On Tue, Sep 5, 2017 at 7:57 AM, Ahmet Arslan <io...@yahoo.com.invalid>
> wrote:
>
> > Hi Jean,
> >
> > I am also interested answers to this question. I need this feature too.
> > Currently I am using a hack.
> > I create an artificial field (with an artificial token) attached to every
> > document.
> >
> > I traverse all documents using the code snippet given in my previous
> > related question. (no one answered to it)
> >
> > http://lucene.472066.n3.nabble.com/PostingsEnum-for-
> > documents-that-does-not-contain-a-term-td4349482.html
> > I found EverythingEnum class in the Lucene50PostingsReader.java, but I
> > couldn't figure out how to use it.
> > So, I do not know if this class is for the task, but its name looks
> > promising.
> > Thanks,Ahmet
> >
> >
> >
> > On Tuesday, September 5, 2017, 3:09:37 AM GMT+3, Jean Claude van Johnson
> <
> > vanjohnsonjeanclaude@gmail.com> wrote:
> >
> >
> >
> >
> >
> > Hi there,
> >
> > I have an use case, were I need to iterate over all documents in an index
> > from time to time.
> > It seems that the MatchAllDocsQuery is what I should use for this,
> however
> > it creates a bunch of Objects (Score etc) that I don’t really need.
> >
> > My question to you is:
> >
> > What is the fastest way to loop over all documents in an index?
> > Is it looping over all possible doc id’s (+filtering out deleted
> > documents)?
> >
> > Thank you very much.
> >
> > Best regards
> > Claude
> >
>

Re: Re: What is the fastest way to loop over all documents in an index?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Ishan,

I saw following loop is suggested for this task in the stack overflow.

for (int i=0; i<reader.maxDoc(); i++)

How can we confirm that internal Lucene IDs are subsequent numbers from 0 to maxDoc()-1?

I thought that they are arbitrary integers.

Ahmet



On Tuesday, September 5, 2017, 7:54:31 AM GMT+3, Ishan Chattopadhyaya <ic...@gmail.com> wrote: 





Maybe IndexReader#document(), looping over docids is the best here?
http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/index/IndexReader.html#document-int-

On Tue, Sep 5, 2017 at 7:57 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Jean,
>
> I am also interested answers to this question. I need this feature too.
> Currently I am using a hack.
> I create an artificial field (with an artificial token) attached to every
> document.
>
> I traverse all documents using the code snippet given in my previous
> related question. (no one answered to it)
>
> http://lucene.472066.n3.nabble.com/PostingsEnum-for-
> documents-that-does-not-contain-a-term-td4349482.html
> I found EverythingEnum class in the Lucene50PostingsReader.java, but I
> couldn't figure out how to use it.
> So, I do not know if this class is for the task, but its name looks
> promising.
> Thanks,Ahmet
>
>
>
> On Tuesday, September 5, 2017, 3:09:37 AM GMT+3, Jean Claude van Johnson <
> vanjohnsonjeanclaude@gmail.com> wrote:
>
>
>
>
>
> Hi there,
>
> I have an use case, were I need to iterate over all documents in an index
> from time to time.
> It seems that the MatchAllDocsQuery is what I should use for this, however
> it creates a bunch of Objects (Score etc) that I don’t really need.
>
> My question to you is:
>
> What is the fastest way to loop over all documents in an index?
> Is it looping over all possible doc id’s (+filtering out deleted
> documents)?
>
> Thank you very much.
>
> Best regards
> Claude
>

Re: What is the fastest way to loop over all documents in an index?

Posted by Ishan Chattopadhyaya <ic...@gmail.com>.

Maybe IndexReader#document(), looping over docids is the best here?
http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/index/IndexReader.html#document-int-

On Tue, Sep 5, 2017 at 7:57 AM, Ahmet Arslan <io...@yahoo.com.invalid>
wrote:

> Hi Jean,
>
> I am also interested answers to this question. I need this feature too.
> Currently I am using a hack.
> I create an artificial field (with an artificial token) attached to every
> document.
>
> I traverse all documents using the code snippet given in my previous
> related question. (no one answered to it)
>
> http://lucene.472066.n3.nabble.com/PostingsEnum-for-
> documents-that-does-not-contain-a-term-td4349482.html
> I found EverythingEnum class in the Lucene50PostingsReader.java, but I
> couldn't figure out how to use it.
> So, I do not know if this class is for the task, but its name looks
> promising.
> Thanks,Ahmet
>
>
>
> On Tuesday, September 5, 2017, 3:09:37 AM GMT+3, Jean Claude van Johnson <
> vanjohnsonjeanclaude@gmail.com> wrote:
>
>
>
>
>
> Hi there,
>
> I have an use case, were I need to iterate over all documents in an index
> from time to time.
> It seems that the MatchAllDocsQuery is what I should use for this, however
> it creates a bunch of Objects (Score etc) that I don’t really need.
>
> My question to you is:
>
> What is the fastest way to loop over all documents in an index?
> Is it looping over all possible doc id’s (+filtering out deleted
> documents)?
>
> Thank you very much.
>
> Best regards
> Claude
>

Re: What is the fastest way to loop over all documents in an index?

Posted by Ahmet Arslan <io...@yahoo.com.INVALID>.

Hi Jean,

I am also interested answers to this question. I need this feature too. Currently I am using a hack.
I create an artificial field (with an artificial token) attached to every document. 

I traverse all documents using the code snippet given in my previous related question. (no one answered to it)

http://lucene.472066.n3.nabble.com/PostingsEnum-for-documents-that-does-not-contain-a-term-td4349482.html
I found EverythingEnum class in the Lucene50PostingsReader.java, but I couldn't figure out how to use it.
So, I do not know if this class is for the task, but its name looks promising.
Thanks,Ahmet



On Tuesday, September 5, 2017, 3:09:37 AM GMT+3, Jean Claude van Johnson <va...@gmail.com> wrote: 





Hi there,

I have an use case, were I need to iterate over all documents in an index from time to time.
It seems that the MatchAllDocsQuery is what I should use for this, however it creates a bunch of Objects (Score etc) that I don’t really need.

My question to you is: 

What is the fastest way to loop over all documents in an index?
Is it looping over all possible doc id’s (+filtering out deleted documents)?

Thank you very much.

Best regards
Claude

Re: What is the fastest way to loop over all documents in an index?

Posted by Jean Claude van Johnson <va...@gmail.com>.

Many thanks for your answers!

Cheers
Claude

> On 5 Sep 2017, at 21:54, Michael McCandless <lu...@mikemccandless.com> wrote:
> 
> You can call MultiFields.getLiveDocs(IndexReader) to get the bitset
> identifying which documents are not deleted.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Tue, Sep 5, 2017 at 2:54 PM, Mikhail Khludnev <mk...@apache.org> wrote:
> 
>> You can call searcher.search() with MatchAlldocsQuery passing own collector
>> impl which will be notified about every non-deleted doc via collect(docId).
>> 
>> On Tue, Sep 5, 2017 at 3:09 AM, Jean Claude van Johnson <
>> vanjohnsonjeanclaude@gmail.com> wrote:
>> 
>>> Hi there,
>>> 
>>> I have an use case, were I need to iterate over all documents in an index
>>> from time to time.
>>> It seems that the MatchAllDocsQuery is what I should use for this,
>> however
>>> it creates a bunch of Objects (Score etc) that I don’t really need.
>>> 
>>> My question to you is:
>>> 
>>> What is the fastest way to loop over all documents in an index?
>>> Is it looping over all possible doc id’s (+filtering out deleted
>>> documents)?
>>> 
>>> Thank you very much.
>>> 
>>> Best regards
>>> Claude
>>> 
>>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What is the fastest way to loop over all documents in an index?

Posted by Michael McCandless <lu...@mikemccandless.com>.

You can call MultiFields.getLiveDocs(IndexReader) to get the bitset
identifying which documents are not deleted.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Sep 5, 2017 at 2:54 PM, Mikhail Khludnev <mk...@apache.org> wrote:

> You can call searcher.search() with MatchAlldocsQuery passing own collector
> impl which will be notified about every non-deleted doc via collect(docId).
>
> On Tue, Sep 5, 2017 at 3:09 AM, Jean Claude van Johnson <
> vanjohnsonjeanclaude@gmail.com> wrote:
>
> > Hi there,
> >
> > I have an use case, were I need to iterate over all documents in an index
> > from time to time.
> > It seems that the MatchAllDocsQuery is what I should use for this,
> however
> > it creates a bunch of Objects (Score etc) that I don’t really need.
> >
> > My question to you is:
> >
> > What is the fastest way to loop over all documents in an index?
> > Is it looping over all possible doc id’s (+filtering out deleted
> > documents)?
> >
> > Thank you very much.
> >
> > Best regards
> > Claude
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: What is the fastest way to loop over all documents in an index?

Posted by Mikhail Khludnev <mk...@apache.org>.

You can call searcher.search() with MatchAlldocsQuery passing own collector
impl which will be notified about every non-deleted doc via collect(docId).

On Tue, Sep 5, 2017 at 3:09 AM, Jean Claude van Johnson <
vanjohnsonjeanclaude@gmail.com> wrote:

> Hi there,
>
> I have an use case, were I need to iterate over all documents in an index
> from time to time.
> It seems that the MatchAllDocsQuery is what I should use for this, however
> it creates a bunch of Objects (Score etc) that I don’t really need.
>
> My question to you is:
>
> What is the fastest way to loop over all documents in an index?
> Is it looping over all possible doc id’s (+filtering out deleted
> documents)?
>
> Thank you very much.
>
> Best regards
> Claude
>
>

-- 
Sincerely yours
Mikhail Khludnev