You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2010/07/13 23:17:50 UTC

Continuously iterate over documents in index

Hi,
I would like to continuously iterate over the documents in my lucene index
as the index is updated.  Kind of like a "stream" of documents.  Is there a
way I can achieve this?

Would something like this be sufficient (untested):

 int currentDocId = 0;
 while(true) {

     for(; currentDocId < reader.maxDoc(); currentDocId++) {

          if(!reader.isDeleted(currentDocId)) {
               Document d = reader.document(currentDocId);
          }
     }

     // Maybe sleep here or something

     IndexReader newReader = reader.reopen();
     if(newReader != reader) {
          reader.close();
          reader = newReader;
     }
}

Right now, I do some NLP  on the index that would slow down my indexing if
done at the same time, so that is why I'm looking for a solution that works
in the background like this.  Another concern I have is that starting from
scratch (fresh invocation of my program) requires me to load a lot of extra
data and then iterate through hundreds of thousands of documents just to get
to the newest docs that I haven't processed yet.  I would rather just start
from the new newest doc and go forward.

I am currently checking whether or not I've processed a Document by looking
up a field in the Document in a Mongo db, but is there a way I could
reliably use the id of the document from the reader to check to see if I've
looked at this document already?  I've heard that IndexReader.document() is
slow so I would like to skip that call if I know I've processed the document
already.

Any ideas?

Thanks,
Max

Re: Continuously iterate over documents in index

Posted by Erick Erickson <er...@gmail.com>.

Kiran:
Please start a new thread when asking a new question. From Hossman's apache
page:

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking

On Wed, Jul 14, 2010 at 10:56 AM, Kiran Kumar <mu...@gmail.com> wrote:

> All,
>
> Issue: Unable to get the proper results after searching. I added sample
> code
> which I used in the application.
>
> If I used *numHitPerPage* value as 1000 its giving expected results.
> ex: The expected results is 32 docs but showing 32 docs
> Instead If I use *numHitPerPage* as 2^32-1 its not giving expected results.
> ex: The expected results is 32 docs but showing only 29 docs.
>
> Sample code below:
>
>
> StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
>  QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, defField,
> analyzer);
> Query q = qp.parse(queryString);
> TopDocsCollector tdc = TopScoreDocCollector.create(*numHitPerPage*, true);
> IndexSearcher(is).search(q,tdc);
>
> ScoreDocs[]  noDocs  = tdc.topDocs().scoreDocs;
>
> Please let me know if any other way to search?
>
> Thanks.
> Kiran. M
>

Re: Continuously iterate over documents in index

Posted by Kiran Kumar <mu...@gmail.com>.

All,

Issue: Unable to get the proper results after searching. I added sample code
which I used in the application.

If I used *numHitPerPage* value as 1000 its giving expected results.
ex: The expected results is 32 docs but showing 32 docs
Instead If I use *numHitPerPage* as 2^32-1 its not giving expected results.
ex: The expected results is 32 docs but showing only 29 docs.

Sample code below:


StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
 QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, defField,
analyzer);
Query q = qp.parse(queryString);
TopDocsCollector tdc = TopScoreDocCollector.create(*numHitPerPage*, true);
IndexSearcher(is).search(q,tdc);

ScoreDocs[]  noDocs  = tdc.topDocs().scoreDocs;

Please let me know if any other way to search?

Thanks.
Kiran. M

Re: Continuously iterate over documents in index

Posted by Max Lynch <ih...@gmail.com>.

Erick,
This is what I ended up doing.  I initially avoided it because I was storing
dates using Solr's date type which AFAIK aren't usable in Lucene, but I
ended up using DateTools to store a lucene readable version that seems to
work well.

Thanks!

On Wed, Jul 14, 2010 at 7:59 PM, Erick Erickson <er...@gmail.com>wrote:

> Hmmmm, if you somehow know the last date you processed, why wouldn't using
> a
> range query work for you? I.e.
> date:[<recorded last date> TO <new date to record (NOW?)>]?
>
> Best
> Erick
>
> On Wed, Jul 14, 2010 at 10:37 AM, Max Lynch <ih...@gmail.com> wrote:
>
> > You could have a field within each doc say "Processed" and store a
> >
> > > value Yes/No, next run a searcher query which should give you the
> > > collection of unprocessed ones.
> > >
> >
> > That sounds like a reasonable idea, and I just realized that I could have
> > done that in a way specific to my application.  However, I already tried
> > doing something with a MatchAllDocsQuery with a custom collector and sort
> > by
> > date.  I store the last date and time of a doc I processed and process
> only
> > newer ones.
> >
>

Re: Continuously iterate over documents in index

Posted by Erick Erickson <er...@gmail.com>.

Hmmmm, if you somehow know the last date you processed, why wouldn't using a
range query work for you? I.e.
date:[<recorded last date> TO <new date to record (NOW?)>]?

Best
Erick

On Wed, Jul 14, 2010 at 10:37 AM, Max Lynch <ih...@gmail.com> wrote:

> You could have a field within each doc say "Processed" and store a
>
> > value Yes/No, next run a searcher query which should give you the
> > collection of unprocessed ones.
> >
>
> That sounds like a reasonable idea, and I just realized that I could have
> done that in a way specific to my application.  However, I already tried
> doing something with a MatchAllDocsQuery with a custom collector and sort
> by
> date.  I store the last date and time of a doc I processed and process only
> newer ones.
>

Re: Continuously iterate over documents in index

Posted by Max Lynch <ih...@gmail.com>.

You could have a field within each doc say "Processed" and store a

> value Yes/No, next run a searcher query which should give you the
> collection of unprocessed ones.
>

That sounds like a reasonable idea, and I just realized that I could have
done that in a way specific to my application.  However, I already tried
doing something with a MatchAllDocsQuery with a custom collector and sort by
date.  I store the last date and time of a doc I processed and process only
newer ones.

Re: Continuously iterate over documents in index

Posted by Shashi Kant <sk...@sloan.mit.edu>.

On Tue, Jul 13, 2010 at 5:17 PM, Max Lynch <ih...@gmail.com> wrote:
> Hi,
> I would like to continuously iterate over the documents in my lucene index
> as the index is updated.  Kind of like a "stream" of documents.  Is there a
> way I can achieve this?
>
> Would something like this be sufficient (untested):
>
>  int currentDocId = 0;
>  while(true) {
>
>     for(; currentDocId < reader.maxDoc(); currentDocId++) {
>
>          if(!reader.isDeleted(currentDocId)) {
>               Document d = reader.document(currentDocId);
>          }
>     }
>
>     // Maybe sleep here or something
>
>     IndexReader newReader = reader.reopen();
>     if(newReader != reader) {
>          reader.close();
>          reader = newReader;
>     }
> }


Looks ok,

>
> Right now, I do some NLP  on the index that would slow down my indexing if
> done at the same time, so that is why I'm looking for a solution that works
> in the background like this.  Another concern I have is that starting from
> scratch (fresh invocation of my program) requires me to load a lot of extra
> data and then iterate through hundreds of thousands of documents just to get
> to the newest docs that I haven't processed yet.  I would rather just start
> from the new newest doc and go forward.
>
> I am currently checking whether or not I've processed a Document by looking
> up a field in the Document in a Mongo db, but is there a way I could
> reliably use the id of the document from the reader to check to see if I've
> looked at this document already?  I've heard that IndexReader.document() is
> slow so I would like to skip that call if I know I've processed the document
> already.


You could have a field within each doc say "Processed" and store a
value Yes/No, next run a searcher query which should give you the
collection of unprocessed ones.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org