You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Charlie Hubbard <ch...@gmail.com> on 2011/09/16 18:30:50 UTC

Extracting all documents for a given search

I'm trying to reimplement a feature I had under 2.x in 3.x.  I have a
feature where a zip file for all of the documents returned by a search can
be exported.  Now with the newer APIs you have to put an upper limit on the
search so it won't return more than X documents.  I'd like to extract all of
the documents matched by that search.  I've been trying to understand how
Collectors work, but I'm not sure I see the connection.  If I wanted to walk
over each document that matches the search and save the contents to the zip
how would that best be done?

Thanks
Charlie

Re: Extracting all documents for a given search

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Sat, 2011-09-17 at 03:57 +0200, Charlie Hubbard wrote:
>  I really just want to be called back when a new document is found by the
> searcher, and I can load the Document, find my object, and drop that to a
> file.  I thought that's essentially what a Collector is, being an interface
> that is called back whenever it encounters a Document that matches a query.

That is correct. It is a simple API so implementing your own
ZIPCollector seems to be the best solution. Partly copied from Trejkaz:

    class ZIPCollector extends Collector {
        public void collect(int doc) {
            addZIPContent(resolveDocumentContent(doc));
        }
    }

    ZIPCollector collector = new ZIPColelctor();
    getSearcher().search(query, filter, collector);



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extracting all documents for a given search

Posted by Charlie Hubbard <ch...@gmail.com>.

Ah because that will easily toss an out of memory exception.  Besides I
already tried it.  I don't want a huge array holding all of those documents.
 I really just want to be called back when a new document is found by the
searcher, and I can load the Document, find my object, and drop that to a
file.  I thought that's essentially what a Collector is, being an interface
that is called back whenever it encounters a Document that matches a query.
 Any elaboration on that?

Charlie

On Fri, Sep 16, 2011 at 2:30 PM, Eddie Drapkin <ed...@wolfram.com> wrote:

> On 9/16/2011 11:30 AM, Charlie Hubbard wrote:
>
>> I'm trying to reimplement a feature I had under 2.x in 3.x.  I have a
>> feature where a zip file for all of the documents returned by a search can
>> be exported.  Now with the newer APIs you have to put an upper limit on
>> the
>> search so it won't return more than X documents.  I'd like to extract all
>> of
>> the documents matched by that search.  I've been trying to understand how
>> Collectors work, but I'm not sure I see the connection.  If I wanted to
>> walk
>> over each document that matches the search and save the contents to the
>> zip
>> how would that best be done?
>>
>> Thanks
>> Charlie
>>
>>
> Why not just get Integer.MAX_VALUE records from the newer APIs?
>
> Thanks,
> Eddie
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org<ja...@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.**org<ja...@lucene.apache.org>
>
>

Re: Extracting all documents for a given search

Posted by Eddie Drapkin <ed...@wolfram.com>.

On 9/16/2011 11:30 AM, Charlie Hubbard wrote:
> I'm trying to reimplement a feature I had under 2.x in 3.x.  I have a
> feature where a zip file for all of the documents returned by a search can
> be exported.  Now with the newer APIs you have to put an upper limit on the
> search so it won't return more than X documents.  I'd like to extract all of
> the documents matched by that search.  I've been trying to understand how
> Collectors work, but I'm not sure I see the connection.  If I wanted to walk
> over each document that matches the search and save the contents to the zip
> how would that best be done?
>
> Thanks
> Charlie
>

Why not just get Integer.MAX_VALUE records from the newer APIs?

Thanks,
Eddie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Extracting all documents for a given search

Posted by Uwe Schindler <uw...@thetaphi.de>.

> Ahem, sorry. I quoted an old answer of mine, but HitCollector has been
gone
> for a while now...
> This is the modern version:
> 
> final ArrayList<Document> docs = new ArrayList<Document>();
> searcher.search( query, new Collector() {  private int docBase; *// ignore
> scorer*
> 
>    public void setScorer(Scorer scorer) {
>    }
> 
>    *// accept docs out of order (for a BitSet it doesn't matter)*
>    public boolean acceptsDocsOutOfOrder() {
>      return true;
>    }
> 
>    public void collect(int doc) {
>      doc.add(searcher.doc(doc + docBase));
>    }
> 
>    public void setNextReader(IndexReader reader, int docBase) {
>      this.docBase = docBase;
>    }
> 
> });

This code surely works, but looks a little bit strange as you are using the
docbase to calculate a top-level document id that then gets again be drilled
down to the indexreader by IndexSearcher.

The more native approach is to ignore the docBase and instead directly ask
IndexReader passed in by setNextReader for the document:

final ArrayList<Document> docs = new ArrayList<Document>();
searcher.search( query, new Collector() {
   private IndexReader reader;
 
   public void setScorer(Scorer scorer) {
   }

   public boolean acceptsDocsOutOfOrder() {
     return true;
   }

   public void collect(int doc) {
     docs.add(reader.document(doc));
   }

   public void setNextReader(IndexReader reader, int docBase) {
     this.reader = reader;
   }

});


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extracting all documents for a given search

Posted by Israel Tsadok <it...@gmail.com>.

Ahem, sorry. I quoted an old answer of mine, but HitCollector has been gone
for a while now...
This is the modern version:

final ArrayList<Document> docs = new ArrayList<Document>();
searcher.search( query, new Collector() {
 private int docBase; *// ignore scorer*

   public void setScorer(Scorer scorer) {
   }

   *// accept docs out of order (for a BitSet it doesn't matter)*
   public boolean acceptsDocsOutOfOrder() {
     return true;
   }

   public void collect(int doc) {
     doc.add(searcher.doc(doc + docBase));
   }

   public void setNextReader(IndexReader reader, int docBase) {
     this.docBase = docBase;
   }

});

Re: Extracting all documents for a given search

Posted by Israel Tsadok <it...@gmail.com>.

If you just want to fetch all the matching documents for a given query,
implement a collector that just saves the document data.

final ArrayList<Document> docs = new ArrayList<Document>();
searcher.search( query, new HitCollector() {
    public void collect(int doc, float score) {
        docs.add(searcher.doc(doc));
    }
});


See also
http://stackoverflow.com/questions/973354/migrating-from-hit-hits-to-topdocs-topdoccollector

RE: Extracting all documents for a given search

Posted by Uwe Schindler <uw...@thetaphi.de>.

In recent Lucene versions there is an implementation of the mentioned collector to count hits, so there is no need to implement it:
http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/TotalHitCountCollector.html

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Trejkaz [mailto:trejkaz@trypticon.org]
> Sent: Monday, September 19, 2011 8:10 AM
> To: java-user@lucene.apache.org
> Subject: Re: Extracting all documents for a given search
> 
> On Mon, Sep 19, 2011 at 3:50 AM, Charlie Hubbard
> <ch...@gmail.com> wrote:
> > Here was the prior API I was calling:
> >
> >        Hits hits = getSearcher().search( query, filter, sort );
> >
> > The new API:
> >
> >        TopDocs hits = getSearcher().search( query, filter, startDoc +
> > length, sort );
> >
> > So the question is what new API can I use that allows me to extract
> > all documents matching the query, sort, and filter in a efficient way?
> 
> How I do this:
> 
>     // 1. Figure out how many results there will be.
>     class CountCollector extends Collector {
>         int count;
>         public void collect(int doc) {
>             count++;
>         }
>         // ... other empty methods ...
>     }
>     CountCollector collector = new CountCollector();
>     getSearcher().search(query, filter, collector);
>     int hitCount = collector.count;
> 
>     // 2. Actually do the query.
>     hits = getSearcher().search(query, filter, hitCount, sort);
> 
> It is a bit unfortunate that there is no equivalent to TopDocs which can grow
> dynamically, but this way is still going to be faster than Hits was, for larger
> result sets.
> 
> TX
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extracting all documents for a given search

Posted by Trejkaz <tr...@trypticon.org>.

On Mon, Sep 19, 2011 at 3:50 AM, Charlie Hubbard
<ch...@gmail.com> wrote:
> Here was the prior API I was calling:
>
>        Hits hits = getSearcher().search( query, filter, sort );
>
> The new API:
>
>        TopDocs hits = getSearcher().search( query, filter, startDoc +
> length, sort );
>
> So the question is what new API can I use that allows me to extract all
> documents matching the query, sort, and filter in a efficient way?

How I do this:

    // 1. Figure out how many results there will be.
    class CountCollector extends Collector {
        int count;
        public void collect(int doc) {
            count++;
        }
        // ... other empty methods ...
    }
    CountCollector collector = new CountCollector();
    getSearcher().search(query, filter, collector);
    int hitCount = collector.count;

    // 2. Actually do the query.
    hits = getSearcher().search(query, filter, hitCount, sort);

It is a bit unfortunate that there is no equivalent to TopDocs which
can grow dynamically, but this way is still going to be faster than
Hits was, for larger result sets.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Extracting all documents for a given search

Posted by Charlie Hubbard <ch...@gmail.com>.

Here was the prior API I was calling:

        Hits hits = getSearcher().search( query, filter, sort );

The new API:

        TopDocs hits = getSearcher().search( query, filter, startDoc +
length, sort );

So the question is what new API can I use that allows me to extract all
documents matching the query, sort, and filter in a efficient way?

Charlie

On Sun, Sep 18, 2011 at 1:01 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : I'm trying to reimplement a feature I had under 2.x in 3.x.  I have a
> : feature where a zip file for all of the documents returned by a search
> can
> : be exported.  Now with the newer APIs you have to put an upper limit on
> the
>
> if you start by explaining what API you were using before, people can
> better understand what exactly it is you were doing to ovver you
> suggestions on how to achieve the same results with newere APIs
>
> (for that matter: the deprecations listed on the old APIs you were using
> before should give you the same info)
>
>
> -Hoss
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Extracting all documents for a given search

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm trying to reimplement a feature I had under 2.x in 3.x.  I have a
: feature where a zip file for all of the documents returned by a search can
: be exported.  Now with the newer APIs you have to put an upper limit on the

if you start by explaining what API you were using before, people can 
better understand what exactly it is you were doing to ovver you 
suggestions on how to achieve the same results with newere APIs

(for that matter: the deprecations listed on the old APIs you were using 
before should give you the same info)


-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org