You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Robust Links <pe...@robustlinks.com> on 2015/04/29 16:41:32 UTC

custom collector

Hi

I need help porting my lucene code from 4 to 5. In particular, I need to
customize a collector (to collect all doc Ids in the index - which can be
>30MM docs..). Below is how I achieved this in lucene 4. Is there some
guidelines how to do this in lucene 5, specially on semantics changes of
AtomicReaderContext (which seems deprecated) and the new LeafReaderContext?

thank you in advance


public class CustomCollector extends Collector {

  private HashSet<String> data = new HashSet<String>();

private Scorer scorer;

  private int docBase;

  private BinaryDocValues dataList;


   public boolean acceptsDocsOutOfOrder() {

  return true;

  }

  public void setScorer(Scorer scorer) {

  this.scorer = scorer;

  }

  public void setNextReader(AtomicReaderContext ctx) throws IOException{

this.docBase = ctx.docBase;

dataList = FieldCache.DEFAULT.getTerms(ctx.reader(),"title",false);

  }

  public void collect(int doc) throws IOException {

  BytesRef t = new BytesRef();

  dataList(doc);

  if (t.bytes != BytesRef.EMPTY_BYTES && t.bytes != BytesRef.EMPTY_BYTES) {

 data((t.utf8ToString()));

   }

  }

  public void reset() {

  data.clear();

  dataList = null;

  }

  public HashSet<String> getData() {

  return data;

  }

}

Re: custom collector

Posted by Robust Links <pe...@robustlinks.com>.
Hi Erick

The index I am searching is lucene. I am trying to perform some operations
over ALL the documents in that index. I can rebuild the index as a solr
index and then use the export functionality. Up to now I've been using the
lucene index searcher with custom collector. Would the below code be
correct if I want to continue with lucene path?

thank you Erick

    public class DocIDCollector extends SimpleCollector {



    HashBiMap<Integer,Long> idSet = HashBiMap.create();

    private Scorer scorer;

    private NumericDocValues ids;


    public boolean acceptsDocsOutOfOrder() {

      return true;

    }


    public void setScorer(Scorer scorer) {

      this.scorer = scorer;

    }

    public void doSetNextReader(LeafReaderContext reader)

    throws IOException{

  ids = DocValues.getNumeric(reader.reader(), "id");

    }


    public void collect(int doc) throws IOException {

  long wid = ids.get(doc);

          idSet.put(doc,wid);

    }


    public void reset() {

    idSet.clear();

    }


    public HashBiMap<Integer,Long> getWikiIds() {

      return idSet;

    }

    }

On Wed, Apr 29, 2015 at 11:32 AM, Erick Erickson <er...@gmail.com>
wrote:

> Hmmm, it's not clear to me whether you're using Solr or not, but if
> you are have you considered using the export functionality? This is
> already built to stream large result sets back to the client. And
> lately (5.1), you can combine that with "streaming aggregation" to do
> some pretty cool stuff.
>
> Not sure it applies in your situation as you didn't state the use-case
> but thought I'd at least mention it.
>
> Best,
> Erick
>
> On Wed, Apr 29, 2015 at 7:41 AM, Robust Links <pe...@robustlinks.com>
> wrote:
> > Hi
> >
> > I need help porting my lucene code from 4 to 5. In particular, I need to
> > customize a collector (to collect all doc Ids in the index - which can be
> >>30MM docs..). Below is how I achieved this in lucene 4. Is there some
> > guidelines how to do this in lucene 5, specially on semantics changes of
> > AtomicReaderContext (which seems deprecated) and the new
> LeafReaderContext?
> >
> > thank you in advance
> >
> >
> > public class CustomCollector extends Collector {
> >
> >   private HashSet<String> data = new HashSet<String>();
> >
> > private Scorer scorer;
> >
> >   private int docBase;
> >
> >   private BinaryDocValues dataList;
> >
> >
> >    public boolean acceptsDocsOutOfOrder() {
> >
> >   return true;
> >
> >   }
> >
> >   public void setScorer(Scorer scorer) {
> >
> >   this.scorer = scorer;
> >
> >   }
> >
> >   public void setNextReader(AtomicReaderContext ctx) throws IOException{
> >
> > this.docBase = ctx.docBase;
> >
> > dataList = FieldCache.DEFAULT.getTerms(ctx.reader(),"title",false);
> >
> >   }
> >
> >   public void collect(int doc) throws IOException {
> >
> >   BytesRef t = new BytesRef();
> >
> >   dataList(doc);
> >
> >   if (t.bytes != BytesRef.EMPTY_BYTES && t.bytes !=
> BytesRef.EMPTY_BYTES) {
> >
> >  data((t.utf8ToString()));
> >
> >    }
> >
> >   }
> >
> >   public void reset() {
> >
> >   data.clear();
> >
> >   dataList = null;
> >
> >   }
> >
> >   public HashSet<String> getData() {
> >
> >   return data;
> >
> >   }
> >
> > }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: custom collector

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, it's not clear to me whether you're using Solr or not, but if
you are have you considered using the export functionality? This is
already built to stream large result sets back to the client. And
lately (5.1), you can combine that with "streaming aggregation" to do
some pretty cool stuff.

Not sure it applies in your situation as you didn't state the use-case
but thought I'd at least mention it.

Best,
Erick

On Wed, Apr 29, 2015 at 7:41 AM, Robust Links <pe...@robustlinks.com> wrote:
> Hi
>
> I need help porting my lucene code from 4 to 5. In particular, I need to
> customize a collector (to collect all doc Ids in the index - which can be
>>30MM docs..). Below is how I achieved this in lucene 4. Is there some
> guidelines how to do this in lucene 5, specially on semantics changes of
> AtomicReaderContext (which seems deprecated) and the new LeafReaderContext?
>
> thank you in advance
>
>
> public class CustomCollector extends Collector {
>
>   private HashSet<String> data = new HashSet<String>();
>
> private Scorer scorer;
>
>   private int docBase;
>
>   private BinaryDocValues dataList;
>
>
>    public boolean acceptsDocsOutOfOrder() {
>
>   return true;
>
>   }
>
>   public void setScorer(Scorer scorer) {
>
>   this.scorer = scorer;
>
>   }
>
>   public void setNextReader(AtomicReaderContext ctx) throws IOException{
>
> this.docBase = ctx.docBase;
>
> dataList = FieldCache.DEFAULT.getTerms(ctx.reader(),"title",false);
>
>   }
>
>   public void collect(int doc) throws IOException {
>
>   BytesRef t = new BytesRef();
>
>   dataList(doc);
>
>   if (t.bytes != BytesRef.EMPTY_BYTES && t.bytes != BytesRef.EMPTY_BYTES) {
>
>  data((t.utf8ToString()));
>
>    }
>
>   }
>
>   public void reset() {
>
>   data.clear();
>
>   dataList = null;
>
>   }
>
>   public HashSet<String> getData() {
>
>   return data;
>
>   }
>
> }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: custom collector

Posted by Robust Links <pe...@robustlinks.com>.
Hi West

thank you for the help. I will try your suggestion.

thank you again

Peyman

On Wed, Apr 29, 2015 at 10:01 PM, west suhanic <we...@gmail.com>
wrote:

> Hi Robust Links:
>
> I think you want to build a class that implements the LeafCollector.
> For example:
>
> public class theLeafCollectorDocid implements LeafCollector
> {
>         theLeafCollectorDocid( final LeafReaderContext context )
>         {
>         }
>
>        collect( int doc )
>        {
>        }
> }
>
> Once you done this then build another class that implements the Collector.
> For example:
>
> public class docCollectorKeyDocid implements Collector
> {
>           public LeafCollector getLeafCollector( final LeafReaderContext
> context )
>           {
>                    final LeafCollector tlc = new
> theLeafCollectorDocid(context );
>           }
> }
>
> This will, I believe, allow you to realize your goal.
>
> regards,
>
> west suhanic
>
>
> On Wed, Apr 29, 2015 at 10:41 AM, Robust Links <pe...@robustlinks.com>
> wrote:
>
> > Hi
> >
> > I need help porting my lucene code from 4 to 5. In particular, I need to
> > customize a collector (to collect all doc Ids in the index - which can be
> > >30MM docs..). Below is how I achieved this in lucene 4. Is there some
> > guidelines how to do this in lucene 5, specially on semantics changes of
> > AtomicReaderContext (which seems deprecated) and the new
> LeafReaderContext?
> >
> > thank you in advance
> >
> >
> > public class CustomCollector extends Collector {
> >
> >   private HashSet<String> data = new HashSet<String>();
> >
> > private Scorer scorer;
> >
> >   private int docBase;
> >
> >   private BinaryDocValues dataList;
> >
> >
> >    public boolean acceptsDocsOutOfOrder() {
> >
> >   return true;
> >
> >   }
> >
> >   public void setScorer(Scorer scorer) {
> >
> >   this.scorer = scorer;
> >
> >   }
> >
> >   public void setNextReader(AtomicReaderContext ctx) throws IOException{
> >
> > this.docBase = ctx.docBase;
> >
> > dataList = FieldCache.DEFAULT.getTerms(ctx.reader(),"title",false);
> >
> >   }
> >
> >   public void collect(int doc) throws IOException {
> >
> >   BytesRef t = new BytesRef();
> >
> >   dataList(doc);
> >
> >   if (t.bytes != BytesRef.EMPTY_BYTES && t.bytes !=
> BytesRef.EMPTY_BYTES) {
> >
> >  data((t.utf8ToString()));
> >
> >    }
> >
> >   }
> >
> >   public void reset() {
> >
> >   data.clear();
> >
> >   dataList = null;
> >
> >   }
> >
> >   public HashSet<String> getData() {
> >
> >   return data;
> >
> >   }
> >
> > }
> >
>

Re: custom collector

Posted by west suhanic <we...@gmail.com>.
Hi Robust Links:

I think you want to build a class that implements the LeafCollector.
For example:

public class theLeafCollectorDocid implements LeafCollector
{
        theLeafCollectorDocid( final LeafReaderContext context )
        {
        }

       collect( int doc )
       {
       }
}

Once you done this then build another class that implements the Collector.
For example:

public class docCollectorKeyDocid implements Collector
{
          public LeafCollector getLeafCollector( final LeafReaderContext
context )
          {
                   final LeafCollector tlc = new
theLeafCollectorDocid(context );
          }
}

This will, I believe, allow you to realize your goal.

regards,

west suhanic


On Wed, Apr 29, 2015 at 10:41 AM, Robust Links <pe...@robustlinks.com>
wrote:

> Hi
>
> I need help porting my lucene code from 4 to 5. In particular, I need to
> customize a collector (to collect all doc Ids in the index - which can be
> >30MM docs..). Below is how I achieved this in lucene 4. Is there some
> guidelines how to do this in lucene 5, specially on semantics changes of
> AtomicReaderContext (which seems deprecated) and the new LeafReaderContext?
>
> thank you in advance
>
>
> public class CustomCollector extends Collector {
>
>   private HashSet<String> data = new HashSet<String>();
>
> private Scorer scorer;
>
>   private int docBase;
>
>   private BinaryDocValues dataList;
>
>
>    public boolean acceptsDocsOutOfOrder() {
>
>   return true;
>
>   }
>
>   public void setScorer(Scorer scorer) {
>
>   this.scorer = scorer;
>
>   }
>
>   public void setNextReader(AtomicReaderContext ctx) throws IOException{
>
> this.docBase = ctx.docBase;
>
> dataList = FieldCache.DEFAULT.getTerms(ctx.reader(),"title",false);
>
>   }
>
>   public void collect(int doc) throws IOException {
>
>   BytesRef t = new BytesRef();
>
>   dataList(doc);
>
>   if (t.bytes != BytesRef.EMPTY_BYTES && t.bytes != BytesRef.EMPTY_BYTES) {
>
>  data((t.utf8ToString()));
>
>    }
>
>   }
>
>   public void reset() {
>
>   data.clear();
>
>   dataList = null;
>
>   }
>
>   public HashSet<String> getData() {
>
>   return data;
>
>   }
>
> }
>