You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/02/01 11:42:59 UTC
Re: Group by in Lucene ?

Yep, you are correct, this is a lousy implementation which I knew when I
wrote it.

I'm not interested in the entire document just the grouping term and the
docId which it is connected to.

So how do I get hold of the TermDocs for the grouping field ?

I mean I probably first need to perform the query: searcher.search(...)
which would give me set of doc ids. Then I need to group them all by for
instance: "ip-address", save each ip-address in another set and in the end
calculate the size of that set.

i.e the equiv of: select count(distinct(ipAddress)) from AccessLog where
date='2009-01-25' (optionally group by ipAddress ?)


//Marcus





On Wed, Jan 28, 2009 at 3:02 PM, Erick Erickson <er...@gmail.com>wrote:

> At a quick glance, this line is really suspicious:
>
> Document document = this.indexReader.document(doc)
>
> From the Javadoc for HitCollector.collect:
>
> Note: This is called in an inner search loop. For good search performance,
> implementations of this method should not call
>
> Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or
>
> IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on
> every document number encountered. Doing so can slow searches by an
> order
> of magnitude or more.
>
> You're loading the document each time through the loop. I think you'd get
> much better
> performance by making sure that your groupField is indexed, then use
> TermDocs (TermEnum?)
> to get the value of the field.
>
> Best
> Erick
>
>
>
> On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Hi.
> >
> > This is way too slow I think since what you are explaining is something I
> > already tested. However I might be using the HitCollector badly.
> >
> > Please prove me wrong. Supplying some code which I tested this with.
> > It stores a hash of the value of the term in a TIntHashSet and just
> > calculates the size of that set.
> > This one takes approx 3 sec on about 0.5M rows = way too slow.
> >
> >
> > main test class:
> > public class GroupingTest
> > {
> >    protected static final Log log =
> > LogFactory.getLog(GroupingTest.class.getName());
> >    static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
> >    public static void main(String[] args)
> >    {
> >        Utils.initLogger();
> >        String[] fields =
> > {"uid","ip","date","siteId","visits","countryCode"};
> >        try
> >        {
> >            IndexFactory fact = new IndexFactory();
> >            String d = "/tmp/csvtest";
> >            fact.initDir(d);
> >            IndexReader reader = fact.getReader(d);
> >            IndexSearcher searcher = fact.getSearcher(d, reader);
> >            QueryParser parser = new MultiFieldQueryParser(fields,
> > fact.getAnalyzer());
> >            Query q = parser.parse("date:20090125");
> >
> >
> >            GroupingHitCollector coll = new GroupingHitCollector();
> >            coll.setDistinct(true);
> >            coll.setGroupField("uid");
> >            coll.setIndexReader(reader);
> >            long start = System.currentTimeMillis();
> >            searcher.search(q, coll);
> >            long stop = System.currentTimeMillis();
> >            System.out.println("Time: " + (stop-start) + ", distinct
> > count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
> >        }
> >        catch (Exception e)
> >        {
> >            log.error(e.toString(), e);
> >        }
> >    }
> > }
> >
> >
> > public class GroupingHitCollector  extends HitCollector
> > {
> >    protected IndexReader indexReader;
> >    protected String groupField;
> >    protected boolean distinct;
> >    //protected TLongHashSet set;
> >    protected TIntHashSet set;
> >    protected int distinctSize;
> >
> >    int count = 0;
> >    int sum = 0;
> >
> >    public GroupingHitCollector()
> >    {
> >        set = new TIntHashSet();
> >    }
> >
> >    public String getGroupField()
> >    {
> >        return groupField;
> >    }
> >
> >    public void setGroupField(String groupField)
> >    {
> >        this.groupField = groupField;
> >    }
> >
> >    public IndexReader getIndexReader()
> >    {
> >        return indexReader;
> >    }
> >
> >    public void setIndexReader(IndexReader indexReader)
> >    {
> >        this.indexReader = indexReader;
> >    }
> >
> >    public boolean isDistinct()
> >    {
> >        return distinct;
> >    }
> >
> >    public void setDistinct(boolean distinct)
> >    {
> >        this.distinct = distinct;
> >    }
> >
> >    public void collect(int doc, float score)
> >    {
> >        if(distinct)
> >        {
> >            try
> >            {
> >                Document document = this.indexReader.document(doc);
> >                if(document != null)
> >                {
> >                    String s = document.get(groupField);
> >                    if(s != null)
> >                    {
> >                        set.add(s.hashCode());
> >                        //set.add(Crc64.generate(s));
> >                    }
> >                }
> >            }
> >            catch (IOException e)
> >            {
> >                e.printStackTrace();
> >            }
> >        }
> >        count++;
> >        sum += doc;  // use it to avoid any possibility of being optimized
> > away
> >    }
> >
> >    public int getCount() { return count; }
> >    public int getSum() { return sum; }
> >
> >    public int getDistinctCount()
> >    {
> >        distinctSize = set.size();
> >        return distinctSize;
> >     }
> > }
> >
> >
> > On Wed, Jan 28, 2009 at 10:51 AM, ninaS <ni...@gmx.de> wrote:
> >
> > >
> > > By the way: if you only need to count documents (count groups)
> > HitCollector
> > > is a good choice. If you only count you don't need to sort anything.
> > >
> > >
> > > ninaS wrote:
> > > >
> > > > Hello,
> > > >
> > > > yes I tried HitCollector but I am not satisfied with it because you
> can
> > > > not use sorting with HitCollector unless you implement a way to use
> > > > TopFieldTocCollector. I did not manage to do that in a performant
> way.
> > > >
> > > > It is easier to first do a normal search und "group by" afterwards:
> > > >
> > > > Iterate through the result documents and take one of each group. Each
> > > > document has a groupingKey. I remember which groupingKey is already
> > used
> > > > and don't take another document of this group into the result list.
> > > >
> > > > Regards,
> > > > Nina
> > > >
> > >
> > > --
> > > View this message in context:
> > > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
> > > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/