You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Marcus Herou <ma...@tailsweep.com> on 2009/02/01 11:42:59 UTC
Re: Group by in Lucene ?
Yep, you are correct, this is a lousy implementation which I knew when I
wrote it.
I'm not interested in the entire document just the grouping term and the
docId which it is connected to.
So how do I get hold of the TermDocs for the grouping field ?
I mean I probably first need to perform the query: searcher.search(...)
which would give me set of doc ids. Then I need to group them all by for
instance: "ip-address", save each ip-address in another set and in the end
calculate the size of that set.
i.e the equiv of: select count(distinct(ipAddress)) from AccessLog where
date='2009-01-25' (optionally group by ipAddress ?)
//Marcus
On Wed, Jan 28, 2009 at 3:02 PM, Erick Erickson <er...@gmail.com>wrote:
> At a quick glance, this line is really suspicious:
>
> Document document = this.indexReader.document(doc)
>
> From the Javadoc for HitCollector.collect:
>
> Note: This is called in an inner search loop. For good search performance,
> implementations of this method should not call
>
> Searcher.doc(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29>or
>
> IndexReader.document(int)<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29>on
> every document number encountered. Doing so can slow searches by an
> order
> of magnitude or more.
>
> You're loading the document each time through the loop. I think you'd get
> much better
> performance by making sure that your groupField is indexed, then use
> TermDocs (TermEnum?)
> to get the value of the field.
>
> Best
> Erick
>
>
>
> On Wed, Jan 28, 2009 at 6:43 AM, Marcus Herou <marcus.herou@tailsweep.com
> >wrote:
>
> > Hi.
> >
> > This is way too slow I think since what you are explaining is something I
> > already tested. However I might be using the HitCollector badly.
> >
> > Please prove me wrong. Supplying some code which I tested this with.
> > It stores a hash of the value of the term in a TIntHashSet and just
> > calculates the size of that set.
> > This one takes approx 3 sec on about 0.5M rows = way too slow.
> >
> >
> > main test class:
> > public class GroupingTest
> > {
> > protected static final Log log =
> > LogFactory.getLog(GroupingTest.class.getName());
> > static DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
> > public static void main(String[] args)
> > {
> > Utils.initLogger();
> > String[] fields =
> > {"uid","ip","date","siteId","visits","countryCode"};
> > try
> > {
> > IndexFactory fact = new IndexFactory();
> > String d = "/tmp/csvtest";
> > fact.initDir(d);
> > IndexReader reader = fact.getReader(d);
> > IndexSearcher searcher = fact.getSearcher(d, reader);
> > QueryParser parser = new MultiFieldQueryParser(fields,
> > fact.getAnalyzer());
> > Query q = parser.parse("date:20090125");
> >
> >
> > GroupingHitCollector coll = new GroupingHitCollector();
> > coll.setDistinct(true);
> > coll.setGroupField("uid");
> > coll.setIndexReader(reader);
> > long start = System.currentTimeMillis();
> > searcher.search(q, coll);
> > long stop = System.currentTimeMillis();
> > System.out.println("Time: " + (stop-start) + ", distinct
> > count(uid):"+coll.getDistinctCount() + ", count(uid): "+coll.getCount());
> > }
> > catch (Exception e)
> > {
> > log.error(e.toString(), e);
> > }
> > }
> > }
> >
> >
> > public class GroupingHitCollector extends HitCollector
> > {
> > protected IndexReader indexReader;
> > protected String groupField;
> > protected boolean distinct;
> > //protected TLongHashSet set;
> > protected TIntHashSet set;
> > protected int distinctSize;
> >
> > int count = 0;
> > int sum = 0;
> >
> > public GroupingHitCollector()
> > {
> > set = new TIntHashSet();
> > }
> >
> > public String getGroupField()
> > {
> > return groupField;
> > }
> >
> > public void setGroupField(String groupField)
> > {
> > this.groupField = groupField;
> > }
> >
> > public IndexReader getIndexReader()
> > {
> > return indexReader;
> > }
> >
> > public void setIndexReader(IndexReader indexReader)
> > {
> > this.indexReader = indexReader;
> > }
> >
> > public boolean isDistinct()
> > {
> > return distinct;
> > }
> >
> > public void setDistinct(boolean distinct)
> > {
> > this.distinct = distinct;
> > }
> >
> > public void collect(int doc, float score)
> > {
> > if(distinct)
> > {
> > try
> > {
> > Document document = this.indexReader.document(doc);
> > if(document != null)
> > {
> > String s = document.get(groupField);
> > if(s != null)
> > {
> > set.add(s.hashCode());
> > //set.add(Crc64.generate(s));
> > }
> > }
> > }
> > catch (IOException e)
> > {
> > e.printStackTrace();
> > }
> > }
> > count++;
> > sum += doc; // use it to avoid any possibility of being optimized
> > away
> > }
> >
> > public int getCount() { return count; }
> > public int getSum() { return sum; }
> >
> > public int getDistinctCount()
> > {
> > distinctSize = set.size();
> > return distinctSize;
> > }
> > }
> >
> >
> > On Wed, Jan 28, 2009 at 10:51 AM, ninaS <ni...@gmx.de> wrote:
> >
> > >
> > > By the way: if you only need to count documents (count groups)
> > HitCollector
> > > is a good choice. If you only count you don't need to sort anything.
> > >
> > >
> > > ninaS wrote:
> > > >
> > > > Hello,
> > > >
> > > > yes I tried HitCollector but I am not satisfied with it because you
> can
> > > > not use sorting with HitCollector unless you implement a way to use
> > > > TopFieldTocCollector. I did not manage to do that in a performant
> way.
> > > >
> > > > It is easier to first do a normal search und "group by" afterwards:
> > > >
> > > > Iterate through the result documents and take one of each group. Each
> > > > document has a groupingKey. I remember which groupingKey is already
> > used
> > > > and don't take another document of this group into the result list.
> > > >
> > > > Regards,
> > > > Nina
> > > >
> > >
> > > --
> > > View this message in context:
> > > http://www.nabble.com/Group-by-in-Lucene---tp13581760p21702742.html
> > > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Marcus Herou CTO and co-founder Tailsweep AB
> > +46702561312
> > marcus.herou@tailsweep.com
> > http://www.tailsweep.com/
> > http://blogg.tailsweep.com/
> >
>
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/