You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Alexey Kozhemiakin <Al...@epam.com> on 2014/02/10 17:29:26 UTC

Facet optimization for facet.method=enum and "exists" case

Dear All,

Background:
We have a dataset containing hundreds of millions of records, we facet by dozens of fields with many of facet-excludes and have relatively small number of unique values in fields, around thousands.
Before executing search, our users work with "advanced search" and goal is to populate dozens of filters with values which are applicable with other selected values, so basically this is a use case for facets with mincount=1, but without need in actual counts.
Our performance tests showed that facet.method=enum works much better than fc\fcs, probably due to a specific ratio of "docset"\"unique terms count". For example average execution of query time with method fc=1500ms, fcs=2600ms and with enum=280ms. Profiling indicated the majority time for enum was spent on intersecting docsets.

So...
We've implemented a patch that introduces an extension to facet calculation for method=enum. Basically it uses docSetA.intersects(docSetB) instead of docSetA. intersectionSize (docSetB).
As a result we were able to reduce our average query time from 280ms to 60ms.

How would you suggest to name such parameter?
Now we call it "facet.enum.exists" but I'm not sure it's a good name.
When we will clarify this little thing, I'll create a jira-issue and attach patch for review. Is there anybody willing to review and commit?

Thank

Alexey

RE: Facet optimization for facet.method=enum and "exists" case

Posted by Alexey Kozhemiakin <Al...@epam.com>.

Hi Annette, 

You might want to find initial version of patch attached https://issues.apache.org/jira/browse/SOLR-5725 

I'd be happy to find out performance improvement on your setup, let me know if you need help with patching your version of solr.

--
Alexey 

-----Original Message-----
From: Annette Newton [mailto:annette.newton@servicetick.com] 
Sent: Thursday, February 13, 2014 13:46
To: solr-user@lucene.apache.org
Subject: Re: Facet optimization for facet.method=enum and "exists" case

Hi Alexey,

I would be very interested in your progress with this.  Your use case seems to match ours, we found enum to be much quicker than fc particularly for multivalued fields.  We found that fc caused memory issues and caused us to frequently lose nodes.  We, like you, have no interest in the counts, just need a distinct list of values.

Thanks.

Netty Newton.


On 10 February 2014 19:30, Erick Erickson <er...@gmail.com> wrote:

> Alexey:
>
> There's no need to wait to create a JIRA! It's perfectly reasonable to 
> create it and attach a patch before it's completely polished. People 
> often include a note when posting the patch like "for review, not 
> ready for commit". Also, including comments in the code like 
> //nocommit will cause it to fail the "ant precommit" step. This is 
> often useful to get other eyeballs on the code early.
>
> But it's up to you.
>
> Best,
> Erick
>
>
> On Mon, Feb 10, 2014 at 8:29 AM, Alexey Kozhemiakin < 
> Alexey_Kozhemiakin@epam.com> wrote:
>
> > Dear All,
> >
> > Background:
> > We have a dataset containing hundreds of millions of records, we 
> > facet by dozens of fields with many of facet-excludes and have 
> > relatively small number of unique values in fields, around thousands.
> > Before executing search, our users work with "advanced search" and 
> > goal
> is
> > to populate dozens of filters with values which are applicable with 
> > other selected values, so basically this is a use case for facets 
> > with mincount=1, but without need in actual counts.
> > Our performance tests showed that facet.method=enum works much 
> > better
> than
> > fc\fcs, probably due to a specific ratio of "docset"\"unique terms
> count".
> > For example average execution of query time with method fc=1500ms, 
> > fcs=2600ms and with enum=280ms. Profiling indicated the majority 
> > time for enum was spent on intersecting docsets.
> >
> > So...
> > We've implemented a patch that introduces an extension to facet 
> > calculation for method=enum. Basically it uses
> docSetA.intersects(docSetB)
> > instead of docSetA. intersectionSize (docSetB).
> > As a result we were able to reduce our average query time from 280ms 
> > to 60ms.
> >
> > How would you suggest to name such parameter?
> > Now we call it "facet.enum.exists" but I'm not sure it's a good name.
> > When we will clarify this little thing, I'll create a jira-issue and 
> > attach patch for review. Is there anybody willing to review and commit?
> >
> > Thank
> >
> > Alexey
> >
>



-- 

Annette Newton

Database Administrator

ServiceTick Ltd



T:+44(0)1603 618326



Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

www.servicetick.com

*www.sessioncam.com <http://www.sessioncam.com>*

--
*This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secure ServiceTick accepts neither legal responsibility for the contents of this message nor responsibility for any change made to this message after it was forwarded by the original author.*

Re: Facet optimization for facet.method=enum and "exists" case

Posted by Annette Newton <an...@servicetick.com>.

Hi Alexey,

I would be very interested in your progress with this.  Your use case seems
to match ours, we found enum to be much quicker than fc particularly for
multivalued fields.  We found that fc caused memory issues and caused us to
frequently lose nodes.  We, like you, have no interest in the counts, just
need a distinct list of values.

Thanks.

Netty Newton.


On 10 February 2014 19:30, Erick Erickson <er...@gmail.com> wrote:

> Alexey:
>
> There's no need to wait to create a JIRA! It's perfectly reasonable to
> create it and attach a patch before it's completely polished. People often
> include a note when posting the patch like "for review, not ready for
> commit". Also, including comments in the code like
> //nocommit
> will cause it to fail the "ant precommit" step. This is often useful to get
> other eyeballs on the code early.
>
> But it's up to you.
>
> Best,
> Erick
>
>
> On Mon, Feb 10, 2014 at 8:29 AM, Alexey Kozhemiakin <
> Alexey_Kozhemiakin@epam.com> wrote:
>
> > Dear All,
> >
> > Background:
> > We have a dataset containing hundreds of millions of records, we facet by
> > dozens of fields with many of facet-excludes and have relatively small
> > number of unique values in fields, around thousands.
> > Before executing search, our users work with "advanced search" and goal
> is
> > to populate dozens of filters with values which are applicable with other
> > selected values, so basically this is a use case for facets with
> > mincount=1, but without need in actual counts.
> > Our performance tests showed that facet.method=enum works much better
> than
> > fc\fcs, probably due to a specific ratio of "docset"\"unique terms
> count".
> > For example average execution of query time with method fc=1500ms,
> > fcs=2600ms and with enum=280ms. Profiling indicated the majority time for
> > enum was spent on intersecting docsets.
> >
> > So...
> > We've implemented a patch that introduces an extension to facet
> > calculation for method=enum. Basically it uses
> docSetA.intersects(docSetB)
> > instead of docSetA. intersectionSize (docSetB).
> > As a result we were able to reduce our average query time from 280ms to
> > 60ms.
> >
> > How would you suggest to name such parameter?
> > Now we call it "facet.enum.exists" but I'm not sure it's a good name.
> > When we will clarify this little thing, I'll create a jira-issue and
> > attach patch for review. Is there anybody willing to review and commit?
> >
> > Thank
> >
> > Alexey
> >
>



-- 

Annette Newton

Database Administrator

ServiceTick Ltd



T:+44(0)1603 618326



Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ

www.servicetick.com

*www.sessioncam.com <http://www.sessioncam.com>*

-- 
*This message is confidential and is intended to be read solely by the 
addressee. The contents should not be disclosed to any other person or 
copies taken unless authorised to do so. If you are not the intended 
recipient, please notify the sender and permanently delete this message. As 
Internet communications are not secure ServiceTick accepts neither legal 
responsibility for the contents of this message nor responsibility for any 
change made to this message after it was forwarded by the original author.*

Re: Facet optimization for facet.method=enum and "exists" case

Posted by Erick Erickson <er...@gmail.com>.

Alexey:

There's no need to wait to create a JIRA! It's perfectly reasonable to
create it and attach a patch before it's completely polished. People often
include a note when posting the patch like "for review, not ready for
commit". Also, including comments in the code like
//nocommit
will cause it to fail the "ant precommit" step. This is often useful to get
other eyeballs on the code early.

But it's up to you.

Best,
Erick


On Mon, Feb 10, 2014 at 8:29 AM, Alexey Kozhemiakin <
Alexey_Kozhemiakin@epam.com> wrote:

> Dear All,
>
> Background:
> We have a dataset containing hundreds of millions of records, we facet by
> dozens of fields with many of facet-excludes and have relatively small
> number of unique values in fields, around thousands.
> Before executing search, our users work with "advanced search" and goal is
> to populate dozens of filters with values which are applicable with other
> selected values, so basically this is a use case for facets with
> mincount=1, but without need in actual counts.
> Our performance tests showed that facet.method=enum works much better than
> fc\fcs, probably due to a specific ratio of "docset"\"unique terms count".
> For example average execution of query time with method fc=1500ms,
> fcs=2600ms and with enum=280ms. Profiling indicated the majority time for
> enum was spent on intersecting docsets.
>
> So...
> We've implemented a patch that introduces an extension to facet
> calculation for method=enum. Basically it uses docSetA.intersects(docSetB)
> instead of docSetA. intersectionSize (docSetB).
> As a result we were able to reduce our average query time from 280ms to
> 60ms.
>
> How would you suggest to name such parameter?
> Now we call it "facet.enum.exists" but I'm not sure it's a good name.
> When we will clarify this little thing, I'll create a jira-issue and
> attach patch for review. Is there anybody willing to review and commit?
>
> Thank
>
> Alexey
>