You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Maria Muslea <ma...@gmail.com> on 2017/04/19 00:16:03 UTC

prefix facet performance

Hi,

I have ~40K documents in SOLR (not many) and a multivalued facet field that
contains at least 2K values per document.

The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, and
I use facet.prefix.

q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/


with "concept" defined as:


<field name=“concept" type="string" indexed="true" multiValued="true"/>


This generates the output that I am looking for, but it takes more than 10
seconds per query.


Is there any way that I could improve the facet query performance for this
example?


Thank you,

Maria

Re: prefix facet performance

Posted by Yonik Seeley <ys...@gmail.com>.

In SimpleFacets.getFacetTermEnumCounts, we seek to the first term
matching the prefix using the index and then for each term after
compare the prefix until it no longer matches.

-Yonik


On Mon, Apr 24, 2017 at 5:04 AM, alessandro.benedetti
<a....@sease.io> wrote:
> Thanks Yonik and Maria.
> It make sense, if we reduce the number of terms, term enum becomes a very
> good solution.
> @Yonik : do we still check the prefix on the term dictionary one by one, or
> an FST is used to identify the set of candidate terms ?
>
> I will check the code later,
>
> Regards
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331553.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: prefix facet performance

Posted by "alessandro.benedetti" <a....@sease.io>.

Thanks Yonik and Maria.
It make sense, if we reduce the number of terms, term enum becomes a very
good solution.
@Yonik : do we still check the prefix on the term dictionary one by one, or
an FST is used to identify the set of candidate terms ?

I will check the code later,

Regards



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331553.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: prefix facet performance

Posted by Maria Muslea <ma...@gmail.com>.

I see. Once I specify a prefix the number of terms is MUCH smaller.

Thank you again for all your help.

Maria

On Fri, Apr 21, 2017 at 1:46 PM, Yonik Seeley <ys...@gmail.com> wrote:

> On Fri, Apr 21, 2017 at 4:25 PM, Maria Muslea <ma...@gmail.com>
> wrote:
> > The field is:
> >
> > <field name="concept" type="string" indexed="true" multiValued="true"/>
> >
> > and using unique() I found that it has 700K+ unique values.
> >
> > The query before (that takes ~10s):
> >
> > wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=
> concept&facet.prefix=A/
> >
> > the query after (that is almost instant):
> >
> > wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=
> concept&facet.prefix=A/&facet.method=enum'
>
> Ah, the fact that you specify a facet.prefix makes this perfectly
> aligned for the "enum" method, which can skip directly to the first
> term on-or-after "A/"
> facet.method=enum goes term-by-term, calculating the intersection with
> the facet domain.
> In this case, it's the number of terms that start with "A/" that
> matters, not the number of terms in the entire field (hence the
> speedup).
>
> -Yonik
>

Re: prefix facet performance

Posted by Yonik Seeley <ys...@gmail.com>.

On Fri, Apr 21, 2017 at 4:25 PM, Maria Muslea <ma...@gmail.com> wrote:
> The field is:
>
> <field name="concept" type="string" indexed="true" multiValued="true"/>
>
> and using unique() I found that it has 700K+ unique values.
>
> The query before (that takes ~10s):
>
> wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/
>
> the query after (that is almost instant):
>
> wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/&facet.method=enum'

Ah, the fact that you specify a facet.prefix makes this perfectly
aligned for the "enum" method, which can skip directly to the first
term on-or-after "A/"
facet.method=enum goes term-by-term, calculating the intersection with
the facet domain.
In this case, it's the number of terms that start with "A/" that
matters, not the number of terms in the entire field (hence the
speedup).

-Yonik

Re: prefix facet performance

Posted by Maria Muslea <ma...@gmail.com>.

The field is:

<field name="concept" type="string" indexed="true" multiValued="true"/>

and using unique() I found that it has 700K+ unique values.

The query before (that takes ~10s):

wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/

the query after (that is almost instant):

wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/&facet.method=enum'

Maria

On Fri, Apr 21, 2017 at 8:59 AM, alessandro.benedetti <a....@sease.io>
wrote:

> That is quite interesting !
> You can use the stats module ( in association with the Json facets if you
> need it) to calculate an accurate approximation of the unique values [1]
> [2]
> .
>
> Good to know it improved your scenario, I may need to update my knowledge
> of
> term enum internals!
> Can you describe your schema configuration for the field and the way you
> were faceting before in comparison to the way you facet now ( with the
> related benefit)
>
> [1] https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> [2] http://yonik.com/solr-count-distinct/
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/prefix-facet-performance-tp4330684p4331309.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: prefix facet performance

Posted by "alessandro.benedetti" <a....@sease.io>.

That is quite interesting !
You can use the stats module ( in association with the Json facets if you
need it) to calculate an accurate approximation of the unique values [1] [2]
.

Good to know it improved your scenario, I may need to update my knowledge of
term enum internals!
Can you describe your schema configuration for the field and the way you
were faceting before in comparison to the way you facet now ( with the
related benefit)

[1] https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
[2] http://yonik.com/solr-count-distinct/



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331309.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: prefix facet performance

Posted by Maria Muslea <ma...@gmail.com>.

Actually using facet.method=enum made a HUGE difference even in my case
where I have many unique values. I am happy with the query response time
now.

Is there a way in SOLR to count the unique values for a field? If not, I
could run the reindexing and count the unique values while I add them to
give you a more accurate count of how many I have (there is a good chance
that I have more than 500K).

Thanks,
Maria

On Fri, Apr 21, 2017 at 1:16 AM, alessandro.benedetti <a....@sease.io>
wrote:

> Hi Maria,
> If you have 100-500.000 unique values for the field you are interested in,
> and the cardinality of your search results is actually quite small in
> comparison, I am not that sure term enum will help you that much ...
>
> To simplify, with the term enum approach, you iterate over each unique
> value, if it matches the prefix and then you count the intersection of the
> result set with the posting list for that term.
> In your case, your result set is likely to be much smaller than the number
> of unique values.
> I would assume you are using the fc approach, which in my opinion was not a
> bad idea.
> Let's start from the algorithm you are using and the schema config for your
> field,
>
> Cheers
>
>
>
> -----
> ---------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/prefix-facet-performance-tp4330684p4331221.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: prefix facet performance

Posted by "alessandro.benedetti" <a....@sease.io>.

Hi Maria,
If you have 100-500.000 unique values for the field you are interested in,
and the cardinality of your search results is actually quite small in
comparison, I am not that sure term enum will help you that much ...

To simplify, with the term enum approach, you iterate over each unique
value, if it matches the prefix and then you count the intersection of the
result set with the posting list for that term.
In your case, your result set is likely to be much smaller than the number
of unique values.
I would assume you are using the fc approach, which in my opinion was not a
bad idea.
Let's start from the algorithm you are using and the schema config for your
field,

Cheers



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331221.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: prefix facet performance

Posted by Maria Muslea <ma...@gmail.com>.

Hmmm, not sure. Probably in the range of 100K-500K.

Before writing the email I was just looking at:
http://yonik.com/facet-performance/

Wow, using facet.method=enum makes a big difference. I will read on it to
understand what it does.

Thank you so much.

Maria

On Tue, Apr 18, 2017 at 5:21 PM, Yonik Seeley <ys...@gmail.com> wrote:

> How many unique values in the index?
> You could try facet.method=enum
>
> -Yonik
>
>
> On Tue, Apr 18, 2017 at 8:16 PM, Maria Muslea <ma...@gmail.com>
> wrote:
> > Hi,
> >
> > I have ~40K documents in SOLR (not many) and a multivalued facet field
> that
> > contains at least 2K values per document.
> >
> > The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc,
> and
> > I use facet.prefix.
> >
> > q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/
> >
> >
> > with "concept" defined as:
> >
> >
> > <field name=“concept" type="string" indexed="true" multiValued="true"/>
> >
> >
> > This generates the output that I am looking for, but it takes more than
> 10
> > seconds per query.
> >
> >
> > Is there any way that I could improve the facet query performance for
> this
> > example?
> >
> >
> > Thank you,
> >
> > Maria
>

Re: prefix facet performance

Posted by Yonik Seeley <ys...@gmail.com>.

How many unique values in the index?
You could try facet.method=enum

-Yonik


On Tue, Apr 18, 2017 at 8:16 PM, Maria Muslea <ma...@gmail.com> wrote:
> Hi,
>
> I have ~40K documents in SOLR (not many) and a multivalued facet field that
> contains at least 2K values per document.
>
> The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, and
> I use facet.prefix.
>
> q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/
>
>
> with "concept" defined as:
>
>
> <field name=“concept" type="string" indexed="true" multiValued="true"/>
>
>
> This generates the output that I am looking for, but it takes more than 10
> seconds per query.
>
>
> Is there any way that I could improve the facet query performance for this
> example?
>
>
> Thank you,
>
> Maria