You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew Nagy <an...@villanova.edu> on 2006/12/07 21:23:47 UTC

Facet Performance

In September there was a thread [1] on this list about heterogeneous 
facets and their performance.  I am having a similar issue and am 
unclear as the resolution of this thread.

I performed a search against my dataset (492,000 records) and got the 
results I am looking for in .3 seconds.  I then set facet to true and 
got results in 16 seconds and the facets include data that is not in my 
result set, it is from the entire set.  How do I limit the faceting to 
my results set and speed up the results?

Thanks!
Andrew

[1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg00955.html

Re: Facet Performance

Posted by Andrew Nagy <an...@villanova.edu>.
Erik Hatcher wrote:

> On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:
>
>> My data is 492,000 records of book data.  I am faceting on 4  fields: 
>> author, subject, language, format.
>> Format and language are fairly simple as their are only a few  unique 
>> terms.  Author and subject however are much different in  that there 
>> are thousands of unique terms.
>
>
> When encountering difficult issues, I like to think in terms of the  
> user interface.  Surely you're not presenting 400k+ authors to the  
> users in one shot.  In Collex, we have put an AJAX drop-down that  
> shows the author facet (we call it name on the UI, with various roles  
> like author, painter, etc).  You can see this in action here:

In our data, we don't have unique authors for each records ... so let's 
say out of the 500,000 records ... we have 200,000 authors.  What I am 
trying to display is the top 10 authors from the results of a search.  
So I do a search for title:"Gone with the wind" and I would like to see 
the top 10 matching authors from these results.

But no worries, I have written my own facet handler and I am now back to 
under a second with faceting!

Thanks for everyone's help and keep up the good work!

Andrew

Re: Facet Performance

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:
> My data is 492,000 records of book data.  I am faceting on 4  
> fields: author, subject, language, format.
> Format and language are fairly simple as their are only a few  
> unique terms.  Author and subject however are much different in  
> that there are thousands of unique terms.

When encountering difficult issues, I like to think in terms of the  
user interface.  Surely you're not presenting 400k+ authors to the  
users in one shot.  In Collex, we have put an AJAX drop-down that  
shows the author facet (we call it name on the UI, with various roles  
like author, painter, etc).  You can see this in action here:

	http://www.nines.org/collex

type in "da" into the name for example.  I developed a custom request  
handler in Solr for returning these types of suggest interfaces  
complete with facet counts.  My code is very specific to our fields,  
so its not usable in a general sense, but maybe this gives you some  
ideas on where to go with these large sets of facet values.

	Erik


Re: Facet Performance

Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, J.J. Larrea <jj...@panix.com> wrote:
> Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique:  If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method.

If anyone had time some of this could be documented here:
http://wiki.apache.org/solr/SimpleFacetParameters
The wiki is open to all.

Or perhaps a new top level FacetedSearching page that references
SimpleFacetParameters

-Yonik

Re: Facet Performance

Posted by Funtick <fu...@efendi.ca>.
Hoss,

This is still extremely interesting area for possible improvements; I simply
don't want the topic to die 
http://www.nabble.com/Facet-Performance-td7746964.html

http://issues.apache.org/jira/browse/SOLR-665
http://issues.apache.org/jira/browse/SOLR-667
http://issues.apache.org/jira/browse/SOLR-669

I am currently using faceting on single-valued _tokenized_ field with huge
amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds
average response time (for faceted queries only!)

I think we can use additional cache for facet results (to store calculated
values!); Lucene's FieldCache can be used only for non-tokenized
single-valued non-bollean fields

-Fuad



hossman_lucene wrote:
> 
> 
> : Unfortunately which strategy will be chosen is currently undocumented
> : and control is a bit oblique:  If the field is tokenized or multivalued
> : or Boolean, the FilterQuery method will be used; otherwise the
> : FieldCache method.  I expect I or others will improve that shortly.
> 
> Bear in mind, what's provide out of the box is "SimpleFacets" ... it's
> designed to meet simple faceting needs ... when you start talking about
> 100s or thousands of constraints per facet, you are getting outside the
> scope of what it was intended to serve efficiently.
> 
> At a certain point the only practical thing to do is write a custom
> request handler that makes the best choices for your data.
> 
> For the record: a really simple patch someone could submit would be to
> make add an optional field based param indicating which type of faceting
> (termenum/fieldcache) should be used to generate the list of terms and
> then make SimpleFacets.getFacetFieldCounts use that and call the
> apprpriate method insteado calling getTermCounts -- that way you could
> force one or the other if you know it's better for your data/query.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Facet-Performance-tp7746964p18756500.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Facet Performance

Posted by Chris Hostetter <ho...@fucit.org>.
: Unfortunately which strategy will be chosen is currently undocumented
: and control is a bit oblique:  If the field is tokenized or multivalued
: or Boolean, the FilterQuery method will be used; otherwise the
: FieldCache method.  I expect I or others will improve that shortly.

Bear in mind, what's provide out of the box is "SimpleFacets" ... it's
designed to meet simple faceting needs ... when you start talking about
100s or thousands of constraints per facet, you are getting outside the
scope of what it was intended to serve efficiently.

At a certain point the only practical thing to do is write a custom
request handler that makes the best choices for your data.

For the record: a really simple patch someone could submit would be to
make add an optional field based param indicating which type of faceting
(termenum/fieldcache) should be used to generate the list of terms and
then make SimpleFacets.getFacetFieldCounts use that and call the
apprpriate method insteado calling getTermCounts -- that way you could
force one or the other if you know it's better for your data/query.



-Hoss


Re: Facet Performance

Posted by Andrew Nagy <an...@villanova.edu>.
J.J. Larrea wrote:

>Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique:  If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method.  I expect I or others will improve that shortly.
>  
>
Good to hear, cause I can't really get away with not having a 
multi-valued field for author.

Im really excited by solr and really impressed so far.

Thanks!
Andrew

Re: Facet Performance

Posted by "J.J. Larrea" <jj...@panix.com>.
Andrew Nagy, ditto on what Yonik said.  Here is some further elaboration:

I am doing much the same thing (faceting on Author etc.). When my Author field was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it wasn't actually tokenized, the faceting code chose the QueryFilter approach, and faceting on Author for 100k+ document took about 4 seconds.

When I changed the field to "string" e.g. solr.StrField, the faceting code recognized it as untokenized and used the FieldCache approach.  Times have dropped to about 120ms for the first query (when the FieldCache is generated) and < 10ms for subsequent queries returning a few thousand results.  Quite a difference.

The strategy must be chosen on a field-by-field basis.  While QueryFilter is excellent for fields with a small set of enumerated values such as Language or Format, it is inappropriate for large value sets such as Author.

Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique:  If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method.  I expect I or others will improve that shortly.

- J.J.

At 2:58 PM -0500 12/8/06, Yonik Seeley wrote:
>Right, if any of these are tokenized, then you could make them
>non-tokenized (use "string" type).  If they really need to be
>tokenized (author for example), then you could use copyField to make
>another copy to a non-tokenized field that you can use for faceting.
>
>After that, as Hoss suggests, run a single faceting query with all 4
>fields and look at the filterCache statistics.  Take the "lookups"
>number and multiply it by, say, 1.5 to leave some room for future
>growth, and use that as your cache size.  You probably want to bump up
>both initialSize and autowarmCount as well.
>
>The first query will still be slow.  The second should be relatively fast.
>You may hit an OOM error.  Increase the JVM heap size if this happens.
>
>-Yonik


Re: Facet Performance

Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, Chris Hostetter <ho...@fucit.org> wrote:
> : My data is 492,000 records of book data.  I am faceting on 4 fields:
> : author, subject, language, format.
> : Format and language are fairly simple as their are only a few unique
> : terms.  Author and subject however are much different in that there are
> : thousands of unique terms.
>
> by the looks of it, you have a lot more then a few thousand unique terms
> in those two fields ... are you tokenizing on these fields?  that's
> probably not what you want for ields you're going to facet on.

Right, if any of these are tokenized, then you could make them
non-tokenized (use "string" type).  If they really need to be
tokenized (author for example), then you could use copyField to make
another copy to a non-tokenized field that you can use for faceting.

After that, as Hoss suggests, run a single faceting query with all 4
fields and look at the filterCache statistics.  Take the "lookups"
number and multiply it by, say, 1.5 to leave some room for future
growth, and use that as your cache size.  You probably want to bump up
both initialSize and autowarmCount as well.

The first query will still be slow.  The second should be relatively fast.
You may hit an OOM error.  Increase the JVM heap size if this happens.

-Yonik

Re: solr-42

Posted by Yonik Seeley <yo...@apache.org>.
On 1/4/07, mirko@sas.upenn.edu <mi...@sas.upenn.edu> wrote:
>  If the HTMLStripReader would simply replace
> the HTML with spaces (same length as the removed HTML part) then the positions
> for the highlighter would be correct.  And most of the Tokenizers would
> be happy with this solution (except maybe the KeywordTokenizer).

Good idea Mirko, that's probably a much easier fix than the one I envisioned.

-Yonik

solr-42

Posted by mi...@sas.upenn.edu.
Hi,

I was wondering if the solution for the Highlighting problems with
HTMLStripWhitespaceTokenizerFactory (see
http://issues.apache.org/jira/browse/SOLR-42) could be resolved in
the following simple way.

The HTMLStripWhitespaceTokenizerFactory basically passes through the
input through an HTMLStripReader which removes the HTML and then passes
to the WhitespaceTokenizer.  If the HTMLStripReader would simply replace
the HTML with spaces (same length as the removed HTML part) then the positions
for the highlighter would be correct.  And most of the Tokenizers would
be happy with this solution (except maybe the KeywordTokenizer).

mirko

Re: Facet Performance

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:

> Are they multivalued, and do they need to be.
> Anything that is of type "string" and not multivalued will use the
> lucene FieldCache rather than the filterCache.

The author field is multivalued.  Will this be a strong performance issue?

I could make multiple author fields as to not have the multivalued field 
and then only facet on the first author.

Thanks
Andrew



Re: Facet Performance

Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, Andrew Nagy <an...@villanova.edu> wrote:
> Chris Hostetter wrote:
>
> >: Could you suggest a better configuration based on this?
> >
> >If that's what your stats look like after a single request, then i would
> >guess you would need to make your cache size at least 1.6 million in order
> >for it to be of any use in improving your facet speed.
> >
> >
> Would this have any strong impacts on my system?  Should I just set it
> to an even 2 million to allow for growth?

Change the following in solrconfig.xml, and you should be fine with a
higher setting.
<useFilterForSortedQuery>true</useFilterForSortedQuery>
to
<useFilterForSortedQuery>false</useFilterForSortedQuery>

That will prevent the filtercache from being used for anything but
filters and faceting, so if you set it to high, it won't be utilized
anyway.

> >: My data is 492,000 records of book data.  I am faceting on 4 fields:
> >: author, subject, language, format.
> >: Format and language are fairly simple as their are only a few unique
> >: terms.  Author and subject however are much different in that there are
> >: thousands of unique terms.
> >
> >by the looks of it, you have a lot more then a few thousand unique terms
> >in those two fields ... are you tokenizing on these fields?  that's
> >probably not what you want for ields you're going to facet on.
> >
> >
> All of these fields are set as "string" in my schema

Are they multivalued, and do they need to be.
Anything that is of type "string" and not multivalued will use the
lucene FieldCache rather than the filterCache.

-Yonik

Re: Facet Performance

Posted by Andrew Nagy <an...@villanova.edu>.
Chris Hostetter wrote:

>: Could you suggest a better configuration based on this?
>
>If that's what your stats look like after a single request, then i would
>guess you would need to make your cache size at least 1.6 million in order
>for it to be of any use in improving your facet speed.
>  
>
Would this have any strong impacts on my system?  Should I just set it 
to an even 2 million to allow for growth?

>: My data is 492,000 records of book data.  I am faceting on 4 fields:
>: author, subject, language, format.
>: Format and language are fairly simple as their are only a few unique
>: terms.  Author and subject however are much different in that there are
>: thousands of unique terms.
>
>by the looks of it, you have a lot more then a few thousand unique terms
>in those two fields ... are you tokenizing on these fields?  that's
>probably not what you want for ields you're going to facet on.
>  
>
All of these fields are set as "string" in my schema, so if I understand 
the fields correctly, they are not being tokenized.  I also have an 
author field that is set as "text" for searching.

Thanks
Andrew

Re: Facet Performance

Posted by Chris Hostetter <ho...@fucit.org>.
: Here are the stats, Im still a newbie to SOLR, so Im not totally sure
: what this all means:
: lookups : 1530036
: hits : 2
: hitratio : 0.00
: inserts : 1530035
: evictions : 1504435
: size : 25600

those numbers are telling you that your cache is capable of holding 25,600
items.  you have attempted to lookup something in the cache 1,530,036
times, and only 2 of those times did you get a hit.  you have
added 1,530,035 items to the cache, and 1,504,435 items have been removed
from your cache to make room for newer items.

in short: your cache isn't really helping you at all.

: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.

: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.



-Hoss


Re: Facet Performance

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:

> On 12/8/06, Andrew Nagy <an...@villanova.edu> wrote:
>
>> I changed the filterCache to the following:
>>     <filterCache
>>       class="solr.LRUCache"
>>       size="25600"
>>       initialSize="5120"
>>       autowarmCount="1024"/>
>>
>> However a search that normally takes .04s is taking 74 seconds once I
>> use the facets since I am faceting on 4 fields.
>
>
> The first time or subsequent times?
> Is your filterCache big enough yet?  What do you see for evictions and
> hit ratio?

Here are the stats, Im still a newbie to SOLR, so Im not totally sure 
what this all means:
lookups : 1530036
hits : 2
hitratio : 0.00
inserts : 1530035
evictions : 1504435
size : 25600
cumulative_lookups : 1530036
cumulative_hits : 2
cumulative_hitratio : 0.00
cumulative_inserts : 1530035
cumulative_evictions : 1504435

Could you suggest a better configuration based on this?

>
>> Can you suggest a better configuration that would solve this performance
>> issue, or should I not use faceting?
>
>
> Faceting isn't something that will always be fast... one often needs
> to design things in a way that it can be fast.
>
> Can you give some examples of your faceted queries?
> Can you show the field and fieldtype definitions for the fields you
> are faceting on?
> For each field that you are faceting on, how many different terms are 
> in it?

My data is 492,000 records of book data.  I am faceting on 4 fields: 
author, subject, language, format.
Format and language are fairly simple as their are only a few unique 
terms.  Author and subject however are much different in that there are 
thousands of unique terms.

Thanks for your help!
Andrew

Re: Facet Performance

Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, Andrew Nagy <an...@villanova.edu> wrote:
> I changed the filterCache to the following:
>     <filterCache
>       class="solr.LRUCache"
>       size="25600"
>       initialSize="5120"
>       autowarmCount="1024"/>
>
> However a search that normally takes .04s is taking 74 seconds once I
> use the facets since I am faceting on 4 fields.

The first time or subsequent times?
Is your filterCache big enough yet?  What do you see for evictions and
hit ratio?

> Can you suggest a better configuration that would solve this performance
> issue, or should I not use faceting?

Faceting isn't something that will always be fast... one often needs
to design things in a way that it can be fast.

Can you give some examples of your faceted queries?
Can you show the field and fieldtype definitions for the fields you
are faceting on?
For each field that you are faceting on, how many different terms are in it?

> I figure I could run the query twice, once limited to 20 records and
> then again with the limit set to the total number of records and develop
> my own facets.  I have infact done this before with a different back-end
> and my code is processed in under .01 seconds.
>
> Why is faceting so slow?

It's computationally expensive to get exact facet counts for a large
number of hits, and that is what the current faceting code is designed
to do.  No single method will be appropriate *and* fast for all
scenarios.

Another method that hasn't been implemented is some statistical
faceting based on the top hits, using stored fields or stored term
vectors.

-Yonik

Re: Facet Performance

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:

> 1) facet on single-valued strings if you can
> 2) if you can't do (1) then enlarge the fieldcache so that the number
> of filters (one per possible term in the field you are filtering on)
> can fit.

I changed the filterCache to the following:
    <filterCache
      class="solr.LRUCache"
      size="25600"
      initialSize="5120"
      autowarmCount="1024"/>

However a search that normally takes .04s is taking 74 seconds once I 
use the facets since I am faceting on 4 fields.

Can you suggest a better configuration that would solve this performance 
issue, or should I not use faceting?
I figure I could run the query twice, once limited to 20 records and 
then again with the limit set to the total number of records and develop 
my own facets.  I have infact done this before with a different back-end 
and my code is processed in under .01 seconds.

Why is faceting so slow?

Andrew

Re: Facet Performance

Posted by Chris Hostetter <ho...@fucit.org>.
: > This seems like a poor choice for an element
: > name.  Why not just name the element what is in the "name" attribute?
: > It would make parsing much easier!
:
: When the XML was first conceived, there was a preference for limiting
: the number of tags.
: The structure could have been inverted so that
: <lst name="myfieldname> could have been <myfieldname type="lst">

...but then we couldn't support arbitrary field names, and it would be
impossible to validate the XML docs independent of hte schema, see this
previous explanation...

http://www.nabble.com/Default-XML-Output-Schema-tf2312439.html#a6430000



-Hoss


Re: Facet Performance

Posted by Yonik Seeley <yo...@apache.org>.
On 12/7/06, Andrew Nagy <an...@villanova.edu> wrote:
> On complaint about the faceting though:  Why is the element that is
> returned called "1st".

I think maybe you are seeing lst (it starts with an L, not a one).
It is short for NamedList, an ordered list who's elements are named.

> This seems like a poor choice for an element
> name.  Why not just name the element what is in the "name" attribute?
> It would make parsing much easier!

When the XML was first conceived, there was a preference for limiting
the number of tags.
The structure could have been inverted so that
<lst name="myfieldname> could have been <myfieldname type="lst">

-Yonik

Re: Facet Performance

Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:

> 1) facet on single-valued strings if you can
> 2) if you can't do (1) then enlarge the fieldcache so that the number
> of filters (one per possible term in the field you are filtering on)
> can fit.

I wll try this out.

> 3) facet counts are limited to the results of the query, filtered by
> any filters.   Is there a reason you think they are not?

No, you are right.  I was thrown off at 1st.

On complaint about the faceting though:  Why is the element that is 
returned called "1st".  This seems like a poor choice for an element 
name.  Why not just name the element what is in the "name" attribute?  
It would make parsing much easier!

Thanks!
Andrew

Re: Facet Performance

Posted by Yonik Seeley <yo...@apache.org>.
On 12/7/06, Andrew Nagy <an...@villanova.edu> wrote:
> In September there was a thread [1] on this list about heterogeneous
> facets and their performance.  I am having a similar issue and am
> unclear as the resolution of this thread.
>
> I performed a search against my dataset (492,000 records) and got the
> results I am looking for in .3 seconds.  I then set facet to true and
> got results in 16 seconds and the facets include data that is not in my
> result set, it is from the entire set.  How do I limit the faceting to
> my results set and speed up the results?

1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.
3) facet counts are limited to the results of the query, filtered by
any filters.   Is there a reason you think they are not?

-Yonik