You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andrew Nagy <an...@villanova.edu> on 2006/12/07 21:23:47 UTC
Facet Performance
In September there was a thread [1] on this list about heterogeneous
facets and their performance. I am having a similar issue and am
unclear as the resolution of this thread.
I performed a search against my dataset (492,000 records) and got the
results I am looking for in .3 seconds. I then set facet to true and
got results in 16 seconds and the facets include data that is not in my
result set, it is from the entire set. How do I limit the faceting to
my results set and speed up the results?
Thanks!
Andrew
[1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg00955.html
Re: Facet Performance
Posted by Andrew Nagy <an...@villanova.edu>.
Erik Hatcher wrote:
> On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:
>
>> My data is 492,000 records of book data. I am faceting on 4 fields:
>> author, subject, language, format.
>> Format and language are fairly simple as their are only a few unique
>> terms. Author and subject however are much different in that there
>> are thousands of unique terms.
>
>
> When encountering difficult issues, I like to think in terms of the
> user interface. Surely you're not presenting 400k+ authors to the
> users in one shot. In Collex, we have put an AJAX drop-down that
> shows the author facet (we call it name on the UI, with various roles
> like author, painter, etc). You can see this in action here:
In our data, we don't have unique authors for each records ... so let's
say out of the 500,000 records ... we have 200,000 authors. What I am
trying to display is the top 10 authors from the results of a search.
So I do a search for title:"Gone with the wind" and I would like to see
the top 10 matching authors from these results.
But no worries, I have written my own facet handler and I am now back to
under a second with faceting!
Thanks for everyone's help and keep up the good work!
Andrew
Re: Facet Performance
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:
> My data is 492,000 records of book data. I am faceting on 4
> fields: author, subject, language, format.
> Format and language are fairly simple as their are only a few
> unique terms. Author and subject however are much different in
> that there are thousands of unique terms.
When encountering difficult issues, I like to think in terms of the
user interface. Surely you're not presenting 400k+ authors to the
users in one shot. In Collex, we have put an AJAX drop-down that
shows the author facet (we call it name on the UI, with various roles
like author, painter, etc). You can see this in action here:
http://www.nines.org/collex
type in "da" into the name for example. I developed a custom request
handler in Solr for returning these types of suggest interfaces
complete with facet counts. My code is very specific to our fields,
so its not usable in a general sense, but maybe this gives you some
ideas on where to go with these large sets of facet values.
Erik
Re: Facet Performance
Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, J.J. Larrea <jj...@panix.com> wrote:
> Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method.
If anyone had time some of this could be documented here:
http://wiki.apache.org/solr/SimpleFacetParameters
The wiki is open to all.
Or perhaps a new top level FacetedSearching page that references
SimpleFacetParameters
-Yonik
Re: Facet Performance
Posted by Funtick <fu...@efendi.ca>.
Hoss,
This is still extremely interesting area for possible improvements; I simply
don't want the topic to die
http://www.nabble.com/Facet-Performance-td7746964.html
http://issues.apache.org/jira/browse/SOLR-665
http://issues.apache.org/jira/browse/SOLR-667
http://issues.apache.org/jira/browse/SOLR-669
I am currently using faceting on single-valued _tokenized_ field with huge
amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds
average response time (for faceted queries only!)
I think we can use additional cache for facet results (to store calculated
values!); Lucene's FieldCache can be used only for non-tokenized
single-valued non-bollean fields
-Fuad
hossman_lucene wrote:
>
>
> : Unfortunately which strategy will be chosen is currently undocumented
> : and control is a bit oblique: If the field is tokenized or multivalued
> : or Boolean, the FilterQuery method will be used; otherwise the
> : FieldCache method. I expect I or others will improve that shortly.
>
> Bear in mind, what's provide out of the box is "SimpleFacets" ... it's
> designed to meet simple faceting needs ... when you start talking about
> 100s or thousands of constraints per facet, you are getting outside the
> scope of what it was intended to serve efficiently.
>
> At a certain point the only practical thing to do is write a custom
> request handler that makes the best choices for your data.
>
> For the record: a really simple patch someone could submit would be to
> make add an optional field based param indicating which type of faceting
> (termenum/fieldcache) should be used to generate the list of terms and
> then make SimpleFacets.getFacetFieldCounts use that and call the
> apprpriate method insteado calling getTermCounts -- that way you could
> force one or the other if you know it's better for your data/query.
>
>
>
> -Hoss
>
>
>
--
View this message in context: http://www.nabble.com/Facet-Performance-tp7746964p18756500.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Performance
Posted by Chris Hostetter <ho...@fucit.org>.
: Unfortunately which strategy will be chosen is currently undocumented
: and control is a bit oblique: If the field is tokenized or multivalued
: or Boolean, the FilterQuery method will be used; otherwise the
: FieldCache method. I expect I or others will improve that shortly.
Bear in mind, what's provide out of the box is "SimpleFacets" ... it's
designed to meet simple faceting needs ... when you start talking about
100s or thousands of constraints per facet, you are getting outside the
scope of what it was intended to serve efficiently.
At a certain point the only practical thing to do is write a custom
request handler that makes the best choices for your data.
For the record: a really simple patch someone could submit would be to
make add an optional field based param indicating which type of faceting
(termenum/fieldcache) should be used to generate the list of terms and
then make SimpleFacets.getFacetFieldCounts use that and call the
apprpriate method insteado calling getTermCounts -- that way you could
force one or the other if you know it's better for your data/query.
-Hoss
Re: Facet Performance
Posted by Andrew Nagy <an...@villanova.edu>.
J.J. Larrea wrote:
>Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly.
>
>
Good to hear, cause I can't really get away with not having a
multi-valued field for author.
Im really excited by solr and really impressed so far.
Thanks!
Andrew
Re: Facet Performance
Posted by "J.J. Larrea" <jj...@panix.com>.
Andrew Nagy, ditto on what Yonik said. Here is some further elaboration:
I am doing much the same thing (faceting on Author etc.). When my Author field was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it wasn't actually tokenized, the faceting code chose the QueryFilter approach, and faceting on Author for 100k+ document took about 4 seconds.
When I changed the field to "string" e.g. solr.StrField, the faceting code recognized it as untokenized and used the FieldCache approach. Times have dropped to about 120ms for the first query (when the FieldCache is generated) and < 10ms for subsequent queries returning a few thousand results. Quite a difference.
The strategy must be chosen on a field-by-field basis. While QueryFilter is excellent for fields with a small set of enumerated values such as Language or Format, it is inappropriate for large value sets such as Author.
Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly.
- J.J.
At 2:58 PM -0500 12/8/06, Yonik Seeley wrote:
>Right, if any of these are tokenized, then you could make them
>non-tokenized (use "string" type). If they really need to be
>tokenized (author for example), then you could use copyField to make
>another copy to a non-tokenized field that you can use for faceting.
>
>After that, as Hoss suggests, run a single faceting query with all 4
>fields and look at the filterCache statistics. Take the "lookups"
>number and multiply it by, say, 1.5 to leave some room for future
>growth, and use that as your cache size. You probably want to bump up
>both initialSize and autowarmCount as well.
>
>The first query will still be slow. The second should be relatively fast.
>You may hit an OOM error. Increase the JVM heap size if this happens.
>
>-Yonik
Re: Facet Performance
Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, Chris Hostetter <ho...@fucit.org> wrote:
> : My data is 492,000 records of book data. I am faceting on 4 fields:
> : author, subject, language, format.
> : Format and language are fairly simple as their are only a few unique
> : terms. Author and subject however are much different in that there are
> : thousands of unique terms.
>
> by the looks of it, you have a lot more then a few thousand unique terms
> in those two fields ... are you tokenizing on these fields? that's
> probably not what you want for ields you're going to facet on.
Right, if any of these are tokenized, then you could make them
non-tokenized (use "string" type). If they really need to be
tokenized (author for example), then you could use copyField to make
another copy to a non-tokenized field that you can use for faceting.
After that, as Hoss suggests, run a single faceting query with all 4
fields and look at the filterCache statistics. Take the "lookups"
number and multiply it by, say, 1.5 to leave some room for future
growth, and use that as your cache size. You probably want to bump up
both initialSize and autowarmCount as well.
The first query will still be slow. The second should be relatively fast.
You may hit an OOM error. Increase the JVM heap size if this happens.
-Yonik
Re: solr-42
Posted by Yonik Seeley <yo...@apache.org>.
On 1/4/07, mirko@sas.upenn.edu <mi...@sas.upenn.edu> wrote:
> If the HTMLStripReader would simply replace
> the HTML with spaces (same length as the removed HTML part) then the positions
> for the highlighter would be correct. And most of the Tokenizers would
> be happy with this solution (except maybe the KeywordTokenizer).
Good idea Mirko, that's probably a much easier fix than the one I envisioned.
-Yonik
solr-42
Posted by mi...@sas.upenn.edu.
Hi,
I was wondering if the solution for the Highlighting problems with
HTMLStripWhitespaceTokenizerFactory (see
http://issues.apache.org/jira/browse/SOLR-42) could be resolved in
the following simple way.
The HTMLStripWhitespaceTokenizerFactory basically passes through the
input through an HTMLStripReader which removes the HTML and then passes
to the WhitespaceTokenizer. If the HTMLStripReader would simply replace
the HTML with spaces (same length as the removed HTML part) then the positions
for the highlighter would be correct. And most of the Tokenizers would
be happy with this solution (except maybe the KeywordTokenizer).
mirko
Re: Facet Performance
Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
> Are they multivalued, and do they need to be.
> Anything that is of type "string" and not multivalued will use the
> lucene FieldCache rather than the filterCache.
The author field is multivalued. Will this be a strong performance issue?
I could make multiple author fields as to not have the multivalued field
and then only facet on the first author.
Thanks
Andrew
Re: Facet Performance
Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, Andrew Nagy <an...@villanova.edu> wrote:
> Chris Hostetter wrote:
>
> >: Could you suggest a better configuration based on this?
> >
> >If that's what your stats look like after a single request, then i would
> >guess you would need to make your cache size at least 1.6 million in order
> >for it to be of any use in improving your facet speed.
> >
> >
> Would this have any strong impacts on my system? Should I just set it
> to an even 2 million to allow for growth?
Change the following in solrconfig.xml, and you should be fine with a
higher setting.
<useFilterForSortedQuery>true</useFilterForSortedQuery>
to
<useFilterForSortedQuery>false</useFilterForSortedQuery>
That will prevent the filtercache from being used for anything but
filters and faceting, so if you set it to high, it won't be utilized
anyway.
> >: My data is 492,000 records of book data. I am faceting on 4 fields:
> >: author, subject, language, format.
> >: Format and language are fairly simple as their are only a few unique
> >: terms. Author and subject however are much different in that there are
> >: thousands of unique terms.
> >
> >by the looks of it, you have a lot more then a few thousand unique terms
> >in those two fields ... are you tokenizing on these fields? that's
> >probably not what you want for ields you're going to facet on.
> >
> >
> All of these fields are set as "string" in my schema
Are they multivalued, and do they need to be.
Anything that is of type "string" and not multivalued will use the
lucene FieldCache rather than the filterCache.
-Yonik
Re: Facet Performance
Posted by Andrew Nagy <an...@villanova.edu>.
Chris Hostetter wrote:
>: Could you suggest a better configuration based on this?
>
>If that's what your stats look like after a single request, then i would
>guess you would need to make your cache size at least 1.6 million in order
>for it to be of any use in improving your facet speed.
>
>
Would this have any strong impacts on my system? Should I just set it
to an even 2 million to allow for growth?
>: My data is 492,000 records of book data. I am faceting on 4 fields:
>: author, subject, language, format.
>: Format and language are fairly simple as their are only a few unique
>: terms. Author and subject however are much different in that there are
>: thousands of unique terms.
>
>by the looks of it, you have a lot more then a few thousand unique terms
>in those two fields ... are you tokenizing on these fields? that's
>probably not what you want for ields you're going to facet on.
>
>
All of these fields are set as "string" in my schema, so if I understand
the fields correctly, they are not being tokenized. I also have an
author field that is set as "text" for searching.
Thanks
Andrew
Re: Facet Performance
Posted by Chris Hostetter <ho...@fucit.org>.
: Here are the stats, Im still a newbie to SOLR, so Im not totally sure
: what this all means:
: lookups : 1530036
: hits : 2
: hitratio : 0.00
: inserts : 1530035
: evictions : 1504435
: size : 25600
those numbers are telling you that your cache is capable of holding 25,600
items. you have attempted to lookup something in the cache 1,530,036
times, and only 2 of those times did you get a hit. you have
added 1,530,035 items to the cache, and 1,504,435 items have been removed
from your cache to make room for newer items.
in short: your cache isn't really helping you at all.
: Could you suggest a better configuration based on this?
If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.
: My data is 492,000 records of book data. I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms. Author and subject however are much different in that there are
: thousands of unique terms.
by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields? that's
probably not what you want for ields you're going to facet on.
-Hoss
Re: Facet Performance
Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
> On 12/8/06, Andrew Nagy <an...@villanova.edu> wrote:
>
>> I changed the filterCache to the following:
>> <filterCache
>> class="solr.LRUCache"
>> size="25600"
>> initialSize="5120"
>> autowarmCount="1024"/>
>>
>> However a search that normally takes .04s is taking 74 seconds once I
>> use the facets since I am faceting on 4 fields.
>
>
> The first time or subsequent times?
> Is your filterCache big enough yet? What do you see for evictions and
> hit ratio?
Here are the stats, Im still a newbie to SOLR, so Im not totally sure
what this all means:
lookups : 1530036
hits : 2
hitratio : 0.00
inserts : 1530035
evictions : 1504435
size : 25600
cumulative_lookups : 1530036
cumulative_hits : 2
cumulative_hitratio : 0.00
cumulative_inserts : 1530035
cumulative_evictions : 1504435
Could you suggest a better configuration based on this?
>
>> Can you suggest a better configuration that would solve this performance
>> issue, or should I not use faceting?
>
>
> Faceting isn't something that will always be fast... one often needs
> to design things in a way that it can be fast.
>
> Can you give some examples of your faceted queries?
> Can you show the field and fieldtype definitions for the fields you
> are faceting on?
> For each field that you are faceting on, how many different terms are
> in it?
My data is 492,000 records of book data. I am faceting on 4 fields:
author, subject, language, format.
Format and language are fairly simple as their are only a few unique
terms. Author and subject however are much different in that there are
thousands of unique terms.
Thanks for your help!
Andrew
Re: Facet Performance
Posted by Yonik Seeley <yo...@apache.org>.
On 12/8/06, Andrew Nagy <an...@villanova.edu> wrote:
> I changed the filterCache to the following:
> <filterCache
> class="solr.LRUCache"
> size="25600"
> initialSize="5120"
> autowarmCount="1024"/>
>
> However a search that normally takes .04s is taking 74 seconds once I
> use the facets since I am faceting on 4 fields.
The first time or subsequent times?
Is your filterCache big enough yet? What do you see for evictions and
hit ratio?
> Can you suggest a better configuration that would solve this performance
> issue, or should I not use faceting?
Faceting isn't something that will always be fast... one often needs
to design things in a way that it can be fast.
Can you give some examples of your faceted queries?
Can you show the field and fieldtype definitions for the fields you
are faceting on?
For each field that you are faceting on, how many different terms are in it?
> I figure I could run the query twice, once limited to 20 records and
> then again with the limit set to the total number of records and develop
> my own facets. I have infact done this before with a different back-end
> and my code is processed in under .01 seconds.
>
> Why is faceting so slow?
It's computationally expensive to get exact facet counts for a large
number of hits, and that is what the current faceting code is designed
to do. No single method will be appropriate *and* fast for all
scenarios.
Another method that hasn't been implemented is some statistical
faceting based on the top hits, using stored fields or stored term
vectors.
-Yonik
Re: Facet Performance
Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
> 1) facet on single-valued strings if you can
> 2) if you can't do (1) then enlarge the fieldcache so that the number
> of filters (one per possible term in the field you are filtering on)
> can fit.
I changed the filterCache to the following:
<filterCache
class="solr.LRUCache"
size="25600"
initialSize="5120"
autowarmCount="1024"/>
However a search that normally takes .04s is taking 74 seconds once I
use the facets since I am faceting on 4 fields.
Can you suggest a better configuration that would solve this performance
issue, or should I not use faceting?
I figure I could run the query twice, once limited to 20 records and
then again with the limit set to the total number of records and develop
my own facets. I have infact done this before with a different back-end
and my code is processed in under .01 seconds.
Why is faceting so slow?
Andrew
Re: Facet Performance
Posted by Chris Hostetter <ho...@fucit.org>.
: > This seems like a poor choice for an element
: > name. Why not just name the element what is in the "name" attribute?
: > It would make parsing much easier!
:
: When the XML was first conceived, there was a preference for limiting
: the number of tags.
: The structure could have been inverted so that
: <lst name="myfieldname> could have been <myfieldname type="lst">
...but then we couldn't support arbitrary field names, and it would be
impossible to validate the XML docs independent of hte schema, see this
previous explanation...
http://www.nabble.com/Default-XML-Output-Schema-tf2312439.html#a6430000
-Hoss
Re: Facet Performance
Posted by Yonik Seeley <yo...@apache.org>.
On 12/7/06, Andrew Nagy <an...@villanova.edu> wrote:
> On complaint about the faceting though: Why is the element that is
> returned called "1st".
I think maybe you are seeing lst (it starts with an L, not a one).
It is short for NamedList, an ordered list who's elements are named.
> This seems like a poor choice for an element
> name. Why not just name the element what is in the "name" attribute?
> It would make parsing much easier!
When the XML was first conceived, there was a preference for limiting
the number of tags.
The structure could have been inverted so that
<lst name="myfieldname> could have been <myfieldname type="lst">
-Yonik
Re: Facet Performance
Posted by Andrew Nagy <an...@villanova.edu>.
Yonik Seeley wrote:
> 1) facet on single-valued strings if you can
> 2) if you can't do (1) then enlarge the fieldcache so that the number
> of filters (one per possible term in the field you are filtering on)
> can fit.
I wll try this out.
> 3) facet counts are limited to the results of the query, filtered by
> any filters. Is there a reason you think they are not?
No, you are right. I was thrown off at 1st.
On complaint about the faceting though: Why is the element that is
returned called "1st". This seems like a poor choice for an element
name. Why not just name the element what is in the "name" attribute?
It would make parsing much easier!
Thanks!
Andrew
Re: Facet Performance
Posted by Yonik Seeley <yo...@apache.org>.
On 12/7/06, Andrew Nagy <an...@villanova.edu> wrote:
> In September there was a thread [1] on this list about heterogeneous
> facets and their performance. I am having a similar issue and am
> unclear as the resolution of this thread.
>
> I performed a search against my dataset (492,000 records) and got the
> results I am looking for in .3 seconds. I then set facet to true and
> got results in 16 seconds and the facets include data that is not in my
> result set, it is from the entire set. How do I limit the faceting to
> my results set and speed up the results?
1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.
3) facet counts are limited to the results of the query, filtered by
any filters. Is there a reason you think they are not?
-Yonik