You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Imbeault <mi...@sympatico.ca> on 2006/09/19 04:07:04 UTC

Facet performance with heterogeneous 'facets'?

Been playing around with the news 'facets search' and it works very 
well, but it's really slow for some particular applications. I've been 
trying to use it to display the most frequent authors of articles; this 
is from a huge (15 millions articles) database and names of authors are 
rare and heterogeneous. On a query that takes (without facets) 0.1 
seconds, it jumps to ~20 seconds with just 1% of the documents indexed 
(I've been getting java.lang.OutOfMemoryError with the full index). ~40 
seconds for a faceted search on 2 (string) fields. Range queries on a 
slong field is more acceptable (even with a dozen of them, query time is 
still in the subsecond range).

I'm I trying to do something which isn't what faceted search was made 
for? It would be understandable, after all, I guess the facets engine 
has to check very doc in the index and sort... which shouldn't yield 
good performance no matter what, sadly.

Is there any other way I could achieve what I'm trying to do? Just a 
list of the most frequent (top 5) authors present in the results of a query.

Thanks,

-- 
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Re: Facet performance with heterogeneous 'facets'?

Posted by Chris Hostetter <ho...@fucit.org>.

: > when we facet on the authors, we start with
: > that list and go in order, generating their facet constraint count using
: > the DocSet intersection just like we currently do ... if we reach our
: > facet.limit before we reach the end of hte list and the lowest constraint
: > count is higher then the total doc count of the last author in the list,
: > then we know we don't need to bother testing any other Author, because no
: > other author an possibly have a higher facet constraint count then the
: > ones on our list
:
: This works OK if the intersection counts are high (as a percentage of
: the facet sets).  I'm not sure how often this will be the case though.

well, keep in mind "N" could be very big, big enough to store the full
list of Terms sorted in docFreq order (it shouldn't take up much space
since it's just hte Term and an int)e ... for any query that returns a
"large" number of results, you probably won't need to reach the end of the
list before you can tell that all the remaining Terms have a lower docFreq
then the current last constraint count in your facet.limit list.  For
queries that return a "small" number of results, it wouldn't be as
usefull, but thats where a switch could be fliped to start with the values
mapped to hte docs (using FieldCache -- assuming single-value fields)

: Another tradeoff is to allow getting inexact counts with multi-token fields by:
:  - simply faceting on the most popular values
:    OR
:  - do some sort of statistical sampling by reading term vectors for a
: fraction of the matching docs.

i loath inexact counts ... i think of them as "Astrology" to the Astronomy
of true Faceted Searching ... but i'm sure they would be "good enough" for
some peoples use cases.



-Hoss

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/21/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> Btw, Any plans for a facets cache?

Maybe a partial one (like caching top terms to implement some other
optimizations).  My general philosophy on caching in Solr has been to
cache things the client can't: elemental things, or *parts* of
requests to make many different requests faster (most
bang-for-the-buck).

Caching complete requests/responses is generally less useful since it
requires even more memory, has a worse hit ratio, and can be done
anyway by the client or a separate process like squid.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Michael Imbeault <mi...@sympatico.ca>.

Dude, stop being so awesome (and the whole Solr team). Seriously! Every 
problem / request (MoreLikeThis class, change AND/OR preference 
programatically, etc) I've submitted to this mailing list has received a 
quick, more-than-I-ever-expected answer.

I'll subscribe to the dev list (been reading it off and on), but I'm 
afraid I couldn't code my way of a paper bag in Java. I'll contribute to 
the Solr wiki (the SolrPHP part in particular) as soon as I can. Thats 
the least I can do!

Btw, Any plans for a facets cache?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:
> On 9/21/06, Michael Imbeault <mi...@sympatico.ca> wrote:
>> It turns out that journal_name has 17038 different tokens, which is
>> manageable, but first_author has > 400 000. I don't think this will ever
>> yield good performance, so i might only do journal_name facets.
>
> Hang in there Michael, a fix is on the way for your scenario (and
> subscribe to solr-dev if you want to stay on the bleeding edge):
>
> http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html 
>
>
> -Yonik
>

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/22/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> Excellent news; as you guessed, my schema was (for some reason) set to
> version 1.0.

Yeah, I just realized that having "version" right next to "name" would
lead people to think it's "their" version number, when it's really
Solr's version number.  I've added a comment to the example schema to
clarify that.

> But better yet, the 800 seconds query is now running in 0.5-2 seconds!
> Amazing optimization! I can now do faceting on journal title (17 000
> different titles) and last author (>400 000 authors), + 12 date range
> queries, in a very reasonable time (considering im on a test windows
> desktop box and not a server).
>
> The only problem is if I add first author, I get a
> java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will
> get away on a server with more than the current 500 megs I can allocate
> to Tomcat.

Yes, the Lucene FieldCache takes up a lot of memory.  It basically
holds the entire field in a non-inverted form:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.StringIndex.html

It's currently also used for sorting, which also needs fast
document->fieldvalue lookups, rather than the inverted
term->documents_containing_that_term

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Michael Imbeault <mi...@sympatico.ca>.

Excellent news; as you guessed, my schema was (for some reason) set to 
version 1.0. This also caused some of the problems I had with the 
original SolrPHP (parsing the wrong response).

But better yet, the 800 seconds query is now running in 0.5-2 seconds! 
Amazing optimization! I can now do faceting on journal title (17 000 
different titles) and last author (>400 000 authors), + 12 date range 
queries, in a very reasonable time (considering im on a test windows 
desktop box and not a server).

The only problem is if I add first author, I get a 
java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will 
get away on a server with more than the current 500 megs I can allocate 
to Tomcat.

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:
> On 9/22/06, Michael Imbeault <mi...@sympatico.ca> wrote:
>> I upgraded to the most recent Solr build (9-22) and sadly it's still
>> really slow. 800 seconds query with a single facet on first_author, 15
>> millions documents total, the query return 180. Maybe i'm doing
>> something wrong? Also, this is on my personal desktop; not on a server.
>> Still, I'm getting 0.1 seconds queries without facets, so I don't think
>> thats the cause. In the admin panel i can still see the filtercache
>> doing millions of lookups (and tons of evictions once it hits the 
>> maxsize).
>
> The fact that you see all the filtercache usage means that the
> optimization didn't kick in for some reason.
>
>> Here's the field i'm using in schema.xml :
>> <field name ="first_author" type="string" indexed="true" stored="true"/>
>
> That looks fine...
>
>> This is the query :
>> q="hiv red 
>> blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false 
>>
>
> That looks OK too.
> I assume that you didn't change the fieldtype definition for "string",
> and that the schema has version="1.1"?  Before 1.1, all fields were
> assumed to be multiValued (there was no checking or enforcement).
>
> -Yonik
>

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/22/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> I upgraded to the most recent Solr build (9-22) and sadly it's still
> really slow. 800 seconds query with a single facet on first_author, 15
> millions documents total, the query return 180. Maybe i'm doing
> something wrong? Also, this is on my personal desktop; not on a server.
> Still, I'm getting 0.1 seconds queries without facets, so I don't think
> thats the cause. In the admin panel i can still see the filtercache
> doing millions of lookups (and tons of evictions once it hits the maxsize).

The fact that you see all the filtercache usage means that the
optimization didn't kick in for some reason.

> Here's the field i'm using in schema.xml :
> <field name ="first_author" type="string" indexed="true" stored="true"/>

That looks fine...

> This is the query :
> q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false

That looks OK too.
I assume that you didn't change the fieldtype definition for "string",
and that the schema has version="1.1"?  Before 1.1, all fields were
assumed to be multiValued (there was no checking or enforcement).

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Michael Imbeault <mi...@sympatico.ca>.

I upgraded to the most recent Solr build (9-22) and sadly it's still 
really slow. 800 seconds query with a single facet on first_author, 15 
millions documents total, the query return 180. Maybe i'm doing 
something wrong? Also, this is on my personal desktop; not on a server. 
Still, I'm getting 0.1 seconds queries without facets, so I don't think 
thats the cause. In the admin panel i can still see the filtercache 
doing millions of lookups (and tons of evictions once it hits the maxsize).

Here's the field i'm using in schema.xml :
<field name ="first_author" type="string" indexed="true" stored="true"/>

This is the query :
q="hiv red 
blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false

I'll do more testing on the weekend,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:
> On 9/21/06, Yonik Seeley <yo...@apache.org> wrote:
>> Hang in there Michael, a fix is on the way for your scenario (and
>> subscribe to solr-dev if you want to stay on the bleeding edge):
>
> OK, the optimization has been checked in.  You can checkout from svn
> and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
> I'd be interested in hearing your results with it.
>
> The first facet request on a field will take longer than subsequent
> ones because the FieldCache entry is loaded on demand.  You can use a
> firstSearcher/newSearcher hook in solrconfig.xml to send a facet
> request so that a real user would never see this slower query.
>
> -Yonik
>

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/21/06, Yonik Seeley <yo...@apache.org> wrote:
> Hang in there Michael, a fix is on the way for your scenario (and
> subscribe to solr-dev if you want to stay on the bleeding edge):

OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/21/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> It turns out that journal_name has 17038 different tokens, which is
> manageable, but first_author has > 400 000. I don't think this will ever
> yield good performance, so i might only do journal_name facets.

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Michael Imbeault <mi...@sympatico.ca>.

Thanks for all the great answers.

>> Quick Question: did you say you are faceting on the first name field
>> seperately from the last name field? ... why?
You misunderstood. I'm doing faceting on first author, and last author 
of the list. Life science papers have authors list, and the first one is 
usually the guy who did most of the work, and the last one is usually 
the boss of the lab. I already have untokenized author fields for that 
using copyField.
>> Second: you mentioned increasing hte size of your filterCache
>> significantly, but we don't really know how heterogenous your index 
>> is ...
>> once you made that cahnge did your filterCache hitrate increase? .. 
>> do you
>> have any evictions (you can check on the "Statistics" page)
It was at the default (16000) and it hit the ceiling so to speak. I did 
maxSize=16000000 (for testing purpose) and now size : 17038 and 0 
evictions. For a single facet field (journal name) with a limit of 5 and 
12 faceted query fields (range on publication date), I now have 0.5 
seconds search, which is not too bad. The filtercache size is pretty 
much constant no matter how many queries I do.

However, if I try to add another facet field (such as first_author), 
something strange happens. 99% CPU, the filter cache is filling up 
really fast, hitratio goes to hell, no disk activity, and it can stay 
that way for at least 30 minutes (didn't test longer, no point really). 
It turns out that journal_name has 17038 different tokens, which is 
manageable, but first_author has > 400 000. I don't think this will ever 
yield good performance, so i might only do journal_name facets.

Any reasons why facets tries to preload every term in the field?

I have noticed that facets are not cached. Facets off, cached query take 
0.01 seconds. Facet on, uncached and cached queries take 0.7 seconds. 
Any plans for a facets cache? I know that facets is still a very early 
feature, but its already awesome; my application is maybe irrealistic.

Thanks,
Michael

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/19/06, Chris Hostetter <ho...@fucit.org> wrote:
>
> Quick Question: did you say you are faceting on the first name field
> seperately from the last name field? ... why?
>
> You'll probably see a sharp increase in performacne if you have a single
> untokenized author field containing hte full name and you facet on that --
> there will be a lot less unique terms to use when computing DocSets and
> intersections.
>
> Second: you mentioned increasing hte size of your filterCache
> significantly, but we don't really know how heterogenous your index is ...
> once you made that cahnge did your filterCache hitrate increase? .. do you
> have any evictions (you can check on the "Statistics" patge)
>
> : > Also, I was under the impression
> : > that it was only searching / sorting for authors that it knows are in
> : > the result set...
> :
> : That's the problem... it's not necessarily easy to know *what* authors
> : are in the result set.  If we could quickly determine that, we could
> : just count them and not do any intersections or anything at all.
>
> another way to look at it is that by looking at all the authors, the work
> done for generating the facet counts for query A can be completely reused
> for the next query B -- presuming your filterCache is large enough to hold
> all of the author filters.
>
> : There could be optimizations when docs_matching_query.size() is small,
> : so we start somehow with terms in the documents rather than terms in
> : the index.  That requires termvectors to be stored (medium speed), or
> : requires that the field be stored and that we re-analyze it (very
> : slow).
> :
> : More optimization of special cases hasn't been done simply because no
> : one has done it yet... (as you note, faceting is a new feature).
>
> the optimization optimization i anticipated from teh begining, would
> probably be usefull in the situation Michael is describing ... if there is
> a "long tail" oif authors (and in my experience, there typically is)

> we
> can cache an ordered list of the top N most prolific authors, along with
> the count of how many documents they have in the index (this info is easy
> to getfrom TermEnum.docFreq).

Yeah, I've thought about a fieldInfoCache too.  It could also cache
the total number of terms in order to make decisions about what
faceting strategy to follow.

> when we facet on the authors, we start with
> that list and go in order, generating their facet constraint count using
> the DocSet intersection just like we currently do ... if we reach our
> facet.limit before we reach the end of hte list and the lowest constraint
> count is higher then the total doc count of the last author in the list,
> then we know we don't need to bother testing any other Author, because no
> other author an possibly have a higher facet constraint count then the
> ones on our list

This works OK if the intersection counts are high (as a percentage of
the facet sets).  I'm not sure how often this will be the case though.

Another tradeoff is to allow getting inexact counts with multi-token fields by:
 - simply faceting on the most popular values
   OR
 - do some sort of statistical sampling by reading term vectors for a
fraction of the matching docs.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Chris Hostetter <ho...@fucit.org>.

Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the "Statistics" patge)

: > Also, I was under the impression
: > that it was only searching / sorting for authors that it knows are in
: > the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a "long tail" oif authors (and in my experience, there typically is) we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).  when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list (since they haven't even written that many documents)



-Hoss

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/18/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> Yonik Seeley wrote:
> > For cases like "author", if there is only one value per document, then
> > a possible fix is to use the field cache.  If there can be multiple
> > occurrences, there doesn't seem to be a good way that preserves exact
> > counts, except maybe if the number of documents matching a query is
> > low.
> >
> I have one value per document (I have fields for authors, last_author
> and first_author, and I'm doing faceted search on first and last authors
> fields). How would I use the field cache to fix my problem?

Unless you want to dive into Solr development, you don't :-)
It requires extensive changes to the faceting code and doing things a
different way in some cases.

The FieldCache is the fastest way to "uninvert" single valued
fields... it's currently only used for Sorting, where one needs to
quickly know the field value given the document id.
The downside is high memory use, and that it's not a general
solution... it can't handle fields with multiple tokens (tokenized
fields or multi-valued fields).

So the strategy would be to step through the documents, get the value
for the field from the FieldCache, increment a counter for that value,
then find the top counters when we are done.

> Also, would
> it be better to store a unique number (for each possible author) in an
> int field along with the string, and do the faceted searching on the int
> field?

It won't really help.  It wouldn't be faster, and it would require
only slightly less memory.

> >> Just a little follow-up - I did a little more testing, and the query
> >> takes 20 seconds no matter what - If there's one document in the results
> >> set, or if I do a query that returns all 130000 documents.
> >
> > Yes, currently the same strategy is always used.
> >   intersection_count(docs_matching_query, docs_matching_author1)
> >   intersection_count(docs_matching_query, docs_matching_author2)
> >   intersection_count(docs_matching_query, docs_matching_author3)
> >   etc...
> >
> > Normally, the docsets will be cached, but since the number of authors
> > is greater than the size of the filtercache, the effective cache hit
> > rate will be 0%
> >
> > -Yonik
> So more memory would fix the problem?

Yes, if your collection size isn't that large...  it's not a practical
solution for many cases though.

> Also, I was under the impression
> that it was only searching / sorting for authors that it knows are in
> the result set...

That's the problem... it's not necessarily easy to know *what* authors
are in the result set.  If we could quickly determine that, we could
just count them and not do any intersections or anything at all.

>  in the case of only one document (1 result), it seems
> strange that it takes the same time as for 130 000 results. It should
> just check the results, see that there's only one author, and return
> that? And in the case of 2 documents, just sort 2 authors (or 1 if
> they're the same)? I understand your answer (it does intersections), but
> I wonder why its intersecting from the whole document set at first, and
> not docs_matching_query like you said.

It is just intersecting docs_matching_query.  The problem is that it's
intersecting that set with all possible author sets since it doesn't
know ahead of time what authors are in the docs that match the query.

There could be optimizations when docs_matching_query.size() is small,
so we start somehow with terms in the documents rather than terms in
the index.  That requires termvectors to be stored (medium speed), or
requires that the field be stored and that we re-analyze it (very
slow).

More optimization of special cases hasn't been done simply because no
one has done it yet... (as you note, faceting is a new feature).

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Michael Imbeault <mi...@sympatico.ca>.

Yonik Seeley wrote:
> I noticed this too, and have been thinking about ways to fix it.
> The root of the problem is that lucene, like all full-text search
> engines, uses inverted indicies.  It's fast and easy to get all
> documents for a particular term, but getting all terms for a document
> documents is either not possible, or not fast (assuming many documents
> match a query).
Yeah that's what I've been thinking; the index isn't built to handle 
such searches, sadly :( It would be very nice to be able to rapidly 
search by most frequent author, journal, etc.
> For cases like "author", if there is only one value per document, then
> a possible fix is to use the field cache.  If there can be multiple
> occurrences, there doesn't seem to be a good way that preserves exact
> counts, except maybe if the number of documents matching a query is
> low.
>
I have one value per document (I have fields for authors, last_author 
and first_author, and I'm doing faceted search on first and last authors 
fields). How would I use the field cache to fix my problem? Also, would 
it be better to store a unique number (for each possible author) in an 
int field along with the string, and do the faceted searching on the int 
field? Would this be faster / require less memory? I guess that yes, and 
I'll test that when I have the time.

>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 130000 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
So more memory would fix the problem? Also, I was under the impression 
that it was only searching / sorting for authors that it knows are in 
the result set... in the case of only one document (1 result), it seems 
strange that it takes the same time as for 130 000 results. It should 
just check the results, see that there's only one author, and return 
that? And in the case of 2 documents, just sort 2 authors (or 1 if 
they're the same)? I understand your answer (it does intersections), but 
I wonder why its intersecting from the whole document set at first, and 
not docs_matching_query like you said.

Thanks for the support,

Michael

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/18/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> Been playing around with the news 'facets search' and it works very
> well, but it's really slow for some particular applications. I've been
> trying to use it to display the most frequent authors of articles

I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Chris Hostetter <ho...@fucit.org>.

: I just updated the comments in solrconfig.xml:

I've tweaked the SolrCaching wiki page to include some of this info as
well, feel free to add any additional info you think would be helpful to
other people (or ask any qestions about it if any of it still doesn't seem
clear to you)...

	http://wiki.apache.org/solr/SolrCaching

: > now, 400docs/sec!). However, I still don't have an idea what are these
: > values representing, and how I should estimate what values I should set
: > them to. Originally I thought it was the size of the cache in kb, and
: > someone on the list told me it was number of items, but I don't quite
: > get it. Better documentation on that would be welcomed :)



-Hoss

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

I just updated the comments in solrconfig.xml:

   <!-- Cache used by SolrIndexSearcher for filters (DocSets),
         unordered sets of *all* documents that match a query.
         When a new searcher is opened, its caches may be prepopulated
         or "autowarmed" using data from caches in the old searcher.
         autowarmCount is the number of items to prepopulate.  For LRUCache,
         the autowarmed items will be the most recently accessed items.
       Parameters:
         class - the SolrCache implementation (currently only LRUCache)
         size - the maximum number of entries in the cache
         initialSize - the initial capacity (number of entries) of
           the cache.  (seel java.util.HashMap)
         autowarmCount - the number of entries to prepopulate from
           and old cache.
         -->
    <filterCache
      class="solr.LRUCache"
      size="512"
      initialSize="512"
      autowarmCount="256"/>

On 9/18/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> Another followup: I bumped all the caches in solrconfig.xml to
>
>       size="1600384"
>       initialSize="400096"
>       autowarmCount="400096"
>
> It seemed to fix the problem on a very small index (facets on last and
> first author fields, + 12 range date facets, sub 0.3 seconds for
> queries). I'll check on the full index tomorrow (it's indexing right
> now, 400docs/sec!). However, I still don't have an idea what are these
> values representing, and how I should estimate what values I should set
> them to. Originally I thought it was the size of the cache in kb, and
> someone on the list told me it was number of items, but I don't quite
> get it. Better documentation on that would be welcomed :)
>
> Also, is there any plans to add an option not to run a facet search if
> the result set is too big? To avoid 40 seconds queries if the docset is
> too large...

I'd like to speed up certain corner cases, but you can always set
timeouts in whatever frontend is making the request to Solr too.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Joachim Martin <jm...@path-works.com>.

Michael Imbeault wrote:

> Also, is there any plans to add an option not to run a facet search if 
> the result set is too big? To avoid 40 seconds queries if the docset 
> is too large...

You could run one query with facet=false, check the result size and then 
run it again (should be fast because it is cached) with 
facet=true&rows=0 to get facet results only.

I would think that the decision to run/not run facets would be highly 
custom to your collection and not easily developed as a configurable 
feature.

--Joachim

Re: Facet performance with heterogeneous 'facets'?

Posted by Michael Imbeault <mi...@sympatico.ca>.

Another followup: I bumped all the caches in solrconfig.xml to

      size="1600384"
      initialSize="400096"
      autowarmCount="400096"

It seemed to fix the problem on a very small index (facets on last and 
first author fields, + 12 range date facets, sub 0.3 seconds for 
queries). I'll check on the full index tomorrow (it's indexing right 
now, 400docs/sec!). However, I still don't have an idea what are these 
values representing, and how I should estimate what values I should set 
them to. Originally I thought it was the size of the cache in kb, and 
someone on the list told me it was number of items, but I don't quite 
get it. Better documentation on that would be welcomed :)

Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset is 
too large...

Thanks,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:
> On 9/18/06, Michael Imbeault <mi...@sympatico.ca> wrote:
>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 130000 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
>

Re: Facet performance with heterogeneous 'facets'?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/18/06, Michael Imbeault <mi...@sympatico.ca> wrote:
> Just a little follow-up - I did a little more testing, and the query
> takes 20 seconds no matter what - If there's one document in the results
> set, or if I do a query that returns all 130000 documents.

Yes, currently the same strategy is always used.
   intersection_count(docs_matching_query, docs_matching_author1)
   intersection_count(docs_matching_query, docs_matching_author2)
   intersection_count(docs_matching_query, docs_matching_author3)
   etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik

Re: Facet performance with heterogeneous 'facets'?

Posted by Michael Imbeault <mi...@sympatico.ca>.

Just a little follow-up - I did a little more testing, and the query 
takes 20 seconds no matter what - If there's one document in the results 
set, or if I do a query that returns all 130000 documents.

It seems something isn't right... it looks like solr is doing faceted 
search on the whole index no matter what's the result set when doing 
facets on a string field. I must be doing something wrong?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Michael Imbeault wrote:
> Been playing around with the news 'facets search' and it works very 
> well, but it's really slow for some particular applications. I've been 
> trying to use it to display the most frequent authors of articles; 
> this is from a huge (15 millions articles) database and names of 
> authors are rare and heterogeneous. On a query that takes (without 
> facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the 
> documents indexed (I've been getting java.lang.OutOfMemoryError with 
> the full index). ~40 seconds for a faceted search on 2 (string) 
> fields. Range queries on a slong field is more acceptable (even with a 
> dozen of them, query time is still in the subsecond range).
>
> I'm I trying to do something which isn't what faceted search was made 
> for? It would be understandable, after all, I guess the facets engine 
> has to check very doc in the index and sort... which shouldn't yield 
> good performance no matter what, sadly.
>
> Is there any other way I could achieve what I'm trying to do? Just a 
> list of the most frequent (top 5) authors present in the results of a 
> query.
>
> Thanks,
>