You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Yonik Seeley <yo...@apache.org> on 2008/11/25 05:12:00 UTC

new faceting algorithm

A new faceting algorithm has been committed to the development version
of Solr, and should be available in the next nightly test build (will
be dated 11-25).  This change should generally improve field faceting
where the field has many unique values but relatively few values per
document.  This new algorithm is now the default for multi-valued
fields (including tokenized fields) so you shouldn't have to do
anything to enable it.  We'd love some feedback on how it works to
ensure that it actually is a win for the majority and should be the
default.

-Yonik

Re: new faceting algorithm

Posted by Yonik Seeley <yo...@apache.org>.

On Thu, Dec 4, 2008 at 2:57 PM, wojtekpia <wo...@hotmail.com> wrote:
> Yonik Seeley wrote:
>>
>> Are you doing commits at any time?
>> One possibility is the caching mechanism (weak-ref on the
>> IndexReader)... that's going to be changing soon hopefully.
>>
>> -Yonik
>>
>
> No commits during this test. Should I start looking into my heap size
> distribution and garbage collector selection?

Hmmm, OK.  The other big difference would then be that retrieving the
top facets requires creating a Lucene TermEnum (not all facet values
are stored in memory).  The lucene version in Solr has changed since I
did long running tests... with various Lucene changes to thread-local
caching, etc.  I'll try and reproduce.  Or maybe this is somehow a GC
bug just tickled by the current caching mechanism? (weak hash map)

-Yonik

Re: new faceting algorithm

Posted by wojtekpia <wo...@hotmail.com>.


Yonik Seeley wrote:
> 
> 
> Are you doing commits at any time?
> One possibility is the caching mechanism (weak-ref on the
> IndexReader)... that's going to be changing soon hopefully.
> 
> -Yonik
> 


No commits during this test. Should I start looking into my heap size
distribution and garbage collector selection?
-- 
View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20841219.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: new faceting algorithm

Posted by Yonik Seeley <yo...@apache.org>.

On Thu, Dec 4, 2008 at 2:28 PM, wojtekpia <wo...@hotmail.com> wrote:
> I'm seeing some strange behavior with my garbage collector that disappears
> when I turn off this optimization. I'm running load tests on my deployment.
> For the first few minutes, everything is fine (and this patch does make
> things faster - I haven't quantified the improvement yet). After that, the
> garbage collector stops collecting. Specifically, the new generation part of
> the heap is full, but never garbage collected, and the old generation is
> emptied, then never gets anything more.

Are you doing commits at any time?
One possibility is the caching mechanism (weak-ref on the
IndexReader)... that's going to be changing soon hopefully.

-Yonik


> This throttles Solr performance
> (average response times that used to be ~500ms are now ~25s).
>
> I described my deployment scenario in an earlier post:
> http://www.nabble.com/Throughput-Optimization-td20335132.html
>
> Does it sound like the new faceting algorithm could be the culprit?
>
>
> wojtekpia wrote:
>>
>> Definitely, but it'll take me a few days. I'll also report findings on
>> SOLR-465. (I've been on holiday for a few weeks)
>>
>>
>> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>
>>> wojtek, you can report back the numbers if possible
>>>
>>> It would be nice to know how the new impl performs in real-world
>>>
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20840622.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: new faceting algorithm

Posted by wojtekpia <wo...@hotmail.com>.

It looks like my filterCache was too big. I reduced my filterCache size from
700,000 to 20,000 (without changing the heap size) and all my performance
issues went away. I experimented with various GC settings, but none of them
made a significant difference.

I see a 16% increase in throughput by applying this patch.

Yonik Seeley wrote:
> 
> ... This can be a big chunk of memory
> per-request, and is most likely what changed your GC profile (i.e.
> changing the GC settings may help).
> 
> 

-- 
View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20984502.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: new faceting algorithm

Posted by Yonik Seeley <yo...@apache.org>.

On Thu, Dec 4, 2008 at 2:28 PM, wojtekpia <wo...@hotmail.com> wrote:
> I'm seeing some strange behavior with my garbage collector that disappears
> when I turn off this optimization.

I just changed the new faceting code to use a solr cache.
Look for the "fieldValueCache" on the statistics page now.

It just occurred to me that there is a big difference in how memory is
used with facet.method=fc.
Since we traverse documents and count up terms, we need to allocate an
int[nTerms]
to accumulate those counts.  This can be a big chunk of memory
per-request, and is most likely what changed your GC profile (i.e.
changing the GC settings may help).

-Yonik


> I'm running load tests on my deployment.
> For the first few minutes, everything is fine (and this patch does make
> things faster - I haven't quantified the improvement yet). After that, the
> garbage collector stops collecting. Specifically, the new generation part of
> the heap is full, but never garbage collected, and the old generation is
> emptied, then never gets anything more. This throttles Solr performance
> (average response times that used to be ~500ms are now ~25s).
>
> I described my deployment scenario in an earlier post:
> http://www.nabble.com/Throughput-Optimization-td20335132.html
>
> Does it sound like the new faceting algorithm could be the culprit?
>
>
> wojtekpia wrote:
>>
>> Definitely, but it'll take me a few days. I'll also report findings on
>> SOLR-465. (I've been on holiday for a few weeks)
>>
>>
>> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>>>
>>> wojtek, you can report back the numbers if possible
>>>
>>> It would be nice to know how the new impl performs in real-world
>>>
>>>
>>>
>>
>>
>
> --
> View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20840622.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: new faceting algorithm

Posted by wojtekpia <wo...@hotmail.com>.

I'm seeing some strange behavior with my garbage collector that disappears
when I turn off this optimization. I'm running load tests on my deployment.
For the first few minutes, everything is fine (and this patch does make
things faster - I haven't quantified the improvement yet). After that, the
garbage collector stops collecting. Specifically, the new generation part of
the heap is full, but never garbage collected, and the old generation is
emptied, then never gets anything more. This throttles Solr performance
(average response times that used to be ~500ms are now ~25s). 

I described my deployment scenario in an earlier post:
http://www.nabble.com/Throughput-Optimization-td20335132.html

Does it sound like the new faceting algorithm could be the culprit?

wojtekpia wrote:
> 
> Definitely, but it'll take me a few days. I'll also report findings on
> SOLR-465. (I've been on holiday for a few weeks)
> 
> 
> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>> 
>> wojtek, you can report back the numbers if possible
>> 
>> It would be nice to know how the new impl performs in real-world
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20840622.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: new faceting algorithm

Posted by wojtekpia <wo...@hotmail.com>.

Definitely, but it'll take me a few days. I'll also report findings on
SOLR-465. (I've been on holiday for a few weeks)


Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> wojtek, you can report back the numbers if possible
> 
> It would be nice to know how the new impl performs in real-world
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20798456.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: new faceting algorithm

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.

wojtek, you can report back the numbers if possible

It would be nice to know how the new impl performs in real-world

On Tue, Dec 2, 2008 at 11:45 PM, Yonik Seeley <yo...@apache.org> wrote:
> On Tue, Dec 2, 2008 at 1:10 PM, wojtekpia <wo...@hotmail.com> wrote:
>> Is there a configurable way to switch to the previous implementation? I'd
>> like to see exactly how it affects performance in my case.
>
> Thanks for the reminder, I need to document this in the wiki.
>
> facet.method=enum  (enumerate terms and do intersections, the old default)
> facet.method=fc  (fieldcache method, the new default)
>
> -Yonik
>
>>
>> Yonik Seeley wrote:
>>>
>>> And if you want to verify that the new faceting code has indeed kicked
>>> in, some statistics are logged, like:
>>>
>>> Nov 24, 2008 11:14:32 PM org.apache.solr.request.UnInvertedField uninvert
>>> INFO: UnInverted multi-valued field features, memSize=14584, time=47,
>>> phase1=47,
>>>  nTerms=285, bigTerms=99, termInstances=186
>>>
>>> -Yonik
>>>
>>>
>>
>> --
>> View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20797812.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>



-- 
--Noble Paul

Re: new faceting algorithm

Posted by Yonik Seeley <yo...@apache.org>.

On Tue, Dec 2, 2008 at 1:10 PM, wojtekpia <wo...@hotmail.com> wrote:
> Is there a configurable way to switch to the previous implementation? I'd
> like to see exactly how it affects performance in my case.

Thanks for the reminder, I need to document this in the wiki.

facet.method=enum  (enumerate terms and do intersections, the old default)
facet.method=fc  (fieldcache method, the new default)

-Yonik

>
> Yonik Seeley wrote:
>>
>> And if you want to verify that the new faceting code has indeed kicked
>> in, some statistics are logged, like:
>>
>> Nov 24, 2008 11:14:32 PM org.apache.solr.request.UnInvertedField uninvert
>> INFO: UnInverted multi-valued field features, memSize=14584, time=47,
>> phase1=47,
>>  nTerms=285, bigTerms=99, termInstances=186
>>
>> -Yonik
>>
>>
>
> --
> View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20797812.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: new faceting algorithm

Posted by wojtekpia <wo...@hotmail.com>.

Is there a configurable way to switch to the previous implementation? I'd
like to see exactly how it affects performance in my case.


Yonik Seeley wrote:
> 
> And if you want to verify that the new faceting code has indeed kicked
> in, some statistics are logged, like:
> 
> Nov 24, 2008 11:14:32 PM org.apache.solr.request.UnInvertedField uninvert
> INFO: UnInverted multi-valued field features, memSize=14584, time=47,
> phase1=47,
>  nTerms=285, bigTerms=99, termInstances=186
> 
> -Yonik
> 
> 

-- 
View this message in context: http://www.nabble.com/new-faceting-algorithm-tp20674902p20797812.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: new faceting algorithm

Posted by Yonik Seeley <yo...@apache.org>.

And if you want to verify that the new faceting code has indeed kicked
in, some statistics are logged, like:

Nov 24, 2008 11:14:32 PM org.apache.solr.request.UnInvertedField uninvert
INFO: UnInverted multi-valued field features, memSize=14584, time=47, phase1=47,
 nTerms=285, bigTerms=99, termInstances=186

-Yonik

On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley <yo...@apache.org> wrote:
> A new faceting algorithm has been committed to the development version
> of Solr, and should be available in the next nightly test build (will
> be dated 11-25).  This change should generally improve field faceting
> where the field has many unique values but relatively few values per
> document.  This new algorithm is now the default for multi-valued
> fields (including tokenized fields) so you shouldn't have to do
> anything to enable it.  We'd love some feedback on how it works to
> ensure that it actually is a win for the majority and should be the
> default.
>
> -Yonik
>

Re: new faceting algorithm

Posted by Andre Hagenbruch <An...@rub.de>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Till Kinstler schrieb:

Hi,

> I just did a quick test using Solr nightly 2008-11-30. I have an index
> of about 2.9 mil bibliographic records, size: 16G. I tested facetting
> author names, each index document may contain multiple author names, so
> author names go into a multivalued field (not analyzed). Queries used
> for testing were extracted from log files of a prototype application.
> With facet.method=enum, 50 request threads, I get an average response
> time of about 190000(!) ms, no cache evictions. With 1 request thread:
> about 1800 ms.
> With facet.method=fc, 50 threads I get an average response time of
> around 300 ms. 1 thread: 16 ms.
> Seems to be a major improvement at first sight :-)

same here: multi valued author fields were the bottleneck with 1.3 for
us, too. I'm currently testing with 1.5 million records, ~1.2 million of
which have values for the author field, but with ~2 million distinct
values. With Solr 1.3 we had average response times of 15000-25000 ms
for 10 parallel requests (depending on cache settings), with 1.4 they
are now down to 230 ms...

Regards,

Andre
- --
Andre Hagenbruch
Projekt "Integriertes Bibliotheksportal"
Universitaetsbibliothek Bochum, Etage 4/Raum 6
Fon: +49 234 3229346, Fax: +49 234 3214736
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk5G5kACgkQ3wuzs9k1icVbOACgta0COUoOJGRN93puG2LzBJZU
t1EAn3od/3CmD9zE0ioo/yjQ5YrHv+1m
=80sA
-----END PGP SIGNATURE-----

Re: new faceting algorithm

Posted by Till Kinstler <ki...@gbv.de>.

Yonik Seeley schrieb:

> We'd love some feedback on how it works to
> ensure that it actually is a win for the majority and should be the
> default.

I just did a quick test using Solr nightly 2008-11-30. I have an index 
of about 2.9 mil bibliographic records, size: 16G. I tested facetting 
author names, each index document may contain multiple author names, so 
author names go into a multivalued field (not analyzed). Queries used 
for testing were extracted from log files of a prototype application.
With facet.method=enum, 50 request threads, I get an average response 
time of about 190000(!) ms, no cache evictions. With 1 request thread: 
about 1800 ms.
With facet.method=fc, 50 threads I get an average response time of 
around 300 ms. 1 thread: 16 ms.
Seems to be a major improvement at first sight :-)

Regards,
Till

-- 
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinstler@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de

Re: new faceting algorithm

Posted by Rob Casson <ro...@gmail.com>.

very similar situation to those already reported.  2.9M bilbiographic
records, with authors being the (previous) bottleneck, and the one
we're starting to test with the new algorithm.

so far, no load tests, but just in single requests i'm seeing the same
improvements...phenomenal improvements, btw, with most example queries
taking less than 1/100th of the time

always very impressed with this project/product, and just thought i'd
add a "me-too" to the list...cheers, and have a great weekend,

rob

On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley <yo...@apache.org> wrote:
> A new faceting algorithm has been committed to the development version
> of Solr, and should be available in the next nightly test build (will
> be dated 11-25).  This change should generally improve field faceting
> where the field has many unique values but relatively few values per
> document.  This new algorithm is now the default for multi-valued
> fields (including tokenized fields) so you shouldn't have to do
> anything to enable it.  We'd love some feedback on how it works to
> ensure that it actually is a win for the majority and should be the
> default.
>
> -Yonik
>

Re: new faceting algorithm

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Peter,

It is UnInvertedField class. See also:
https://issues.apache.org/jira/browse/SOLR-475


Peter Keegan wrote:
> Hi Yonik,
>
> May I ask in which class(es) this improvement was made? I've been using the
> DocSet, DocList, BitDocSet, HashDocSet from Solr from a few years ago with a
> Lucene based app. to do faceting.
>
> Thanks,
> Peter
>
>

Re: new faceting algorithm

Posted by Peter Keegan <pe...@gmail.com>.

Hi Yonik,

May I ask in which class(es) this improvement was made? I've been using the
DocSet, DocList, BitDocSet, HashDocSet from Solr from a few years ago with a
Lucene based app. to do faceting.

Thanks,
Peter

On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley <yo...@apache.org> wrote:

> A new faceting algorithm has been committed to the development version
> of Solr, and should be available in the next nightly test build (will
> be dated 11-25).  This change should generally improve field faceting
> where the field has many unique values but relatively few values per
> document.  This new algorithm is now the default for multi-valued
> fields (including tokenized fields) so you shouldn't have to do
> anything to enable it.  We'd love some feedback on how it works to
> ensure that it actually is a win for the majority and should be the
> default.
>
> -Yonik
>