You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Yonik Seeley <yo...@lucidimagination.com> on 2009/06/16 02:52:09 UTC

Re: UnInvertedField performance on faceted fields containing many unique terms

Great writeup Ken,

All the constants you see in UnInvertedField were a best guess - I
wasn't working with any real data.  It's surprising that a big array
allocation every 4096 terms is so significant - I had figured that the
work involved in processing that many terms would far outweigh
realloc+GC.

Could you open a JIRA issue with your recommended changes?  It's
simple enough we should have no problem getting it in for Solr 1.4.

Also, are you using a recent Solr build (within the last month)?
LUCENE-1596 should improve uninvert time for non-optimized indexes.

And don't forget to update http://wiki.apache.org/solr/PublicServers
when you go live!

-Yonik
http://www.lucidimagination.com



On Mon, Jun 15, 2009 at 7:43 PM, Kent Fitch<ke...@gmail.com> wrote:
> Hi,
>
> This may be of interest to other users of SOLR's UnInvertedField who
> have a very large number of unique terms in faceted fields.
>
> Our setup is :
>
> - about 34M lucene documents of bibliographic and full text content
> - index currently 115GB, will at least double over next 6 months
> - moving to support real-time-ish updates (maybe 5 min delay)
>
> We facet on 8 fields, 6 of which are "normal" with small numbers of
> distinct values.  But 2 faceted fields, creator and subject, are huge,
> with 18M and 9M terms respectively.  (Whether we should be faceting on
> such a huge number of values, and at the same time attempting to
> provide real time-ish updates is another question!  Whether facets
> derived from all of the hundreds of thousands of results regardless of
> match quality which typically happens in a large full text index is
> yet another question!).  The app is visible here:
> http://sbdsproto.nla.gov.au/
>
> On a server with 2xquad core AMD 2382 processors and 64GB memory, java
> 1.6.0_13-b03, 64 bit run with "-Xmx15192M -Xms6000M -verbose:gc", with
> the index on Intel X25M SSD, on start-up the elapsed time to create
> the 8 facets is 306 seconds (best time).  Following an index reopen,
> the time to recreate them in 318 seconds (best time).
>
> [We have made an independent experimental change to create the facets
> with 3 async threads, that is, in parallel, and also to decouple them
> from the underlying index, so our facets lag the index changes by the
> time to recreate the facets.  With our setup, the 3 threads reduced
> facet creation elapsed time from about 450 secs to around 320 secs,
> but this will depend a lot on IO capabilities of the device containing
> the index, amount of file system caching, load, etc]
>
> Anyway, we noticed that huge amounts of garbage were being collected
> during facet generation of the creator and subject fields, and tracked
> it down to this decision in UnInvertedField univert():
>
>      if (termNum >= maxTermCounts.length) {
>        // resize, but conserve memory by not doubling
>        // resize at end??? we waste a maximum of 16K (average of 8K)
>        int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
>        System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
>        maxTermCounts = newMaxTermCounts;
>      }
>
> So, we tried the obvious thing:
>
> - allocate 10K terms initially, rather than 1K
> - extend by doubling the current size, rather than adding a fixed 4K
> - free unused space at the end (but only if unused space is
> "significant") by reallocating the array to the exact required size
>
> And also:
>
> - created a static HashMap lookup keyed on field name which remembers
> the previous allocated size for maxTermCounts for that field, and
> initially allocates that size + 1000 entries
>
> The second change is a minor optimisation, but the first change, by
> eliminating thousands of array reallocations and copies, greatly
> improved load times, down from 306 to 124 seconds on the initial load
> and from 318 to 134 seconds on reloads after index updates.  About
> 60-70 secs is still spend in GC, but it is a significant improvement.
>
> Unless you have very large numbers of facet values, this change won't
> have any positive benefit.
>
> Regards,
>
> Kent Fitch
>

Re: UnInvertedField performance on faceted fields containing many unique terms

Posted by Kent Fitch <ke...@gmail.com>.
Hi Yonik,

On Tue, Jun 16, 2009 at 10:52 AM, Yonik
Seeley<yo...@lucidimagination.com> wrote:

> All the constants you see in UnInvertedField were a best guess - I
> wasn't working with any real data.  It's surprising that a big array
> allocation every 4096 terms is so significant - I had figured that the
> work involved in processing that many terms would far outweigh
> realloc+GC.

Well, they were pretty good guesses!  The code is extremely fast for
"reasonable" sized term lists.
I think with our 18M terms, the increasingly long array of ints was
being reallocated, copied and garbage collected 18M/4K = 4,500 times,
creating 4500x(18Mx4bytes)/2 = 162GB of garbage to collect.

> Could you open a JIRA issue with your recommended changes?  It's
> simple enough we should have no problem getting it in for Solr 1.4.

Thanks - just added SOLR-1220.  I havent mentioned the change to the
initial allocation on 10K (rather than 1024) because I dont think it
is significant.  I also havent mentioned the remembering of sizes to
initially allocate, because the improvement is marginal compared to
this big change, and for all I know, a static hashmap with fieldnames
could cause unwanted side effects from field name clashes if running
SOLR with multiple indices.

> Also, are you using a recent Solr build (within the last month)?
> LUCENE-1596 should improve uninvert time for non-optimized indexes.

We're not - but we'll upgrade to the latest version of 1.4 very soon.

> And don't forget to update http://wiki.apache.org/solr/PublicServers
> when you go live!

We will - thanks for your great work in improving SOLR performance
with 1.4 which makes such outrageous uses of facets even thinkable.

Regards,

Kent Fitch