You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2009/06/16 17:43:07 UTC
[jira] Resolved: (SOLR-1220) UnInvertedField performance
improvement on fields with an extremely large number of values
[ https://issues.apache.org/jira/browse/SOLR-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yonik Seeley resolved SOLR-1220.
--------------------------------
Resolution: Fixed
Fix Version/s: 1.4
Committed. Thanks Kent!
A future little optimization could collect a list of byte[] for maxTermCount and then do a single resize at the end.
> UnInvertedField performance improvement on fields with an extremely large number of values
> ------------------------------------------------------------------------------------------
>
> Key: SOLR-1220
> URL: https://issues.apache.org/jira/browse/SOLR-1220
> Project: Solr
> Issue Type: Improvement
> Components: search
> Affects Versions: 1.4
> Reporter: Kent Fitch
> Priority: Minor
> Fix For: 1.4
>
> Attachments: Udiff.txt
>
>
> Our setup is :
> - about 34M lucene documents of bibliographic and full text content
> - index currently 115GB, will at least double over next 6 months
> - moving to support real-time-ish updates (maybe 5 min delay)
> We facet on 8 fields, 6 of which are "normal" with small numbers of
> distinct values. But 2 faceted fields, creator and subject, are huge,
> with 18M and 9M terms respectively.
> On a server with 2xquad core AMD 2382 processors and 64GB memory, java
> 1.6.0_13-b03, 64 bit run with "-Xmx15192M -Xms6000M -verbose:gc", with
> the index on Intel X25M SSD, on start-up the elapsed time to create
> the 8 facets is 306 seconds (best time). Following an index reopen,
> the time to recreate them in 318 seconds (best time).
> [We have made an independent experimental change to create the facets
> with 3 async threads, that is, in parallel, and also to decouple them
> from the underlying index, so our facets lag the index changes by the
> time to recreate the facets. With our setup, the 3 threads reduced
> facet creation elapsed time from about 450 secs to around 320 secs,
> but this will depend a lot on IO capabilities of the device containing
> the index, amount of file system caching, load, etc]
> Anyway, we noticed that huge amounts of garbage were being collected
> during facet generation of the creator and subject fields, and tracked
> it down to this decision in UnInvertedField univert():
> if (termNum >= maxTermCounts.length) {
> // resize, but conserve memory by not doubling
> // resize at end??? we waste a maximum of 16K (average of 8K)
> int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
> System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
> maxTermCounts = newMaxTermCounts;
> }
> So, we tried the obvious thing:
> - allocate 10K terms initially, rather than 1K
> - extend by doubling the current size, rather than adding a fixed 4K
> - free unused space at the end (but only if unused space is
> "significant") by reallocating the array to the exact required size
> And also:
> - created a static HashMap lookup keyed on field name which remembers
> the previous allocated size for maxTermCounts for that field, and
> initially allocates that size + 1000 entries
> The second change is a minor optimisation, but the first change, by
> eliminating thousands of array reallocations and copies, greatly
> improved load times, down from 306 to 124 seconds on the initial load
> and from 318 to 134 seconds on reloads after index updates. About
> 60-70 secs is still spend in GC, but it is a significant improvement.
> Unless you have very large numbers of facet values, this change won't
> have any positive benefit.
> The core part of our change is reflected by this diff against revision 785058:
> ***************
> *** 222,232 ****
> int termNum = te.getTermNumber();
> if (termNum >= maxTermCounts.length) {
> ! // resize, but conserve memory by not doubling
> ! // resize at end??? we waste a maximum of 16K (average of 8K)
> ! int[] newMaxTermCounts = new int[maxTermCounts.length+4096];
> System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
> maxTermCounts = newMaxTermCounts;
> }
> --- 222,232 ----
> int termNum = te.getTermNumber();
> if (termNum >= maxTermCounts.length) {
> ! // resize by doubling - for very large number of unique terms, expanding
> ! // by 4K and resultant GC will dominate uninvert times. Resize at end if material
> ! int[] newMaxTermCounts = new int[maxTermCounts.length*2];
> System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, termNum);
> maxTermCounts = newMaxTermCounts;
> }
> ***************
> *** 331,338 ****
> --- 331,346 ----
> numTermsInField = te.getTermNumber();
> te.close();
> + // free space if outrageously wasteful (tradeoff memory/cpu)
> +
> + if ((maxTermCounts.length - numTermsInField) > 1024) { // too much waste!
> + int[] newMaxTermCounts = new int[numTermsInField];
> + System.arraycopy(maxTermCounts, 0, newMaxTermCounts, 0, numTermsInField);
> + maxTermCounts = newMaxTermCounts;
> + }
> +
> long midPoint = System.currentTimeMillis();
> if (termInstances == 0) {
> // we didn't invert anything
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.