You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2007/04/29 07:01:15 UTC
[jira] Commented: (SOLR-221) faceting memory and performance improvement

    [ https://issues.apache.org/jira/browse/SOLR-221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492543 ] 

Yonik Seeley commented on SOLR-221:
-----------------------------------

The results are slightly surprising.

I made up an index, and each document contained 4 random numbers between 1 and 500,000
This is not the distribution one would expect to see in a real index. but we can still learn much.

The synthetic index:
 maxDoc=500,000
 numDocs=393,566
 number of segments = 5
 number of unique facet terms = 490903
 filterCache max size = 1,000,000 entries (more than enough)
 JVM=1.5.0_09 -server -Xmx200M
 System=WinXP, 3GHz P4, hyperthreaded, 1GB dual channel RAM
 facet type = facet.field, facet.sort=true, facet.limit=10
 maximum df of any term = 15
 warming times were not included... queries were run many times and the lowest time recorded.

Number of documents that match test "base" queries (for example, base query #1 matches 175K docs):
1) 175000,  
2) 43000
3) 8682
4) 2179
5) 422
6) 1

WITHOUT PATCH (milliseconds to facet each base query):
1578, 1578, 1547, 1485, 1484,1422

WITH PATCH (min df comparison w/ term df,  minDfFilterCache=0) (all field cache)
 984,  1203, 1391, 1437, 1484, 1420

WITH PATCH (min df comp, minDfFilterCache=30)  (no fieldCache at all)
1406, 2344, 3125, 3015, 3172, 3172

CONCLUSION1: min df comparison increases faceting speed 60% when the base query matches many documents.  With a real term distribution, this could be even greater.

CONCLUSION2: opting to not use the fieldCache for smaller df terms can save a lot of memory, but it hurts performance up to 200% for our non-optimized index.

CONCLUSION3: using the field cache less can significantly speed up warming time (times not shown, but a full warming of the fieldCache took 33 sec)

======== now the same index, but optimized ===========
WITH PATCH (optimized, min df comparison w/ term df,  minDfFilterCache=0) (all field cache)
 172,  312,  485,  578,  610,  656

WITH PATCH (optimized, min df comp, minDfFilterCache=30)  (no fieldCache at all)
 265,  344,  422,  468,  500,  484  

CONCLUSION3: An optimized index increased performance 200-500%

CONCLUSION4:  The fact that an all-fieldcache option was significantly faster on an optimized probably cannot totally be explained by accurate dfs (no deleted documents to inflate the term df values), means that just iterating over the terms is *much* faster in an optimized index (a potential Lucene area to look into)


> faceting memory and performance improvement
> -------------------------------------------
>
>                 Key: SOLR-221
>                 URL: https://issues.apache.org/jira/browse/SOLR-221
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Yonik Seeley
>         Assigned To: Yonik Seeley
>         Attachments: facet.patch
>
>
> 1) compare minimum count currently needed to the term df and avoid unnecessary intersection count
> 2) set a minimum term df in order to use the filterCache, otherwise iterate over TermDocs

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.