You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (Commented) (JIRA)" <ji...@apache.org> on 2011/12/04 01:34:39 UTC

[jira] [Commented] (LUCENE-3584) bulk postings should be codec private

    [ https://issues.apache.org/jira/browse/LUCENE-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162246#comment-13162246 ] 

Yonik Seeley commented on LUCENE-3584:
--------------------------------------

I tested Solr's faceting code (the enum method that steps over terms and uses the filterCache), with minDf set high enough so that the filterCache wouldn't be used (i.e.it directly uses DocsEnum to calculate the count for the term).  %increase when we were using the bulk API = r208282/trunk time (i.e. performance is measured as change in throughput... so going from 400ms to 200ms is expressed as 100% increase in throughput).
 
|number of terms|documents per term|bulk API performance increase|
|10000000|1|2.1|
|1000000|10|3.0|
|1000|10000|8.9|
|10|1000000|51.6|

So when terms match many documents, we've had quite a drop-off due to the removal of the bulk API.
                
> bulk postings should be codec private
> -------------------------------------
>
>                 Key: LUCENE-3584
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3584
>             Project: Lucene - Java
>          Issue Type: Task
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3584.patch
>
>
> In LUCENE-2723, a lot of work was done to speed up Lucene's bulk postings read API.
> There were some upsides:
> * you could specify things like 'i dont care about frequency data up front'.
>   This made things like multitermquery->filter and other consumers that don't
>   care about freqs faster. But this is unrelated to 'bulkness' and we have a
>   separate patch now for this on LUCENE-2929.
> * the buffersize for standardcodec was increased to 128, increasing performance
>   for TermQueries, but this was unrelated too.
> But there were serious downsides/nocommits:
> * the API was hairy because it tried to be 'one-size-fits-all'. This made consumer code crazy.
> * the API could not really be specialized to your codec: e.g. could never take advantage that e.g. docs and freqs are aligned.
> * the API forced codecs to implement delta encoding for things like documents and positions. 
>   But this is totally up to the codec how it wants to encode! Some codecs might not use delta encoding.
> * using such an API for positions was only theoretical, it would have been super complicated and I doubt ever
>   performant or maintainable.
> * there was a regression with advance(), probably because the api forced you to do both a linear scan thru
>   the remaining buffer, then refill...
> I think a cleaner approach is to let codecs do whatever they want to implement the DISI
> contract. This lets codecs have the freedom to implement whatever compression/buffering they want
> for the best performance, and keeps consumers simple. If a codec uses delta encoding, or if it wants
> to defer this to the last possible minute or do it at decode time, thats its own business. Maybe a codec
> doesn't want to do any buffering at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org