You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2012/06/25 19:51:44 UTC

[jira] [Updated] (LUCENE-4161) Make PackedInts usable by codecs

     [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4161:
---------------------------------

    Attachment: LUCENE-4161.patch

First version of the patch.

A few things that were internal now need to be exposed, so I tried to do some clean up:
 * {{CODEC_NAME}} and CODEC_VERSION{START,CURRENT} are public,
 * the format is an enum (PackedInts.Format.{PACKED,PACKED_SINGLE_BLOCK}),
 * improved docs overall.

There are new factory methods get{Reader,ReaderIterator,Writer}NoHeader that do the same as their get{Reader,ReaderIterator,Writer} counterpart, but with no header writing/checking.

Improved performance of Reader/Mutable bulk methods (using code generation, see http://people.apache.org/~jpountz/packed_ints.html vs. http://people.apache.org/~jpountz/packed_ints2.html).

{{ReaderIterator}} and {{Writer}} now use the same code as {{Reader}}/{{Mutable}} bulk methods so they are likely to be much faster too. In addition, ReaderIterator now allows consumers to retrieve several values at the same time.

{{Direct*}} and {{Packed*ThreeBlocks}} had a lot of duplicate code that was not factorizable so I created scripts to generate them.

Something that might still slow down ReaderIterator (probably the most useful class for codecs) a bit is that ReaderIterator always reads one long at a time. Adding a method to bulk-read longs to DataInput (similarly to readBytes) might improve performance. This probably deserves an other issue in JIRA and can be done later.
                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4161.patch
>
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org