You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2012/06/21 15:42:42 UTC

[jira] [Created] (LUCENE-4161) Make PackedInts usable by codecs

Adrien Grand created LUCENE-4161:
------------------------------------

             Summary: Make PackedInts usable by codecs
                 Key: LUCENE-4161
                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
             Project: Lucene - Java
          Issue Type: Improvement
          Components: core/store
            Reporter: Adrien Grand
            Assignee: Adrien Grand
            Priority: Minor


Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values efficiently.

The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).

Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4161) Make PackedInts usable by codecs

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13398473#comment-13398473 ] 

Michael McCandless commented on LUCENE-4161:
--------------------------------------------

+1

We could really use this on LUCENE-3892 where the FOR postings format is really just packed ints.  Would be best if we could simply efficiently use oal.util.packed.* for this.
                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4161) Make PackedInts usable by codecs

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404656#comment-13404656 ] 

Michael McCandless commented on LUCENE-4161:
--------------------------------------------

bq. The meaning of n is actually a bit complicated.

Thank you for the explanation!  That makes sense.  I think "iterations" is good?  Or ... maybe we simply leave it as n and then put this nice explanation in there as a comment?

Naming is the hardest part :)

bq. What additional methods do you think we need?

I'm not sure off-hand yet ... we've been iterating in LUCENE-3892 to find the least-cost way to decode from the underlying byte based storage from the IndexInput, but with no real clear fastest solution yet.  Logically we are currently storing an int[] and decoding into int[], so I guess encode/decode to/from int[]?  We should probably try long[] as the backing too ... but, I think we should explore this (adding int[] based methods) under a new issue?  This patch is already great progress.
                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4161.patch
>
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-4161) Make PackedInts usable by codecs

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-4161:
---------------------------------

    Attachment: LUCENE-4161.patch

First version of the patch.

A few things that were internal now need to be exposed, so I tried to do some clean up:
 * {{CODEC_NAME}} and CODEC_VERSION{START,CURRENT} are public,
 * the format is an enum (PackedInts.Format.{PACKED,PACKED_SINGLE_BLOCK}),
 * improved docs overall.

There are new factory methods get{Reader,ReaderIterator,Writer}NoHeader that do the same as their get{Reader,ReaderIterator,Writer} counterpart, but with no header writing/checking.

Improved performance of Reader/Mutable bulk methods (using code generation, see http://people.apache.org/~jpountz/packed_ints.html vs. http://people.apache.org/~jpountz/packed_ints2.html).

{{ReaderIterator}} and {{Writer}} now use the same code as {{Reader}}/{{Mutable}} bulk methods so they are likely to be much faster too. In addition, ReaderIterator now allows consumers to retrieve several values at the same time.

{{Direct*}} and {{Packed*ThreeBlocks}} had a lot of duplicate code that was not factorizable so I created scripts to generate them.

Something that might still slow down ReaderIterator (probably the most useful class for codecs) a bit is that ReaderIterator always reads one long at a time. Adding a method to bulk-read longs to DataInput (similarly to readBytes) might improve performance. This probably deserves an other issue in JIRA and can be done later.
                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4161.patch
>
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4161) Make PackedInts usable by codecs

Posted by "Adrien Grand (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403910#comment-13403910 ] 

Adrien Grand commented on LUCENE-4161:
--------------------------------------

bq. Can we find a better name for computeN?

The meaning of {{n}} is actually a bit complicated. For every number of bits per value, there is a minimum number of blocks (b) / values (v) you need to write in order to reach the next block boundary:
 * 16 bits per value -> b=1, v=4
 * 24 bits per value -> b=3, v=8
 * 50 bits per value -> b=25, v=32
 * 63 bits per value -> b=63, v = 64
 * ...

A bulk read consists in copying {{n*v}} values that are contained in {{n*b}} blocks into a long[] (higher values of {{n}} are likely to yield a better throughput) => this requires {{n * (b + v)}} longs in memory, this is why I compute {{n}} as {{ramBudget / (8 * (b + v))}} (since a long is 8 bytes). I called it {{n}} in the method name because I have no idea how to name it... "iterations", maybe?

bq. I suspect, to use these for codecs, we will want to have versions that work on int[] values instead (everything we encode are ints: docIDs/deltas, term freqs, offsets, positions).

I hesitated to do this since it would involve some code duplication, but I guess it can't be avoided if we want this API to be actually used... What additional methods do you think we need?
  * {{PackedReaderIterator.nextInts(int count)}}
  * others?

bq. [static computeN], [code style]

You are right, I will fix it!

bq. Does this change the on-disk format?

No, it doesn't. I will add unit tests for that...
                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4161.patch
>
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4161) Make PackedInts usable by codecs

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403843#comment-13403843 ] 

Michael McCandless commented on LUCENE-4161:
--------------------------------------------

Wow, this patch is impressive!  Lots of amazing changes... very cool
how you factored out a simple common API for bulk write/read of all
the formats.

Should computeN be non-static method on BulkOperation base class?
(Just seems odd to make a static method whose first arg is an instance
of that class anyway...).

Can we find a better name for computeN?  I think n is the number of
blocks we can buffer up given the RAM "budget"?  computeNumBlocks?
computeNumBufferedBlocks?  computeBufferedBlocksCount?  Something
else...?

I suspect, to use these for codecs, we will want to have versions that
work on int[] values instead (everything we encode are ints:
docIDs/deltas, term freqs, offsets, positions).

Code styling: can we use three lines, ie:
{noformat}
-    if (valueCount > MAX_SIZE) {
-      throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded");
-    }
{noformat}
instead of one line:
{noformat}
+    if (valueCount > MAX_SIZE) { throw new ArrayIndexOutOfBoundsException("MAX_SIZE exceeded"); }
{noformat}
in general?

Does this change the on-disk format?  I think no?  (if those
Format.getId()s match?)  If it does change we need back compat (4.0.0 alpha
has left the station...).

                
> Make PackedInts usable by codecs
> --------------------------------
>
>                 Key: LUCENE-4161
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4161
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/store
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>         Attachments: LUCENE-4161.patch
>
>
> Some codecs might be interested in using PackedInts.{Writer,Reader,ReaderIterator} to read and write fixed-size values efficiently.
> The problem is that the serialization format is self contained, and always writes the name of the codec, its version, its number of bits per value and its format. For example, if you want to use packed ints to store your postings list, this is a lot of overhead (at least ~60 bytes per term, in case you only use one Writer per term, more otherwise).
> Users should be able to externalize the storage of metadata to save space. For example, to use PackedInts to store a postings list, one should be able to store the codec name, its version and the number of bits per doc in the header of the terms+postings list instead of having to write it once (or more!) per term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org