You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shai Erera (JIRA)" <ji...@apache.org> on 2013/01/17 08:28:14 UTC
[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

    [ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555948#comment-13555948 ] 

Shai Erera commented on LUCENE-4609:
------------------------------------

How come the order of the lines in the two tables is different? At first I thought you misread the table because the results don't look that bad :).

You can very easily push DGap into the encoder? But judging from LUCENE-4686, this got us there 6% improvement, so not likely it will double the decoding speed here.
Also, I see that you have a static MASK table. I wonder, given the recent thread about FixedBitSet and static mask tables (people were against it because a couple of bitwise ops would do better), if getting rid of it would improve perf.

I think that the main problem with these ordinals encoding is that we encode small groups of integers, usually. So just for fun, I ran EncodingSpeed with 4 encoders/decoders, encoding 2430 values in one go (single call made to encode). The results are below:

{noformat}
Encoder                            Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                   [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorting(Unique(DGap(PackedInts)))  11.0189                 8561                    85.6105                  590                     5.9000
Sorting(Unique(DGap(Packed)))      11.0058                 8206                    82.0605                  806                     8.0601
Sorting(Unique(DGap(VInt8)))        8.5597                 7133                    71.3305                  263                     2.6300
Sorting(Unique(DGapVInt8))          8.5597                 7208                    72.0805                  251                     2.5100
{noformat}

* PackedInts is an encoder similar to what Gilad wrote here, which uses PackedInts.Writer/ReaderNoHeader (writes a 1-byte header itself).
* Packed is Mike's encoder
* I wrapped all encoders with DGap (except for the last one which is specialized)

Notice that:
* Both Packed versions achieve worse compression than VInt (confirms Mike's DV files results above)
* Both DGap perform much faster at decode that the Packed versions.
* Surprisingly, but this may be a result of micro-benchmkaring, Packed (Mike's specialized) decodes slower than PackedInts. I wonder if luceneutil would confirm that too.

I think that if we had some statistics about the result Wikipedia index, e.g. min/max ord per document (the final integer, after dgap), we could tell better if there's any point to continue these investigations further. Since luceneutil search perf is happier w/ VInt and more so, the result index is smaller by ~25%, I tend to think that packed ints won't get us far here.

With postings, I think that the numbers are smaller? e.g. for dense posting lists, with docID gaps, you'll get a small bpv?
Mike, is it possible to print the min/max bpv used during the indexing? For instance, I printed that info when running EncodingSpeed, and got 11 (there's a single chunk to encode, so min=max). I wonder what the result will be for the index. I added these as public static members to PackedIntEncoder, so I think you can do the same for the index.

Also, 11 bpv means that every value is encoded int 1.5 bytes + 1 header byte. So for 25 ords, we're looking at ~39 bytes per document. If most of the values are small though (and this happens with ordinals, even with dgap, because the first value is the largest), we pay a high penalty for each value, because bpv will be large.

Maybe in order to beat VInt we need a smarter encoder. E.g. the facets package have a FourFlags and EightFlags encoders which are very good for really low numbers (Four for 1,2,3 and Eight for 1). Gilad had an idea above to use a static bpv=6 and encode all values that are 1-63 as 0-62 in 6 bits, and values that are 64+ are encoded as 63 (all 1's) followed by a VInt. I ran a short analysis using EncodingSpeed (I should say here that these 2430 integers are actual category ordinals of a real document from one application, so these are not just made up numbers):

* maxValue=1928 (requires 11 bits)
* numValuesLessThan64=2169 (out of 2430!)

So if we had encoded using Gilad's idea, we would end up with
* 261 "large" values, at most 2 bytes-per-value = 522 bytes (4176 bits)
* 2169 values encoded with 6 bpv = 135.5625 longs = 1084.5 bytes (8676 bits)
* avg bits/int = 12852 / 2430 = 5.28

Which is better than VInt. I omitted the 1 byte header here, because of the large number of values that are encoded at once, but it should add some overhead for documents with very few categories (e.g. OrdinalPolicy.NO_PARENTS). Also, this is the "best" compression that we could get for these numbers, but if we really encoded them I suspect it would be worse. Because we cannot reorder after sorting+dgap, so e.g. if we have a series of 5-1 values (5 small, 1 large), we would waste 2 bits on the 5 small ones. 2 bits here, 2 bits there, they add up.

Still, I wonder what would be the speed of such decoder. And what are the statistics for the real Wikipedia index, with its 25 ords per doc. Encoders like the one above greatly benefit when there are a large number of values (because there are more chances for small values).
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders.
> It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org