You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Gilad Barkai (JIRA)" <ji...@apache.org> on 2013/02/07 16:29:13 UTC

[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

     [ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gilad Barkai updated LUCENE-4609:
---------------------------------

    Attachment: SemiPackedEncoder.patch

Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt).

The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario.

Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code.

I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the EncodingSpeed test - and was surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat {code}SimpleIntEncoder{code} (which writes plain 32 bits per value) in decoding time.

Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia). 

Mike - could you please try this encoder? I guess it only makes sense to run the specialized {code}DGapSemiPackedEncoder{code}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower.

Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times.

Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  190                     1.9000                  165                     1.6500
VInt8                                                           18.4955                  436                     4.3600                  359                     3.5900
Sorting(Unique(VInt8))                                          18.4955                 3557                    35.5702                  314                     3.1400
Sorting(Unique(DGap(VInt8)))                                     8.5597                 3485                    34.8502                  270                     2.7000
Sorting(Unique(DGapVInt8))                                       8.5597                 3434                    34.3402                  192                     1.9200
Sorting(Unique(DGap(SemiPacked)))                                8.6453                 3386                    33.8602                  156                     1.5600
Sorting(Unique(DGapSemiPacked))                                  8.6453                 3397                    33.9702                   99                     0.9900
Sorting(Unique(DGap(EightFlags(VInt))))                          4.9679                 4002                    40.0203                  381                     3.8100
Sorting(Unique(DGap(FourFlags(VInt))))                           4.8198                 3972                    39.7203                  399                     3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                4.5794                 4448                    44.4803                  645                     6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                4.5794                 4461                    44.6103                  641                     6.4100


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times.

Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  169                     1.6900                  163                     1.6300
VInt8                                                           18.2673                  421                     4.2100                  374                     3.7400
Sorting(Unique(VInt8))                                          18.2673                 2230                    22.3001                  337                     3.3700
Sorting(Unique(DGap(VInt8)))                                     8.9456                 2257                    22.5701                  273                     2.7300
Sorting(Unique(DGapVInt8))                                       8.9456                 2214                    22.1401                  192                     1.9200
Sorting(Unique(DGap(SemiPacked)))                                9.2787                 2162                    21.6201                  180                     1.8000
Sorting(Unique(DGapSemiPacked))                                  9.2787                 2148                    21.4801                  120                     1.2000
Sorting(Unique(DGap(EightFlags(VInt))))                          5.7542                 2937                    29.3701                  395                     3.9500
Sorting(Unique(DGap(FourFlags(VInt))))                           5.5447                 2768                    27.6801                  407                     4.0700
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                5.3566                 3294                    32.9401                  651                     6.5100
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                5.3996                 3318                    33.1801                  662                     6.6200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times.

Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  316                     3.1600                  318                     3.1800
VInt8                                                           20.8889                  569                     5.6900                  496                     4.9600
Sorting(Unique(VInt8))                                          20.8889                 1196                    11.9600                  488                     4.8800
Sorting(Unique(DGap(VInt8)))                                    12.0000                 1151                    11.5100                  481                     4.8100
Sorting(Unique(DGapVInt8))                                      12.0000                 1100                    11.0000                  352                     3.5200
Sorting(Unique(DGap(SemiPacked)))                               14.6667                 1107                    11.0700                  507                     5.0700
Sorting(Unique(DGapSemiPacked))                                 14.6667                 1037                    10.3700                  439                     4.3900
Sorting(Unique(DGap(EightFlags(VInt))))                         10.2222                 1315                    13.1500                  656                     6.5600
Sorting(Unique(DGap(FourFlags(VInt))))                          10.2222                 1408                    14.0800                  675                     6.7500
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                9.7778                 1617                    16.1700                  990                     9.9000
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))               10.2222                 1708                    17.0800                  992                     9.9200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times.

Encoder                                                        Bits/Int          Encode Time                Encode Time          Decode Time                Decode Time
                                                                              [milliseconds]        [microsecond / int]       [milliseconds]        [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000                  181                     1.8100                  168                     1.6800
VInt8                                                           16.5768                  428                     4.2800                  351                     3.5100
Sorting(Unique(VInt8))                                          16.5768                 1586                    15.8600                  330                     3.3000
Sorting(Unique(DGap(VInt8)))                                     8.4848                 1477                    14.7700                  267                     2.6700
Sorting(Unique(DGapVInt8))                                       8.4848                 1435                    14.3500                  193                     1.9300
Sorting(Unique(DGap(SemiPacked)))                                8.6771                 1424                    14.2400                  159                     1.5900
Sorting(Unique(DGapSemiPacked))                                  8.6771                 1402                    14.0200                  100                     1.0000
Sorting(Unique(DGap(EightFlags(VInt))))                          4.4138                 1603                    16.0300                  376                     3.7600
Sorting(Unique(DGap(FourFlags(VInt))))                           4.1797                 1700                    17.0000                  382                     3.8200
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                3.8955                 2011                    20.1100                  602                     6.0200
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                3.8871                 2024                    20.2400                  594                     5.9400


{noformat}


I have another version which uses two bytes before using the variable length (e.g values smaller than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful. 
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders.
> It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org