You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Gilad Barkai (JIRA)" <ji...@apache.org> on 2013/02/07 16:29:13 UTC
[jira] [Updated] (LUCENE-4609) Write a PackedIntsEncoder/Decoder
for facets
[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gilad Barkai updated LUCENE-4609:
---------------------------------
Attachment: SemiPackedEncoder.patch
Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt).
The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario.
Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code.
I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the EncodingSpeed test - and was surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat {code}SimpleIntEncoder{code} (which writes plain 32 bits per value) in decoding time.
Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia).
Mike - could you please try this encoder? I guess it only makes sense to run the specialized {code}DGapSemiPackedEncoder{code}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower.
Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times.
Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time
[milliseconds] [microsecond / int] [milliseconds] [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000 190 1.9000 165 1.6500
VInt8 18.4955 436 4.3600 359 3.5900
Sorting(Unique(VInt8)) 18.4955 3557 35.5702 314 3.1400
Sorting(Unique(DGap(VInt8))) 8.5597 3485 34.8502 270 2.7000
Sorting(Unique(DGapVInt8)) 8.5597 3434 34.3402 192 1.9200
Sorting(Unique(DGap(SemiPacked))) 8.6453 3386 33.8602 156 1.5600
Sorting(Unique(DGapSemiPacked)) 8.6453 3397 33.9702 99 0.9900
Sorting(Unique(DGap(EightFlags(VInt)))) 4.9679 4002 40.0203 381 3.8100
Sorting(Unique(DGap(FourFlags(VInt)))) 4.8198 3972 39.7203 399 3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 4.5794 4448 44.4803 645 6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 4.5794 4461 44.6103 641 6.4100
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times.
Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time
[milliseconds] [microsecond / int] [milliseconds] [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000 169 1.6900 163 1.6300
VInt8 18.2673 421 4.2100 374 3.7400
Sorting(Unique(VInt8)) 18.2673 2230 22.3001 337 3.3700
Sorting(Unique(DGap(VInt8))) 8.9456 2257 22.5701 273 2.7300
Sorting(Unique(DGapVInt8)) 8.9456 2214 22.1401 192 1.9200
Sorting(Unique(DGap(SemiPacked))) 9.2787 2162 21.6201 180 1.8000
Sorting(Unique(DGapSemiPacked)) 9.2787 2148 21.4801 120 1.2000
Sorting(Unique(DGap(EightFlags(VInt)))) 5.7542 2937 29.3701 395 3.9500
Sorting(Unique(DGap(FourFlags(VInt)))) 5.5447 2768 27.6801 407 4.0700
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 5.3566 3294 32.9401 651 6.5100
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 5.3996 3318 33.1801 662 6.6200
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times.
Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time
[milliseconds] [microsecond / int] [milliseconds] [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000 316 3.1600 318 3.1800
VInt8 20.8889 569 5.6900 496 4.9600
Sorting(Unique(VInt8)) 20.8889 1196 11.9600 488 4.8800
Sorting(Unique(DGap(VInt8))) 12.0000 1151 11.5100 481 4.8100
Sorting(Unique(DGapVInt8)) 12.0000 1100 11.0000 352 3.5200
Sorting(Unique(DGap(SemiPacked))) 14.6667 1107 11.0700 507 5.0700
Sorting(Unique(DGapSemiPacked)) 14.6667 1037 10.3700 439 4.3900
Sorting(Unique(DGap(EightFlags(VInt)))) 10.2222 1315 13.1500 656 6.5600
Sorting(Unique(DGap(FourFlags(VInt)))) 10.2222 1408 14.0800 675 6.7500
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 9.7778 1617 16.1700 990 9.9000
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 10.2222 1708 17.0800 992 9.9200
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times.
Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time
[milliseconds] [microsecond / int] [milliseconds] [microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple 32.0000 181 1.8100 168 1.6800
VInt8 16.5768 428 4.2800 351 3.5100
Sorting(Unique(VInt8)) 16.5768 1586 15.8600 330 3.3000
Sorting(Unique(DGap(VInt8))) 8.4848 1477 14.7700 267 2.6700
Sorting(Unique(DGapVInt8)) 8.4848 1435 14.3500 193 1.9300
Sorting(Unique(DGap(SemiPacked))) 8.6771 1424 14.2400 159 1.5900
Sorting(Unique(DGapSemiPacked)) 8.6771 1402 14.0200 100 1.0000
Sorting(Unique(DGap(EightFlags(VInt)))) 4.4138 1603 16.0300 376 3.7600
Sorting(Unique(DGap(FourFlags(VInt)))) 4.1797 1700 17.0000 382 3.8200
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 3.8955 2011 20.1100 602 6.0200
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 3.8871 2024 20.2400 594 5.9400
{noformat}
I have another version which uses two bytes before using the variable length (e.g values smaller than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful.
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
> Key: LUCENE-4609
> URL: https://issues.apache.org/jira/browse/LUCENE-4609
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/facet
> Reporter: Shai Erera
> Priority: Minor
> Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the category ordinals. We have several such encoders, including VInt (default), and block encoders.
> It would be interesting to implement and benchmark a PackedIntsEncoder/Decoder, with potentially two variants: (1) receives bitsPerValue up front, when you e.g. know that you have a small taxonomy and the max value you can see and (2) one that decides for each doc on the optimal bitsPerValue, writes it as a header in the byte[] or something.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org