You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/08/20 12:43:20 UTC
BinStorage doesn't use efficient integer encoding?
I'm looking at BinStorage which I believe if I've read correct is used for
all Pig intermediate files.
… so any optimizations here would be transparent to the user.
I just did a simple STORE using BinStorage and the format doesn't appear
amazingly concise.
for example:
0 {(1),(2),(3),(4),(1000000000)}
is the following in BinStorage:
n20xn21n22n23n24n2
1000000000
or
00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00
|...n....2....0x.|
00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00
|.......n....2...|
00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00
|.1n....2....2n..|
00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00
|..2....3n....2..|
00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30
|..4n....2....100|
00000050 30 30 30 30 30 30 30 |0000000|
00000057
… now, efficient integer storage is a controversial topic.
if you have short integers representing them as four bytes will waste a ton
of space.
implementing them as varints is usually a good compromise:
http://code.google.com/apis/protocolbuffers/docs/encoding.html
7 bits of each byte are used to represent the int. One of the bits is used
to signal whether there is a next bit which needs to be read.
In my job , most of my ints will be stored in 4-8 bytes….. however, in
varint encoding they would only be 2 bytes.
A 4x savings in disk space could significantly improve performance.
I haven't benchmarked CPU of variants vs Integer.toString() though …. which
I might do now.
Still…. even if varint encoding is slower, using 4 bytes for some uses could
be a win.
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
Skype-in: *(415) 871-0687*
Re: BinStorage doesn't use efficient integer encoding?
Posted by Thejas Nair <th...@hortonworks.com>.
The intermediate serialization format was improved in as part of -
https://issues.apache.org/jira/browse/PIG-1472, it includes the move to
varint like format.
BinStorage is also used by users as regular load/store func, so its
binary format could not be changed, and a new internal load/store func
was introduced.
Thanks,
Thejas
On 8/20/11 10:41 AM, Kevin Burton wrote:
> 0.9.0 ... I will investigate ....
>
> --
> Kevin Burton
>
>
> On Aug 20, 2011, at 8:31 AM, Alan Gates<ga...@hortonworks.com> wrote:
>
>> What version of Pig are you using? I believe we moved off of BinStorage in 0.8 and started using varint around the same time.
>>
>> Alan.
>>
>> On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:
>>
>>> I ran the benchmark now and results are pretty interesting. Varint is about
>>> 2x performance and 1/2 the size of using integer.toString…
>>>
>>>
>>>
>>> varint encoding
>>> duration: 37,355 ms
>>> string encoding
>>> duration: 74,178 ms
>>>
>>> varint total bytes: 2,229,450,884
>>> string total bytes: 4,388,888,890
>>>
>>>
>>>
>>> long before,after;
>>> int max = 500000000;
>>>
>>> System.gc();
>>>
>>> before = System.currentTimeMillis();
>>>
>>> System.out.printf( "varint encoding\n" );
>>>
>>> VarintReader reader = new VarintReader();
>>> VarintWriter writer = new VarintWriter();
>>>
>>> for ( int i = 0; i< max; ++i ) {
>>> reader.read( writer.write( i ) );
>>> }
>>>
>>> after = System.currentTimeMillis();
>>>
>>> System.gc();
>>>
>>> System.out.printf( "duration: %,d ms\n", (after-before) );
>>>
>>> before = System.currentTimeMillis();
>>>
>>> System.out.printf( "string encoding\n" );
>>>
>>> for ( int i = 0; i< max; ++i ) {
>>> Integer.parseInt( Integer.toString( i ) );
>>> }
>>>
>>> after = System.currentTimeMillis();
>>>
>>> System.gc();
>>>
>>> System.out.printf( "duration: %,d ms\n", (after-before) );
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton<bu...@spinn3r.com> wrote:
>>>
>>>> I'm looking at BinStorage which I believe if I've read correct is used for
>>>> all Pig intermediate files.
>>>>
>>>> … so any optimizations here would be transparent to the user.
>>>>
>>>> I just did a simple STORE using BinStorage and the format doesn't appear
>>>> amazingly concise.
>>>>
>>>> for example:
>>>>
>>>> 0 {(1),(2),(3),(4),(1000000000)}
>>>>
>>>> is the following in BinStorage:
>>>>
>>>> n20xn21n22n23n24n2
>>>> 1000000000
>>>>
>>>> or
>>>>
>>>> 00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00
>>>> |...n....2....0x.|
>>>> 00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00
>>>> |.......n....2...|
>>>> 00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00
>>>> |.1n....2....2n..|
>>>> 00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00
>>>> |..2....3n....2..|
>>>> 00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30
>>>> |..4n....2....100|
>>>> 00000050 30 30 30 30 30 30 30 |0000000|
>>>> 00000057
>>>>
>>>> … now, efficient integer storage is a controversial topic.
>>>>
>>>> if you have short integers representing them as four bytes will waste a ton
>>>> of space.
>>>>
>>>> implementing them as varints is usually a good compromise:
>>>>
>>>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>>>>
>>>> 7 bits of each byte are used to represent the int. One of the bits is used
>>>> to signal whether there is a next bit which needs to be read.
>>>>
>>>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>>>> varint encoding they would only be 2 bytes.
>>>>
>>>> A 4x savings in disk space could significantly improve performance.
>>>>
>>>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>>>> I might do now.
>>>>
>>>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>>>> could be a win.
>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>>
>>>> Location: *San Francisco, CA*
>>>> Skype: *burtonator*
>>>>
>>>> Skype-in: *(415) 871-0687*
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>>
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>>
>>> Skype-in: *(415) 871-0687*
>>
Re: BinStorage doesn't use efficient integer encoding?
Posted by Kevin Burton <bu...@gmail.com>.
0.9.0 ... I will investigate ....
--
Kevin Burton
On Aug 20, 2011, at 8:31 AM, Alan Gates <ga...@hortonworks.com> wrote:
> What version of Pig are you using? I believe we moved off of BinStorage in 0.8 and started using varint around the same time.
>
> Alan.
>
> On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:
>
>> I ran the benchmark now and results are pretty interesting. Varint is about
>> 2x performance and 1/2 the size of using integer.toString…
>>
>>
>>
>> varint encoding
>> duration: 37,355 ms
>> string encoding
>> duration: 74,178 ms
>>
>> varint total bytes: 2,229,450,884
>> string total bytes: 4,388,888,890
>>
>>
>>
>> long before,after;
>> int max = 500000000;
>>
>> System.gc();
>>
>> before = System.currentTimeMillis();
>>
>> System.out.printf( "varint encoding\n" );
>>
>> VarintReader reader = new VarintReader();
>> VarintWriter writer = new VarintWriter();
>>
>> for ( int i = 0; i < max; ++i ) {
>> reader.read( writer.write( i ) );
>> }
>>
>> after = System.currentTimeMillis();
>>
>> System.gc();
>>
>> System.out.printf( "duration: %,d ms\n", (after-before) );
>>
>> before = System.currentTimeMillis();
>>
>> System.out.printf( "string encoding\n" );
>>
>> for ( int i = 0; i < max; ++i ) {
>> Integer.parseInt( Integer.toString( i ) );
>> }
>>
>> after = System.currentTimeMillis();
>>
>> System.gc();
>>
>> System.out.printf( "duration: %,d ms\n", (after-before) );
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <bu...@spinn3r.com> wrote:
>>
>>> I'm looking at BinStorage which I believe if I've read correct is used for
>>> all Pig intermediate files.
>>>
>>> … so any optimizations here would be transparent to the user.
>>>
>>> I just did a simple STORE using BinStorage and the format doesn't appear
>>> amazingly concise.
>>>
>>> for example:
>>>
>>> 0 {(1),(2),(3),(4),(1000000000)}
>>>
>>> is the following in BinStorage:
>>>
>>> n20xn21n22n23n24n2
>>> 1000000000
>>>
>>> or
>>>
>>> 00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00
>>> |...n....2....0x.|
>>> 00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00
>>> |.......n....2...|
>>> 00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00
>>> |.1n....2....2n..|
>>> 00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00
>>> |..2....3n....2..|
>>> 00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30
>>> |..4n....2....100|
>>> 00000050 30 30 30 30 30 30 30 |0000000|
>>> 00000057
>>>
>>> … now, efficient integer storage is a controversial topic.
>>>
>>> if you have short integers representing them as four bytes will waste a ton
>>> of space.
>>>
>>> implementing them as varints is usually a good compromise:
>>>
>>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>>>
>>> 7 bits of each byte are used to represent the int. One of the bits is used
>>> to signal whether there is a next bit which needs to be read.
>>>
>>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>>> varint encoding they would only be 2 bytes.
>>>
>>> A 4x savings in disk space could significantly improve performance.
>>>
>>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>>> I might do now.
>>>
>>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>>> could be a win.
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>>
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>>
>>> Skype-in: *(415) 871-0687*
>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>>
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>>
>> Skype-in: *(415) 871-0687*
>
Re: BinStorage doesn't use efficient integer encoding?
Posted by Alan Gates <ga...@hortonworks.com>.
What version of Pig are you using? I believe we moved off of BinStorage in 0.8 and started using varint around the same time.
Alan.
On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:
> I ran the benchmark now and results are pretty interesting. Varint is about
> 2x performance and 1/2 the size of using integer.toString…
>
>
>
> varint encoding
> duration: 37,355 ms
> string encoding
> duration: 74,178 ms
>
> varint total bytes: 2,229,450,884
> string total bytes: 4,388,888,890
>
>
>
> long before,after;
> int max = 500000000;
>
> System.gc();
>
> before = System.currentTimeMillis();
>
> System.out.printf( "varint encoding\n" );
>
> VarintReader reader = new VarintReader();
> VarintWriter writer = new VarintWriter();
>
> for ( int i = 0; i < max; ++i ) {
> reader.read( writer.write( i ) );
> }
>
> after = System.currentTimeMillis();
>
> System.gc();
>
> System.out.printf( "duration: %,d ms\n", (after-before) );
>
> before = System.currentTimeMillis();
>
> System.out.printf( "string encoding\n" );
>
> for ( int i = 0; i < max; ++i ) {
> Integer.parseInt( Integer.toString( i ) );
> }
>
> after = System.currentTimeMillis();
>
> System.gc();
>
> System.out.printf( "duration: %,d ms\n", (after-before) );
>
>
>
>
>
>
>
> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <bu...@spinn3r.com> wrote:
>
>> I'm looking at BinStorage which I believe if I've read correct is used for
>> all Pig intermediate files.
>>
>> … so any optimizations here would be transparent to the user.
>>
>> I just did a simple STORE using BinStorage and the format doesn't appear
>> amazingly concise.
>>
>> for example:
>>
>> 0 {(1),(2),(3),(4),(1000000000)}
>>
>> is the following in BinStorage:
>>
>> n20xn21n22n23n24n2
>> 1000000000
>>
>> or
>>
>> 00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00
>> |...n....2....0x.|
>> 00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00
>> |.......n....2...|
>> 00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00
>> |.1n....2....2n..|
>> 00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00
>> |..2....3n....2..|
>> 00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30
>> |..4n....2....100|
>> 00000050 30 30 30 30 30 30 30 |0000000|
>> 00000057
>>
>> … now, efficient integer storage is a controversial topic.
>>
>> if you have short integers representing them as four bytes will waste a ton
>> of space.
>>
>> implementing them as varints is usually a good compromise:
>>
>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>>
>> 7 bits of each byte are used to represent the int. One of the bits is used
>> to signal whether there is a next bit which needs to be read.
>>
>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>> varint encoding they would only be 2 bytes.
>>
>> A 4x savings in disk space could significantly improve performance.
>>
>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>> I might do now.
>>
>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>> could be a win.
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>>
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>>
>> Skype-in: *(415) 871-0687*
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*
Re: BinStorage doesn't use efficient integer encoding?
Posted by Kevin Burton <bu...@spinn3r.com>.
I ran the benchmark now and results are pretty interesting. Varint is about
2x performance and 1/2 the size of using integer.toString…
varint encoding
duration: 37,355 ms
string encoding
duration: 74,178 ms
varint total bytes: 2,229,450,884
string total bytes: 4,388,888,890
long before,after;
int max = 500000000;
System.gc();
before = System.currentTimeMillis();
System.out.printf( "varint encoding\n" );
VarintReader reader = new VarintReader();
VarintWriter writer = new VarintWriter();
for ( int i = 0; i < max; ++i ) {
reader.read( writer.write( i ) );
}
after = System.currentTimeMillis();
System.gc();
System.out.printf( "duration: %,d ms\n", (after-before) );
before = System.currentTimeMillis();
System.out.printf( "string encoding\n" );
for ( int i = 0; i < max; ++i ) {
Integer.parseInt( Integer.toString( i ) );
}
after = System.currentTimeMillis();
System.gc();
System.out.printf( "duration: %,d ms\n", (after-before) );
On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <bu...@spinn3r.com> wrote:
> I'm looking at BinStorage which I believe if I've read correct is used for
> all Pig intermediate files.
>
> … so any optimizations here would be transparent to the user.
>
> I just did a simple STORE using BinStorage and the format doesn't appear
> amazingly concise.
>
> for example:
>
> 0 {(1),(2),(3),(4),(1000000000)}
>
> is the following in BinStorage:
>
> n20xn21n22n23n24n2
> 1000000000
>
> or
>
> 00000000 01 02 03 6e 00 00 00 02 32 00 00 00 01 30 78 00
> |...n....2....0x.|
> 00000010 00 00 00 00 00 00 05 6e 00 00 00 01 32 00 00 00
> |.......n....2...|
> 00000020 01 31 6e 00 00 00 01 32 00 00 00 01 32 6e 00 00
> |.1n....2....2n..|
> 00000030 00 01 32 00 00 00 01 33 6e 00 00 00 01 32 00 00
> |..2....3n....2..|
> 00000040 00 01 34 6e 00 00 00 01 32 00 00 00 0a 31 30 30
> |..4n....2....100|
> 00000050 30 30 30 30 30 30 30 |0000000|
> 00000057
>
> … now, efficient integer storage is a controversial topic.
>
> if you have short integers representing them as four bytes will waste a ton
> of space.
>
> implementing them as varints is usually a good compromise:
>
> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>
> 7 bits of each byte are used to represent the int. One of the bits is used
> to signal whether there is a next bit which needs to be read.
>
> In my job , most of my ints will be stored in 4-8 bytes….. however, in
> varint encoding they would only be 2 bytes.
>
> A 4x savings in disk space could significantly improve performance.
>
> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
> I might do now.
>
> Still…. even if varint encoding is slower, using 4 bytes for some uses
> could be a win.
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*
>
>
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
Skype-in: *(415) 871-0687*