You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kevin Burton <bu...@spinn3r.com> on 2011/08/20 12:43:20 UTC

BinStorage doesn't use efficient integer encoding?

I'm looking at BinStorage which I believe if I've read correct is used for
all Pig intermediate files.

… so any optimizations here would be transparent to the user.

I just did a simple STORE using BinStorage and the format doesn't appear
amazingly concise.

for example:

0 {(1),(2),(3),(4),(1000000000)}

is the following in BinStorage:

n20xn21n22n23n24n2
1000000000

or

00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
 |...n....2....0x.|
00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
 |.......n....2...|
00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
 |.1n....2....2n..|
00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
 |..2....3n....2..|
00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
 |..4n....2....100|
00000050  30 30 30 30 30 30 30                              |0000000|
00000057

… now, efficient integer storage is a controversial topic.

if you have short integers representing them as four bytes will waste a ton
of space.

implementing them as varints is usually a good compromise:

http://code.google.com/apis/protocolbuffers/docs/encoding.html

7 bits of each byte are used to represent the int.  One of the bits is used
to signal whether there is a next bit which needs to be read.

In my job , most of my ints will be stored in 4-8 bytes….. however, in
varint encoding they would only be 2 bytes.

A 4x savings in disk space could significantly improve performance.

I haven't benchmarked CPU of variants vs Integer.toString() though …. which
I might do now.

Still…. even if varint encoding is slower, using 4 bytes for some uses could
be a win.

-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Re: BinStorage doesn't use efficient integer encoding?

Posted by Thejas Nair <th...@hortonworks.com>.
The intermediate serialization format was improved in as part of - 
https://issues.apache.org/jira/browse/PIG-1472, it includes the move to 
varint like format.

BinStorage is also used by users as regular load/store func, so its 
binary format could not be changed, and a new internal load/store func 
was introduced.

Thanks,
Thejas


On 8/20/11 10:41 AM, Kevin Burton wrote:
> 0.9.0 ... I will investigate ....
>
> --
> Kevin Burton
>
>
> On Aug 20, 2011, at 8:31 AM, Alan Gates<ga...@hortonworks.com>  wrote:
>
>> What version of Pig are you using?  I believe we moved off of BinStorage in 0.8 and started using varint around the same time.
>>
>> Alan.
>>
>> On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:
>>
>>> I ran the benchmark now and results are pretty interesting.  Varint is about
>>> 2x performance and 1/2 the size of using integer.toString…
>>>
>>>
>>>
>>> varint encoding
>>> duration: 37,355 ms
>>> string encoding
>>> duration: 74,178 ms
>>>
>>> varint total bytes: 2,229,450,884
>>> string total bytes: 4,388,888,890
>>>
>>>
>>>
>>>        long before,after;
>>>        int max = 500000000;
>>>
>>>        System.gc();
>>>
>>>        before = System.currentTimeMillis();
>>>
>>>        System.out.printf( "varint encoding\n" );
>>>
>>>        VarintReader reader = new VarintReader();
>>>        VarintWriter writer = new VarintWriter();
>>>
>>>        for ( int i = 0; i<  max; ++i ) {
>>>            reader.read( writer.write( i ) );
>>>        }
>>>
>>>        after = System.currentTimeMillis();
>>>
>>>        System.gc();
>>>
>>>        System.out.printf( "duration: %,d ms\n", (after-before) );
>>>
>>>        before = System.currentTimeMillis();
>>>
>>>        System.out.printf( "string encoding\n" );
>>>
>>>        for ( int i = 0; i<  max; ++i ) {
>>>            Integer.parseInt( Integer.toString( i ) );
>>>        }
>>>
>>>        after = System.currentTimeMillis();
>>>
>>>        System.gc();
>>>
>>>        System.out.printf( "duration: %,d ms\n", (after-before) );
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton<bu...@spinn3r.com>  wrote:
>>>
>>>> I'm looking at BinStorage which I believe if I've read correct is used for
>>>> all Pig intermediate files.
>>>>
>>>> … so any optimizations here would be transparent to the user.
>>>>
>>>> I just did a simple STORE using BinStorage and the format doesn't appear
>>>> amazingly concise.
>>>>
>>>> for example:
>>>>
>>>> 0 {(1),(2),(3),(4),(1000000000)}
>>>>
>>>> is the following in BinStorage:
>>>>
>>>> n20xn21n22n23n24n2
>>>> 1000000000
>>>>
>>>> or
>>>>
>>>> 00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
>>>> |...n....2....0x.|
>>>> 00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
>>>> |.......n....2...|
>>>> 00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
>>>> |.1n....2....2n..|
>>>> 00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
>>>> |..2....3n....2..|
>>>> 00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
>>>> |..4n....2....100|
>>>> 00000050  30 30 30 30 30 30 30                              |0000000|
>>>> 00000057
>>>>
>>>> … now, efficient integer storage is a controversial topic.
>>>>
>>>> if you have short integers representing them as four bytes will waste a ton
>>>> of space.
>>>>
>>>> implementing them as varints is usually a good compromise:
>>>>
>>>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>>>>
>>>> 7 bits of each byte are used to represent the int.  One of the bits is used
>>>> to signal whether there is a next bit which needs to be read.
>>>>
>>>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>>>> varint encoding they would only be 2 bytes.
>>>>
>>>> A 4x savings in disk space could significantly improve performance.
>>>>
>>>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>>>> I might do now.
>>>>
>>>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>>>> could be a win.
>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>>
>>>> Location: *San Francisco, CA*
>>>> Skype: *burtonator*
>>>>
>>>> Skype-in: *(415) 871-0687*
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>>
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>>
>>> Skype-in: *(415) 871-0687*
>>


Re: BinStorage doesn't use efficient integer encoding?

Posted by Kevin Burton <bu...@gmail.com>.
0.9.0 ... I will investigate .... 

--
Kevin Burton


On Aug 20, 2011, at 8:31 AM, Alan Gates <ga...@hortonworks.com> wrote:

> What version of Pig are you using?  I believe we moved off of BinStorage in 0.8 and started using varint around the same time.
> 
> Alan.
> 
> On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:
> 
>> I ran the benchmark now and results are pretty interesting.  Varint is about
>> 2x performance and 1/2 the size of using integer.toString…
>> 
>> 
>> 
>> varint encoding
>> duration: 37,355 ms
>> string encoding
>> duration: 74,178 ms
>> 
>> varint total bytes: 2,229,450,884
>> string total bytes: 4,388,888,890
>> 
>> 
>> 
>>       long before,after;
>>       int max = 500000000;
>> 
>>       System.gc();
>> 
>>       before = System.currentTimeMillis();
>> 
>>       System.out.printf( "varint encoding\n" );
>> 
>>       VarintReader reader = new VarintReader();
>>       VarintWriter writer = new VarintWriter();
>> 
>>       for ( int i = 0; i < max; ++i ) {
>>           reader.read( writer.write( i ) );
>>       }
>> 
>>       after = System.currentTimeMillis();
>> 
>>       System.gc();
>> 
>>       System.out.printf( "duration: %,d ms\n", (after-before) );
>> 
>>       before = System.currentTimeMillis();
>> 
>>       System.out.printf( "string encoding\n" );
>> 
>>       for ( int i = 0; i < max; ++i ) {
>>           Integer.parseInt( Integer.toString( i ) );
>>       }
>> 
>>       after = System.currentTimeMillis();
>> 
>>       System.gc();
>> 
>>       System.out.printf( "duration: %,d ms\n", (after-before) );
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <bu...@spinn3r.com> wrote:
>> 
>>> I'm looking at BinStorage which I believe if I've read correct is used for
>>> all Pig intermediate files.
>>> 
>>> … so any optimizations here would be transparent to the user.
>>> 
>>> I just did a simple STORE using BinStorage and the format doesn't appear
>>> amazingly concise.
>>> 
>>> for example:
>>> 
>>> 0 {(1),(2),(3),(4),(1000000000)}
>>> 
>>> is the following in BinStorage:
>>> 
>>> n20xn21n22n23n24n2
>>> 1000000000
>>> 
>>> or
>>> 
>>> 00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
>>> |...n....2....0x.|
>>> 00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
>>> |.......n....2...|
>>> 00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
>>> |.1n....2....2n..|
>>> 00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
>>> |..2....3n....2..|
>>> 00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
>>> |..4n....2....100|
>>> 00000050  30 30 30 30 30 30 30                              |0000000|
>>> 00000057
>>> 
>>> … now, efficient integer storage is a controversial topic.
>>> 
>>> if you have short integers representing them as four bytes will waste a ton
>>> of space.
>>> 
>>> implementing them as varints is usually a good compromise:
>>> 
>>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>>> 
>>> 7 bits of each byte are used to represent the int.  One of the bits is used
>>> to signal whether there is a next bit which needs to be read.
>>> 
>>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>>> varint encoding they would only be 2 bytes.
>>> 
>>> A 4x savings in disk space could significantly improve performance.
>>> 
>>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>>> I might do now.
>>> 
>>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>>> could be a win.
>>> 
>>> --
>>> 
>>> Founder/CEO Spinn3r.com
>>> 
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>> 
>>> Skype-in: *(415) 871-0687*
>>> 
>>> 
>> 
>> 
>> -- 
>> 
>> Founder/CEO Spinn3r.com
>> 
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>> 
>> Skype-in: *(415) 871-0687*
> 

Re: BinStorage doesn't use efficient integer encoding?

Posted by Alan Gates <ga...@hortonworks.com>.
What version of Pig are you using?  I believe we moved off of BinStorage in 0.8 and started using varint around the same time.

Alan.

On Aug 20, 2011, at 4:03 AM, Kevin Burton wrote:

> I ran the benchmark now and results are pretty interesting.  Varint is about
> 2x performance and 1/2 the size of using integer.toString…
> 
> 
> 
> varint encoding
> duration: 37,355 ms
> string encoding
> duration: 74,178 ms
> 
> varint total bytes: 2,229,450,884
> string total bytes: 4,388,888,890
> 
> 
> 
>        long before,after;
>        int max = 500000000;
> 
>        System.gc();
> 
>        before = System.currentTimeMillis();
> 
>        System.out.printf( "varint encoding\n" );
> 
>        VarintReader reader = new VarintReader();
>        VarintWriter writer = new VarintWriter();
> 
>        for ( int i = 0; i < max; ++i ) {
>            reader.read( writer.write( i ) );
>        }
> 
>        after = System.currentTimeMillis();
> 
>        System.gc();
> 
>        System.out.printf( "duration: %,d ms\n", (after-before) );
> 
>        before = System.currentTimeMillis();
> 
>        System.out.printf( "string encoding\n" );
> 
>        for ( int i = 0; i < max; ++i ) {
>            Integer.parseInt( Integer.toString( i ) );
>        }
> 
>        after = System.currentTimeMillis();
> 
>        System.gc();
> 
>        System.out.printf( "duration: %,d ms\n", (after-before) );
> 
> 
> 
> 
> 
> 
> 
> On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <bu...@spinn3r.com> wrote:
> 
>> I'm looking at BinStorage which I believe if I've read correct is used for
>> all Pig intermediate files.
>> 
>> … so any optimizations here would be transparent to the user.
>> 
>> I just did a simple STORE using BinStorage and the format doesn't appear
>> amazingly concise.
>> 
>> for example:
>> 
>> 0 {(1),(2),(3),(4),(1000000000)}
>> 
>> is the following in BinStorage:
>> 
>> n20xn21n22n23n24n2
>> 1000000000
>> 
>> or
>> 
>> 00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
>> |...n....2....0x.|
>> 00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
>> |.......n....2...|
>> 00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
>> |.1n....2....2n..|
>> 00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
>> |..2....3n....2..|
>> 00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
>> |..4n....2....100|
>> 00000050  30 30 30 30 30 30 30                              |0000000|
>> 00000057
>> 
>> … now, efficient integer storage is a controversial topic.
>> 
>> if you have short integers representing them as four bytes will waste a ton
>> of space.
>> 
>> implementing them as varints is usually a good compromise:
>> 
>> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>> 
>> 7 bits of each byte are used to represent the int.  One of the bits is used
>> to signal whether there is a next bit which needs to be read.
>> 
>> In my job , most of my ints will be stored in 4-8 bytes….. however, in
>> varint encoding they would only be 2 bytes.
>> 
>> A 4x savings in disk space could significantly improve performance.
>> 
>> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
>> I might do now.
>> 
>> Still…. even if varint encoding is slower, using 4 bytes for some uses
>> could be a win.
>> 
>> --
>> 
>> Founder/CEO Spinn3r.com
>> 
>> Location: *San Francisco, CA*
>> Skype: *burtonator*
>> 
>> Skype-in: *(415) 871-0687*
>> 
>> 
> 
> 
> -- 
> 
> Founder/CEO Spinn3r.com
> 
> Location: *San Francisco, CA*
> Skype: *burtonator*
> 
> Skype-in: *(415) 871-0687*


Re: BinStorage doesn't use efficient integer encoding?

Posted by Kevin Burton <bu...@spinn3r.com>.
I ran the benchmark now and results are pretty interesting.  Varint is about
2x performance and 1/2 the size of using integer.toString…



varint encoding
duration: 37,355 ms
string encoding
duration: 74,178 ms

varint total bytes: 2,229,450,884
string total bytes: 4,388,888,890



        long before,after;
        int max = 500000000;

        System.gc();

        before = System.currentTimeMillis();

        System.out.printf( "varint encoding\n" );

        VarintReader reader = new VarintReader();
        VarintWriter writer = new VarintWriter();

        for ( int i = 0; i < max; ++i ) {
            reader.read( writer.write( i ) );
        }

        after = System.currentTimeMillis();

        System.gc();

        System.out.printf( "duration: %,d ms\n", (after-before) );

        before = System.currentTimeMillis();

        System.out.printf( "string encoding\n" );

        for ( int i = 0; i < max; ++i ) {
            Integer.parseInt( Integer.toString( i ) );
        }

        after = System.currentTimeMillis();

        System.gc();

        System.out.printf( "duration: %,d ms\n", (after-before) );







On Sat, Aug 20, 2011 at 3:43 AM, Kevin Burton <bu...@spinn3r.com> wrote:

> I'm looking at BinStorage which I believe if I've read correct is used for
> all Pig intermediate files.
>
> … so any optimizations here would be transparent to the user.
>
> I just did a simple STORE using BinStorage and the format doesn't appear
> amazingly concise.
>
> for example:
>
> 0 {(1),(2),(3),(4),(1000000000)}
>
> is the following in BinStorage:
>
> n20xn21n22n23n24n2
> 1000000000
>
> or
>
> 00000000  01 02 03 6e 00 00 00 02  32 00 00 00 01 30 78 00
>  |...n....2....0x.|
> 00000010  00 00 00 00 00 00 05 6e  00 00 00 01 32 00 00 00
>  |.......n....2...|
> 00000020  01 31 6e 00 00 00 01 32  00 00 00 01 32 6e 00 00
>  |.1n....2....2n..|
> 00000030  00 01 32 00 00 00 01 33  6e 00 00 00 01 32 00 00
>  |..2....3n....2..|
> 00000040  00 01 34 6e 00 00 00 01  32 00 00 00 0a 31 30 30
>  |..4n....2....100|
> 00000050  30 30 30 30 30 30 30                              |0000000|
> 00000057
>
> … now, efficient integer storage is a controversial topic.
>
> if you have short integers representing them as four bytes will waste a ton
> of space.
>
> implementing them as varints is usually a good compromise:
>
> http://code.google.com/apis/protocolbuffers/docs/encoding.html
>
> 7 bits of each byte are used to represent the int.  One of the bits is used
> to signal whether there is a next bit which needs to be read.
>
> In my job , most of my ints will be stored in 4-8 bytes….. however, in
> varint encoding they would only be 2 bytes.
>
> A 4x savings in disk space could significantly improve performance.
>
> I haven't benchmarked CPU of variants vs Integer.toString() though …. which
> I might do now.
>
> Still…. even if varint encoding is slower, using 4 bytes for some uses
> could be a win.
>
> --
>
> Founder/CEO Spinn3r.com
>
> Location: *San Francisco, CA*
> Skype: *burtonator*
>
> Skype-in: *(415) 871-0687*
>
>


-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*