You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Thejas Nair <te...@yahoo-inc.com> on 2010/06/29 02:08:31 UTC

Avoiding serialization/de-serialization in pig

I have created a wiki which puts together some ideas that can help in
improving performance by avoiding/delaying serialization/de-serialization .

http://wiki.apache.org/pig/AvoidingSedes

These are ideas that don't involve changes to optimizer. Most of them
involve changes in the load/store functions.

Your feedback is welcome.

Thanks,
Thejas

Re: Avoiding serialization/de-serialization in pig

Posted by Russell Jurney <ru...@gmail.com>.

I don't fully understand the repercussions of this, but I like it.  We're
moving from our VoldemortStorage stuff to Avro and it would be great to pipe
Avro all the way through.

Russ

On Mon, Jun 28, 2010 at 5:51 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> For what it's worth, I saw very significant speed improvements (order of
> magnitude for wide tables with few projected columns) when I implemented
> (2)
> for our protocol buffer - based loaders.
>
> I have a feeling that propagating schemas when known, and using them to for
> (de)serialization instead of reflecting every field, would also be a big
> win.
>
> Thoughts on just using Avro for the internal PigStorage?
>
> -D
>
> On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
>
> > I have created a wiki which puts together some ideas that can help in
> > improving performance by avoiding/delaying serialization/de-serialization
> .
> >
> > http://wiki.apache.org/pig/AvoidingSedes
> >
> > These are ideas that don't involve changes to optimizer. Most of them
> > involve changes in the load/store functions.
> >
> > Your feedback is welcome.
> >
> > Thanks,
> > Thejas
> >
> >
>

Re: Avoiding serialization/de-serialization in pig

Posted by Jeff Zhang <zj...@gmail.com>.

Agree, I have compared the performance between Hive and Pig using some
simple script.  The performance of Hive is much better than Pig. The mapper
task time of pig and hive is almost the same, The time difference is almost
is caused by the reduce task and much time is spent on transfer time from
mapper to reducer. This is because the Pig will transfer much more data than
Hive. Hive use another binary format
(Hive-640<https://issues.apache.org/jira/browse/HIVE-640>)
which can reduce the intermediate data between mapper and reducer. And Avro
is something very similar to this, it's more compact. I believe it will
improve Pig's performance.

On Tue, Jun 29, 2010 at 8:51 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> For what it's worth, I saw very significant speed improvements (order of
> magnitude for wide tables with few projected columns) when I implemented
(2)
> for our protocol buffer - based loaders.
>
> I have a feeling that propagating schemas when known, and using them to
for
> (de)serialization instead of reflecting every field, would also be a big
> win.
>
> Thoughts on just using Avro for the internal PigStorage?
>
> -D
>
> On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair <te...@yahoo-inc.com> wrote:
>
>> I have created a wiki which puts together some ideas that can help in
>> improving performance by avoiding/delaying serialization/de-serialization
.
>>
>> http://wiki.apache.org/pig/AvoidingSedes
>>
>> These are ideas that don't involve changes to optimizer. Most of them
>> involve changes in the load/store functions.
>>
>> Your feedback is welcome.
>>
>> Thanks,
>> Thejas
>>
>>
>

-- 
Best Regards

Jeff Zhang

Re: Avoiding serialization/de-serialization in pig

Posted by Alan Gates <ga...@yahoo-inc.com>.

On Jun 28, 2010, at 5:51 PM, Dmitriy Ryaboy wrote:

> For what it's worth, I saw very significant speed improvements  
> (order of
> magnitude for wide tables with few projected columns) when I  
> implemented (2)
> for our protocol buffer - based loaders.
>
> I have a feeling that propagating schemas when known, and using them  
> to for
> (de)serialization instead of reflecting every field, would also be a  
> big
> win.
>
> Thoughts on just using Avro for the internal PigStorage?
I'm been trying to play with this in my spare time but haven't gotten  
far yet.  We're certain open to looking at it and seeing how it  
performs.

Alan.

>
> -D
>
> On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair <te...@yahoo-inc.com>  
> wrote:
>
>> I have created a wiki which puts together some ideas that can help in
>> improving performance by avoiding/delaying serialization/de- 
>> serialization .
>>
>> http://wiki.apache.org/pig/AvoidingSedes
>>
>> These are ideas that don't involve changes to optimizer. Most of them
>> involve changes in the load/store functions.
>>
>> Your feedback is welcome.
>>
>> Thanks,
>> Thejas
>>
>>

Re: Avoiding serialization/de-serialization in pig

Posted by Thejas Nair <te...@yahoo-inc.com>.

On 6/28/10 5:51 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

> 
> I have a feeling that propagating schemas when known, and using them to for
> (de)serialization instead of reflecting every field, would also be a big
> win.
> 
> Thoughts on just using Avro for the internal PigStorage?

When I profiled pig queries, I don't see much time being spent in
DataType.findType(Object o), where the type of object is determined using
"instanceof". (I am assuming you were referring to that).

But we can still optimize the cases where schema is known (ie all rows have
same schema) by not storing the type with each field in the serialization
format . Avro stores the schema separately, so I assume it has this
optimization. But in the case where schema is not known, we would need to
store the type information for every row.
When query plan is generated, we would need to determine which serialization
format is to be used.

-Thejas

Re: Avoiding serialization/de-serialization in pig

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

For what it's worth, I saw very significant speed improvements (order of
magnitude for wide tables with few projected columns) when I implemented (2)
for our protocol buffer - based loaders.

I have a feeling that propagating schemas when known, and using them to for
(de)serialization instead of reflecting every field, would also be a big
win.

Thoughts on just using Avro for the internal PigStorage?

-D

On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair <te...@yahoo-inc.com> wrote:

> I have created a wiki which puts together some ideas that can help in
> improving performance by avoiding/delaying serialization/de-serialization .
>
> http://wiki.apache.org/pig/AvoidingSedes
>
> These are ideas that don't involve changes to optimizer. Most of them
> involve changes in the load/store functions.
>
> Your feedback is welcome.
>
> Thanks,
> Thejas
>
>