You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Corbin Hoenes <co...@tynt.com> on 2010/08/25 22:58:55 UTC

COUNT(A.field1)

Wondering about performance and count...
A =  load 'test.csv' as (a1,a2,a3); 
B = GROUP A by a1;
-- which preferred?
C = FOREACH B GENERATE COUNT(A);
-- or would this only send a single field through the COUNT and be more performant? 
C = FOREACH B GENERATE COUNT(A.a2);

Re: COUNT(A.field1)

Posted by Thejas M Nair <te...@yahoo-inc.com>.

In case of COUNT(A) or COUNT(A.a2), since the combiner would get used, the
value that is sent from map to reduce will only be the result of COUNT for
each of the group on a1 in the map. Ie, The data transferred will be same in
both cases.

However, pig can tell the loader that it needs only column a2, if you are
using COUNT(A.a2) in your query. If the loader has optimizations (selective
deserialization or columnar storgae) which results in less cost if fewer
number of columns are requested by pig, then you will benefit from using
COUNT(A.a2).
But in case of group , I think the column pruning does not work across it,
and (if so) that should change in a future release.



-Thejas




On 8/28/10 12:44 PM, "Renato Marroquín Mogrovejo"
<re...@gmail.com> wrote:

> Hi, this is also interesting and kinda confusing for me too (=
> From the db world, the second one would have a better performance, but Pig
> doesn't save statistics on the data, so it has to read the whole file
> anyways, and like the count operation is mainly done on the map side, all
> attributes will be read anyways, but the ones that are not interesting for
> us will be dismissed and not passed to the reducer part of the job, and
> besides wouldn't the presence of null values affect the performance? For
> example, if a2 would have many null values, then less values would be passed
> too right?
> 
> Renato M.
> 
> 2010/8/27 Mridul Muralidharan <mr...@yahoo-inc.com>
> 
>> 
>> On second thoughts, that part is obvious - duh
>> 
>> - Mridul
>> 
>> 
>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>> 
>>> 
>>> But it does for COUNT(A.a2) ?
>>> That is interesting, and somehow weird :)
>>> 
>>> Thanks !
>>> Mridul
>>> 
>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>> 
>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>> a3, and project all of them.
>>>> 
>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>> <mr...@yahoo-inc.com>>  wrote:
>>>> 
>>>> 
>>>>     I am not sure why second option is better - in both cases, you are
>>>>     shipping only the combined counts from map to reduce.
>>>>     On other hand, first could be better since it means we need to
>>>>     project only 'a1' - and none of the other fields.
>>>> 
>>>>     Or did I miss something here ?
>>>>     I am not very familiar to what pig does in this case right now.
>>>> 
>>>>     Regards,
>>>>     Mridul
>>>> 
>>>> 
>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>> 
>>>>         Generally speaking, the second option will be more performant as
>>>>         it might
>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>         this is likely
>>>>         to be very small as COUNT is an algebraic function, so most of
>>>>         the work is
>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>         are shipped
>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>         column store, or
>>>>         has non-negligible deserialization cost that can be offset by
>>>> only
>>>>         deserializing a few fields -- the second option is better.
>>>> 
>>>>         -D
>>>> 
>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>>>         <ma...@tynt.com>>   wrote:
>>>> 
>>>>             Wondering about performance and count...
>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>             B = GROUP A by a1;
>>>>             -- which preferred?
>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>             -- or would this only send a single field through the COUNT
>>>>             and be more
>>>>             performant?
>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>

Re: COUNT(A.field1)

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Thanks Thejas!

2010/9/10 Thejas M Nair <te...@yahoo-inc.com>

> Yes, Zebra has columnar storage format.
> Regarding selective deserialization  (ie deserializing only columns that
> are
> actually needed for the pig query) - As per my understanding elephant-bird
> has a protocol buffer based loader which does lazy deserialization.
> PigStorage also does something similar- when PigStorage is used to load
> data, pigstorage returns bytearray type and there is type-casting foreach
> added by pig after the load which does the type conversion on the fields
> that are required in rest of the query.
>
> -Thejas
>
>
>
> On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo"
> <re...@gmail.com> wrote:
>
> > Thanks Dmitriy! Hey, a couple of final questions please.
> > Which are the deserializers that implement this selective
> deserialization?
> > And the columnar storage used is Zebra?
> > Thanks again for the great replies.
> >
> > Renato M.
> >
> > 2010/9/2 Dmitriy Ryaboy <dv...@gmail.com>
> >
> >> Pig has selective deserialization and columnar storage if the loader you
> >> are using implements it. So that depends on what you are doing.
> Naturally,
> >> if your data is not stored in a way that separates the columns, Pig
> can't
> >> magically read them separately :).
> >>
> >> You should try to always use combiners.
> >>
> >> -D
> >>
> >>
> >> On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
> >> renatoj.marroquin@gmail.com> wrote:
> >>
> >>> So in terms of performance is the same if I count just a single column
> or
> >>> the whole data set, right?
> >>> But what Thejas said about the loader having optimizations (selective
> >>> deserialization or columnar storage) is something that Pig actually
> has? or
> >>> is it something planned for the future?
> >>> And hey using a combiner shouldn't be a thing we should try to avoid? I
> >>> mean for the COUNT case, a combiner is needed, but are there any other
> >>> operations that are put into that combiner? like trying to reuse the
> >>> computation being made?
> >>> Thanks for the replies (=
> >>>
> >>> Renato M.
> >>>
> >>>
> >>> 2010/8/29 Mridul Muralidharan <mr...@yahoo-inc.com>
> >>>
> >>>
> >>>>
> >>>> Reason why COUNT(a.field1) would have better performance is 'cos pig
> does
> >>>> not 'know' what is required from a tuple in case of COUNT(a).
> >>>> In a custom mapred job, we can optimize it away so that only the
> single
> >>>> required field is projected out : but that is obviously not possible
> here
> >>>> (COUNT is a udf) : so the entire tuple is deserialized from input.
> >>>>
> >>>> Ofcourse, the performance difference, as Dmitriy noted, would not be
> very
> >>>> high.
> >>>>
> >>>>
> >>>> Regards,
> >>>> Mridul
> >>>>
> >>>>
> >>>>
> >>>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
> >>>>
> >>>>> Hi, this is also interesting and kinda confusing for me too (=
> >>>>>  From the db world, the second one would have a better performance,
> but
> >>>>> Pig
> >>>>> doesn't save statistics on the data, so it has to read the whole file
> >>>>> anyways, and like the count operation is mainly done on the map side,
> >>>>> all
> >>>>> attributes will be read anyways, but the ones that are not
> interesting
> >>>>> for
> >>>>> us will be dismissed and not passed to the reducer part of the job,
> and
> >>>>> besides wouldn't the presence of null values affect the performance?
> For
> >>>>> example, if a2 would have many null values, then less values would be
> >>>>> passed
> >>>>> too right?
> >>>>>
> >>>>> Renato M.
> >>>>>
> >>>>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
> >>>>>
> >>>>>
> >>>>>> On second thoughts, that part is obvious - duh
> >>>>>>
> >>>>>> - Mridul
> >>>>>>
> >>>>>>
> >>>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
> >>>>>>
> >>>>>>
> >>>>>>> But it does for COUNT(A.a2) ?
> >>>>>>> That is interesting, and somehow weird :)
> >>>>>>>
> >>>>>>> Thanks !
> >>>>>>> Mridul
> >>>>>>>
> >>>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
> >>>>>>>
> >>>>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2
> and
> >>>>>>>> a3, and project all of them.
> >>>>>>>>
> >>>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
> >>>>>>>> <mr...@yahoo-inc.com>>   wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>     I am not sure why second option is better - in both cases, you
> >>>>>>>> are
> >>>>>>>>     shipping only the combined counts from map to reduce.
> >>>>>>>>     On other hand, first could be better since it means we need to
> >>>>>>>>     project only 'a1' - and none of the other fields.
> >>>>>>>>
> >>>>>>>>     Or did I miss something here ?
> >>>>>>>>     I am not very familiar to what pig does in this case right
> now.
> >>>>>>>>
> >>>>>>>>     Regards,
> >>>>>>>>     Mridul
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
> >>>>>>>>
> >>>>>>>>         Generally speaking, the second option will be more
> performant
> >>>>>>>> as
> >>>>>>>>         it might
> >>>>>>>>         let you drop column a3 early. In most cases the magnitude
> of
> >>>>>>>>         this is likely
> >>>>>>>>         to be very small as COUNT is an algebraic function, so
> most
> >>>>>>>> of
> >>>>>>>>         the work is
> >>>>>>>>         done map-side anyway, and only partial, pre-aggregated
> counts
> >>>>>>>>         are shipped
> >>>>>>>>         from mappers to reducers. However, if A is very wide, or a
> >>>>>>>>         column store, or
> >>>>>>>>         has non-negligible deserialization cost that can be offset
> by
> >>>>>>>> only
> >>>>>>>>         deserializing a few fields -- the second option is better.
> >>>>>>>>
> >>>>>>>>         -D
> >>>>>>>>
> >>>>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
> >>>>>>>> corbin@tynt.com
> >>>>>>>>         <ma...@tynt.com>>    wrote:
> >>>>>>>>
> >>>>>>>>             Wondering about performance and count...
> >>>>>>>>             A =  load 'test.csv' as (a1,a2,a3);
> >>>>>>>>             B = GROUP A by a1;
> >>>>>>>>             -- which preferred?
> >>>>>>>>             C = FOREACH B GENERATE COUNT(A);
> >>>>>>>>             -- or would this only send a single field through the
> >>>>>>>> COUNT
> >>>>>>>>             and be more
> >>>>>>>>             performant?
> >>>>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >
>
>
>

Re: COUNT(A.field1)

Posted by Thejas M Nair <te...@yahoo-inc.com>.

Yes, Zebra has columnar storage format.
Regarding selective deserialization  (ie deserializing only columns that are
actually needed for the pig query) - As per my understanding elephant-bird
has a protocol buffer based loader which does lazy deserialization.
PigStorage also does something similar- when PigStorage is used to load
data, pigstorage returns bytearray type and there is type-casting foreach
added by pig after the load which does the type conversion on the fields
that are required in rest of the query.

-Thejas



On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo"
<re...@gmail.com> wrote:

> Thanks Dmitriy! Hey, a couple of final questions please.
> Which are the deserializers that implement this selective deserialization?
> And the columnar storage used is Zebra?
> Thanks again for the great replies.
> 
> Renato M.
> 
> 2010/9/2 Dmitriy Ryaboy <dv...@gmail.com>
> 
>> Pig has selective deserialization and columnar storage if the loader you
>> are using implements it. So that depends on what you are doing. Naturally,
>> if your data is not stored in a way that separates the columns, Pig can't
>> magically read them separately :).
>> 
>> You should try to always use combiners.
>> 
>> -D
>> 
>> 
>> On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
>> renatoj.marroquin@gmail.com> wrote:
>> 
>>> So in terms of performance is the same if I count just a single column or
>>> the whole data set, right?
>>> But what Thejas said about the loader having optimizations (selective
>>> deserialization or columnar storage) is something that Pig actually has? or
>>> is it something planned for the future?
>>> And hey using a combiner shouldn't be a thing we should try to avoid? I
>>> mean for the COUNT case, a combiner is needed, but are there any other
>>> operations that are put into that combiner? like trying to reuse the
>>> computation being made?
>>> Thanks for the replies (=
>>> 
>>> Renato M.
>>> 
>>> 
>>> 2010/8/29 Mridul Muralidharan <mr...@yahoo-inc.com>
>>> 
>>> 
>>>> 
>>>> Reason why COUNT(a.field1) would have better performance is 'cos pig does
>>>> not 'know' what is required from a tuple in case of COUNT(a).
>>>> In a custom mapred job, we can optimize it away so that only the single
>>>> required field is projected out : but that is obviously not possible here
>>>> (COUNT is a udf) : so the entire tuple is deserialized from input.
>>>> 
>>>> Ofcourse, the performance difference, as Dmitriy noted, would not be very
>>>> high.
>>>> 
>>>> 
>>>> Regards,
>>>> Mridul
>>>> 
>>>> 
>>>> 
>>>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>>>> 
>>>>> Hi, this is also interesting and kinda confusing for me too (=
>>>>>  From the db world, the second one would have a better performance, but
>>>>> Pig
>>>>> doesn't save statistics on the data, so it has to read the whole file
>>>>> anyways, and like the count operation is mainly done on the map side,
>>>>> all
>>>>> attributes will be read anyways, but the ones that are not interesting
>>>>> for
>>>>> us will be dismissed and not passed to the reducer part of the job, and
>>>>> besides wouldn't the presence of null values affect the performance? For
>>>>> example, if a2 would have many null values, then less values would be
>>>>> passed
>>>>> too right?
>>>>> 
>>>>> Renato M.
>>>>> 
>>>>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>>>>> 
>>>>> 
>>>>>> On second thoughts, that part is obvious - duh
>>>>>> 
>>>>>> - Mridul
>>>>>> 
>>>>>> 
>>>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>>>> 
>>>>>> 
>>>>>>> But it does for COUNT(A.a2) ?
>>>>>>> That is interesting, and somehow weird :)
>>>>>>> 
>>>>>>> Thanks !
>>>>>>> Mridul
>>>>>>> 
>>>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>>>> 
>>>>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>>>>> a3, and project all of them.
>>>>>>>> 
>>>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>>>>> <mr...@yahoo-inc.com>>   wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>     I am not sure why second option is better - in both cases, you
>>>>>>>> are
>>>>>>>>     shipping only the combined counts from map to reduce.
>>>>>>>>     On other hand, first could be better since it means we need to
>>>>>>>>     project only 'a1' - and none of the other fields.
>>>>>>>> 
>>>>>>>>     Or did I miss something here ?
>>>>>>>>     I am not very familiar to what pig does in this case right now.
>>>>>>>> 
>>>>>>>>     Regards,
>>>>>>>>     Mridul
>>>>>>>> 
>>>>>>>> 
>>>>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>>>>> 
>>>>>>>>         Generally speaking, the second option will be more performant
>>>>>>>> as
>>>>>>>>         it might
>>>>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>>>>         this is likely
>>>>>>>>         to be very small as COUNT is an algebraic function, so most
>>>>>>>> of
>>>>>>>>         the work is
>>>>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>>>>         are shipped
>>>>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>>>>         column store, or
>>>>>>>>         has non-negligible deserialization cost that can be offset by
>>>>>>>> only
>>>>>>>>         deserializing a few fields -- the second option is better.
>>>>>>>> 
>>>>>>>>         -D
>>>>>>>> 
>>>>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
>>>>>>>> corbin@tynt.com
>>>>>>>>         <ma...@tynt.com>>    wrote:
>>>>>>>> 
>>>>>>>>             Wondering about performance and count...
>>>>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>>>>             B = GROUP A by a1;
>>>>>>>>             -- which preferred?
>>>>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>>>>             -- or would this only send a single field through the
>>>>>>>> COUNT
>>>>>>>>             and be more
>>>>>>>>             performant?
>>>>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>

Re: COUNT(A.field1)

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Thanks Dmitriy! Hey, a couple of final questions please.
Which are the deserializers that implement this selective deserialization?
And the columnar storage used is Zebra?
Thanks again for the great replies.

Renato M.

2010/9/2 Dmitriy Ryaboy <dv...@gmail.com>

> Pig has selective deserialization and columnar storage if the loader you
> are using implements it. So that depends on what you are doing. Naturally,
> if your data is not stored in a way that separates the columns, Pig can't
> magically read them separately :).
>
> You should try to always use combiners.
>
> -D
>
>
> On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> So in terms of performance is the same if I count just a single column or
>> the whole data set, right?
>> But what Thejas said about the loader having optimizations (selective
>> deserialization or columnar storage) is something that Pig actually has? or
>> is it something planned for the future?
>> And hey using a combiner shouldn't be a thing we should try to avoid? I
>> mean for the COUNT case, a combiner is needed, but are there any other
>> operations that are put into that combiner? like trying to reuse the
>> computation being made?
>> Thanks for the replies (=
>>
>> Renato M.
>>
>>
>> 2010/8/29 Mridul Muralidharan <mr...@yahoo-inc.com>
>>
>>
>>>
>>> Reason why COUNT(a.field1) would have better performance is 'cos pig does
>>> not 'know' what is required from a tuple in case of COUNT(a).
>>> In a custom mapred job, we can optimize it away so that only the single
>>> required field is projected out : but that is obviously not possible here
>>> (COUNT is a udf) : so the entire tuple is deserialized from input.
>>>
>>> Ofcourse, the performance difference, as Dmitriy noted, would not be very
>>> high.
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>>
>>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>>>
>>>> Hi, this is also interesting and kinda confusing for me too (=
>>>>  From the db world, the second one would have a better performance, but
>>>> Pig
>>>> doesn't save statistics on the data, so it has to read the whole file
>>>> anyways, and like the count operation is mainly done on the map side,
>>>> all
>>>> attributes will be read anyways, but the ones that are not interesting
>>>> for
>>>> us will be dismissed and not passed to the reducer part of the job, and
>>>> besides wouldn't the presence of null values affect the performance? For
>>>> example, if a2 would have many null values, then less values would be
>>>> passed
>>>> too right?
>>>>
>>>> Renato M.
>>>>
>>>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>>>>
>>>>
>>>>> On second thoughts, that part is obvious - duh
>>>>>
>>>>> - Mridul
>>>>>
>>>>>
>>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>>>
>>>>>
>>>>>> But it does for COUNT(A.a2) ?
>>>>>> That is interesting, and somehow weird :)
>>>>>>
>>>>>> Thanks !
>>>>>> Mridul
>>>>>>
>>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>>>
>>>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>>>> a3, and project all of them.
>>>>>>>
>>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>>>> <mr...@yahoo-inc.com>>   wrote:
>>>>>>>
>>>>>>>
>>>>>>>     I am not sure why second option is better - in both cases, you
>>>>>>> are
>>>>>>>     shipping only the combined counts from map to reduce.
>>>>>>>     On other hand, first could be better since it means we need to
>>>>>>>     project only 'a1' - and none of the other fields.
>>>>>>>
>>>>>>>     Or did I miss something here ?
>>>>>>>     I am not very familiar to what pig does in this case right now.
>>>>>>>
>>>>>>>     Regards,
>>>>>>>     Mridul
>>>>>>>
>>>>>>>
>>>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>>>>
>>>>>>>         Generally speaking, the second option will be more performant
>>>>>>> as
>>>>>>>         it might
>>>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>>>         this is likely
>>>>>>>         to be very small as COUNT is an algebraic function, so most
>>>>>>> of
>>>>>>>         the work is
>>>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>>>         are shipped
>>>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>>>         column store, or
>>>>>>>         has non-negligible deserialization cost that can be offset by
>>>>>>> only
>>>>>>>         deserializing a few fields -- the second option is better.
>>>>>>>
>>>>>>>         -D
>>>>>>>
>>>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
>>>>>>> corbin@tynt.com
>>>>>>>         <ma...@tynt.com>>    wrote:
>>>>>>>
>>>>>>>             Wondering about performance and count...
>>>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>>>             B = GROUP A by a1;
>>>>>>>             -- which preferred?
>>>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>>>             -- or would this only send a single field through the
>>>>>>> COUNT
>>>>>>>             and be more
>>>>>>>             performant?
>>>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: COUNT(A.field1)

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Pig has selective deserialization and columnar storage if the loader you are
using implements it. So that depends on what you are doing. Naturally, if
your data is not stored in a way that separates the columns, Pig can't
magically read them separately :).

You should try to always use combiners.

-D

On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> So in terms of performance is the same if I count just a single column or
> the whole data set, right?
> But what Thejas said about the loader having optimizations (selective
> deserialization or columnar storage) is something that Pig actually has? or
> is it something planned for the future?
> And hey using a combiner shouldn't be a thing we should try to avoid? I
> mean for the COUNT case, a combiner is needed, but are there any other
> operations that are put into that combiner? like trying to reuse the
> computation being made?
> Thanks for the replies (=
>
> Renato M.
>
>
> 2010/8/29 Mridul Muralidharan <mr...@yahoo-inc.com>
>
>
>>
>> Reason why COUNT(a.field1) would have better performance is 'cos pig does
>> not 'know' what is required from a tuple in case of COUNT(a).
>> In a custom mapred job, we can optimize it away so that only the single
>> required field is projected out : but that is obviously not possible here
>> (COUNT is a udf) : so the entire tuple is deserialized from input.
>>
>> Ofcourse, the performance difference, as Dmitriy noted, would not be very
>> high.
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>>
>>> Hi, this is also interesting and kinda confusing for me too (=
>>>  From the db world, the second one would have a better performance, but
>>> Pig
>>> doesn't save statistics on the data, so it has to read the whole file
>>> anyways, and like the count operation is mainly done on the map side, all
>>> attributes will be read anyways, but the ones that are not interesting
>>> for
>>> us will be dismissed and not passed to the reducer part of the job, and
>>> besides wouldn't the presence of null values affect the performance? For
>>> example, if a2 would have many null values, then less values would be
>>> passed
>>> too right?
>>>
>>> Renato M.
>>>
>>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>>>
>>>
>>>> On second thoughts, that part is obvious - duh
>>>>
>>>> - Mridul
>>>>
>>>>
>>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>>
>>>>
>>>>> But it does for COUNT(A.a2) ?
>>>>> That is interesting, and somehow weird :)
>>>>>
>>>>> Thanks !
>>>>> Mridul
>>>>>
>>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>>
>>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>>> a3, and project all of them.
>>>>>>
>>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>>> <mr...@yahoo-inc.com>>   wrote:
>>>>>>
>>>>>>
>>>>>>     I am not sure why second option is better - in both cases, you are
>>>>>>     shipping only the combined counts from map to reduce.
>>>>>>     On other hand, first could be better since it means we need to
>>>>>>     project only 'a1' - and none of the other fields.
>>>>>>
>>>>>>     Or did I miss something here ?
>>>>>>     I am not very familiar to what pig does in this case right now.
>>>>>>
>>>>>>     Regards,
>>>>>>     Mridul
>>>>>>
>>>>>>
>>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>>>
>>>>>>         Generally speaking, the second option will be more performant
>>>>>> as
>>>>>>         it might
>>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>>         this is likely
>>>>>>         to be very small as COUNT is an algebraic function, so most of
>>>>>>         the work is
>>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>>         are shipped
>>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>>         column store, or
>>>>>>         has non-negligible deserialization cost that can be offset by
>>>>>> only
>>>>>>         deserializing a few fields -- the second option is better.
>>>>>>
>>>>>>         -D
>>>>>>
>>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
>>>>>> corbin@tynt.com
>>>>>>         <ma...@tynt.com>>    wrote:
>>>>>>
>>>>>>             Wondering about performance and count...
>>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>>             B = GROUP A by a1;
>>>>>>             -- which preferred?
>>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>>             -- or would this only send a single field through the
>>>>>> COUNT
>>>>>>             and be more
>>>>>>             performant?
>>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: COUNT(A.field1)

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

So in terms of performance is the same if I count just a single column or
the whole data set, right?
But what Thejas said about the loader having optimizations (selective
deserialization or columnar storage) is something that Pig actually has? or
is it something planned for the future?
And hey using a combiner shouldn't be a thing we should try to avoid? I mean
for the COUNT case, a combiner is needed, but are there any other operations
that are put into that combiner? like trying to reuse the computation being
made?
Thanks for the replies (=

Renato M.


2010/8/29 Mridul Muralidharan <mr...@yahoo-inc.com>

>
>
> Reason why COUNT(a.field1) would have better performance is 'cos pig does
> not 'know' what is required from a tuple in case of COUNT(a).
> In a custom mapred job, we can optimize it away so that only the single
> required field is projected out : but that is obviously not possible here
> (COUNT is a udf) : so the entire tuple is deserialized from input.
>
> Ofcourse, the performance difference, as Dmitriy noted, would not be very
> high.
>
>
> Regards,
> Mridul
>
>
>
> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>
>> Hi, this is also interesting and kinda confusing for me too (=
>>  From the db world, the second one would have a better performance, but
>> Pig
>> doesn't save statistics on the data, so it has to read the whole file
>> anyways, and like the count operation is mainly done on the map side, all
>> attributes will be read anyways, but the ones that are not interesting for
>> us will be dismissed and not passed to the reducer part of the job, and
>> besides wouldn't the presence of null values affect the performance? For
>> example, if a2 would have many null values, then less values would be
>> passed
>> too right?
>>
>> Renato M.
>>
>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>>
>>
>>> On second thoughts, that part is obvious - duh
>>>
>>> - Mridul
>>>
>>>
>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>
>>>
>>>> But it does for COUNT(A.a2) ?
>>>> That is interesting, and somehow weird :)
>>>>
>>>> Thanks !
>>>> Mridul
>>>>
>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>
>>>>  I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>> a3, and project all of them.
>>>>>
>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>> <mr...@yahoo-inc.com>>   wrote:
>>>>>
>>>>>
>>>>>     I am not sure why second option is better - in both cases, you are
>>>>>     shipping only the combined counts from map to reduce.
>>>>>     On other hand, first could be better since it means we need to
>>>>>     project only 'a1' - and none of the other fields.
>>>>>
>>>>>     Or did I miss something here ?
>>>>>     I am not very familiar to what pig does in this case right now.
>>>>>
>>>>>     Regards,
>>>>>     Mridul
>>>>>
>>>>>
>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>>
>>>>>         Generally speaking, the second option will be more performant
>>>>> as
>>>>>         it might
>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>         this is likely
>>>>>         to be very small as COUNT is an algebraic function, so most of
>>>>>         the work is
>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>         are shipped
>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>         column store, or
>>>>>         has non-negligible deserialization cost that can be offset by
>>>>> only
>>>>>         deserializing a few fields -- the second option is better.
>>>>>
>>>>>         -D
>>>>>
>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>>>>         <ma...@tynt.com>>    wrote:
>>>>>
>>>>>             Wondering about performance and count...
>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>             B = GROUP A by a1;
>>>>>             -- which preferred?
>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>             -- or would this only send a single field through the COUNT
>>>>>             and be more
>>>>>             performant?
>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: COUNT(A.field1)

Posted by Corbin Hoenes <co...@tynt.com>.

Wow...thanks for all the discussion and insight guys.

On Aug 29, 2010, at 10:01 AM, Mridul Muralidharan wrote:

> 
> 
> Reason why COUNT(a.field1) would have better performance is 'cos pig does not 'know' what is required from a tuple in case of COUNT(a).
> In a custom mapred job, we can optimize it away so that only the single required field is projected out : but that is obviously not possible here (COUNT is a udf) : so the entire tuple is deserialized from input.
> 
> Ofcourse, the performance difference, as Dmitriy noted, would not be very high.
> 
> 
> Regards,
> Mridul
> 
> 
> On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
>> Hi, this is also interesting and kinda confusing for me too (=
>> From the db world, the second one would have a better performance, but Pig
>> doesn't save statistics on the data, so it has to read the whole file
>> anyways, and like the count operation is mainly done on the map side, all
>> attributes will be read anyways, but the ones that are not interesting for
>> us will be dismissed and not passed to the reducer part of the job, and
>> besides wouldn't the presence of null values affect the performance? For
>> example, if a2 would have many null values, then less values would be passed
>> too right?
>> 
>> Renato M.
>> 
>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>> 
>>> 
>>> On second thoughts, that part is obvious - duh
>>> 
>>> - Mridul
>>> 
>>> 
>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>> 
>>>> 
>>>> But it does for COUNT(A.a2) ?
>>>> That is interesting, and somehow weird :)
>>>> 
>>>> Thanks !
>>>> Mridul
>>>> 
>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>> 
>>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>> a3, and project all of them.
>>>>> 
>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>> <mr...@yahoo-inc.com>>   wrote:
>>>>> 
>>>>> 
>>>>>     I am not sure why second option is better - in both cases, you are
>>>>>     shipping only the combined counts from map to reduce.
>>>>>     On other hand, first could be better since it means we need to
>>>>>     project only 'a1' - and none of the other fields.
>>>>> 
>>>>>     Or did I miss something here ?
>>>>>     I am not very familiar to what pig does in this case right now.
>>>>> 
>>>>>     Regards,
>>>>>     Mridul
>>>>> 
>>>>> 
>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>> 
>>>>>         Generally speaking, the second option will be more performant as
>>>>>         it might
>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>         this is likely
>>>>>         to be very small as COUNT is an algebraic function, so most of
>>>>>         the work is
>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>         are shipped
>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>         column store, or
>>>>>         has non-negligible deserialization cost that can be offset by
>>>>> only
>>>>>         deserializing a few fields -- the second option is better.
>>>>> 
>>>>>         -D
>>>>> 
>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>>>>         <ma...@tynt.com>>    wrote:
>>>>> 
>>>>>             Wondering about performance and count...
>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>             B = GROUP A by a1;
>>>>>             -- which preferred?
>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>             -- or would this only send a single field through the COUNT
>>>>>             and be more
>>>>>             performant?
>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>

Re: COUNT(A.field1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.


Reason why COUNT(a.field1) would have better performance is 'cos pig 
does not 'know' what is required from a tuple in case of COUNT(a).
In a custom mapred job, we can optimize it away so that only the single 
required field is projected out : but that is obviously not possible 
here (COUNT is a udf) : so the entire tuple is deserialized from input.

Ofcourse, the performance difference, as Dmitriy noted, would not be 
very high.


Regards,
Mridul


On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
> Hi, this is also interesting and kinda confusing for me too (=
>  From the db world, the second one would have a better performance, but Pig
> doesn't save statistics on the data, so it has to read the whole file
> anyways, and like the count operation is mainly done on the map side, all
> attributes will be read anyways, but the ones that are not interesting for
> us will be dismissed and not passed to the reducer part of the job, and
> besides wouldn't the presence of null values affect the performance? For
> example, if a2 would have many null values, then less values would be passed
> too right?
>
> Renato M.
>
> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>
>>
>> On second thoughts, that part is obvious - duh
>>
>> - Mridul
>>
>>
>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>
>>>
>>> But it does for COUNT(A.a2) ?
>>> That is interesting, and somehow weird :)
>>>
>>> Thanks !
>>> Mridul
>>>
>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>
>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>> a3, and project all of them.
>>>>
>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>> <mr...@yahoo-inc.com>>   wrote:
>>>>
>>>>
>>>>      I am not sure why second option is better - in both cases, you are
>>>>      shipping only the combined counts from map to reduce.
>>>>      On other hand, first could be better since it means we need to
>>>>      project only 'a1' - and none of the other fields.
>>>>
>>>>      Or did I miss something here ?
>>>>      I am not very familiar to what pig does in this case right now.
>>>>
>>>>      Regards,
>>>>      Mridul
>>>>
>>>>
>>>>      On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>
>>>>          Generally speaking, the second option will be more performant as
>>>>          it might
>>>>          let you drop column a3 early. In most cases the magnitude of
>>>>          this is likely
>>>>          to be very small as COUNT is an algebraic function, so most of
>>>>          the work is
>>>>          done map-side anyway, and only partial, pre-aggregated counts
>>>>          are shipped
>>>>          from mappers to reducers. However, if A is very wide, or a
>>>>          column store, or
>>>>          has non-negligible deserialization cost that can be offset by
>>>> only
>>>>          deserializing a few fields -- the second option is better.
>>>>
>>>>          -D
>>>>
>>>>          On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>>>          <ma...@tynt.com>>    wrote:
>>>>
>>>>              Wondering about performance and count...
>>>>              A =  load 'test.csv' as (a1,a2,a3);
>>>>              B = GROUP A by a1;
>>>>              -- which preferred?
>>>>              C = FOREACH B GENERATE COUNT(A);
>>>>              -- or would this only send a single field through the COUNT
>>>>              and be more
>>>>              performant?
>>>>              C = FOREACH B GENERATE COUNT(A.a2);
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

Re: COUNT(A.field1)

Posted by Renato Marroquín Mogrovejo <re...@gmail.com>.

Hi, this is also interesting and kinda confusing for me too (=
>From the db world, the second one would have a better performance, but Pig
doesn't save statistics on the data, so it has to read the whole file
anyways, and like the count operation is mainly done on the map side, all
attributes will be read anyways, but the ones that are not interesting for
us will be dismissed and not passed to the reducer part of the job, and
besides wouldn't the presence of null values affect the performance? For
example, if a2 would have many null values, then less values would be passed
too right?

Renato M.

2010/8/27 Mridul Muralidharan <mr...@yahoo-inc.com>

>
> On second thoughts, that part is obvious - duh
>
> - Mridul
>
>
> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>
>>
>> But it does for COUNT(A.a2) ?
>> That is interesting, and somehow weird :)
>>
>> Thanks !
>> Mridul
>>
>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>
>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>> a3, and project all of them.
>>>
>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>> <mr...@yahoo-inc.com>>  wrote:
>>>
>>>
>>>     I am not sure why second option is better - in both cases, you are
>>>     shipping only the combined counts from map to reduce.
>>>     On other hand, first could be better since it means we need to
>>>     project only 'a1' - and none of the other fields.
>>>
>>>     Or did I miss something here ?
>>>     I am not very familiar to what pig does in this case right now.
>>>
>>>     Regards,
>>>     Mridul
>>>
>>>
>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>
>>>         Generally speaking, the second option will be more performant as
>>>         it might
>>>         let you drop column a3 early. In most cases the magnitude of
>>>         this is likely
>>>         to be very small as COUNT is an algebraic function, so most of
>>>         the work is
>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>         are shipped
>>>         from mappers to reducers. However, if A is very wide, or a
>>>         column store, or
>>>         has non-negligible deserialization cost that can be offset by
>>> only
>>>         deserializing a few fields -- the second option is better.
>>>
>>>         -D
>>>
>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>>         <ma...@tynt.com>>   wrote:
>>>
>>>             Wondering about performance and count...
>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>             B = GROUP A by a1;
>>>             -- which preferred?
>>>             C = FOREACH B GENERATE COUNT(A);
>>>             -- or would this only send a single field through the COUNT
>>>             and be more
>>>             performant?
>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: COUNT(A.field1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

On second thoughts, that part is obvious - duh

- Mridul

On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>
> But it does for COUNT(A.a2) ?
> That is interesting, and somehow weird :)
>
> Thanks !
> Mridul
>
> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>> a3, and project all of them.
>>
>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>> <mr...@yahoo-inc.com>>  wrote:
>>
>>
>>      I am not sure why second option is better - in both cases, you are
>>      shipping only the combined counts from map to reduce.
>>      On other hand, first could be better since it means we need to
>>      project only 'a1' - and none of the other fields.
>>
>>      Or did I miss something here ?
>>      I am not very familiar to what pig does in this case right now.
>>
>>      Regards,
>>      Mridul
>>
>>
>>      On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>
>>          Generally speaking, the second option will be more performant as
>>          it might
>>          let you drop column a3 early. In most cases the magnitude of
>>          this is likely
>>          to be very small as COUNT is an algebraic function, so most of
>>          the work is
>>          done map-side anyway, and only partial, pre-aggregated counts
>>          are shipped
>>          from mappers to reducers. However, if A is very wide, or a
>>          column store, or
>>          has non-negligible deserialization cost that can be offset by only
>>          deserializing a few fields -- the second option is better.
>>
>>          -D
>>
>>          On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>          <ma...@tynt.com>>   wrote:
>>
>>              Wondering about performance and count...
>>              A =  load 'test.csv' as (a1,a2,a3);
>>              B = GROUP A by a1;
>>              -- which preferred?
>>              C = FOREACH B GENERATE COUNT(A);
>>              -- or would this only send a single field through the COUNT
>>              and be more
>>              performant?
>>              C = FOREACH B GENERATE COUNT(A.a2);
>>
>>
>>
>>
>>
>

Re: COUNT(A.field1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

But it does for COUNT(A.a2) ?
That is interesting, and somehow weird :)

Thanks !
Mridul

On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
> a3, and project all of them.
>
> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
> <mridulm@yahoo-inc.com <ma...@yahoo-inc.com>> wrote:
>
>
>     I am not sure why second option is better - in both cases, you are
>     shipping only the combined counts from map to reduce.
>     On other hand, first could be better since it means we need to
>     project only 'a1' - and none of the other fields.
>
>     Or did I miss something here ?
>     I am not very familiar to what pig does in this case right now.
>
>     Regards,
>     Mridul
>
>
>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>
>         Generally speaking, the second option will be more performant as
>         it might
>         let you drop column a3 early. In most cases the magnitude of
>         this is likely
>         to be very small as COUNT is an algebraic function, so most of
>         the work is
>         done map-side anyway, and only partial, pre-aggregated counts
>         are shipped
>         from mappers to reducers. However, if A is very wide, or a
>         column store, or
>         has non-negligible deserialization cost that can be offset by only
>         deserializing a few fields -- the second option is better.
>
>         -D
>
>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>         <ma...@tynt.com>>  wrote:
>
>             Wondering about performance and count...
>             A =  load 'test.csv' as (a1,a2,a3);
>             B = GROUP A by a1;
>             -- which preferred?
>             C = FOREACH B GENERATE COUNT(A);
>             -- or would this only send a single field through the COUNT
>             and be more
>             performant?
>             C = FOREACH B GENERATE COUNT(A.a2);
>
>
>
>
>

Re: COUNT(A.field1)

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

I think if you do COUNT(A), Pig will not realize it can ignore a2 and a3,
and project all of them.

On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> I am not sure why second option is better - in both cases, you are shipping
> only the combined counts from map to reduce.
> On other hand, first could be better since it means we need to project only
> 'a1' - and none of the other fields.
>
> Or did I miss something here ?
> I am not very familiar to what pig does in this case right now.
>
> Regards,
> Mridul
>
>
> On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>
>> Generally speaking, the second option will be more performant as it might
>> let you drop column a3 early. In most cases the magnitude of this is
>> likely
>> to be very small as COUNT is an algebraic function, so most of the work is
>> done map-side anyway, and only partial, pre-aggregated counts are shipped
>> from mappers to reducers. However, if A is very wide, or a column store,
>> or
>> has non-negligible deserialization cost that can be offset by only
>> deserializing a few fields -- the second option is better.
>>
>> -D
>>
>> On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<co...@tynt.com>  wrote:
>>
>>  Wondering about performance and count...
>>> A =  load 'test.csv' as (a1,a2,a3);
>>> B = GROUP A by a1;
>>> -- which preferred?
>>> C = FOREACH B GENERATE COUNT(A);
>>> -- or would this only send a single field through the COUNT and be more
>>> performant?
>>> C = FOREACH B GENERATE COUNT(A.a2);
>>>
>>>
>>>
>>>
>

Re: COUNT(A.field1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

I am not sure why second option is better - in both cases, you are 
shipping only the combined counts from map to reduce.
On other hand, first could be better since it means we need to project 
only 'a1' - and none of the other fields.

Or did I miss something here ?
I am not very familiar to what pig does in this case right now.

Regards,
Mridul

On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
> Generally speaking, the second option will be more performant as it might
> let you drop column a3 early. In most cases the magnitude of this is likely
> to be very small as COUNT is an algebraic function, so most of the work is
> done map-side anyway, and only partial, pre-aggregated counts are shipped
> from mappers to reducers. However, if A is very wide, or a column store, or
> has non-negligible deserialization cost that can be offset by only
> deserializing a few fields -- the second option is better.
>
> -D
>
> On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<co...@tynt.com>  wrote:
>
>> Wondering about performance and count...
>> A =  load 'test.csv' as (a1,a2,a3);
>> B = GROUP A by a1;
>> -- which preferred?
>> C = FOREACH B GENERATE COUNT(A);
>> -- or would this only send a single field through the COUNT and be more
>> performant?
>> C = FOREACH B GENERATE COUNT(A.a2);
>>
>>
>>

Re: COUNT(A.field1)

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Generally speaking, the second option will be more performant as it might
let you drop column a3 early. In most cases the magnitude of this is likely
to be very small as COUNT is an algebraic function, so most of the work is
done map-side anyway, and only partial, pre-aggregated counts are shipped
from mappers to reducers. However, if A is very wide, or a column store, or
has non-negligible deserialization cost that can be offset by only
deserializing a few fields -- the second option is better.

-D

On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes <co...@tynt.com> wrote:

> Wondering about performance and count...
> A =  load 'test.csv' as (a1,a2,a3);
> B = GROUP A by a1;
> -- which preferred?
> C = FOREACH B GENERATE COUNT(A);
> -- or would this only send a single field through the COUNT and be more
> performant?
> C = FOREACH B GENERATE COUNT(A.a2);
>
>
>