You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Corbin Hoenes <co...@tynt.com> on 2010/09/02 20:09:47 UTC

Re: COUNT(A.field1)

Wow...thanks for all the discussion and insight guys.

On Aug 29, 2010, at 10:01 AM, Mridul Muralidharan wrote:

> 
> 
> Reason why COUNT(a.field1) would have better performance is 'cos pig does not 'know' what is required from a tuple in case of COUNT(a).
> In a custom mapred job, we can optimize it away so that only the single required field is projected out : but that is obviously not possible here (COUNT is a udf) : so the entire tuple is deserialized from input.
> 
> Ofcourse, the performance difference, as Dmitriy noted, would not be very high.
> 
> 
> Regards,
> Mridul
> 
> 
> On Sunday 29 August 2010 01:14 AM, Renato MarroquĂ­n Mogrovejo wrote:
>> Hi, this is also interesting and kinda confusing for me too (=
>> From the db world, the second one would have a better performance, but Pig
>> doesn't save statistics on the data, so it has to read the whole file
>> anyways, and like the count operation is mainly done on the map side, all
>> attributes will be read anyways, but the ones that are not interesting for
>> us will be dismissed and not passed to the reducer part of the job, and
>> besides wouldn't the presence of null values affect the performance? For
>> example, if a2 would have many null values, then less values would be passed
>> too right?
>> 
>> Renato M.
>> 
>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>> 
>>> 
>>> On second thoughts, that part is obvious - duh
>>> 
>>> - Mridul
>>> 
>>> 
>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>> 
>>>> 
>>>> But it does for COUNT(A.a2) ?
>>>> That is interesting, and somehow weird :)
>>>> 
>>>> Thanks !
>>>> Mridul
>>>> 
>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>> 
>>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>> a3, and project all of them.
>>>>> 
>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>> <mr...@yahoo-inc.com>>   wrote:
>>>>> 
>>>>> 
>>>>>     I am not sure why second option is better - in both cases, you are
>>>>>     shipping only the combined counts from map to reduce.
>>>>>     On other hand, first could be better since it means we need to
>>>>>     project only 'a1' - and none of the other fields.
>>>>> 
>>>>>     Or did I miss something here ?
>>>>>     I am not very familiar to what pig does in this case right now.
>>>>> 
>>>>>     Regards,
>>>>>     Mridul
>>>>> 
>>>>> 
>>>>>     On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>> 
>>>>>         Generally speaking, the second option will be more performant as
>>>>>         it might
>>>>>         let you drop column a3 early. In most cases the magnitude of
>>>>>         this is likely
>>>>>         to be very small as COUNT is an algebraic function, so most of
>>>>>         the work is
>>>>>         done map-side anyway, and only partial, pre-aggregated counts
>>>>>         are shipped
>>>>>         from mappers to reducers. However, if A is very wide, or a
>>>>>         column store, or
>>>>>         has non-negligible deserialization cost that can be offset by
>>>>> only
>>>>>         deserializing a few fields -- the second option is better.
>>>>> 
>>>>>         -D
>>>>> 
>>>>>         On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>>>>         <ma...@tynt.com>>    wrote:
>>>>> 
>>>>>             Wondering about performance and count...
>>>>>             A =  load 'test.csv' as (a1,a2,a3);
>>>>>             B = GROUP A by a1;
>>>>>             -- which preferred?
>>>>>             C = FOREACH B GENERATE COUNT(A);
>>>>>             -- or would this only send a single field through the COUNT
>>>>>             and be more
>>>>>             performant?
>>>>>             C = FOREACH B GENERATE COUNT(A.a2);
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>