You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Corbin Hoenes <co...@tynt.com> on 2010/09/02 20:09:47 UTC
Re: COUNT(A.field1)
Wow...thanks for all the discussion and insight guys.
On Aug 29, 2010, at 10:01 AM, Mridul Muralidharan wrote:
>
>
> Reason why COUNT(a.field1) would have better performance is 'cos pig does not 'know' what is required from a tuple in case of COUNT(a).
> In a custom mapred job, we can optimize it away so that only the single required field is projected out : but that is obviously not possible here (COUNT is a udf) : so the entire tuple is deserialized from input.
>
> Ofcourse, the performance difference, as Dmitriy noted, would not be very high.
>
>
> Regards,
> Mridul
>
>
> On Sunday 29 August 2010 01:14 AM, Renato MarroquĂn Mogrovejo wrote:
>> Hi, this is also interesting and kinda confusing for me too (=
>> From the db world, the second one would have a better performance, but Pig
>> doesn't save statistics on the data, so it has to read the whole file
>> anyways, and like the count operation is mainly done on the map side, all
>> attributes will be read anyways, but the ones that are not interesting for
>> us will be dismissed and not passed to the reducer part of the job, and
>> besides wouldn't the presence of null values affect the performance? For
>> example, if a2 would have many null values, then less values would be passed
>> too right?
>>
>> Renato M.
>>
>> 2010/8/27 Mridul Muralidharan<mr...@yahoo-inc.com>
>>
>>>
>>> On second thoughts, that part is obvious - duh
>>>
>>> - Mridul
>>>
>>>
>>> On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
>>>
>>>>
>>>> But it does for COUNT(A.a2) ?
>>>> That is interesting, and somehow weird :)
>>>>
>>>> Thanks !
>>>> Mridul
>>>>
>>>> On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
>>>>
>>>>> I think if you do COUNT(A), Pig will not realize it can ignore a2 and
>>>>> a3, and project all of them.
>>>>>
>>>>> On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
>>>>> <mr...@yahoo-inc.com>> wrote:
>>>>>
>>>>>
>>>>> I am not sure why second option is better - in both cases, you are
>>>>> shipping only the combined counts from map to reduce.
>>>>> On other hand, first could be better since it means we need to
>>>>> project only 'a1' - and none of the other fields.
>>>>>
>>>>> Or did I miss something here ?
>>>>> I am not very familiar to what pig does in this case right now.
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>>
>>>>> On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
>>>>>
>>>>> Generally speaking, the second option will be more performant as
>>>>> it might
>>>>> let you drop column a3 early. In most cases the magnitude of
>>>>> this is likely
>>>>> to be very small as COUNT is an algebraic function, so most of
>>>>> the work is
>>>>> done map-side anyway, and only partial, pre-aggregated counts
>>>>> are shipped
>>>>> from mappers to reducers. However, if A is very wide, or a
>>>>> column store, or
>>>>> has non-negligible deserialization cost that can be offset by
>>>>> only
>>>>> deserializing a few fields -- the second option is better.
>>>>>
>>>>> -D
>>>>>
>>>>> On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
>>>>> <ma...@tynt.com>> wrote:
>>>>>
>>>>> Wondering about performance and count...
>>>>> A = load 'test.csv' as (a1,a2,a3);
>>>>> B = GROUP A by a1;
>>>>> -- which preferred?
>>>>> C = FOREACH B GENERATE COUNT(A);
>>>>> -- or would this only send a single field through the COUNT
>>>>> and be more
>>>>> performant?
>>>>> C = FOREACH B GENERATE COUNT(A.a2);
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>