You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Uppuluri, Rohini" <ro...@teamaol.com> on 2010/10/29 14:01:16 UTC

Regarding Multifile InputFormat patch

Hi,

I came across this patch
(https://issues.apache.org/jira/browse/PIG-1518) which supports multifile input format from Pig 0.8 version on wards.

A patch is also available for Pig 0.7. I was wondering if any one tried out the patch with Pig 0.7 and if they could share any notes on performance improvements due to this.

Thanks,
-Rohini


Re: Regarding Multifile InputFormat patch

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
It would be a tradeoff between data-locality versus number of tasks 
executed. In some of our experiments, it performed much worse (dont have 
actual numbers, but it was in the 2x ballpark iirc) : ofcourse, ours was 
a highly constrained and specialized experiment anyway !

On the other hand, the benefits in terms of number of tasks can be 
extremely useful for job times - in particular, for environments where 
there is quota enabled in terms of number of tasks, or number of files 
(if map-only output), etc : the benefits can be pretty good.


I am yet to look at the patch in detail, but from what I recall, 
performance could be improved by being more intelligent in terms of 
clustering splits based on 'locations' returned for the combined 
multiple-split, etc : to ensure maximal data-locality for the contained 
splits, etc.
Not sure if it is in there in final version ...

Regards,
Mridul


On Friday 29 October 2010 05:31 PM, Uppuluri, Rohini wrote:
> Hi,
>
> I came across this patch
> (https://issues.apache.org/jira/browse/PIG-1518) which supports multifile input format from Pig 0.8 version on wards.
>
> A patch is also available for Pig 0.7. I was wondering if any one tried out the patch with Pig 0.7 and if they could share any notes on performance improvements due to this.
>
> Thanks,
> -Rohini
>