You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Thejas Nair <te...@yahoo-inc.com> on 2009/11/04 01:28:37 UTC

LoadFunc.skipNext() function for faster sampling ?

In the new implementation of SampleLoader subclasses (used by order-by,
skew-join ..) as part of the loader redesign, we are not only reading all
the records input but also parsing them as pig tuples.

This is because the SampleLoaders are wrappers around the actual input
loaders specified in the query. We can make things much faster by having a
skipNext() function (or skipNext(int numSkip) ) which will avoid parsing the
record into a pig tuple.
LoadFunc could optionally implement this (easy to implement) function (which
will be part of an interface) for improving speed of queries such as
order-by.

-Thejas


Re: LoadFunc.skipNext() function for faster sampling ?

Posted by Thejas Nair <te...@yahoo-inc.com>.
Yes, that should work. I will use InputFormat.getNext from the SampleLoader
to skip the records.
Thanks,
Thejas


On 11/3/09 6:39 PM, "Alan Gates" <ga...@yahoo-inc.com> wrote:

> We definitely want to avoid parsing every tuple when sampling.  But do
> we need to implement a special function for it?  Pig will have access
> to the InputFormat instance, correct?  Can it not call
> InputFormat.getNext the desired number of times (which will not parse
> the tuple) and then call LoadFunc.getNext to get the next parsed tuple?
> 
> Alan.
> 
> On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote:
> 
>> In the new implementation of SampleLoader subclasses (used by order-
>> by,
>> skew-join ..) as part of the loader redesign, we are not only
>> reading all
>> the records input but also parsing them as pig tuples.
>> 
>> This is because the SampleLoaders are wrappers around the actual input
>> loaders specified in the query. We can make things much faster by
>> having a
>> skipNext() function (or skipNext(int numSkip) ) which will avoid
>> parsing the
>> record into a pig tuple.
>> LoadFunc could optionally implement this (easy to implement)
>> function (which
>> will be part of an interface) for improving speed of queries such as
>> order-by.
>> 
>> -Thejas
>> 
> 


Re: LoadFunc.skipNext() function for faster sampling ?

Posted by Alan Gates <ga...@yahoo-inc.com>.
We definitely want to avoid parsing every tuple when sampling.  But do  
we need to implement a special function for it?  Pig will have access  
to the InputFormat instance, correct?  Can it not call  
InputFormat.getNext the desired number of times (which will not parse  
the tuple) and then call LoadFunc.getNext to get the next parsed tuple?

Alan.

On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote:

> In the new implementation of SampleLoader subclasses (used by order- 
> by,
> skew-join ..) as part of the loader redesign, we are not only  
> reading all
> the records input but also parsing them as pig tuples.
>
> This is because the SampleLoaders are wrappers around the actual input
> loaders specified in the query. We can make things much faster by  
> having a
> skipNext() function (or skipNext(int numSkip) ) which will avoid  
> parsing the
> record into a pig tuple.
> LoadFunc could optionally implement this (easy to implement)  
> function (which
> will be part of an interface) for improving speed of queries such as
> order-by.
>
> -Thejas
>