You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Andrew Rothstein <an...@gmail.com> on 2010/04/29 22:13:26 UTC

LoadFunc.bindTo in pig 0.6.0

I'm writing a user defined LoadFunc. In the bindTo function the
fileName parameter appears as the verbatim text passed as the
parameter to the LOAD function in my script. In the case where I'm
processing multiple files from a directory, is there a way I can
determine the name of the underlying data file that the LoadFunc
instance is bound to?

regards, Andrew

Re: LoadFunc.bindTo in pig 0.6.0

Posted by Andrew Rothstein <an...@gmail.com>.
Thanks for the tip. I'm reading over the example in the UDF manual and
trying to make this work. The RangeSlicer example is sufficiently
contrived that I'm not sure how to extend it to my problem.

Ultimately I want to split up input files but inject into the tuple
stream the name of the file that each record was read from.

cat 20100427.TXT;
foo
kmee
cat 20100426.TXT;
foo
bar
A = LOAD '*.TXT' using MySlicer;
DUMP A;
('20100427', 'foo'),
('20100427', 'kmee')
('20100426', 'foo')
('20100426', 'bar')

It's not really clear to me what the responsibility of the Slicer is
versus the Slice, nor how this interacts with the LoadFunc interface.
I took my LoanFunc implementation and added Slicer to this list of
implemented interfaces. I added stub implementations of the slice and
validate functions to try to empirically determine the relationship.
These functions do not appear to be invoked at all which I don't know
how to make sense of.

Am I reinventing the wheel here? Has anyone implemented anything like
what I'm talking about already?

-Andrew

On Thu, Apr 29, 2010 at 4:26 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> Andrew, in 0.6 you can use slices for that. Checkout the elephant bird
> code, which does this for lzo files:
>
> http://www.github.com/kevinweil/elephant-bird
>
> Specifically you want to look at LzoBaseLoadFunc and LzoSlice
>
> -D
>
> On Thu, Apr 29, 2010 at 1:13 PM, Andrew Rothstein
> <an...@gmail.com> wrote:
>> I'm writing a user defined LoadFunc. In the bindTo function the
>> fileName parameter appears as the verbatim text passed as the
>> parameter to the LOAD function in my script. In the case where I'm
>> processing multiple files from a directory, is there a way I can
>> determine the name of the underlying data file that the LoadFunc
>> instance is bound to?
>>
>> regards, Andrew
>>
>

Re: LoadFunc.bindTo in pig 0.6.0

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Andrew, in 0.6 you can use slices for that. Checkout the elephant bird
code, which does this for lzo files:

http://www.github.com/kevinweil/elephant-bird

Specifically you want to look at LzoBaseLoadFunc and LzoSlice

-D

On Thu, Apr 29, 2010 at 1:13 PM, Andrew Rothstein
<an...@gmail.com> wrote:
> I'm writing a user defined LoadFunc. In the bindTo function the
> fileName parameter appears as the verbatim text passed as the
> parameter to the LOAD function in my script. In the case where I'm
> processing multiple files from a directory, is there a way I can
> determine the name of the underlying data file that the LoadFunc
> instance is bound to?
>
> regards, Andrew
>