You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kim Vogt <ki...@simplegeo.com> on 2011/02/04 00:52:32 UTC

Use Filename in Tuple

Hey,

I have a bunch of files where the filename is significant.  I'm loading the
files by supplying the top level directory that contains the files.  Is
there a way to capture the filename of the file and append to the tuple of
data that's in that file?

-Kim

Re: Use Filename in Tuple

Posted by Kim Vogt <ki...@simplegeo.com>.
I switched to using the CSVLoader in piggybank, and appended the filepath to
the current RecordReader instead.

-Kim

On Thu, Feb 3, 2011 at 10:11 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> There's a CSV loader in the piggybank that does proper CSV escaping,
> if you are interested.
>
> On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > And to include the filename in the tuple with the data, I copied
> PigStorage
> > (I'm loading csv), created a private PigSplit object, set this object in
> > "prepareToRead", and added this code before returning the tuple in
> > "getNext",
> >
> > if (mSplit != null) {
> >    FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
> >    Path p = fs.getPath();
> >    mProtoTuple.add(p.toString());
> > }
> >
> > And it works!  Thanks again :-)
> >
> > -Kim
> >
> > On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <wa...@gmail.com> wrote:
> >
> >> wow, I almost got it right. Double quote, fails. Single quote, works.
> >>
> >> Thanks.
> >>
> >> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> >>
> >> > This should work:
> >> >
> >> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
> >> >
> >> > or
> >> >
> >> > grunt> B = FOREACH A GENERATE f1, '$paramName';
> >> >
> >> > -Kim
> >> >
> >> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com>
> wrote:
> >> >
> >> > > Similarly, is it possible to insert some literal values to a tuple
> >> > stream?
> >> > >
> >> > > For example, when I invoke my Pig script, I already know what data
> >> source
> >> > > is
> >> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig
> >> using
> >> > > -param, and I want to insert this known file name to the tuple
> stream.
> >> > How
> >> > > can I do that?
> >> > >
> >> > > Example, I have:
> >> > >
> >> > > grunt> A = LOAD 'aa' AS (f1, f2);
> >> > > grunt> DUMP A;
> >> > > (aa,bb)
> >> > > (cc,dd)
> >> > >
> >> > > I want to do something like:
> >> > >
> >> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
> >> > >
> >> > > Thanks.
> >> > >
> >> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > In pig 6, you can hook into bindTo() and save the file name.
> >> > > >
> >> > > > In pig 8 you have to find your way to the underlying InputSplit
> via
> >> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call
> getPath()
> >> > > > on it.. I think. Haven't done this.
> >> > > >
> >> > > > This will totally break if you have splitCombination turned on, of
> >> > > > course, as pig can silently move to a different file under you, so
> >> > > > you'd have to turn that off.
> >> > > >
> >> > > > D
> >> > > >
> >> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com>
> wrote:
> >> > > > > Hey,
> >> > > > >
> >> > > > > I have a bunch of files where the filename is significant.  I'm
> >> > loading
> >> > > > the
> >> > > > > files by supplying the top level directory that contains the
> files.
> >> >  Is
> >> > > > > there a way to capture the filename of the file and append to
> the
> >> > tuple
> >> > > > of
> >> > > > > data that's in that file?
> >> > > > >
> >> > > > > -Kim
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Use Filename in Tuple

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
There's a CSV loader in the piggybank that does proper CSV escaping,
if you are interested.

On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> And to include the filename in the tuple with the data, I copied PigStorage
> (I'm loading csv), created a private PigSplit object, set this object in
> "prepareToRead", and added this code before returning the tuple in
> "getNext",
>
> if (mSplit != null) {
>    FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
>    Path p = fs.getPath();
>    mProtoTuple.add(p.toString());
> }
>
> And it works!  Thanks again :-)
>
> -Kim
>
> On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <wa...@gmail.com> wrote:
>
>> wow, I almost got it right. Double quote, fails. Single quote, works.
>>
>> Thanks.
>>
>> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>>
>> > This should work:
>> >
>> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
>> >
>> > or
>> >
>> > grunt> B = FOREACH A GENERATE f1, '$paramName';
>> >
>> > -Kim
>> >
>> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:
>> >
>> > > Similarly, is it possible to insert some literal values to a tuple
>> > stream?
>> > >
>> > > For example, when I invoke my Pig script, I already know what data
>> source
>> > > is
>> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig
>> using
>> > > -param, and I want to insert this known file name to the tuple stream.
>> > How
>> > > can I do that?
>> > >
>> > > Example, I have:
>> > >
>> > > grunt> A = LOAD 'aa' AS (f1, f2);
>> > > grunt> DUMP A;
>> > > (aa,bb)
>> > > (cc,dd)
>> > >
>> > > I want to do something like:
>> > >
>> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
>> > >
>> > > Thanks.
>> > >
>> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> > wrote:
>> > >
>> > > > In pig 6, you can hook into bindTo() and save the file name.
>> > > >
>> > > > In pig 8 you have to find your way to the underlying InputSplit via
>> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
>> > > > on it.. I think. Haven't done this.
>> > > >
>> > > > This will totally break if you have splitCombination turned on, of
>> > > > course, as pig can silently move to a different file under you, so
>> > > > you'd have to turn that off.
>> > > >
>> > > > D
>> > > >
>> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>> > > > > Hey,
>> > > > >
>> > > > > I have a bunch of files where the filename is significant.  I'm
>> > loading
>> > > > the
>> > > > > files by supplying the top level directory that contains the files.
>> >  Is
>> > > > > there a way to capture the filename of the file and append to the
>> > tuple
>> > > > of
>> > > > > data that's in that file?
>> > > > >
>> > > > > -Kim
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Use Filename in Tuple

Posted by Kim Vogt <ki...@simplegeo.com>.
And to include the filename in the tuple with the data, I copied PigStorage
(I'm loading csv), created a private PigSplit object, set this object in
"prepareToRead", and added this code before returning the tuple in
"getNext",

if (mSplit != null) {
    FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
    Path p = fs.getPath();
    mProtoTuple.add(p.toString());
}

And it works!  Thanks again :-)

-Kim

On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <wa...@gmail.com> wrote:

> wow, I almost got it right. Double quote, fails. Single quote, works.
>
> Thanks.
>
> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>
> > This should work:
> >
> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
> >
> > or
> >
> > grunt> B = FOREACH A GENERATE f1, '$paramName';
> >
> > -Kim
> >
> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:
> >
> > > Similarly, is it possible to insert some literal values to a tuple
> > stream?
> > >
> > > For example, when I invoke my Pig script, I already know what data
> source
> > > is
> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig
> using
> > > -param, and I want to insert this known file name to the tuple stream.
> > How
> > > can I do that?
> > >
> > > Example, I have:
> > >
> > > grunt> A = LOAD 'aa' AS (f1, f2);
> > > grunt> DUMP A;
> > > (aa,bb)
> > > (cc,dd)
> > >
> > > I want to do something like:
> > >
> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
> > >
> > > Thanks.
> > >
> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> > >
> > > > In pig 6, you can hook into bindTo() and save the file name.
> > > >
> > > > In pig 8 you have to find your way to the underlying InputSplit via
> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> > > > on it.. I think. Haven't done this.
> > > >
> > > > This will totally break if you have splitCombination turned on, of
> > > > course, as pig can silently move to a different file under you, so
> > > > you'd have to turn that off.
> > > >
> > > > D
> > > >
> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > > > > Hey,
> > > > >
> > > > > I have a bunch of files where the filename is significant.  I'm
> > loading
> > > > the
> > > > > files by supplying the top level directory that contains the files.
> >  Is
> > > > > there a way to capture the filename of the file and append to the
> > tuple
> > > > of
> > > > > data that's in that file?
> > > > >
> > > > > -Kim
> > > > >
> > > >
> > >
> >
>

Re: Use Filename in Tuple

Posted by Dexin Wang <wa...@gmail.com>.
wow, I almost got it right. Double quote, fails. Single quote, works.

Thanks.

On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:

> This should work:
>
> grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
>
> or
>
> grunt> B = FOREACH A GENERATE f1, '$paramName';
>
> -Kim
>
> On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:
>
> > Similarly, is it possible to insert some literal values to a tuple
> stream?
> >
> > For example, when I invoke my Pig script, I already know what data source
> > is
> > (say, it's from filename_2011-02-03), so I can just pass it to Pig using
> > -param, and I want to insert this known file name to the tuple stream.
> How
> > can I do that?
> >
> > Example, I have:
> >
> > grunt> A = LOAD 'aa' AS (f1, f2);
> > grunt> DUMP A;
> > (aa,bb)
> > (cc,dd)
> >
> > I want to do something like:
> >
> > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
> >
> > Thanks.
> >
> > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> > > In pig 6, you can hook into bindTo() and save the file name.
> > >
> > > In pig 8 you have to find your way to the underlying InputSplit via
> > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> > > on it.. I think. Haven't done this.
> > >
> > > This will totally break if you have splitCombination turned on, of
> > > course, as pig can silently move to a different file under you, so
> > > you'd have to turn that off.
> > >
> > > D
> > >
> > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > > > Hey,
> > > >
> > > > I have a bunch of files where the filename is significant.  I'm
> loading
> > > the
> > > > files by supplying the top level directory that contains the files.
>  Is
> > > > there a way to capture the filename of the file and append to the
> tuple
> > > of
> > > > data that's in that file?
> > > >
> > > > -Kim
> > > >
> > >
> >
>

Re: Use Filename in Tuple

Posted by Kim Vogt <ki...@simplegeo.com>.
This should work:

grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';

or

grunt> B = FOREACH A GENERATE f1, '$paramName';

-Kim

On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:

> Similarly, is it possible to insert some literal values to a tuple stream?
>
> For example, when I invoke my Pig script, I already know what data source
> is
> (say, it's from filename_2011-02-03), so I can just pass it to Pig using
> -param, and I want to insert this known file name to the tuple stream. How
> can I do that?
>
> Example, I have:
>
> grunt> A = LOAD 'aa' AS (f1, f2);
> grunt> DUMP A;
> (aa,bb)
> (cc,dd)
>
> I want to do something like:
>
> grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
>
> Thanks.
>
> On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > In pig 6, you can hook into bindTo() and save the file name.
> >
> > In pig 8 you have to find your way to the underlying InputSplit via
> > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> > on it.. I think. Haven't done this.
> >
> > This will totally break if you have splitCombination turned on, of
> > course, as pig can silently move to a different file under you, so
> > you'd have to turn that off.
> >
> > D
> >
> > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > > Hey,
> > >
> > > I have a bunch of files where the filename is significant.  I'm loading
> > the
> > > files by supplying the top level directory that contains the files.  Is
> > > there a way to capture the filename of the file and append to the tuple
> > of
> > > data that's in that file?
> > >
> > > -Kim
> > >
> >
>

Re: Use Filename in Tuple

Posted by Dexin Wang <wa...@gmail.com>.
Similarly, is it possible to insert some literal values to a tuple stream?

For example, when I invoke my Pig script, I already know what data source is
(say, it's from filename_2011-02-03), so I can just pass it to Pig using
-param, and I want to insert this known file name to the tuple stream. How
can I do that?

Example, I have:

grunt> A = LOAD 'aa' AS (f1, f2);
grunt> DUMP A;
(aa,bb)
(cc,dd)

I want to do something like:

grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";

Thanks.

On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> In pig 6, you can hook into bindTo() and save the file name.
>
> In pig 8 you have to find your way to the underlying InputSplit via
> PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> on it.. I think. Haven't done this.
>
> This will totally break if you have splitCombination turned on, of
> course, as pig can silently move to a different file under you, so
> you'd have to turn that off.
>
> D
>
> On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > Hey,
> >
> > I have a bunch of files where the filename is significant.  I'm loading
> the
> > files by supplying the top level directory that contains the files.  Is
> > there a way to capture the filename of the file and append to the tuple
> of
> > data that's in that file?
> >
> > -Kim
> >
>

Re: Use Filename in Tuple

Posted by Kim Vogt <ki...@simplegeo.com>.
Thanks Dmitriy!

I'm using pig 8 and no splitCombination (I don't think). I accept this challenge and will keep you  pig'ites updated.

-Kim

On Feb 3, 2011, at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> In pig 6, you can hook into bindTo() and save the file name.
> 
> In pig 8 you have to find your way to the underlying InputSplit via
> PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> on it.. I think. Haven't done this.
> 
> This will totally break if you have splitCombination turned on, of
> course, as pig can silently move to a different file under you, so
> you'd have to turn that off.
> 
> D
> 
> On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>> Hey,
>> 
>> I have a bunch of files where the filename is significant.  I'm loading the
>> files by supplying the top level directory that contains the files.  Is
>> there a way to capture the filename of the file and append to the tuple of
>> data that's in that file?
>> 
>> -Kim
>> 

Re: Use Filename in Tuple

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
In pig 6, you can hook into bindTo() and save the file name.

In pig 8 you have to find your way to the underlying InputSplit via
PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
on it.. I think. Haven't done this.

This will totally break if you have splitCombination turned on, of
course, as pig can silently move to a different file under you, so
you'd have to turn that off.

D

On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> Hey,
>
> I have a bunch of files where the filename is significant.  I'm loading the
> files by supplying the top level directory that contains the files.  Is
> there a way to capture the filename of the file and append to the tuple of
> data that's in that file?
>
> -Kim
>