You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Kim Vogt <ki...@simplegeo.com> on 2011/02/04 00:52:32 UTC
Use Filename in Tuple
Hey,
I have a bunch of files where the filename is significant. I'm loading the
files by supplying the top level directory that contains the files. Is
there a way to capture the filename of the file and append to the tuple of
data that's in that file?
-Kim
Re: Use Filename in Tuple
Posted by Kim Vogt <ki...@simplegeo.com>.
I switched to using the CSVLoader in piggybank, and appended the filepath to
the current RecordReader instead.
-Kim
On Thu, Feb 3, 2011 at 10:11 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> There's a CSV loader in the piggybank that does proper CSV escaping,
> if you are interested.
>
> On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > And to include the filename in the tuple with the data, I copied
> PigStorage
> > (I'm loading csv), created a private PigSplit object, set this object in
> > "prepareToRead", and added this code before returning the tuple in
> > "getNext",
> >
> > if (mSplit != null) {
> > FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
> > Path p = fs.getPath();
> > mProtoTuple.add(p.toString());
> > }
> >
> > And it works! Thanks again :-)
> >
> > -Kim
> >
> > On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <wa...@gmail.com> wrote:
> >
> >> wow, I almost got it right. Double quote, fails. Single quote, works.
> >>
> >> Thanks.
> >>
> >> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> >>
> >> > This should work:
> >> >
> >> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
> >> >
> >> > or
> >> >
> >> > grunt> B = FOREACH A GENERATE f1, '$paramName';
> >> >
> >> > -Kim
> >> >
> >> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com>
> wrote:
> >> >
> >> > > Similarly, is it possible to insert some literal values to a tuple
> >> > stream?
> >> > >
> >> > > For example, when I invoke my Pig script, I already know what data
> >> source
> >> > > is
> >> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig
> >> using
> >> > > -param, and I want to insert this known file name to the tuple
> stream.
> >> > How
> >> > > can I do that?
> >> > >
> >> > > Example, I have:
> >> > >
> >> > > grunt> A = LOAD 'aa' AS (f1, f2);
> >> > > grunt> DUMP A;
> >> > > (aa,bb)
> >> > > (cc,dd)
> >> > >
> >> > > I want to do something like:
> >> > >
> >> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
> >> > >
> >> > > Thanks.
> >> > >
> >> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > In pig 6, you can hook into bindTo() and save the file name.
> >> > > >
> >> > > > In pig 8 you have to find your way to the underlying InputSplit
> via
> >> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call
> getPath()
> >> > > > on it.. I think. Haven't done this.
> >> > > >
> >> > > > This will totally break if you have splitCombination turned on, of
> >> > > > course, as pig can silently move to a different file under you, so
> >> > > > you'd have to turn that off.
> >> > > >
> >> > > > D
> >> > > >
> >> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com>
> wrote:
> >> > > > > Hey,
> >> > > > >
> >> > > > > I have a bunch of files where the filename is significant. I'm
> >> > loading
> >> > > > the
> >> > > > > files by supplying the top level directory that contains the
> files.
> >> > Is
> >> > > > > there a way to capture the filename of the file and append to
> the
> >> > tuple
> >> > > > of
> >> > > > > data that's in that file?
> >> > > > >
> >> > > > > -Kim
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>
Re: Use Filename in Tuple
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
There's a CSV loader in the piggybank that does proper CSV escaping,
if you are interested.
On Thu, Feb 3, 2011 at 9:53 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> And to include the filename in the tuple with the data, I copied PigStorage
> (I'm loading csv), created a private PigSplit object, set this object in
> "prepareToRead", and added this code before returning the tuple in
> "getNext",
>
> if (mSplit != null) {
> FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
> Path p = fs.getPath();
> mProtoTuple.add(p.toString());
> }
>
> And it works! Thanks again :-)
>
> -Kim
>
> On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <wa...@gmail.com> wrote:
>
>> wow, I almost got it right. Double quote, fails. Single quote, works.
>>
>> Thanks.
>>
>> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>>
>> > This should work:
>> >
>> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
>> >
>> > or
>> >
>> > grunt> B = FOREACH A GENERATE f1, '$paramName';
>> >
>> > -Kim
>> >
>> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:
>> >
>> > > Similarly, is it possible to insert some literal values to a tuple
>> > stream?
>> > >
>> > > For example, when I invoke my Pig script, I already know what data
>> source
>> > > is
>> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig
>> using
>> > > -param, and I want to insert this known file name to the tuple stream.
>> > How
>> > > can I do that?
>> > >
>> > > Example, I have:
>> > >
>> > > grunt> A = LOAD 'aa' AS (f1, f2);
>> > > grunt> DUMP A;
>> > > (aa,bb)
>> > > (cc,dd)
>> > >
>> > > I want to do something like:
>> > >
>> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
>> > >
>> > > Thanks.
>> > >
>> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
>> > wrote:
>> > >
>> > > > In pig 6, you can hook into bindTo() and save the file name.
>> > > >
>> > > > In pig 8 you have to find your way to the underlying InputSplit via
>> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
>> > > > on it.. I think. Haven't done this.
>> > > >
>> > > > This will totally break if you have splitCombination turned on, of
>> > > > course, as pig can silently move to a different file under you, so
>> > > > you'd have to turn that off.
>> > > >
>> > > > D
>> > > >
>> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>> > > > > Hey,
>> > > > >
>> > > > > I have a bunch of files where the filename is significant. I'm
>> > loading
>> > > > the
>> > > > > files by supplying the top level directory that contains the files.
>> > Is
>> > > > > there a way to capture the filename of the file and append to the
>> > tuple
>> > > > of
>> > > > > data that's in that file?
>> > > > >
>> > > > > -Kim
>> > > > >
>> > > >
>> > >
>> >
>>
>
Re: Use Filename in Tuple
Posted by Kim Vogt <ki...@simplegeo.com>.
And to include the filename in the tuple with the data, I copied PigStorage
(I'm loading csv), created a private PigSplit object, set this object in
"prepareToRead", and added this code before returning the tuple in
"getNext",
if (mSplit != null) {
FileSplit fs = (FileSplit) mSplit.getWrappedSplit();
Path p = fs.getPath();
mProtoTuple.add(p.toString());
}
And it works! Thanks again :-)
-Kim
On Thu, Feb 3, 2011 at 9:43 PM, Dexin Wang <wa...@gmail.com> wrote:
> wow, I almost got it right. Double quote, fails. Single quote, works.
>
> Thanks.
>
> On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>
> > This should work:
> >
> > grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
> >
> > or
> >
> > grunt> B = FOREACH A GENERATE f1, '$paramName';
> >
> > -Kim
> >
> > On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:
> >
> > > Similarly, is it possible to insert some literal values to a tuple
> > stream?
> > >
> > > For example, when I invoke my Pig script, I already know what data
> source
> > > is
> > > (say, it's from filename_2011-02-03), so I can just pass it to Pig
> using
> > > -param, and I want to insert this known file name to the tuple stream.
> > How
> > > can I do that?
> > >
> > > Example, I have:
> > >
> > > grunt> A = LOAD 'aa' AS (f1, f2);
> > > grunt> DUMP A;
> > > (aa,bb)
> > > (cc,dd)
> > >
> > > I want to do something like:
> > >
> > > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
> > >
> > > Thanks.
> > >
> > > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
> > wrote:
> > >
> > > > In pig 6, you can hook into bindTo() and save the file name.
> > > >
> > > > In pig 8 you have to find your way to the underlying InputSplit via
> > > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> > > > on it.. I think. Haven't done this.
> > > >
> > > > This will totally break if you have splitCombination turned on, of
> > > > course, as pig can silently move to a different file under you, so
> > > > you'd have to turn that off.
> > > >
> > > > D
> > > >
> > > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > > > > Hey,
> > > > >
> > > > > I have a bunch of files where the filename is significant. I'm
> > loading
> > > > the
> > > > > files by supplying the top level directory that contains the files.
> > Is
> > > > > there a way to capture the filename of the file and append to the
> > tuple
> > > > of
> > > > > data that's in that file?
> > > > >
> > > > > -Kim
> > > > >
> > > >
> > >
> >
>
Re: Use Filename in Tuple
Posted by Dexin Wang <wa...@gmail.com>.
wow, I almost got it right. Double quote, fails. Single quote, works.
Thanks.
On Thu, Feb 3, 2011 at 9:40 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> This should work:
>
> grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
>
> or
>
> grunt> B = FOREACH A GENERATE f1, '$paramName';
>
> -Kim
>
> On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:
>
> > Similarly, is it possible to insert some literal values to a tuple
> stream?
> >
> > For example, when I invoke my Pig script, I already know what data source
> > is
> > (say, it's from filename_2011-02-03), so I can just pass it to Pig using
> > -param, and I want to insert this known file name to the tuple stream.
> How
> > can I do that?
> >
> > Example, I have:
> >
> > grunt> A = LOAD 'aa' AS (f1, f2);
> > grunt> DUMP A;
> > (aa,bb)
> > (cc,dd)
> >
> > I want to do something like:
> >
> > grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
> >
> > Thanks.
> >
> > On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> > > In pig 6, you can hook into bindTo() and save the file name.
> > >
> > > In pig 8 you have to find your way to the underlying InputSplit via
> > > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> > > on it.. I think. Haven't done this.
> > >
> > > This will totally break if you have splitCombination turned on, of
> > > course, as pig can silently move to a different file under you, so
> > > you'd have to turn that off.
> > >
> > > D
> > >
> > > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > > > Hey,
> > > >
> > > > I have a bunch of files where the filename is significant. I'm
> loading
> > > the
> > > > files by supplying the top level directory that contains the files.
> Is
> > > > there a way to capture the filename of the file and append to the
> tuple
> > > of
> > > > data that's in that file?
> > > >
> > > > -Kim
> > > >
> > >
> >
>
Re: Use Filename in Tuple
Posted by Kim Vogt <ki...@simplegeo.com>.
This should work:
grunt> B = FOREACH A GENERATE f1, 'filename-2011-02-03';
or
grunt> B = FOREACH A GENERATE f1, '$paramName';
-Kim
On Thu, Feb 3, 2011 at 8:32 PM, Dexin Wang <wa...@gmail.com> wrote:
> Similarly, is it possible to insert some literal values to a tuple stream?
>
> For example, when I invoke my Pig script, I already know what data source
> is
> (say, it's from filename_2011-02-03), so I can just pass it to Pig using
> -param, and I want to insert this known file name to the tuple stream. How
> can I do that?
>
> Example, I have:
>
> grunt> A = LOAD 'aa' AS (f1, f2);
> grunt> DUMP A;
> (aa,bb)
> (cc,dd)
>
> I want to do something like:
>
> grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
>
> Thanks.
>
> On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > In pig 6, you can hook into bindTo() and save the file name.
> >
> > In pig 8 you have to find your way to the underlying InputSplit via
> > PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> > on it.. I think. Haven't done this.
> >
> > This will totally break if you have splitCombination turned on, of
> > course, as pig can silently move to a different file under you, so
> > you'd have to turn that off.
> >
> > D
> >
> > On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > > Hey,
> > >
> > > I have a bunch of files where the filename is significant. I'm loading
> > the
> > > files by supplying the top level directory that contains the files. Is
> > > there a way to capture the filename of the file and append to the tuple
> > of
> > > data that's in that file?
> > >
> > > -Kim
> > >
> >
>
Re: Use Filename in Tuple
Posted by Dexin Wang <wa...@gmail.com>.
Similarly, is it possible to insert some literal values to a tuple stream?
For example, when I invoke my Pig script, I already know what data source is
(say, it's from filename_2011-02-03), so I can just pass it to Pig using
-param, and I want to insert this known file name to the tuple stream. How
can I do that?
Example, I have:
grunt> A = LOAD 'aa' AS (f1, f2);
grunt> DUMP A;
(aa,bb)
(cc,dd)
I want to do something like:
grunt> B = FOREACH A GENERATE f1, "filename-2011-02-03";
Thanks.
On Thu, Feb 3, 2011 at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> In pig 6, you can hook into bindTo() and save the file name.
>
> In pig 8 you have to find your way to the underlying InputSplit via
> PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> on it.. I think. Haven't done this.
>
> This will totally break if you have splitCombination turned on, of
> course, as pig can silently move to a different file under you, so
> you'd have to turn that off.
>
> D
>
> On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> > Hey,
> >
> > I have a bunch of files where the filename is significant. I'm loading
> the
> > files by supplying the top level directory that contains the files. Is
> > there a way to capture the filename of the file and append to the tuple
> of
> > data that's in that file?
> >
> > -Kim
> >
>
Re: Use Filename in Tuple
Posted by Kim Vogt <ki...@simplegeo.com>.
Thanks Dmitriy!
I'm using pig 8 and no splitCombination (I don't think). I accept this challenge and will keep you pig'ites updated.
-Kim
On Feb 3, 2011, at 7:49 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> In pig 6, you can hook into bindTo() and save the file name.
>
> In pig 8 you have to find your way to the underlying InputSplit via
> PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
> on it.. I think. Haven't done this.
>
> This will totally break if you have splitCombination turned on, of
> course, as pig can silently move to a different file under you, so
> you'd have to turn that off.
>
> D
>
> On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
>> Hey,
>>
>> I have a bunch of files where the filename is significant. I'm loading the
>> files by supplying the top level directory that contains the files. Is
>> there a way to capture the filename of the file and append to the tuple of
>> data that's in that file?
>>
>> -Kim
>>
Re: Use Filename in Tuple
Posted by Dmitriy Ryaboy <dv...@gmail.com>.
In pig 6, you can hook into bindTo() and save the file name.
In pig 8 you have to find your way to the underlying InputSplit via
PigSplit.getWrappedSplit(), cast it as FileSplit, and call getPath()
on it.. I think. Haven't done this.
This will totally break if you have splitCombination turned on, of
course, as pig can silently move to a different file under you, so
you'd have to turn that off.
D
On Thu, Feb 3, 2011 at 3:52 PM, Kim Vogt <ki...@simplegeo.com> wrote:
> Hey,
>
> I have a bunch of files where the filename is significant. I'm loading the
> files by supplying the top level directory that contains the files. Is
> there a way to capture the filename of the file and append to the tuple of
> data that's in that file?
>
> -Kim
>