You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Latha <us...@gmail.com> on 2008/10/05 19:35:24 UTC

How to access filenames after loading a directory to an Alias [pig scripting]

Greetings!
Hi , When I load a directory(from hdfs)  into an alias and try to dump it, I
find all the lines of various files in that directory appearing one after
another.
However, not able to figure out how to access filenames from alias. Tried
understanding script1-hadoop.pig. Still ,am not able to find out the same.

A = load "inputDir" using PigStorage();
dump A;
Output:
------------------------------------------------
( line1 from inputDir/insideDir/file1.txt)
( line 2 from inputDir/insideDir/file1.txt)
.
(line 1 from inputDir/insideDir/innermost/fileone.txt)
...
etc.,
------------------------------------------------

Am interested in filewise results , where I can retain the filename and get
the results filewise.

filename1
( line1 )
( line2 )

filename2
(line 1)
(line 2)
etc.,

Is there any way I can access filenames from alias to which a directory is
loaded? Requirement is to iterate through all the files, and in each file,
would like to process every line. please point me the right approach.

Regards,
Srilatha

RE: How to access filenames after loading a directory to an Alias [pig scripting]

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
That is true. Pig currently does not support that.

Olga 

> -----Original Message-----
> From: Latha [mailto:uslatha@gmail.com] 
> Sent: Monday, October 06, 2008 11:52 AM
> To: pig-user@incubator.apache.org
> Subject: Re: How to access filenames after loading a 
> directory to an Alias [pig scripting]
> 
> Hi Olga,
> 
> How can I achieve loading individual files from a directory 
> structure at grunt shell?
> 
> "bin/hadoop dfs -lsr"  lists all the files in a hdfs 
> irrespective of the depth of the file in directories.
> [it also lists directories :(   ]
> 
> However, PIG grunt shell supports  dfs "ls" command , and  not "lsr"
> command.Here, its not
> possible to get all the filenames. It lists only the toplevel 
> directories or files available at hdfs.
> Please correct me if wrong.
> 
> Rgds,
> Srilatha
> 
> 
> On Mon, Oct 6, 2008 at 9:11 PM, Olga Natkovich 
> <ol...@yahoo-inc.com> wrote:
> 
> > Metadata like filename is not preserved when the data is 
> loaded. You 
> > can load individual files and then use union command but 
> that will run 
> > slower because of extra processing steps.
> >
> > Olga
> >
> > > -----Original Message-----
> > > From: Latha [mailto:uslatha@gmail.com]
> > > Sent: Sunday, October 05, 2008 10:35 AM
> > > To: pig-user@incubator.apache.org
> > > Subject: How to access filenames after loading a directory to an 
> > > Alias [pig scripting]
> > >
> > > Greetings!
> > > Hi , When I load a directory(from hdfs)  into an alias and try to 
> > > dump it, I find all the lines of various files in that directory 
> > > appearing one after another.
> > > However, not able to figure out how to access filenames 
> from alias. 
> > > Tried understanding script1-hadoop.pig. Still ,am not 
> able to find 
> > > out the same.
> > >
> > > A = load "inputDir" using PigStorage(); dump A;
> > > Output:
> > > ------------------------------------------------
> > > ( line1 from inputDir/insideDir/file1.txt) ( line 2 from
> > > inputDir/insideDir/file1.txt) .
> > > (line 1 from inputDir/insideDir/innermost/fileone.txt)
> > > ...
> > > etc.,
> > > ------------------------------------------------
> > >
> > > Am interested in filewise results , where I can retain 
> the filename 
> > > and get the results filewise.
> > >
> > > filename1
> > > ( line1 )
> > > ( line2 )
> > >
> > > filename2
> > > (line 1)
> > > (line 2)
> > > etc.,
> > >
> > > Is there any way I can access filenames from alias to which a 
> > > directory is loaded? Requirement is to iterate through all the 
> > > files, and in each file, would like to process every line. please 
> > > point me the right approach.
> > >
> > > Regards,
> > > Srilatha
> > >
> >
> 

Re: How to access filenames after loading a directory to an Alias [pig scripting]

Posted by Latha <us...@gmail.com>.
Hi Olga,

How can I achieve loading individual files from a directory structure at
grunt shell?

"bin/hadoop dfs -lsr"  lists all the files in a hdfs irrespective of the
depth of the file in directories.
[it also lists directories :(   ]

However, PIG grunt shell supports  dfs "ls" command , and  not "lsr"
command.Here, its not
possible to get all the filenames. It lists only the toplevel directories or
files available at hdfs.
Please correct me if wrong.

Rgds,
Srilatha


On Mon, Oct 6, 2008 at 9:11 PM, Olga Natkovich <ol...@yahoo-inc.com> wrote:

> Metadata like filename is not preserved when the data is loaded. You can
> load individual files and then use union command but that will run
> slower because of extra processing steps.
>
> Olga
>
> > -----Original Message-----
> > From: Latha [mailto:uslatha@gmail.com]
> > Sent: Sunday, October 05, 2008 10:35 AM
> > To: pig-user@incubator.apache.org
> > Subject: How to access filenames after loading a directory to
> > an Alias [pig scripting]
> >
> > Greetings!
> > Hi , When I load a directory(from hdfs)  into an alias and
> > try to dump it, I find all the lines of various files in that
> > directory appearing one after another.
> > However, not able to figure out how to access filenames from
> > alias. Tried understanding script1-hadoop.pig. Still ,am not
> > able to find out the same.
> >
> > A = load "inputDir" using PigStorage();
> > dump A;
> > Output:
> > ------------------------------------------------
> > ( line1 from inputDir/insideDir/file1.txt) ( line 2 from
> > inputDir/insideDir/file1.txt) .
> > (line 1 from inputDir/insideDir/innermost/fileone.txt)
> > ...
> > etc.,
> > ------------------------------------------------
> >
> > Am interested in filewise results , where I can retain the
> > filename and get the results filewise.
> >
> > filename1
> > ( line1 )
> > ( line2 )
> >
> > filename2
> > (line 1)
> > (line 2)
> > etc.,
> >
> > Is there any way I can access filenames from alias to which a
> > directory is loaded? Requirement is to iterate through all
> > the files, and in each file, would like to process every
> > line. please point me the right approach.
> >
> > Regards,
> > Srilatha
> >
>

RE: How to access filenames after loading a directory to an Alias [pig scripting]

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
Metadata like filename is not preserved when the data is loaded. You can
load individual files and then use union command but that will run
slower because of extra processing steps.

Olga 

> -----Original Message-----
> From: Latha [mailto:uslatha@gmail.com] 
> Sent: Sunday, October 05, 2008 10:35 AM
> To: pig-user@incubator.apache.org
> Subject: How to access filenames after loading a directory to 
> an Alias [pig scripting]
> 
> Greetings!
> Hi , When I load a directory(from hdfs)  into an alias and 
> try to dump it, I find all the lines of various files in that 
> directory appearing one after another.
> However, not able to figure out how to access filenames from 
> alias. Tried understanding script1-hadoop.pig. Still ,am not 
> able to find out the same.
> 
> A = load "inputDir" using PigStorage();
> dump A;
> Output:
> ------------------------------------------------
> ( line1 from inputDir/insideDir/file1.txt) ( line 2 from 
> inputDir/insideDir/file1.txt) .
> (line 1 from inputDir/insideDir/innermost/fileone.txt)
> ...
> etc.,
> ------------------------------------------------
> 
> Am interested in filewise results , where I can retain the 
> filename and get the results filewise.
> 
> filename1
> ( line1 )
> ( line2 )
> 
> filename2
> (line 1)
> (line 2)
> etc.,
> 
> Is there any way I can access filenames from alias to which a 
> directory is loaded? Requirement is to iterate through all 
> the files, and in each file, would like to process every 
> line. please point me the right approach.
> 
> Regards,
> Srilatha
>