You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Ricky Ho <rh...@adobe.com> on 2009/05/19 07:36:15 UTC

Q on loading data from a directory

There are a bunch of files in a directory.  My goal is to process these files to compute their TF/IDF.

I am looking for something like the following ...

A = LOAD 'input/dir'  as  (filename, text);
DUMP A;
(file1, Line one of file1)
(file1, Line two of file1)
(file2, Line one of file2)
(file2, Line two of file2)

Note that I want the filename to appear automatically in each record.  Is this doable ?

Rgds,
Ricky

Re: Q on loading data from a directory

Posted by Alan Gates <ga...@yahoo-inc.com>.
You would need to write your own loader function to do that.  The  
filename is passed in the bindTo call.  Your loader could cash the  
filename and then use it to construct the first element of each output  
tuple in getNext.  You should be able to use PigStorage as a guide on  
how to write the loader.

Alan.

On May 18, 2009, at 10:36 PM, Ricky Ho wrote:

> There are a bunch of files in a directory.  My goal is to process  
> these files to compute their TF/IDF.
>
> I am looking for something like the following ...
>
> A = LOAD 'input/dir'  as  (filename, text);
> DUMP A;
> (file1, Line one of file1)
> (file1, Line two of file1)
> (file2, Line one of file2)
> (file2, Line two of file2)
>
> Note that I want the filename to appear automatically in each  
> record.  Is this doable ?
>
> Rgds,
> Ricky