You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Ricky Ho <rh...@adobe.com> on 2009/05/19 07:36:15 UTC
Q on loading data from a directory
There are a bunch of files in a directory. My goal is to process these files to compute their TF/IDF.
I am looking for something like the following ...
A = LOAD 'input/dir' as (filename, text);
DUMP A;
(file1, Line one of file1)
(file1, Line two of file1)
(file2, Line one of file2)
(file2, Line two of file2)
Note that I want the filename to appear automatically in each record. Is this doable ?
Rgds,
Ricky
Re: Q on loading data from a directory
Posted by Alan Gates <ga...@yahoo-inc.com>.
You would need to write your own loader function to do that. The
filename is passed in the bindTo call. Your loader could cash the
filename and then use it to construct the first element of each output
tuple in getNext. You should be able to use PigStorage as a guide on
how to write the loader.
Alan.
On May 18, 2009, at 10:36 PM, Ricky Ho wrote:
> There are a bunch of files in a directory. My goal is to process
> these files to compute their TF/IDF.
>
> I am looking for something like the following ...
>
> A = LOAD 'input/dir' as (filename, text);
> DUMP A;
> (file1, Line one of file1)
> (file1, Line two of file1)
> (file2, Line one of file2)
> (file2, Line two of file2)
>
> Note that I want the filename to appear automatically in each
> record. Is this doable ?
>
> Rgds,
> Ricky