You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by fraz man <fr...@gmail.com> on 2012/09/26 02:39:00 UTC

Help writing first program.

Hi,
   I have a quick question.



I have two type of files.
dirA/  --> file_a , file_b, file_c

dirB/  --> another_file_a, another_file_b...

Files in directory A contains tranascation information.

So something like:


       id, time_stamp
       1 , some_time_stamp
       2 , some_another_time_stamp
       1  , another_time_stamp

So, this kind of information is scattered across all the files in dirA.
Now 1st thing to do is: I give a time frame (lets say last week) and I
want to find all the unique ids which are present between that time
frame.

So, save a file.

Now, dirB files contains the address information.
Something like:

        id, address, zip code
         1, fooadd, 12345
         and so on

So all the unique ids outputted by the first file.. I take them as
input and then find the address and zip code.


basically the final out is like the sql merge.
After following a lot of documentation.. I wrote a script
foo.pig (which is a text file) to basically merge two files..
Here is the script

times = LOAD 'dirA' USING PigStorage(', ') AS (id:int, time:chararray);
addresses = LOAD 'dirB' USING PigStorage(', ') AS (id:int,
address:chararray, zipcode:chararray);
filtered_times = FILTER times BY (time >= $START_TIME) AND (time <= $END_TIME);
just_ids = FOREACH filtered_times GENERATE id;
distinct_ids = DISTINCT just_ids;
result = JOIN distinct_ids BY id, addresses BY id;

But I am now stuck.

How do I save this file to a folder "/foo/bar/results.txt"

How do i execute this script.. when i run pig foo.pig -param
START_TIME="2012-08-2" -param END_TIME= "2012-09-1"

It throws me an error that foo.pig is not executable? It is actually a
text file...which is saved as pig extension.
Also. am i passing the params correctly.. in the file they are just
string of format yyyy-mm-dd??

I am antipating a lot of errors here.. feel free to blast my code

Re: Help writing first program.

Posted by TianYi Zhu <ti...@facilitatedigital.com>.

Hi Fraz,

To store your result:
STORE result INTO '/foo/bar/results.txt' using PigStorage(',');


Change
filtered_times = FILTER times BY (time >= $START_TIME) AND (time <=
$END_TIME);
to
filtered_times = FILTER times BY (time >= '$START_TIME') AND (time <=
'$END_TIME');

and run your script with
pig -f foo.pig -p START_TIME="2012-08-02" -p END_TIME="2012-09-01"

may make your script work.

Pig doesn't have build-in datetime type, it compares your time column as
string. Writing a java/python user defined function or strictly using the
same standard date format('yyyy-mm-dd') in both your script and data will
help you get the correct answer.

Thanks,
TianYi

On Wed, Sep 26, 2012 at 10:39 AM, fraz man <fr...@gmail.com> wrote:

> Hi,
>    I have a quick question.
>
>
>
> I have two type of files.
> dirA/  --> file_a , file_b, file_c
>
> dirB/  --> another_file_a, another_file_b...
>
> Files in directory A contains tranascation information.
>
> So something like:
>
>
>        id, time_stamp
>        1 , some_time_stamp
>        2 , some_another_time_stamp
>        1  , another_time_stamp
>
> So, this kind of information is scattered across all the files in dirA.
> Now 1st thing to do is: I give a time frame (lets say last week) and I
> want to find all the unique ids which are present between that time
> frame.
>
> So, save a file.
>
> Now, dirB files contains the address information.
> Something like:
>
>         id, address, zip code
>          1, fooadd, 12345
>          and so on
>
> So all the unique ids outputted by the first file.. I take them as
> input and then find the address and zip code.
>
>
> basically the final out is like the sql merge.
> After following a lot of documentation.. I wrote a script
> foo.pig (which is a text file) to basically merge two files..
> Here is the script
>
> times = LOAD 'dirA' USING PigStorage(', ') AS (id:int, time:chararray);
> addresses = LOAD 'dirB' USING PigStorage(', ') AS (id:int,
> address:chararray, zipcode:chararray);
> filtered_times = FILTER times BY (time >= $START_TIME) AND (time <=
> $END_TIME);
> just_ids = FOREACH filtered_times GENERATE id;
> distinct_ids = DISTINCT just_ids;
> result = JOIN distinct_ids BY id, addresses BY id;
>
> But I am now stuck.
>
> How do I save this file to a folder "/foo/bar/results.txt"
>
> How do i execute this script.. when i run pig foo.pig -param
> START_TIME="2012-08-2" -param END_TIME= "2012-09-1"
>
> It throws me an error that foo.pig is not executable? It is actually a
> text file...which is saved as pig extension.
> Also. am i passing the params correctly.. in the file they are just
> string of format yyyy-mm-dd??
>
> I am antipating a lot of errors here.. feel free to blast my code
>