You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by fraz man <fr...@gmail.com> on 2012/09/26 02:39:00 UTC
Help writing first program.
Hi,
I have a quick question.
I have two type of files.
dirA/ --> file_a , file_b, file_c
dirB/ --> another_file_a, another_file_b...
Files in directory A contains tranascation information.
So something like:
id, time_stamp
1 , some_time_stamp
2 , some_another_time_stamp
1 , another_time_stamp
So, this kind of information is scattered across all the files in dirA.
Now 1st thing to do is: I give a time frame (lets say last week) and I
want to find all the unique ids which are present between that time
frame.
So, save a file.
Now, dirB files contains the address information.
Something like:
id, address, zip code
1, fooadd, 12345
and so on
So all the unique ids outputted by the first file.. I take them as
input and then find the address and zip code.
basically the final out is like the sql merge.
After following a lot of documentation.. I wrote a script
foo.pig (which is a text file) to basically merge two files..
Here is the script
times = LOAD 'dirA' USING PigStorage(', ') AS (id:int, time:chararray);
addresses = LOAD 'dirB' USING PigStorage(', ') AS (id:int,
address:chararray, zipcode:chararray);
filtered_times = FILTER times BY (time >= $START_TIME) AND (time <= $END_TIME);
just_ids = FOREACH filtered_times GENERATE id;
distinct_ids = DISTINCT just_ids;
result = JOIN distinct_ids BY id, addresses BY id;
But I am now stuck.
How do I save this file to a folder "/foo/bar/results.txt"
How do i execute this script.. when i run pig foo.pig -param
START_TIME="2012-08-2" -param END_TIME= "2012-09-1"
It throws me an error that foo.pig is not executable? It is actually a
text file...which is saved as pig extension.
Also. am i passing the params correctly.. in the file they are just
string of format yyyy-mm-dd??
I am antipating a lot of errors here.. feel free to blast my code
Re: Help writing first program.
Posted by TianYi Zhu <ti...@facilitatedigital.com>.
Hi Fraz,
To store your result:
STORE result INTO '/foo/bar/results.txt' using PigStorage(',');
Change
filtered_times = FILTER times BY (time >= $START_TIME) AND (time <=
$END_TIME);
to
filtered_times = FILTER times BY (time >= '$START_TIME') AND (time <=
'$END_TIME');
and run your script with
pig -f foo.pig -p START_TIME="2012-08-02" -p END_TIME="2012-09-01"
may make your script work.
Pig doesn't have build-in datetime type, it compares your time column as
string. Writing a java/python user defined function or strictly using the
same standard date format('yyyy-mm-dd') in both your script and data will
help you get the correct answer.
Thanks,
TianYi
On Wed, Sep 26, 2012 at 10:39 AM, fraz man <fr...@gmail.com> wrote:
> Hi,
> I have a quick question.
>
>
>
> I have two type of files.
> dirA/ --> file_a , file_b, file_c
>
> dirB/ --> another_file_a, another_file_b...
>
> Files in directory A contains tranascation information.
>
> So something like:
>
>
> id, time_stamp
> 1 , some_time_stamp
> 2 , some_another_time_stamp
> 1 , another_time_stamp
>
> So, this kind of information is scattered across all the files in dirA.
> Now 1st thing to do is: I give a time frame (lets say last week) and I
> want to find all the unique ids which are present between that time
> frame.
>
> So, save a file.
>
> Now, dirB files contains the address information.
> Something like:
>
> id, address, zip code
> 1, fooadd, 12345
> and so on
>
> So all the unique ids outputted by the first file.. I take them as
> input and then find the address and zip code.
>
>
> basically the final out is like the sql merge.
> After following a lot of documentation.. I wrote a script
> foo.pig (which is a text file) to basically merge two files..
> Here is the script
>
> times = LOAD 'dirA' USING PigStorage(', ') AS (id:int, time:chararray);
> addresses = LOAD 'dirB' USING PigStorage(', ') AS (id:int,
> address:chararray, zipcode:chararray);
> filtered_times = FILTER times BY (time >= $START_TIME) AND (time <=
> $END_TIME);
> just_ids = FOREACH filtered_times GENERATE id;
> distinct_ids = DISTINCT just_ids;
> result = JOIN distinct_ids BY id, addresses BY id;
>
> But I am now stuck.
>
> How do I save this file to a folder "/foo/bar/results.txt"
>
> How do i execute this script.. when i run pig foo.pig -param
> START_TIME="2012-08-2" -param END_TIME= "2012-09-1"
>
> It throws me an error that foo.pig is not executable? It is actually a
> text file...which is saved as pig extension.
> Also. am i passing the params correctly.. in the file they are just
> string of format yyyy-mm-dd??
>
> I am antipating a lot of errors here.. feel free to blast my code
>