You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Katukuri, Jay" <jk...@ebay.com> on 2010/04/27 04:41:39 UTC

chaining pig scripts

Hello,
I have two pig scripts and a java program that need to be chained in the following order: Pig-script1  --> Java Program --> Pig-Script2

That is the output of the Pig-Script1 in HDFS is the input for the java program. The java program ( not a Map-reduce job)  does some analysis and writes to HDFS.
The output generated by the java program is given as the input to Pig-Script2.

What are the different ways I can  chain them in the order  Pig-script1  --> Java Program --> Pig-Script2

One approach I can think of is the following:

The java program verifies if the file is written (exists) in HDFS by Pig-Script1 before executing. If the file is not found it blocks.
Pig-script2 verifies if the file is written (exists) in HDFS by the java program. If the file is not found it blocks.

Obviously this is not a good (correct) approach because a job ( in  my case pig-script1)  may be in the process of writing the file and the next job (java program) will start reading it before pig-script1 completes.

Can anyone please suggest better alternatives?


Thanks,
Jay


Re: chaining pig scripts

Posted by Thejas Nair <te...@yahoo-inc.com>.


On 4/27/10 5:51 PM, "Katukuri, Jay" <jk...@ebay.com> wrote:

> Thanks Alan,
> Looking at the Javadoc for PigServer, I believe I can use
> "ExecJob.hasCompleted" to find if the Pig latin script finished.
> 
> I ran into the following compilation problem in converting my original pig
> latin script into Embedded pig.
> 

> String guidGroupStats = "guid_group_stats = FOREACH guid_group {a_bag = FILTER
> itemfiltered BY VisitCount > 1;\n"+ "GENERATE group as Guid, COUNT (a_bag) as
> a_sum, SUM (itemfiltered.VisitCount)- COUNT(itemfiltered)  as ax_sum,
> COUNT(itemfiltered) as b_sum;\n}";
> 
> pigServer.registerQuery(guidGroupStats);
> 
> It complains that it Encountered "<EOF>" Was expecting one of:
> "parallel" ...
>     ";" ..


Based on the error message, it looks like adding a semicolon at end of the
statement should fix it -

String guidGroupStats = "guid_group_stats = FOREACH guid_group {a_bag =
FILTER itemfiltered BY VisitCount > 1;\n"+ "GENERATE group as Guid, COUNT
(a_bag) as a_sum, SUM (itemfiltered.VisitCount)- COUNT(itemfiltered)  as
ax_sum,  COUNT(itemfiltered) as b_sum;\n} ; " ;


RE: chaining pig scripts

Posted by "Katukuri, Jay" <jk...@ebay.com>.
Thanks Alan,
Looking at the Javadoc for PigServer, I believe I can use "ExecJob.hasCompleted" to find if the Pig latin script finished.

I ran into the following compilation problem in converting my original pig latin script into Embedded pig.

The Embedded pig complains for the following query, while my Pig Latin compiles and runs fine:

guid_group_stats = FOREACH guid_group {
                          a_bag = FILTER itemfiltered BY VisitCount > 1;

                          GENERATE group as Guid, COUNT (a_bag) as a_sum, SUM (itemfiltered.VisitCount)- COUNT(itemfiltered)  as ax_sum,  COUNT(itemfiltered) as b_sum;
}

The above query is converted to a embedded pig as below:

String guidGroupStats = "guid_group_stats = FOREACH guid_group {a_bag = FILTER itemfiltered BY VisitCount > 1;\n"+ "GENERATE group as Guid, COUNT (a_bag) as a_sum, SUM (itemfiltered.VisitCount)- COUNT(itemfiltered)  as ax_sum,  COUNT(itemfiltered) as b_sum;\n}";

pigServer.registerQuery(guidGroupStats);

It complains that it Encountered "<EOF>" Was expecting one of:
"parallel" ...
    ";" ..

Is it not possible to use FOREACH group { } construct in Embedded pig.

Any help is greatly appreciated.

p.s: I am using Pig 0.40

Thanks,
Jay


-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com] 
Sent: Tuesday, April 27, 2010 9:10 AM
To: Katukuri, Jay
Cc: pig-user@hadoop.apache.org
Subject: Re: chaining pig scripts

Use the PigServer interface from Java.  This way your Pig Latin and  
Java can be intermixed.  You will be guaranteed that your middle Java  
code will start immediately after PigServer finishes running the first  
Pig Latin script.

Alan.

On Apr 26, 2010, at 7:41 PM, Katukuri, Jay wrote:

> Hello,
> I have two pig scripts and a java program that need to be chained in  
> the following order: Pig-script1  à Java Program à Pig-Script2
>
> That is the output of the Pig-Script1 in HDFS is the input for the  
> java program. The java program ( not a Map-reduce job)  does some  
> analysis and writes to HDFS.
> The output generated by the java program is given as the input to  
> Pig-Script2.
>
> What are the different ways I can  chain them in the order  Pig- 
> script1  à Java Program à Pig-Script2
>
> One approach I can think of is the following:
>
> The java program verifies if the file is written (exists) in HDFS by  
> Pig-Script1 before executing. If the file is not found it blocks.
> Pig-script2 verifies if the file is written (exists) in HDFS by the  
> java program. If the file is not found it blocks.
>
> Obviously this is not a good (correct) approach because a job ( in   
> my case pig-script1)  may be in the process of writing the file and  
> the next job (java program) will start reading it before pig-script1  
> completes.
>
> Can anyone please suggest better alternatives?
>
>
> Thanks,
> Jay
>


Re: chaining pig scripts

Posted by hc busy <hc...@gmail.com>.
Has anybody had success with a 'touch' statement inside of pigscript and a
separate process polling to see when that file completes?

script 1 touch /markers/stage_one_complete
then java kicks in
when java completes script 2 starts
script 2 touches /marker/stage_two_complete
... and so on and so forth
The only problem is the pigscript doesn't have a wait-for command.

if *while-loop* and `*sleep*` and `*test*` are introduced as pig latin
commands, or if a simple `*wait*` command that waits for a file to come into
existence on hdfs... that would be pretty cool!

On Tue, Apr 27, 2010 at 9:10 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

> Use the PigServer interface from Java.  This way your Pig Latin and Java
> can be intermixed.  You will be guaranteed that your middle Java code will
> start immediately after PigServer finishes running the first Pig Latin
> script.
>
> Alan.
>
> On Apr 26, 2010, at 7:41 PM, Katukuri, Jay wrote:
>
>  Hello,
>> I have two pig scripts and a java program that need to be chained in the
>> following order: Pig-script1  à Java Program à Pig-Script2
>>
>> That is the output of the Pig-Script1 in HDFS is the input for the java
>> program. The java program ( not a Map-reduce job)  does some analysis and
>> writes to HDFS.
>> The output generated by the java program is given as the input to
>> Pig-Script2.
>>
>> What are the different ways I can  chain them in the order  Pig-script1  à
>> Java Program à Pig-Script2
>>
>> One approach I can think of is the following:
>>
>> The java program verifies if the file is written (exists) in HDFS by
>> Pig-Script1 before executing. If the file is not found it blocks.
>> Pig-script2 verifies if the file is written (exists) in HDFS by the java
>> program. If the file is not found it blocks.
>>
>> Obviously this is not a good (correct) approach because a job ( in  my
>> case pig-script1)  may be in the process of writing the file and the next
>> job (java program) will start reading it before pig-script1 completes.
>>
>> Can anyone please suggest better alternatives?
>>
>>
>> Thanks,
>> Jay
>>
>>
>

Re: chaining pig scripts

Posted by Alan Gates <ga...@yahoo-inc.com>.
Use the PigServer interface from Java.  This way your Pig Latin and  
Java can be intermixed.  You will be guaranteed that your middle Java  
code will start immediately after PigServer finishes running the first  
Pig Latin script.

Alan.

On Apr 26, 2010, at 7:41 PM, Katukuri, Jay wrote:

> Hello,
> I have two pig scripts and a java program that need to be chained in  
> the following order: Pig-script1  à Java Program à Pig-Script2
>
> That is the output of the Pig-Script1 in HDFS is the input for the  
> java program. The java program ( not a Map-reduce job)  does some  
> analysis and writes to HDFS.
> The output generated by the java program is given as the input to  
> Pig-Script2.
>
> What are the different ways I can  chain them in the order  Pig- 
> script1  à Java Program à Pig-Script2
>
> One approach I can think of is the following:
>
> The java program verifies if the file is written (exists) in HDFS by  
> Pig-Script1 before executing. If the file is not found it blocks.
> Pig-script2 verifies if the file is written (exists) in HDFS by the  
> java program. If the file is not found it blocks.
>
> Obviously this is not a good (correct) approach because a job ( in   
> my case pig-script1)  may be in the process of writing the file and  
> the next job (java program) will start reading it before pig-script1  
> completes.
>
> Can anyone please suggest better alternatives?
>
>
> Thanks,
> Jay
>