You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Viraj Bhat (JIRA)" <ji...@apache.org> on 2010/01/29 03:29:35 UTC

[jira] Created: (PIG-1211) Pig script runs half way after which it reports syntax error

Pig script runs half way after which it reports syntax error
------------------------------------------------------------

                 Key: PIG-1211
                 URL: https://issues.apache.org/jira/browse/PIG-1211
             Project: Pig
          Issue Type: Improvement
          Components: impl
    Affects Versions: 0.6.0
            Reporter: Viraj Bhat
             Fix For: 0.8.0


I have a Pig script which is structured in the following way

{code}
register cp.jar

dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);

filtered_dataset = filter dataset by (col1 == 1);

proj_filtered_dataset = foreach filtered_dataset generate col2, col3;

rmf $output1;

store proj_filtered_dataset into '$output1' using PigStorage();

second_stream = foreach filtered_dataset  generate col2, col4, col5;

group_second_stream = group second_stream by col4;

output2 = foreach group_second_stream {
 a =  second_stream.col2
 b =   distinct second_stream.col5;
 c = order b by $0;
 generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
}

rmf  $output2;

--syntax error here
store output2 to '$output2' using PigStorage();

{code}

I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 

The usage of HDFS option, "rmf" causes the first store to execute. 

The only option the I have is to run an explain before running his script 

grunt> explain -script myscript.pig -out explain.out

or moving the rmf statements to the top of the script

Here are some questions:

a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables

Thanks
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861106#action_12861106 ] 

Viraj Bhat commented on PIG-1211:
---------------------------------

Ashutosh, yes as more and more people adopt Pig, they expect some type of guarantees, since Pig is designed to help people with no experience in writing M/R programs.

If I am a novice user I have a small typo, do I wait for 3-4 hours to discover that there is a syntax error? I have not only wasted the CPU cycles but also the users productivity.

The problem here is that dump and hadoop shell commands are treated differently in Pig scripts and Multi-query optimizations are ignored.

I have listed what Milind and Dmitry is suggesting. Maybe this is the way future Pig Language will compile to give you a hadoop jar file in sequence or as a DAG.

Pigcc -L myScript.pig -> parses pig script, generates logical plan, and stores it in myScript.pig.l

Pigcc -P myScript.pig.l -> produces physical plan from the logical plan, and stores it in myScript.pig.p

Pigcc -M myScript.pig.p -> produces map-reduce plan, myScript.pig.m

Pig myScript.pig.m -> interprets the MR plan. This can be split into multiple sequential MR jobs plans too,  myScript.pig.m.{1,2,3..}, so that a way to execute the pig script is to run

Hadoop jar pigRT.jar myScript.pig.m.1
Hadoop jar pigRT.jar myScript.pig.m.2
Hadoop jar pigRT.jar myScript.pig.m.3
Hadoop jar pigRT.jar myScript.pig.m.4

Thanks Viraj


> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864120#action_12864120 ] 

Hadoop QA commented on PIG-1211:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12443635/PIG-1211.patch
  against trunk revision 941005.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 release audit.  The applied patch generated 530 release audit warnings (more than the trunk's current 529 warnings).

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/308/testReport/
Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/308/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/308/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/308/console

This message is automatically generated.

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>         Attachments: PIG-1211.patch
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860614#action_12860614 ] 

Ashutosh Chauhan commented on PIG-1211:
---------------------------------------

Oh, I got confused. From your earlier comment, it occurred to me you are saying that we should add a -checkscript command line option. From your previous comment are you suggesting that we should add syntax checker which will always run (i.e., without needing any cmd line directive) before the query starts to execute and thereby catching as many user error as possible. I think this is a reasonable ask and will be useful to users. This might be the first step towards making a distinction between pig compile time and run-time explicit to user. If we go full length here, we might as well do what Milind suggested earlier (and in recent mail thread). We can add a "compilation" phase which first runs a syntax checker, then generates "object code" (essentially job jar) from pig script. This compiled object can then be handed over to run-time (hadoop cluster). Wow, pig-latin is evolving towards a "true language" :)   

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-1211:
--------------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Incompatible change, Reviewed]
    Release Note: -c (-cluster) was earlier documented as the option to provide cluster information - this was not being used in the Pig code though - with PIG-1211, "-c" is being reused as the option to check syntax of the pig script 
      Resolution: Fixed

Patch committed to trunk

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>         Attachments: PIG-1211.patch
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-1211:
--------------------------------

    Attachment: PIG-1211.patch

Attached patch addresses the issue by adding support for a check script option. For this purpose, the "-c" command line option is reused thus fixing https://issues.apache.org/jira/browse/PIG-1382 (Command line option -c doesn't work ...Currently this option is not used...).

The implementation of this check option piggybacks on "explain -script" and just modifies the GruntParser code to not output the explain output. 

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>         Attachments: PIG-1211.patch
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pradeep Kamath updated PIG-1211:
--------------------------------

    Status: Patch Available  (was: Open)

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>         Attachments: PIG-1211.patch
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Ashutosh Chauhan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12859462#action_12859462 ] 

Ashutosh Chauhan commented on PIG-1211:
---------------------------------------

bq. Can we have an option to do something like "checkscript" instead of explain to get the same syntax error? In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error

Though its possible to add something like checkscript. But, it will be a syntactic sugar, since it will do the same exact thing as explain does (but not printing the plan at the end). So,  I am thinking, shall we tell users to run explain to catch syntax errors, instead of adding this new command line option? What do others think ?

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Thejas M Nair (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864589#action_12864589 ] 

Thejas M Nair commented on PIG-1211:
------------------------------------

+1

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>         Attachments: PIG-1211.patch
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Viraj Bhat (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860419#action_12860419 ] 

Viraj Bhat commented on PIG-1211:
---------------------------------

Ashutosh, I feel that the user may not be interested in running his script first using explain finding his syntax error and then again running it again to get his results.  
They expect Pig to tell them all the errors upfront before submitting a M/R job.

Explain was not designed for checking syntax error in scripts. 

I believe that if you have a dump statement, explain -script will cause the script to run.

Is it not possible for Pig to find out that there is an error with "store" syntax? 

Viraj

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-1211) Pig script runs half way after which it reports syntax error

Posted by "Pradeep Kamath (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12864386#action_12864386 ] 

Pradeep Kamath commented on PIG-1211:
-------------------------------------

core unit tests are pass on my local machine - the errors reported above seem to be related to the environment. The release audit warning is due to a html file change and can be ignored - the patch is ready for review.

> Pig script runs half way after which it reports syntax error
> ------------------------------------------------------------
>
>                 Key: PIG-1211
>                 URL: https://issues.apache.org/jira/browse/PIG-1211
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.6.0
>            Reporter: Viraj Bhat
>             Fix For: 0.8.0
>
>         Attachments: PIG-1211.patch
>
>
> I have a Pig script which is structured in the following way
> {code}
> register cp.jar
> dataset = load '/data/dataset/' using PigStorage('\u0001') as (col1, col2, col3, col4, col5);
> filtered_dataset = filter dataset by (col1 == 1);
> proj_filtered_dataset = foreach filtered_dataset generate col2, col3;
> rmf $output1;
> store proj_filtered_dataset into '$output1' using PigStorage();
> second_stream = foreach filtered_dataset  generate col2, col4, col5;
> group_second_stream = group second_stream by col4;
> output2 = foreach group_second_stream {
>  a =  second_stream.col2
>  b =   distinct second_stream.col5;
>  c = order b by $0;
>  generate 1 as key, group as keyword, MYUDF(c, 100) as finalcalc;
> }
> rmf  $output2;
> --syntax error here
> store output2 to '$output2' using PigStorage();
> {code}
> I run this script using the Multi-query option, it runs successfully till the first store but later fails with a syntax error. 
> The usage of HDFS option, "rmf" causes the first store to execute. 
> The only option the I have is to run an explain before running his script 
> grunt> explain -script myscript.pig -out explain.out
> or moving the rmf statements to the top of the script
> Here are some questions:
> a) Can we have an option to do something like "checkscript" instead of explain to get the same syntax error?  In this way I can ensure that I do not run for 3-4 hours before encountering a syntax error
> b) Can pig not figure out a way to re-order the rmf statements since all the store directories are variables
> Thanks
> Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.