You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2008/01/10 23:28:33 UTC

[jira] Created: (PIG-58) parameterized Pig scripts

parameterized Pig scripts
-------------------------

                 Key: PIG-58
                 URL: https://issues.apache.org/jira/browse/PIG-58
             Project: Pig
          Issue Type: New Feature
            Reporter: Olga Natkovich


This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.

Example:
=======
Pig script myscript.pig:

A = load '/data/mydata/%date%';
B = filter A by $0>'5';
.....

Pig command line:

pig -param date='20080110' myscript.pig

Proposed interface and implementation:

Interface:
=======
(0) Substitution will be only supported with pig script files.
(1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
(2) Default values for the parameters can be specified within the script via decare statement:
decare <param>=<value>
(3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.

Implementation:
============

Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:

- create a new file in temp location
- build a hash of parameters from command line and declare statement
- for each line in the original script
  if this is a declare line, skip it
  else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
- pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-58:
------------------------------

    Status: Patch Available  (was: Open)

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566376#action_12566376 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

Alan,

Your comments just make me think again that C-style preprocessor does not really make sense for Pig. So I think in my last proposal, I will change from #define to declare but with extensions that I proposed for #define.

I like to prepend $ to the var names to make it more readable and easier parsable. Also, if pig ever has full support for variables, this would be useful as well.

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-58:
------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

patch committed. Thanks Pi and Alan for reviews

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2, PIG-58_v3.patch
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12587391#action_12587391 ] 

Alan Gates commented on PIG-58:
-------------------------------

The code as checked in included an import that was java 1.6 specific.  The import was spurious, so I've checked in a fix that resolved it (revision 646546).  But if you see the following error:

  [javac] Compiling 229 source files to /home/gates/src/pig/PIG-111/trunk/build/classes
    [javac] /home/gates/src/pig/PIG-111/trunk/src/org/apache/pig/tools/parameters/ParameterSubstitutionPreprocessor.java:31: cannot find symbol
    [javac] symbol  : class Console
    [javac] location: package java.io
    [javac] import java.io.Console;
    [javac]                ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 1 error

The solution is to either update to the listed revision or remove the import for java.io.Console from src/org/apache/pig/tools/parameters/ParameterSubstitutionPreprocessor.jav

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2, PIG-58_v3.patch
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586876#action_12586876 ] 

Alan Gates commented on PIG-58:
-------------------------------

Comments:

1) Dryrun isn't a good name for the command line option.  It's far to generic.
Knowing nothing about parameter substitution and looking at the usage
statement of pig I'd assume that dryrun meant that pig was going to parse my
query but not run it.  I would suggest a name like preproc or preproconly.

2) Why did we choose to put all of the data for the unit tests in separate
files in a test/data directory?  To date the approach has been to have the
tests themselves generate the data they need on the fly.  There are pros and
cons to switching, but I think we should discuss and have a policy of how unit
tests handle their data before we start adding a directory with a lot of files
in it.

3) A number of the public functions do not have javadoc comments.

4) In general more comments throughout the code on what it is doing would be
helpful.  For example, in UtilFunctions.substitute, it is totally non-obvious
what the line "replaced_line = replaced_line.replaceAll("\\\\\\$","\\$");

5) PigFileParser.jj doesn't skip over commented lines in the pig code.  It should ignore anything on a line after --




> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565977#action_12565977 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

This is in response to comment https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565959#action_12565959

>> 1. The parameter usage notation seems to arcane
Hopefully, the new proposal to use #define to declare all variables addresses this issue

>> 2. "-param" key seems redundant and will cause confusion
I usually prefer explicit rather than implicit specifications. Also, the -param approach separates pigs command line namespace from user parameter one. But I could go either way on this.

>> 3.  We need an ability to specify "DFS working directory" for the whole Pig job.
There is already way in pig to do that by issuing CD command

>> 4. Need to specify the precedence rules
I thought I already specified this in the proposal. Please, let me know which parts are unclear.

>> 5. Can parameters be used in RHS of declare?

My plan was to avoid it for the initial release to keep things simple since most of the use cases I saw did not need that. However, I don't think it would be difficult to implement so we can put that in.








> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586909#action_12586909 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

In reply to last comments from Pi:

>>I am not sure but seems like all parser we have do that.I think others throw ParseException. 

I checked - they throw both.

>>Here is comment on static structure:- Having getHash() and setHash() in both PigFileParser and ParamLoader pointing to UtilFunctions.param_val is a kind of weird. If you try to substitute vars multiple times in the same JVM, this can lead to side effects (I don't know if such scenarios exist).  I think for the sake of clarity and reusability, UtilFunctions.param_val should be separated as non-static member in a new ParamSubstituteContext class. PigFileParser and ParamLoader then accept the context as constructor parameter.

Ok, I will look into making UtilFunctions a class without static and just pass it around. 


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565975#action_12565975 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

This is in reply to https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565956#action_12565956

>> Given the use of a distinct preprocessor, are there any existing ones that make sense?

I found this one: http://antenna.sourceforge.net/wtkpreprocess.php which basically implements C-style preprocessor for JAVA but I am not sure that we want/need to go this far. See my comments on CPP preprocessor.

>> Does the pig language support literals with whitespace, carriage- returns and so on? If so, how does one define a parameter in a file with such characters in them?

I believe that pig allows string constants to contain arbitrary text including white spaces and carriage-returns. To put it in the parameter file you would enclose the value in quotes and we can just allow literals to span multiple lines. I can make clarification in the document.

>> How does this approach compare to solutions used in SQL?

As far as I could tell this is not part of standard SQL. In MySQL, user can define variables using set statement and then use them in their SQL  statement: http://dev.mysql.com/doc/refman/5.0/en/user-variables.html. 


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Benjamin Reed (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557828#action_12557828 ] 

Benjamin Reed commented on PIG-58:
----------------------------------

We should also allow positional parameters since nothing following the script:

pig -param date='aoeuaoeu' myscript.pig firstPos secondPos

We also don't need to generate a temp file. We can do the substitution on the fly.


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565971#action_12565971 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

This is in response to https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958
=============================================================================================

This is response to Alan's comment: https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565958#action_12565958
=========================================================================================================

I think it is an interesting idea and a reasonable approach but I have a few concerns:

(1) I don't think that C-style preprocessor (CPP) is necessarily that well known and understood among our users and developers
(2) CPP is complex to use and implement and might be too heavy weight for what we are trying to do. Briefly looked at the CPP code and it is very involved. Translating or writing on our own chunks of it would be a fairly large project.
(3) For C, CPP is used to influence how code is compiled. For pig we are trying to influence the run time behavior of the pig program and there are other ways to do it.  One way to do it is to embed Pig in languages such as Perl, Python or C/C++ which would take care of code inclusion, conditional execution and more. Similarly, we might decide to have this things in pig language but I don't think they belong in our preprocessor. (What's the difference between "if" and "#if" would be in pig.) So my approach is (a) simple things like parameter substitution can be done in preprocessor. (b) more complex things happen in the language itself or in the language in which Pig is embedded.
(4) CPP does not provide support for command execution which users asked for and just forcing them to run it from command line has limitation in terms of parameterizing command line and also harvesting return codes and error messages.

There are a couple of things I like from this proposal and would like to use:

(1) use #define rather than declare
(2) extend #define to also declare commands
(3) We can later further expand #define to include more things as we need them.

This way only variable names would be used outside of define which is nice since if pig later support variables such as for scalars they would have consistent representation.

So my examples from the document would now look as follows:

(1)
 A = load '/data/mydata/$date';

(2) 
#define CMD `generate_date`
A = load '/data/mydata/$CMD';

(3)
#define CMD `generate_name $date`
A = load '/data/mydata/$CMD';

(4)
#define CMD `$cmd $date`;
A = load '/data/mydata/$CMD';

I think this also addresses some of the concerns from 

https://issues.apache.org/jira/browse/PIG-58?focusedCommentId=12565959#action_12565959

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586465#action_12586465 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

Pi, thanks for your prompt feedback. Couple of follow up questions/comments.

>> 1. You assume no escaping in shell command, right?

Yes, I assume that the command is formed such that it can run as is. At this point it does not allow backticks inside of the command. If this turned out to be needed we can change that. The only escaping that is done is for $identifier to prevent the substitution. This is done by the substitution code itself.

>> 2. The name "UtilFunctions" implies it does not hold state (even global state). From the way it is used, we should have a better name or refactor is needed.

This is how the code was submitted to me and renaming things is not one of things I like to do :).  Since it is not an interface point, I don't really want to change names.

>> 3. PigFileParser.unquote still doesn't do escaping.

There are two types of escaping supported for the lines: for quotes and for $identifier. Since the preprocessor does not interpret any other data, I don't think anything else is needed. 

>> 4. <DEFALT> token in PigFileParser misspelled

Will fix that :).

>> 5. Why ParamLoader.Parse() throw IOException?

I am not sure but seems like all parser we have do that.

>> 6. In UtilFunctions.substitute, what does "replaced_line = replaced_line.replaceAll("\\\\\\$","\\$");" do?

The indent is to allow escaping $identifier. I am not exactly sure why it requires so many escapes. I can try and see if there is another more intuitive way to do that.

>> 7. Shouldn't logger be declared "private final Logger logger = Logger.getLogger("org.apache.pig.preprocessor.log");" (everything in one line) to make it consistent?

Will fix that

>> 1. I prefer HashMap to HashTable

Any particular reason?

2. Pattern identifier in UtilFunctions.substitute can be made static, this makes it 1 microsecond faster :D

Will fix that :).

I also realized that I we don't have to require that single word parameters have to be quoted. I will make that change in the parser, make the updates that you suggested and submit new patch



> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565959#action_12565959 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

Comments from a user:

1.  The parameter usage notation seems to arcane.
      Why both %% and $?   Why both %% and ``?
      It looks like you need %..% only to be able to put a command inside them -- they (%..%) do not seem necessary for variable substitutions.
      Similarly, `..` with this meaning can only be right inside %...%
      So why don't you simplify it to the following:
Example 1: parameter name specification
A = load '/data/mydata/$date';
Example 2: command specification
A = load '/data/mydata/%generate_date%';
Example 3: command taking a parameter
A = load '/data/mydata/%generate_name $date%';
Example 4: command is passed as a parameter:
A = load '/data/mydata/%$cmd $date%';


I would not want to write
B = filter A by $0>'%$N%';
(assuming I have N=5 in the command line...)
I'd prefer
B = filter A by $0>$N;


2. (repeated from another message)
"-param" key seems redundant and will cause confusion.
     Why not assume that any command line argument that has = is a parameter specification?  Indeed, we can get rid of -option syntax at all.
     Another way to think about this:  "X=Y" and "-X Y" are two ways to specify parameters.  Use one.


3. We need an ability to specify "DFS working directory" for the whole Pig job.
   This is the directory where all the relative Record Set path names (in "load" and "store") are rooted.
   By default, it is the users "home DFS directory".
   It is very convenient to be able to specify it when a script is run.


4. Need to specify the precedence rules: if for the same parameter name 3 values are specified -- in on command line, in "declare", and in parameters file, which one wins?  (probable, the order is command line, parameters file, declare ?)


5.  Can parameters be used in RHS of declare?  The documents implicitly says "NO",  but is a very convenient feature.   (I could give examples, but you can come up with plenty on your own.)


6. Typo:
The fault parameter values can be specified


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-58:
------------------------------

    Attachment: PIG-58_v3.patch

I made all the changes requested in the feedback by Alan and Pi. Please, take another look. Unless I hear any objections, I will commit the patch at noon PST tomorrow.

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2, PIG-58_v3.patch
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586897#action_12586897 ] 

Alan Gates commented on PIG-58:
-------------------------------

[Alan before] 1) Dryrun isn't a good name for the command line option. It's far to generic. Knowing nothing about parameter substitution and looking at the usage statement of pig I'd assume that dryrun meant that pig was going to parse my query but not run it. I would suggest a name like preproc or preproconly

[Olga] I was hoping that we would extend dryrun over time to include things that you were talking about. I did not mean it to stay parameter substitution specific. What do you think?

Seems fine.  We should put in a comment articulating that in the command line option.

[Alan before] 2) Why did we choose to put all of the data for the unit tests in separate files in a test/data directory? To date the approach has been to have the
tests themselves generate the data they need on the fly. There are pros and cons to switching, but I think we should discuss and have a policy of how unit tests handle their data before we start adding a directory with a lot of files in it.

[Olga] This is how the test cases were created when they were given to me. SInce I did not have strong oppinion and did not have time to redo them I left them as is. I am not really sure that we need a policy on this. What are your concerns with different tests doing what makes most sense for the kind of things they are testing? Let me know if you think it is ok to leave it as is or if you think that it needs to be reworked before committing the changes.

I guess I like having one way to do things instead of 3 (does that mean I'm more of python programmer than perl programmer?).  But it isn't worth delaying getting this in for.

[Alan before] 5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after -

[Olga] The logic is to not change any lines including comments that don't have parameters to be substituted. Grun parser will skip the comments.

I was referring to that fact that it will do substitutions in comments.  If we had comments that could be terminated, this could lead to strange behavior (consider C
style /* comment */ comments and then defining a parameter that contained */).  But our comments are currently to line end only (--) so it's should be fine.



> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565958#action_12565958 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

comments from Alan Gates:  
======================

What if instead we just built a C style preprocessor exactly, with #define, #include, #ifdef, #ifndef, #if, #else, and #endif?

This would meet requirements 1, 2, 3, 5, and 6.  For 4, people would have to generate the values with a script or binary themselves before calling the command line, but this should be trivial:  pig -Dbla=`xxx`

This has the advantage that it is well understood by all, implementers and users.  No one has to develop a new set or semantics and we don't have to deal with user questions of how it works.

Alan.


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586879#action_12586879 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

Alan, thanks for your feedback. My comments below:

>>> 1) Dryrun isn't a good name for the command line option. It's far to generic. Knowing nothing about parameter substitution and looking at the usage statement of pig I'd assume that dryrun meant that pig was going to parse my query but not run it. I would suggest a name like preproc or preproconly

I was hoping that we would extend dryrun over time to include things that you were talking about. I did not mean it to stay parameter substitution specific. What do you think?

>>> 2) Why did we choose to put all of the data for the unit tests in separate files in a test/data directory? To date the approach has been to have the
tests themselves generate the data they need on the fly. There are pros and cons to switching, but I think we should discuss and have a policy of how unit tests handle their data before we start adding a directory with a lot of files in it.

This is how the test cases were created when they were given to me. SInce I did not have strong oppinion and did not have time to redo them I left them as is. I am not really sure that we need a policy on this. What are your concerns with different tests doing what makes most sense for the kind of things they are testing? Let me know if you think it is ok to leave it as is or if you think that it needs to be reworked before committing the changes.

>>> 3) A number of the public functions do not have javadoc comments.

Will fix that

>>>  4) In general more comments throughout the code on what it is doing would be helpful. For example, in UtilFunctions.substitute, it is totally non-obvious what the line "replaced_line = replaced_line.replaceAll("\\\\\\$","$");

Will update that

>>> 5) PigFileParser.jj doesn't skip over commented lines in the pig code. It should ignore anything on a line after -

The logic is to not change any lines including comments that don't have parameters to be substituted. Grun parser will skip the comments.









> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586403#action_12586403 ] 

Pi Song commented on PIG-58:
----------------------------

A quick review:-

1. You assume no escaping in shell command, right?

2. The name "UtilFunctions" implies it does not hold state (even global state). From the way it is used, we should have a better name or refactor is needed.

3. PigFileParser.unquote still doesn't do escaping.

4. <DEFALT> token in PigFileParser misspelled

5. Why ParamLoader.Parse() throw IOException?

6. In UtilFunctions.substitute, what does "replaced_line = replaced_line.replaceAll("\\\\\\$","\\$");" do?

7. Shouldn't logger be declared "private final Logger logger = Logger.getLogger("org.apache.pig.preprocessor.log");" (everything in one line) to make it consistent?

Trivial things:-
1. I prefer HashMap to HashTable
2. Pattern identifier in UtilFunctions.substitute can be made static, this makes it 1 microsecond faster :D



> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561770#action_12561770 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

We have 2 additional requirements:

(1) Specify the substitutions via input file. The input file will have key/value pairs.
(2) Substitutions, can be strings or an executable that generates the string. Eg 20080122 or a script that generates the date.


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-58:
------------------------------

    Attachment: PIG-58_v2

Updated patch:

(1) Changed name for default token. It conflicted with reserved word in javacc. Now it is called PIGDEFAULT
(2) Made log declaration consistent with other modules
(3) Made identifier pattern static
(4) Changed grammar to no longer require quotes around single word values


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-58:
------------------------------

    Attachment: PIG-58_v1.patch

The patch contains the initial implementation provided by a tean at Yahoo. I have made some modification and also integrated it into pig.

The code includes an extenssive set of unit tests which all pass. Also, all existing unit and end-to-end tests are passing. I ran several end-to-end tests and they also pass. More end-to-end testing is on the way.

Also, update the specification: http://wiki.apache.org/pig/ParameterSubstitution

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Pi Song (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586776#action_12586776 ] 

Pi Song commented on PIG-58:
----------------------------

>>>> 5. Why ParamLoader.Parse() throw IOException?
>>I am not sure but seems like all parser we have do that.
I think others throw ParseException. 
>>>> 1. I prefer HashMap to HashTable
Hashtable just gives me a feeling of old-fashioned programmers :)


Here is comment on static structure:-

Having getHash() and setHash() in both PigFileParser and ParamLoader pointing to UtilFunctions.param_val is a kind of weird. If you try to substitute vars multiple times in the same JVM, this can lead to side effects (I don't know if such scenarios exist). 
I think for the sake of clarity and reusability, UtilFunctions.param_val should be separated as non-static member in a new ParamSubstituteContext class. PigFileParser and ParamLoader then accept the context as constructor parameter.


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>         Attachments: PIG-58_v1.patch, PIG-58_v2
>
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565954#action_12565954 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

Moved the proposal into a separate document:

http://wiki.apache.org/pig/ParameterSubstitution

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566215#action_12566215 ] 

Alan Gates commented on PIG-58:
-------------------------------

A couple of comments on using #define

1) Why do we need $ in our variables after #define?  In you example you say:

#define CMD `generate_date`

a = load '/data/$CMD';

It seems the second line should be:

a = load '/data/CMD';

People used to using #define in Cpp won't be thinking they need a variable designator such as $ in there.

2) How will default values be done? I think the proposal was to have values from the command line override any #defines in the script.  If we're going to follow the syntax of Cpp we better follow it's semantics as well, else we'll be violating the law of least astonishment.  In Cpp #defines are not overridden by values supplied at compile time with -D.  Instead, it's done as:

#ifndef X

#define X

#endif

That way if X is provided on the command line via -D it remains, otherwise a default value is given.  We could implement #ifndef without bogging down in a full #ifdef implementation.

> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-58) parameterized Pig scripts

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-58?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565956#action_12565956 ] 

Olga Natkovich commented on PIG-58:
-----------------------------------

comments from erik14:
===================

This approach seems fairly brittle.  For example whitespace in a parameter can have unexpected results.

Given the use of a distinct preprocessor, are there any existing ones that make sense?

Does the pig language support literals with whitespace, carriage- returns and so on?  If so, how does one define a parameter in a file with such characters in them?

How does this approach compare to solutions used in SQL?

In Hadoop, the discussion would be in a jira.  Does moving the discussion to a jira make sense?


> parameterized Pig scripts
> -------------------------
>
>                 Key: PIG-58
>                 URL: https://issues.apache.org/jira/browse/PIG-58
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Olga Natkovich
>
> This feature has been requested by several users and would be very useful in conjunction with streaming. The feature would allow pig script to include parameters that are replaced at run time. For instance, if your script needs to run on a daily basis over the data of the previous day, you would be able to use the script and providing a date as a run-time parameter to it.
> Example:
> =======
> Pig script myscript.pig:
> A = load '/data/mydata/%date%';
> B = filter A by $0>'5';
> .....
> Pig command line:
> pig -param date='20080110' myscript.pig
> Proposed interface and implementation:
> Interface:
> =======
> (0) Substitution will be only supported with pig script files.
> (1) Parameters are specified on the command line via -param <param>=<val> construct. Multiple parameters can be specified. They are applied to the script in the order they are specified on the command line
> (2) Default values for the parameters can be specified within the script via decare statement:
> decare <param>=<value>
> (3) Withint the script the parameter will be enclosed in %%. \% can be used te escape.
> Implementation:
> ============
> Use preprocessor to do the substitution. The preprocessor would be invoced by Main before grunt is instanciated and do the following:
> - create a new file in temp location
> - build a hash of parameters from command line and declare statement
> - for each line in the original script
>   if this is a declare line, skip it
>   else for each unescaped pattern %<identifie>% look for a match in the hash. Replace, if found.  Write the line to the temp file.
> - pass the temp file to grunt.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.