You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/04/06 20:07:35 UTC

[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

------------------------------------------------------------------------------
  
  == Motivation ==
  
- This document describes a proposal for implementing parameter substitution in pig. This proposal is motivated by multiple requests from users who would like to create a template pig script and then use it with different parameters on a regular basis. For instance, if you have daily processing that is identical every day except the date it needs to process, it would be very convenient to put a placeholder for the date and provide the actual value at run time.
+ This document describes a proposal for implementing parameter substitution in pig. This proposal is motivated by multiple requests from users who would like to create a  
+ template pig script and then use it with different parameters on a regular basis. For instance, if you have daily processing that is identical every day except the date it  
+ needs to process, it would be very convenient to put a placeholder for the date and provide the actual value at run time.
  
  == Requirements ==
  
@@ -17, +19 @@

  
  == Interface ==
  
- === Parameter Specification ===
+ === Using Parameters ===
  
- Parameters in a pig script will be of the form `$<identifier>`. 
+ Parameters in a pig script are in the form of `$<identifier>`. 
  
  {{{
  A = load '/data/mydata/$date';
@@ -27, +29 @@

  .....
  }}}
  
- For this example, pig would expect `date` to be passed from pig command line or from a parameter file. The value would be substituted prior to running the load statement.
+ In this example, the value of the `date` parameter is expected to be passed on each invocation of the script and is substituted in before running the pig script. An error  
+ is generated if the value for any parameter is not found.
  
- In addition to supplying parameter value, a user can supply a command to execute to generate a parameter value. This can be done using `declare` statement. 
+ A parameter name have a structure of a standard language identifier: it must start with a latter or underscore followed by any number of letters, digits, and underscores. The  
+ names are case insensitive. The names can be escaped with `\` in which case substitution does not take place.
+ 
+ In the initial version of the software the parameters are only allowed when pig script is specified. They are disabled with `-e` switch or in the interactive mode. 
+ 
+ === Specifying Parameters ===
+ 
+ Parameter value can be supplied in four different ways.
+ 
+ ==== Command Line ====
+ 
+ Parameters can be passed via pig command line using `-param <param>=<val>` construct. Multiple parameters can be specified. If the same parameter is specified multiple  
+ times, the last value will be used and a warning will be generated.
+ 
+ The command line for Example 4 above would look as follows:
+ 
+ {{{
+ pig -param date='20080201'
+ }}}
+ 
+ ==== Parameter File ====
+ 
+ Parameters can also be specified in a file that can be passed to pig using `-param_file <file>` construct. Multiple files can be specified. If the same parameter is present  
+ multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used  
+ and a warning will be generated.
+ 
+ A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comments must take a full line and `#` must be  
+ the first character on the line. Each parameter line will be of the form: `<param_name>=<param_value>`. White spaces around `=` are allowed but are optional. 
+ 
+ {{{
+ # my parameters
+ 
+ date = '20080201'
+ cmd = `generate_name`
+ }}}
+ 
+ Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters.
+ 
+ ==== Declare Statement ====
+ 
+ `declare` command can be used from within pig script. The use case for this is to describe one parameter in terms of other(s).
+ 
+ {{{
+ %declare CMD `$mycmd $date`
+ A = load '/data/mydata/$CMD';
+ B = filter A by $0>'5';
+ .....
+ }}}
+ 
+ The format is `%declare <param> <value>`
+ 
+ `declare` command starts with `%` to indicate that this is a preprocessor command that is processed prior to executing pig script. It takes the highest precedence. The  
+ scope of parameter value defined via `declare` is all the lines following `declare` command until the next `declare` command that defines this parameter is encountered.
+ 
+ ==== Default Statement ====
+ 
+ `default` command can be used to provide a default value for a parameter. This value is used if the parameter has no value defined by any other means. (`default` has the  
+ lowest priority.).
+ 
+ `default` has the format and scoping rules identical do `declare`.
+ 
+ {{{
+ %default DATE '20080101'
+ }}}
+ 
+ ==== Processing Order ====
+ 
+  1. Configuration files will be scanned in the order they are specified on the command line. Within each file, the parameters are processed in the order they are  
+ specified.
+  2. Command line parameters will be scanned in the order they are specified on the command line.
+  3. declare/default commands will be processed in the order they appear in the pig script.
+ 
+ ==== Value Format ====
+ 
+ Value format are identical regardless of how the parameter is specified and can be of two types. First is a sequence of characters enclosed in single or double quotes. In  
+ this case the unquoted version of the value is used during substitution. Quotes within the value can be escaped.
+ 
+ {{{
+ %declare DESC 'Joe\'s URL'
+ A = load 'data' as (name, desc, url);
+ B = FILTER A by desc eq '$DESC';
+ }}} 
+ 
+ Note that the constant given to the filter needs to be enclosed in quotes because the parameter value is the unquoted version of the string.
+ 
+ Second is a command enclosed in backticks. In this case, the command is executed and its `stdout` is used as the parameter value:
  
  {{{
  %declare CMD `generate_date`
@@ -38, +126 @@

  .....
  }}}
  
- For this example, pig would execute `generate_date` command when it encounters the `declare` statement and assigns the result (stdout) to parameter `CMD`. The value of `CMD` is substituted prior to running the load statement.
+ The values of both types can be expressed in terms of other parameters as long a the values of the dependent parameters are defined prior to this value.
  
- `declare` statement starts with `%` to indicate that it is part of the preprocessor that performs parameter substitution rather than Pig language itself. The declare statement runs till the end of the line unless the value is a literal in which case it can take multiple lines.
- 
- `declare` can also be used to define one parameter in terms of others:
- 
- {{{
- %declare param1 ($param2 + $param3)
- }}}
- 
- With exception to string literals that can span multiple lines, for initial release, `declare` is a single-line command.
- 
- The command specified within `declare` statement can take parameters which need to be substituted as well.
- 
- {{{
- %declare CMD `generate_date $date`
- A = load '/data/mydata/$CMD';
- B = filter A by $0>'5';
- .....
- }}}
- 
- For this example, parameter `date` is substituted first when `declare` statement is encountered. Then `generate_name` command is executed passing value of `date` as a parameter to it. Its output (stdout) is assigned to `CMD` which is used in the load statement prior to its execution.
- 
- Note that variables passed on the command line must be resolved prior to the declare statement. The following sequence would cause an error:
- 
- {{{
- %declare A `cmd1 $B`
- %declare $B `cmd2`
- }}}
- 
- Command name itself can be a parameter.
  
  {{{
  %declare CMD `$mycmd $date`
@@ -77, +136 @@

  .....
  }}}
  
- In this example, parameters `mycmd` and `date` are substituted first when `declare` statement is encountered. Then the resulting command is executed and its stdout is placed into the path prior to running the load statement.
+ In this example, parameters `mycmd` and `date` are substituted first when `declare` statement is encountered. Then the resulting command is executed and its stdout is  
+ placed into the path prior to running the load statement.
  
- Note that parameter names are case insensitive and $cmd and $CMD means the same thing. This is to match the rest of Pig Latin.
- 
- === Parameter Passing ===
- 
- Parameters can be specified on pig command line using `-param <param>=<val>` construct. Multiple parameters can be specified. If the same parameter is specified multiple times, the last value will be used and a warning will be generated.
- 
- The command line for Example 4 above would look as follows:
- 
- {{{
- pig -param date='20080201' -param cmd='generate_name'
- }}}
- 
- Parameters can also be specified in a file that can be passed to pig using `-param_file <file>` construct. Multiple files can be specified. If the same parameter is present multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used and a warning will be generated.
- 
- A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comments must take a full line and `#` must be the first character on the line. Each parameter line will be of the form: `<param_name>=<param_value>`. White spaces around `=` are allowed but are optional. A parameter value can include white spaces. There is no need to quote the value and the quotes will be considered part of the value.
- 
- The parameter file for Example 4 above would look as follows:
- 
- {{{
- # my parameters
- 
- date = 20080201
- cmd = generate_name
- }}}
- 
- Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters.
- 
- `declare` command takes the highest precedence. The scope of parameter value defined via `declare` is all the lines following `declare` command until the next `declare` command that defines this parameter is encountered.
- 
- Default parameter values can be specified in a script using `%default <param> <value>` statement. This statement is identical to `declare` except that it has the lowest precedence meaning that its value is only used if it has not been defined before. Only first `default` statement for a particular parameter is meaningful. The rest are warned on and are ignored.
- 
- {{{
- %default cmd=generate_name
- }}}
- 
- Values specified from the command line as well as configuration file can be commands or expressions including other parameters. Their format is identical to `declare` and `default` format. Also, the same rule that variables need to be resolved before they can be used applies. The following order will be used:
- 
-  1. Configuration files will be scanned in the order they are specified on the command line. Within each file, the parameteres are processed in the order they are specified.
-  2. Command line parameters will be scanned in the order they are specified on the command line.
-  3. declare/default commands will be processed in the order they appear in the pig script.
  
  === Debugging ===
  
  If -debug option is specified to pig, it will produce fully substituted pig script in the current working directory named `<original name>.substituted`
  
- A -dryrun option will be added to pig in which case no execution is performed and substituted script is produced. We can also use the same option to produce just the execution plan.
+ A -dryrun option will be added to pig in which case no execution is performed and substituted script is produced. We can also use the same option to produce just the  
+ execution plan.
  
  === Logging === 
  
- Pig uses apache commons(http://commons.apache.org/logging/) in conjunction with log4j(http://logging.apache.org/log4j/) and we should to the same in the parameter substitution code.
+ Pig uses apache commons(http://commons.apache.org/logging/) in conjunction with log4j(http://logging.apache.org/log4j/) and we should to the same in the parameter  
+ substitution code.
  
  The following code can be used to instanciate a logger:
  
@@ -146, +168 @@

  
  Note that this code will work once we integrate this into Pig.
  
- Pig uses INFO as the default log level. Any messages that you want users to see during normal operation should be logged at this level. Anything that is only useful for debugging, should be logged at DEBUG level. Warnings should be logged at WARN level.
+ Pig uses INFO as the default log level. Any messages that you want users to see during normal operation should be logged at this level. Anything that is only useful for  
+ debugging, should be logged at DEBUG level. Warnings should be logged at WARN level.
  
  === Error Handling ===
  
@@ -156, +179 @@

  
   * ParseExceptions - for any errors due to parsing command line or config file parameters or pig script.
   * If the underlying code throws an exception and the exception is derived from RuntimeException - just let it propagate
-  * If the underlying code throws an exception that is not derived from RuntimeException, catch it and throw a RuntimeException with the original exception as the cause. (We want to make sure that we don't have to declare additional exceptions in our APIs.)
+  * If the underlying code throws an exception that is not derived from RuntimeException, catch it and throw a RuntimeException with the original exception as the cause. (We  
+ want to make sure that we don't have to declare additional exceptions in our APIs.)
   * Any exception that the code originates should be either RuntimeException or its derivation if appropriate.
  
  == Design ==
@@ -167, +191 @@

   2. Create `parameter hash` that maps parameter names to parameter values.
   3. Read parameters from files in the order they are specified on the command line
   4. `Resolve each parameter`:
-   * search the parameter value for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash.
+   * search the parameter value for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the  
-   * if the parameter value is enclosed in backticks, run the command and capture its stdout. If the command succeeds (returns 0), store the parameter in the hash with the value equal to stdout of the command. If the command fails (returns non-0 value), report the error and abort the processing.
+ correspondent parameter is not found in the parameter hash.
+   * if the parameter value is enclosed in backticks, run the command and capture its stdout. If the command succeeds (returns 0), store the parameter in the hash with the  
+ value equal to stdout of the command. If the command fails (returns non-0 value), report the error and abort the processing.
    * if the value is not a command, store it in the parameter hash.
    * if this is a duplicate parameter, warn and replace the old value with newly generated one.
   5. Resolve each command line parameter in the order they are specified on the command line
    * use the same resolution steps as for parameters passed in a file
   6. For each line in the input script
    * if comment or empty line, copy over
-   * if declare line resolve the paramter using the same steps as for parameters passed in a file
+   * if declare line resolve the parameter using the same steps as for parameters passed in a file
-   * if default line is encountered, the parameter defined is looked up in the parameter hash. If the parameter is not found, processing identical to declare line is performed; otherwise, the line is skipped.
+   * if default line is encountered, the parameter defined is looked up in the parameter hash. If the parameter is not found, processing identical to declare line is  
+ performed; otherwise, the line is skipped.
    * for all other lines
-    * search the line for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash. (Reuse the code from the parameter substitution in declare statement.)
+    * search the line for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent  
+ parameter is not found in the parameter hash. (Reuse the code from the parameter substitution in declare statement.)
     * place the substituted line into the output file.
   4. If -dryrun is not specified, pass the output file to grunt to execute. Otherwise, print the name of the file and exit.
   5. if neither -debug nor -dryrun are specified, remove the output file.
  
- == Code Integration ==
+ == Future Features ==
  
- TBW
+ One nice feature to add later is to be able to constrain parameter names. For instance in the statement below the intent might be to replace only `$date` and leave `latest`  
+ in the path.
  
+ {{{
+ A = load 'data/$date_latest';
+ ...
+ }}}
+ 
+ This can be specified with perl-style syntax:
+ 
+ {{{
+ A = load 'data/${date}_latest';
+ ...
+ }}}
+