You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/04/01 22:27:03 UTC

[Pig Wiki] Update of "PigStreamingFunctionalSpec" by Arun C Murthy

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by Arun C Murthy:
http://wiki.apache.org/pig/PigStreamingFunctionalSpec

------------------------------------------------------------------------------
  
  Pig will not ship the files but would expect the files to be available on the compute nodes.
  
- If the cache clause has a `#<name>`, then Hadoop's DistributedCache will a create a symlink in the task's cwd for the cached file. So, one can use this to distribute binaries too.
+ If the cache clause has a `#<name>`, then Hadoop's !DistributedCache will a create a symlink in the task's cwd for the cached file. So, one can use this to distribute binaries too.
  
  {{{
  define X `./stream.pl` cache('/home/joe/foo#stream.pl')
@@ -306, +306 @@

  
  ==== 4.3 Ability to processing binary data ====
  
- Sometimes, applications need to consume the entire data file without any parsing. All we would need in this case is to provide a custom loader function that just reads the entire data.
+ Sometimes, applications need to consume the entire data file without any parsing. In those cases applications can specify the ''split by 'file' '' option to the LoadFunc being used, further they can use ''BinaryStorage'' to specify that they do not want Pig to parse data at all and hence directly get the raw data.
  
  {{{
- A = load 'data' using AsIsLoader();
+ A = load 'data' using BinaryStorage() split by 'file';
  B = stream A by `stream.pl`
  }}}
  
@@ -345, +345 @@

  
  We should have a performance target in mind as compared to Hadoop streaming. I think for the initial release it would make sense to aim for '''30%''' overhead for streaming in Pig.
  
+ ==== 5.1 Load/Stream and Stream/Store optimizations ====
+ 
+ In cases where the STREAM operator immediately follows the LOAD or where it directly precedes the STORE operator, and given that they have the '''same''' LoadFunc/StoreFunc specifications Pig will try and optimize away the interpretation of data in the LoadFunc/StoreFunc (i.e. need to breakup raw input into ''Tuples'') by substituting the equivalent {Load|Store}Funcs for !BinaryStorage. For the LOAD/STREAM case the caveat is that this is feasible only when individual tasks are processing all of the data in the given input file (i.e. the split by 'file' option is specified to the LOAD operator).
+ 
+ E.g.
+ Pig will optimize:
+ {{{
+ IP = load 'data' split by 'file';
+ OP = stream IP through `myscript`;
+ store OP into 'output';
+ }}}
+ into
+ {{{
+ define CMD `myscript` input(stdin using BinaryStorage()) output(stdout using BinaryStorage());
+ IP = load 'data' using BinaryStorage() split by 'file';
+ OP = stream IP through CMD;
+ store OP into 'output' using BinaryStorage();
+ }}}
+ 
+ However,
+ {{{
+ IP = load 'data' using PigStorage(',') split by 'file';
+ OP = stream IP through `myscript`;
+ store OP into 'output';
+ }}}
+ 
+ cannot optimize the LOAD/STREAM pair since they have different !LoadFuncs (load has !PigStorage(',') and stream has !PigStorage()). The STREAM/STORE pair will be optimized to use !BinaryStorage.
+ 
  [[Anchor(Referencs)]]
  == References ==