You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/01 21:01:25 UTC
[Pig Wiki] Update of "PigJournal" by AlanGates

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=5&rev2=6

--------------------------------------------------

  || Multiquery support                                   || 0.3                  || ||
  || Add skewed join                                      || 0.4                  || ||
  || Add merge join                                       || 0.4                  || ||
+ || Add Zebra as contrib project                         || 0.4                  || ||
  || Support Hadoop 0.20                                  || 0.5                  || ||
  || Improved Sampling                                    || 0.6                  || There is still room for improvement for order by sampling ||
  || Change bags to spill after reaching fixed size       || 0.6                  || Also created bag backed by Hadoop iterator for single UDF cases ||
@@ -32, +33 @@

  || Switch local mode to Hadoop local mode               || 0.6                  || ||
  || Outer join for default, fragment-replicate, skewed   || 0.6                  || ||
  || Make configuration available to UDFs                 || 0.6                  || ||
+ || Load Store Redesign                                  || 0.7                  || ||
+ || Add Owl as contrib project                           || not yet released     || ||
+ || Pig Mix 2.0                                          || not yet released     || ||
  
  == Work in Progress ==
  This covers work that is currently being done.  For each entry the main JIRA for the work is referenced.
  
- || Feature                                                  || JIRA                                                       || Comments ||
+ || Feature                                  || JIRA                                                         || Comments ||
- || Metadata                                                 || [[http://issues.apache.org/jira/browse/PIG-823|PIG-823]]   || ||
+ || Boolean Type                             || [[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
- || Query Optimizer                                          || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || ||
+ || Query Optimizer                          || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]   || ||
- || Load Store Redesign                                      || [[http://issues.apache.org/jira/browse/PIG-966|PIG-966]]   || ||
- || Add SQL Support                                          || [[http://issues.apache.org/jira/browse/PIG-824|PIG-824]]   || ||
- || Change Pig internal representation of charrarry to Text  || [[http://issues.apache.org/jira/browse/PIG-1017|PIG-1017]] || Patch ready, unclear when to commit to minimize disruption to users and destabilization to code base. ||
- || Integration with Zebra                                   || [[http://issues.apache.org/jira/browse/PIG-833|PIG-833]]   || ||
+ || Cleanup of javadocs                      || [[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
+ || UDFs in scripting languages              || [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
+ || Ability to specify a custom partitioner  || [[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
+ || Pig usage stats collection               || [[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], [[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], [[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], [[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
+ || Make Pig available via Maven             || [[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
  
  
  == Proposed Future Work ==
@@ -68, +73 @@

  Within each subsection order is alphabetical and does not imply priority.
  
  === Agreed Work, Agreed Approach ===
- ==== Boolean Type ====
- Boolean is currently supported internally as a type in Pig, but it is not exposed to users.  Data cannot be of type boolean, nor can UDFs (other than
- !FilterFuncs) return boolean.  Users have repeatedly requested that boolean be made a full type.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''  Will affect all !LoadCasters, as they will have to provide byteToBoolean methods.
- 
- '''References:'''
- 
- '''Estimated Development Effort:'''  small
- 
  ==== Combiner Not Used with Limit or Filter ====
  Pig Scripts that have a foreach with a nested limit or filter do not use the combiner even when they could.  Not all filters can use the combiner, but in some cases
  they can.  I think all limits could at least apply the limit in the combiner, though the UDF itself may only be executed in the reducer. 
@@ -226, +219 @@

  
  '''Estimated Development Effort:'''  small
  
- ==== Pig Mix 2.0 ====
- Pig Mix has been a very useful tool for Pig to test performance from version to version and to communicate the results of those tests to users.  However, it was
- developed prior to release 0.3, and does not test any functionality included with 0.4 or later.  Also the current
- Pig Mix tests only latency and not scalability.  A new version of Pig Mix is needed that tests additional Pig
- functionality such as outer joins, new join implementations, makes use of the accumulator interface, etc.  Scalability tests also need to be
- added to Pig Mix 2.0, or a separate scalability benchmark developed, so that Pig developers can measure Pig's scalability as changes are
- made.
- 
- '''Category:'''  Development
- 
- '''Dependency:'''
- 
- '''References:''' [[http://wiki.apache.org/pig/PigMix|Pig Mix]]
- 
- '''Estimated Development Effort:'''  medium
- 
  ==== Pig Server ====
  Currently Pig runs as a "fat client" where all of the front end processing is done on the user's machine.  This has the advantage that it requires no
  installation and no maintenance of a server.  However, it has the drawback that upgrades require upgrading every client machine, users may be using 
@@ -286, +263 @@

  
  '''Estimated Development Effort:'''  medium
  
- ==== UDF Support in Other Languages ====
- Currently Pig users must implement UDFs in Java.  We would like to extend this to allow !EvalFuncs and !FilterFuncs to be implemented in scripting languages.
- There seems to be consensus that implementing this in one of the frameworks that compiles scripting languages down to Java bytecode would be simpler than
- supporting any number of languages and also would provide sufficient scripting support.  Specifically, Python, Ruby, and Groovy can all be supported in this
- manner, though Perl and C cannot.  Which framework to use for this is not clear.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''
- 
- '''References:'''  [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]
- 
- '''Estimated Development Effort:'''  medium
- 
  === Agreed Work, Unknown Approach ===
- 
  ==== Clarify Pig Latin Semantics ====
  There are areas of Pig Latin semantics that are not clear or not consistent.  Take for example, a script like:
  
@@ -383, +345 @@

  
  '''Estimated Development Effort:'''  depends on how much SQL we decide to implement
  
- ==== Standard UDFs Should Pig Provide ====
+ ==== Standard UDFs Pig Should Provide ====
  There are a number of UDFs in Piggybank that might be considered standard, such as UPPER, LOWER, etc.  These could be moved into Pig proper so that they are better tested
  and maintained.  Also the Pig team should consider what additional UDFs should be added as standard.  Categories for consideration
  include string functions, math functions, statistics functions, date and time functions.  We should also consider if there are
@@ -412, +374 @@

  '''Estimated Development Effort:'''  medium
  
  
- ==== Statistics on Usage ====
- It would be very useful for Pig developers if Pig collected statistics of how users used Pig.  This could include what scripts were run, basic characteristics of
- the data, etc.  Note that this is separate from collecting statistics about data for the optimizer, though the two may share some functionality.  Also, this will
- raises security concerns (who gets to see who ran what) and thus will have to be configurable from site to site.  This has been placed in the unknown approach
- section because no design of how to collect statistics, where to store them, etc. has been proposed.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''
- 
- '''References:'''
- 
- '''Estimated Development Effort:'''  medium
- 
  === Experimental ===
+ ==== Add Scalars To Pig Latin ====
+ Users have repeatedly requested the ability to something like this:
- ==== Custom Partitioner ====
- Hadoop allows !MapReduce users to set a custom partitioner between Map and Reduce phases.  Users would like to use these
- partitioners in their Pig scripts.  In some situations Pig sets its own custom partitioner (order by, skew join), so
- users would not be override the partitioner in this case.  
  
- '''Category:'''  New Functionality
+ {{{
+     A = load 'myfile';
+     B = group A all;
+     C = foreach B generate COUNT(A); -- notice that this produces a relation with one row and one column
+     D = load 'myotherfile';
+     E = group D by $0;
+     F = foreach E generate group, sum(D.$1) / C;
+ }}}
  
- '''Dependency:'''
+ Pig Latin does not currently allow this since C is a relation (or a bag, if you prefer).  But it is guaranteed to be a relation with one row and one column.  So it
+ should be possible to do something like:
  
- '''References:''' [[https://issues.apache.org/jira/browse/PIG-282|Pig-282]]
+ {{{
+     A = load 'myfile';
+     B = group A all;
+     C = foreach B generate COUNT(A); -- notice that this produces a relation with one row and one column
+     D = load 'myotherfile';
+     E = group D by $0;
+     F = foreach E generate group, sum(D.$1) / (long)C;
+ }}}
  
+ The planner would have to catch this and insert a store between C and D, and then in F C could be reloaded.  The parser should also make some effort at
+ guaranteeing that C will produce a single value, though this will not be bullet proof (e.g. checking that the foreach only generates one column is easy, checking
+ that it only produces one row is harder).  If C is reloaded and contains more than one row or column then a runtime error would occur.
+ 
+ '''Category:'''  New Functionality
+ 
+ '''Dependency:'''
+ 
+ '''References:'''
+ 
- '''Estimated Development Effort:'''  small
+ '''Estimated Development Effort:'''  Small
+ 
+ ==== Add List Datatype ====
+ Pig has tuples (roughly equivalent to structs or records in many languages).  Bags, which are roughly equivalent to lists, have the restriction that they can only
+ contain tuples.  This means that users have modeled lists as bags of tuples of a single element.  This is confusing to users and wastes memory.  Changing bags to
+ take any type would be very disruptive, since much existing Pig code is built around the assumption that bags only contain tuples.  Additionally bags contain
+ extensive functionality to handle memory management, spilling, etc.  A list type need not offer all these features.  Therefore the best route to adding this
+ functionality may be to add a list type to Pig Latin.
+ 
+ '''Category:'''  New Feature
+ 
+ '''Dependency:'''
+ 
+ '''References:'''
+ 
+ '''Estimated Development Effort:'''  Medium
  
  ==== Automated Hadoop Tuning ====
  Hadoop has many configuration parameters that can affect the latency and scalability of a job.  For different types of jobs, different configurations will yield