You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2009/01/14 01:48:43 UTC

[Pig Wiki] Update of "UDFManual" by OlgaN

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/UDFManual

------------------------------------------------------------------------------
  = User Defined Function Guide =
  
  Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig. This document describes how to use existing functions as well as how to write your own functions. 
- 
- '''Note''': The infomation presented here is for the latest version of Pig, currently available on the `types` branch.
  
  [[Anchor(Eval_Functions)]]
  == Eval Functions ==
@@ -72, +70 @@

  
  The actual function implementation is on lines 13-14 and is self-explanatory.
  
- Now that we have the function implemented, it needs to be compiled and included in a jar. You will need a `pig.jar` built from the `types` branch to compile your UDF. You can use the following set of commands to checkout the code from SVN repository and create pig.jar:
+ Now that we have the function implemented, it needs to be compiled and included in a jar. You will need to build `pig.jar` to compile your UDF. You can use the following set of commands to checkout the code from SVN repository and create pig.jar:
  
  {{{
- svn co http://svn.apache.org/repos/asf/hadoop/pig/branches/types
+ svn co http://svn.apache.org/repos/asf/hadoop/pig/trunk
- cd types
+ cd trunk
  ant
  }}}
  
@@ -107, +105 @@

  
  An aggregate function is an eval function that takes a bag and returns a scalar value. One interesting and useful property of many aggregate functions is that they can be computed incrementally in a distributed fashion. We call these functions `algebraic`. `COUNT` is an example of an algebraic function because we can count the number of elements in a subset of the data and then sum the counts to produce a final output. In the Hadoop world, this means that the partial computations can be done by the map and combiner, and the final result can be computed by the reducer.
  
- It is very important for performance to make sure that aggregate functions that are algebraic are implemented as such. Let's look at the implementation of the COUNT function to see what this means. (Error handling and some other code is omitted to save space. The full code can be accessed [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/COUNT.java?view=markup here].)
+ It is very important for performance to make sure that aggregate functions that are algebraic are implemented as such. Let's look at the implementation of the COUNT function to see what this means. (Error handling and some other code is omitted to save space. The full code can be accessed [http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/COUNT.java?view=markup here].)
  
  {{{#!java
  public class COUNT extends EvalFunc<Long> implements Algebraic{
@@ -231, +229 @@

  || bag || !DataBag ||
  || map || Map<Object, Object> ||
  
- All Pig-specific classes are available [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/data/ here]
+ All Pig-specific classes are available [http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/data/ here]
  
  `Tuple` and `DataBag` are different in that they are not concrete classes but rather interfaces. This enables users to extend Pig with their own versions of tuples and bags. As a result, UDFs cannot directly instantiate bags or tuples; they need to go through factory classes: `TupleFactory` and `BagFactory`.
  
@@ -607, +605 @@

  
  [[Anchor(Load_Functions)]]
  === Load Functions ===
- Every load function needs to implement the `LoadFunc` interface. An abbreviated version is shown below. The full definition can be seen [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/LoadFunc.java?view=markup here].
+ Every load function needs to implement the `LoadFunc` interface. An abbreviated version is shown below. The full definition can be seen [http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup here].
  
  {{{#!java
  public interface LoadFunc {
@@ -641, +639 @@

  
  In this query, only `age` needs to be converted to its actual type (=int=) right away. `name` only needs to be converted in the next step of processing where the data is likely to be much smaller. `gpa` is not used at all and will never need to be converted.
  
- This is the main reason for Pig to separate the reading of the data (which can happen immediately) from the converting of the data (to the right type, which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` that your loader class can extend and will take care of all the conversion routines. The code for it can be found [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup here].
+ This is the main reason for Pig to separate the reading of the data (which can happen immediately) from the converting of the data (to the right type, which can happen later). For ASCII data, Pig provides `Utf8StorageConverter` that your loader class can extend and will take care of all the conversion routines. The code for it can be found [http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup here].
  
  Note that conversion rutines should return null values for data that can't be converted to the specified type.
  
@@ -675, +673 @@

  
  Note that this approach assumes that the data has a uniform schema. The function needs to make sure that the data it produces conforms to the schema returned by `determineSchema`, otherwise the processing will fail. This means producing the right number of fields in the tuple (dropping fields or emitting null values if needed) and producing fields of the right type (again emitting null values as needed).
  
- For complete examples, see [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/BinStorage.java?view=markup BinStroage] and [http://svn.apache.org/viewvc/hadoop/pig/branches/types/src/org/apache/pig/builtin/PigStorage.java?view=markup PigStorage].
+ For complete examples, see [http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/BinStorage.java?view=markup BinStroage] and [http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/PigStorage.java?view=markup PigStorage].
  
  [[Anchor(Store_Functions)]]
  === Store Functions ===