You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2007/11/03 00:02:47 UTC

[Pig Wiki] Update of "UserDefinedOrdering" by AlanGates

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/UserDefinedOrdering

New page:
= Adding User Specified Ordering to Pig =

== Introduction ==
This document proposes changing pig to allow users to provide and specify a
function for doing ordering in an ORDER BY statement.  In cases where users
have data that is not scalar or needs to be ordered in a special way this will
allow them to accomplish that ordering in pig.  In the short term it will also
allow users to do numeric and descending ordering.

== Syntax Changes ==
An class that contanis an ordering function provided by the user will need to be part of a jar that
is registered with pig using REGISTER in the same way that evaluation functions currently
are.  The ORDER BY clause in pig will then change to allow a user to specify
the class containing the function ('''not the function itself'''):

{{{
ORDER BY keys [ USING class ];
}}}

Where the class is the full package and class name.  For example:

{{{
a = LOAD 'myfile' as name, address, zipcode;
b = ORDER a BY zipcode USING com.mycompany.myproject.MyOrderByClass;
}}}

== Specification for the Ordering Class ==
The ordering class provided by the user will need to implement the interface
`java.util.Comparator` for `Tuple`.

{{{
class MyOrderByClass implements Comparator<Tuple> {

public int compare(Tuple t1, Tuple t2) { ... }

}
}}}

== Logical and Physical Plan Changes ==
When an ORDER BY clause is encountered in a query a `ProjectSpec` is created
and passed to the `SortDistinctSpec` that controls how a sort is done.
`ProjectSpec` is a subclass of `EvalSpec`.  `EvalSpec` includes a function
`getComparator` that returns a `java.util.Comparator` object.  Currently
`EvalSpec` hard wires the return function for this.

A new subclass of `ProjectSpec` will be created `SortProjectSpec`.  This class
will look like:

{{{
class SortProjectSpec extends ProjectSpec {

ProjectSpec(List<Integer> cols, Comparator<Tuple> comparator)
{
	super(cols);
	mComparator = comparator;
}

@Override
public Comparator<Tuple> getComparator()
{
   return new Comparator<Tuple>() {
       public int compare(Tuple t1, Tuple t2) {
    		return comp.compare(simpleEval(t1), simpleEval(t2));
       }

	   private Comparator<Tuple> comp = new mComparator();
   };
}

private Comparator<Tuple> mComparator;
}

}}}

If I understand things correctly this will handle both hooking the comparator
into the pipeline and making sure that the sort keys are passed to the
comparator (as they will be what is in the projection list for the
`ProjectSpec`).  Is this correct Utkarsh?