You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/06/28 21:31:57 UTC

[Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GenericUDAFCaseStudy" page has been changed by MayankLahiri.
The comment on this change is: initial version of GenericUDAF tutorial.
http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy

--------------------------------------------------

New page:
= Writing GenericUDAFs: A Tutorial =

User-Defined Aggregation Functions (UDAFs) are an excellent way to integrate advanced data-processing into Hive. Hive allows two varieties of UDAFs: simple and generic. Simple UDAFs, as the name implies, are rather simple to write, but incur performance penalties because of the use of [[http://java.sun.com/docs/books/tutorial/reflect/index.html | Java Reflection]], and do not allow features such as variable-length argument lists. Generic UDAFs allow all these features, but are perhaps not quite as intuitive to write as Simple UDAFs.

This tutorial walks through the development of the `histogram()` UDAF, which computes a histogram with a fixed, user-specified number of bins, using a constant amount of memory and time linear in the input size. It demonstrates a number of features of Generic UDAFs, such as a complex return type (an array of structures), and type checking on the input. The assumption is that the reader wants to write a UDAF for eventual submission to the Hive open-source project, so steps such as modifying the function registry in Hive and writing `.q` tests are also included. If you just want to write a UDAF, debug and deploy locally, see [[http://wiki.apache.org/hadoop/Hive/HivePlugins | this page]].

'''NOTE:''' In this tutorial, we walk through the creation of a `histogram()` function. In future (as of July 2010) releases of Hive, this will appear as the built-in function `histogram_numeric()`.

<<TableOfContents(2)>>

== Preliminaries ==

Make sure you have the latest Hive trunk by running `svn up` in your Hive directory. More detailed instructions on downloading and setting up Hive can be found at [[http://wiki.apache.org/hadoop/Hive/GettingStarted | Getting Started ]]. Your local copy of Hive should work by running `build/dist/bin/hive` from the Hive root directory, and you should have some tables of data loaded into your local instance for testing whatever UDAF you have in mind. For this example, assume that a table called `normal` exists with a single `double` column called `val`, containing a large number of random number drawn from the standard normal distribution.

The files we will be editing or creating are as follows, relative to the Hive root:

|| `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java` |||| the main source file, to be created by you.||
|| `ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java` |||| the function registry source file, to be edited by you to register our new `histogram()` UDAF into Hive's built-in function list.||
|| `ql/src/test/queries/clientpositive/udaf_histogram.q` |||| a file of sample queries for testing `histogram()` on sample data, to be created by you.||
|| `ql/src/test/results/clientpositive/udaf_histogram.q.out` |||| the expected output from your sample queries, to be created by `ant` in a later step. ||
|| `ql/src/test/results/clientpositive/show_functions.q.out` |||| the expected output from the SHOW FUNCTIONS Hive query. Since we're adding a new `histogram()` function, this expected output will change to reflect the new function. This file will be modified by `ant` in a later step. ||

== Writing the source ==

As stated above, create a new file called `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogram.java`, relative to the Hive root directory. Please see the `ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java` for a detailed example of a UDAF.

== Modifying the function registry ==

== Creating the tests ==

== Compiling, testing ==

= Checklist for open source submission =

 * Create an account on the [[ https://issues.apache.org/jira/browse/HIVE | Hive JIRA ]], create an issue for your new patch under the `Query Processor` component. Solicit discussion, incorporate feedback.
 * Create your UDAF, integrate it into your local Hive copy.
 * Run `ant package` from the Hive root to compile Hive and your new UDAF.
 * Create `.q` tests and their corresponding `.q.out` output.
 * Modify the function registry if adding a new function.
 * Run `ant checkstyle`, ensure that your source files conform to the coding convention.
 * Run `ant test`, ensure that tests pass.
 * Run `svn up`, ensure no conflicts with the main repository.
 * Run `svn add` for whatever new files you have created.
 * Ensure that you have added `.q` and `.q.out` tests.
 * Ensure that you have run the `.q` tests for all new functionality.
 * If adding a new UDAF, ensure that `show_functions.q.out` has been updated.
 * Run `svn diff > HIVE-NNNN.1.patch` from the Hive root directory, where NNNN is the issue number the JIRA has assigned to you.
 * Attach your file to the JIRA issue, describe your patch in the comments section.
 * Ask for a code review in the comments.
 * Click '''Submit patch''' on your issue after you have completed the steps above.
 * It is also advisable to '''watch''' your issue to monitor new comments.