You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/08/19 20:57:26 UTC

[Hadoop Wiki] Update of "Hive/GenericUDAFCaseStudy" by MayankLahiri

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GenericUDAFCaseStudy" page has been changed by MayankLahiri.
The comment on this change is: finished the GenericUDAF writing case study.
http://wiki.apache.org/hadoop/Hive/GenericUDAFCaseStudy?action=diff&rev1=3&rev2=4

--------------------------------------------------

    } 
  }}}
  
+ What do all these functions do? The following is a brief summary of each function, in (roughly) chronological order of being called. It's '''very''' important to remember that the computation of your aggregation must be arbitrarily divisible over the data. Think of it like writing a divide-and-conquer algorithm where the partitioning of the data is completely out of your control and handled by Hive. More formally, given any subset of the input rows, you should be able to compute a partial result, and also be able to merge any pair of partial results into another partial result. This naturally makes it difficult to port over many existing algorithms, but should guarantee researchers jobs for quite some time.
+ 
+ || '''Function''' || '''Purpose''' ||
+ || init || Called by Hive to initialize an instance of your UDAF evaluator class. ||
+ || getNewAggregationBuffer || Return an object that will be used to store temporary aggregation results. ||
+ || iterate || Process a new row of data into the aggregation buffer ||
+ || terminatePartial || Return the contents of the current aggregation in a serializable way ||
+ || merge || Merge a partial aggregation returned by '''terminatePartial''' into the current aggregation ||
+ || terminate || Return the final result of the aggregation to Hive ||
+ 
+ For writing the `histogram()` function, the following is the strategy that was adopted.
+ 
+ ==== getNewAggregationBuffer ====
+ 
+ The aggregation buffer for a histogram is a list of (x,y) pairs that represent the histogram's bin centers and heights. In addition, the aggregation buffer also stores two integers with the maximum number of bins (a user-specified parameter), and the current number of bins used. The aggregation buffer is initialized to a 'not ready' state with the number of bins set to 0. This is because Hive makes no distinction between a constant parameter supplied to a UDAF and a column from a table; thus, we have no way of knowing how many bins the user wants in their histogram until the first call to `iterate()`.
+ 
+ ==== iterate ====
+ 
+ The first thing we do in `iterate()` is to check whether the histogram object in our aggregation buffer is initialized. If it is not, we parse our the second argument to `iterate()`, which is the number of histogram bins requested by the user. We do this exactly once and initialize the histogram object. Note that error checking is performed here -- if the user supplied a negative number or zero for the number of histogram bins, a `HiveException` is thrown at this point and computation terminates.
+ 
+ Next, we parse out the actual input data item (a number) and add it to our histogram estimation in the aggregation buffer. See the `GenericUDAFHistogramNumeric.java` file for details on the heuristic used to construct a histogram.
+ 
+ ==== terminatePartial ====
+ 
+ The current histogram approximation is serialized as a list of `DoubleWritable` objects. The first two doubles in the list indicate the maximum number of histogram bins specified by the user and number of bins current used. The remaining entries are (x,y) pairs from the current histogram approximation.
+ 
+ ==== merge ====
+ 
+ At this point, we have a (possibly uninitialized) histogram estimation, and have been requested to merge it with another estimation performed on a separate subset of the rows. If '''N''' is the number of histogram bins specified by the user, the current heuristic first builds a histogram with all '''2N''' bins from both estimations, and then iteratively merges the closest pair of bins until only '''N''' bins remain.
+ 
+ ==== terminate ====
+ 
+ The final return type from the `histogram()` function is an array of (x,y) pairs representing histogram bin centers and heights. These can be `explode()`ed into a separate table, or parsed using a script and passed to Gnuplot (for example) to visualize the histogram.
+ 
  == Modifying the function registry ==
  
+ Once the code for the UDAF has been written and the source file placed in `ql/src/java/org/apache/hadoop/hive/ql/udf/generic`, it's time to modify the function registry and incorporate the new function into Hive's list of functions. This simply involves editing `ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java` to import your UDAF class and register it's name.
+ 
+ Please note that you will have to run the following command to update the output of the `show functions` Hive call:
+ 
+ {{{ant test -Dtestcase=TestCliDriver -Dqfile=show_functions.q -Doverwrite=true}}}
+ 
+ == Compiling and running ==
+ 
+ {{{
+ ant package
+ build/dist/bin/hive
+ }}}
+ 
  == Creating the tests ==
  
- == Compiling, testing ==
+ System-level tests consist of writing some sample queries that operate on sample data, generating the expected output from the queries, and making sure that things don't break in the future in terms of expected output. Note that the expected output is passed through `diff` with the actual output from Hive, so nondeterministic algorithms will have to compute some sort of statistic and then only keep the most significant digits (for example).
+ 
+ These are the simple steps needed for creating test cases for your new UDAF/UDF:
+ 
+  1. Create a file in `ql/src/test/queries/clientpositive/udaf_XXXXX.q` where `XXXXX` is your UDAF's name.
+  2. Put some queries in the `.q` file -- hopefully enough to cover the full range of functionality and special cases.
+  3. For sample data, put your own in `hive/data/files` and load it using `LOAD DATA LOCAL INPATH...`, or reuse one of the files already there (grep for LOAD in the queries directory to see table names).
+  4. `touch ql/src/test/results/clientpositive/udaf_XXXX.q.out`
+  5. Run the following command to generate the output into the `.q.out` result file.
+ {{{
+ ant test -Dtestcase=TestCliDriver -Dqfile=udaf_XXXXX.q -Doverwrite=true
+ }}}
+  6. Run the following command to make sure your test runs fine.
+ {{{
+ ant test -Dtestcase=TestCliDriver -Dqfile=udaf_XXXXX.q
+ }}}
  
  = Checklist for open source submission =
  
@@ -178, +240 @@

   * Run `ant package` from the Hive root to compile Hive and your new UDAF.
   * Create `.q` tests and their corresponding `.q.out` output.
   * Modify the function registry if adding a new function.
-  * Run `ant checkstyle`, ensure that your source files conform to the coding convention.
+  * Run `ant checkstyle` and examine `build/checkstyle/checkstyle-errors.html`, ensure that your source files conform to the Sun Java coding convention (with the 100 character line length exception).
   * Run `ant test`, ensure that tests pass.
   * Run `svn up`, ensure no conflicts with the main repository.
   * Run `svn add` for whatever new files you have created.
   * Ensure that you have added `.q` and `.q.out` tests.
   * Ensure that you have run the `.q` tests for all new functionality.
-  * If adding a new UDAF, ensure that `show_functions.q.out` has been updated.
+  * If adding a new UDAF, ensure that `show_functions.q.out` has been updated. Run `ant test -Dtestcase=TestCliDriver -Dqfile=show_functions.q -Doverwrite=true` to do this.
   * Run `svn diff > HIVE-NNNN.1.patch` from the Hive root directory, where NNNN is the issue number the JIRA has assigned to you.
   * Attach your file to the JIRA issue, describe your patch in the comments section.
   * Ask for a code review in the comments.
   * Click '''Submit patch''' on your issue after you have completed the steps above.
   * It is also advisable to '''watch''' your issue to monitor new comments.
  
+ = Tips, Tricks, Best Practices =
+ 
+  * Hive can have unexpected behavior sometimes. It is best to first run `ant clean` if you're seeing something weird, ranging from unexplained exceptions to strings being incorrectly double-quoted.
+  * When serializing the aggregation buffer in a `terminatePartial()` call, if your UDAF only uses a few variables to represent the buffer (such as average), consider serializing them into a list of doubles, for example, instead of complicated named structures.
+  * Strongly cast generics wherever you can.
+  * Abstract core functionality from multiple UDAFs into its own class. Examples are `histogram_numeric()` and `percentile_approx()`, which both use the same core histogram estimation functionality.
+  * If you're stuck looking for an algorithm to adapt to the terminatePartial/merge paradigm, divide-and-conquer and parallel algorithms are predictably good places to start.
+  * Remember that the tests do a `diff` on the expected and actual output, and fail if there is any difference at all. An example of where this can fail horribly is a UDAF like `ngrams()`, where the output is a list of sorted (word,count) pairs. In some cases, different sort implementations might place words with the same count at different positions in the output. Even though the output is correct, the test will fail. In these cases, it's better to output (for example) only the counts, or some appropriate statistic on the counts, like the sum.
+