You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Jonathan Chang (JIRA)" <ji...@apache.org> on 2011/06/30 00:20:32 UTC

[jira] [Updated] (HIVE-1545) Add a bunch of UDFs and UDAFs

     [ https://issues.apache.org/jira/browse/HIVE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Chang updated HIVE-1545:
---------------------------------

    Attachment: ext.tar.gz
                core.tar.gz

Some UDFs for tomorrow's contributor meeting.  Summary of contents:


Core:

Basic functionality - CAST, HEX2DEC, MAP_GET
Date/time functions - DAY_OF_WEEK, DST_OFFSET
Multiple row manipulations - EXPLODE_INDEX, EXPLODE_MAP, REPEAT_ROWS
Extensions of existing aggregations - COUNT_WHERE, SUM_WHERE,
WEIGHTED_AVG, WEIGHTED_PERCENTILE
Aggregations for collecting - COLLECT, COLLECT_MAP, COLLECT_WHERE,
HISTOGRAM, UNION_MAP, UNION_SET
Basic mathematical operations - ARG_MIN, ARG_MAX, BUCKET, IS_FINITE, PMAX,
PMIN, PSUM
Generally useful aggregations - ALL, ANY, CHOOSE_ONE, TOP, TOP_N
JSON functionality - JSON_AS_ARRAY,  JSON_AS_MAP, MAKE_JSON_ARRAY,
MAKE_JSON_OBJ
Generally useful array ops - ARRAY_CONCAT, ARRAY_INTERSECT, ARRAY_JOIN,
ARRAY_SORT, ARRAY_SUBSET, ARRAY_UNION

Ext:


Maintaining state across rows - CUMPROD, CUMSUM, FILL, NUMBER_ROWS, PREV
Probability (narrowly focused) - CHOOSE, ENTROPY, KMEANS, LDA,
MAP_ENTROPY, PPOIS, RPOIS, SAMPLE, LINEAR_REGRESSION
Narrowly focused string ops - MD5, LEVENSHTEIN, LONGEST,
NORMALIZE_UNICODE, UNESCAPE, URL_QUOTE, GROUP_LONGEST, TITLECASE,
REGEXP_EXTRACT_ALL
More esoteric array/map ops - ARRAY_AGGREGATE, ARRAY_COUNT_OVERLAP,
ARRAY_EXCLUDE, ARRAY_SLICE, FIND_SEQUENCE_IN_ARRAY, MAP_EXCLUDE


> Add a bunch of UDFs and UDAFs
> -----------------------------
>
>                 Key: HIVE-1545
>                 URL: https://issues.apache.org/jira/browse/HIVE-1545
>             Project: Hive
>          Issue Type: New Feature
>          Components: UDF
>            Reporter: Jonathan Chang
>            Assignee: Jonathan Chang
>            Priority: Minor
>         Attachments: core.tar.gz, ext.tar.gz, udfs.tar.gz, udfs.tar.gz
>
>
> Here some UD(A)Fs which can be incorporated into the Hive distribution:
> UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
> UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_{i} but <= b_{i+1}. Returns 0 if x is smaller than all the buckets.
> UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
> UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
> UDFLDA - Performs LDA inference on a vector given fixed topics.
> UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
> UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
> UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
> UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
> UDFWhich - Given a boolean array, return the indices which are TRUE.
> UDFJaccard
> UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
> UDAFCollectMap - Like collect except that it takes tuples and generates a map.
> UDAFEntropy - Compute the entropy of a column.
> UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
> UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
> UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
> UDAFHistogram

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira