You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Jonathan Chang (JIRA)" <ji...@apache.org> on 2010/08/17 00:39:18 UTC
[jira] Created: (HIVE-1545) Add a bunch of UDFs and UDAFs
Add a bunch of UDFs and UDAFs
-----------------------------
Key: HIVE-1545
URL: https://issues.apache.org/jira/browse/HIVE-1545
Project: Hadoop Hive
Issue Type: New Feature
Reporter: Jonathan Chang
Assignee: Jonathan Chang
Priority: Minor
Here some UD(A)Fs which can be incorporated into the Hive distribution:
UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_{i} but <= b_{i+1}. Returns 0 if x is smaller than all the buckets.
UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
UDFLDA - Performs LDA inference on a vector given fixed topics.
UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
UDFWhich - Given a boolean array, return the indices which are TRUE.
UDFJaccard
UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
UDAFCollectMap - Like collect except that it takes tuples and generates a map.
UDAFEntropy - Compute the entropy of a column.
UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
UDAFHistogram
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1545) Add a bunch of UDFs and UDAFs
Posted by "Jonathan Chang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Chang updated HIVE-1545:
---------------------------------
Attachment: udfs.tar.gz
> Add a bunch of UDFs and UDAFs
> -----------------------------
>
> Key: HIVE-1545
> URL: https://issues.apache.org/jira/browse/HIVE-1545
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: UDF
> Reporter: Jonathan Chang
> Assignee: Jonathan Chang
> Priority: Minor
> Attachments: udfs.tar.gz, udfs.tar.gz
>
>
> Here some UD(A)Fs which can be incorporated into the Hive distribution:
> UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
> UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_{i} but <= b_{i+1}. Returns 0 if x is smaller than all the buckets.
> UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
> UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
> UDFLDA - Performs LDA inference on a vector given fixed topics.
> UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
> UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
> UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
> UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
> UDFWhich - Given a boolean array, return the indices which are TRUE.
> UDFJaccard
> UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
> UDAFCollectMap - Like collect except that it takes tuples and generates a map.
> UDAFEntropy - Compute the entropy of a column.
> UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
> UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
> UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
> UDAFHistogram
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1545) Add a bunch of UDFs and UDAFs
Posted by "Jonathan Chang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Chang updated HIVE-1545:
---------------------------------
Attachment: udfs.tar.gz
Here is a tarball of the poorly documented/tested udfs.
> Add a bunch of UDFs and UDAFs
> -----------------------------
>
> Key: HIVE-1545
> URL: https://issues.apache.org/jira/browse/HIVE-1545
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: Jonathan Chang
> Assignee: Jonathan Chang
> Priority: Minor
> Attachments: udfs.tar.gz
>
>
> Here some UD(A)Fs which can be incorporated into the Hive distribution:
> UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
> UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_{i} but <= b_{i+1}. Returns 0 if x is smaller than all the buckets.
> UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
> UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
> UDFLDA - Performs LDA inference on a vector given fixed topics.
> UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
> UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
> UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
> UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
> UDFWhich - Given a boolean array, return the indices which are TRUE.
> UDFJaccard
> UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
> UDAFCollectMap - Like collect except that it takes tuples and generates a map.
> UDAFEntropy - Compute the entropy of a column.
> UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
> UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
> UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
> UDAFHistogram
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1545) Add a bunch of UDFs and UDAFs
Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Steinbach updated HIVE-1545:
---------------------------------
Component/s: UDF
> Add a bunch of UDFs and UDAFs
> -----------------------------
>
> Key: HIVE-1545
> URL: https://issues.apache.org/jira/browse/HIVE-1545
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: UDF
> Reporter: Jonathan Chang
> Assignee: Jonathan Chang
> Priority: Minor
> Attachments: udfs.tar.gz
>
>
> Here some UD(A)Fs which can be incorporated into the Hive distribution:
> UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
> UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_{i} but <= b_{i+1}. Returns 0 if x is smaller than all the buckets.
> UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
> UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
> UDFLDA - Performs LDA inference on a vector given fixed topics.
> UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
> UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
> UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
> UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
> UDFWhich - Given a boolean array, return the indices which are TRUE.
> UDFJaccard
> UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
> UDAFCollectMap - Like collect except that it takes tuples and generates a map.
> UDAFEntropy - Compute the entropy of a column.
> UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
> UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
> UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
> UDAFHistogram
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1545) Add a bunch of UDFs and UDAFs
Posted by "Terje Marthinussen (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917546#action_12917546 ]
Terje Marthinussen commented on HIVE-1545:
------------------------------------------
Was just quickly looking at this and noticed that
grep lib com/facebook/hive/udf/*java
com/facebook/hive/udf/UDAFHistogram.java:import com.facebook.hive.udf.lib.Counter;
com/facebook/hive/udf/UDFJaccard.java:import com.facebook.hive.udf.lib.SetOps;
however, there is no com.facebook.hive.udf.lib included.
> Add a bunch of UDFs and UDAFs
> -----------------------------
>
> Key: HIVE-1545
> URL: https://issues.apache.org/jira/browse/HIVE-1545
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: UDF
> Reporter: Jonathan Chang
> Assignee: Jonathan Chang
> Priority: Minor
> Attachments: udfs.tar.gz
>
>
> Here some UD(A)Fs which can be incorporated into the Hive distribution:
> UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
> UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_{i} but <= b_{i+1}. Returns 0 if x is smaller than all the buckets.
> UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
> UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
> UDFLDA - Performs LDA inference on a vector given fixed topics.
> UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
> UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
> UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
> UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
> UDFWhich - Given a boolean array, return the indices which are TRUE.
> UDFJaccard
> UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
> UDAFCollectMap - Like collect except that it takes tuples and generates a map.
> UDAFEntropy - Compute the entropy of a column.
> UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
> UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
> UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
> UDAFHistogram
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1545) Add a bunch of UDFs and UDAFs
Posted by "Jonathan Chang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918094#action_12918094 ]
Jonathan Chang commented on HIVE-1545:
--------------------------------------
Sorry about that. I've uploaded a new tarball which should contain the lib directory along with some new UDFs. UDAFPearson has also been removed since it's been obsoleted by CORR.
> Add a bunch of UDFs and UDAFs
> -----------------------------
>
> Key: HIVE-1545
> URL: https://issues.apache.org/jira/browse/HIVE-1545
> Project: Hadoop Hive
> Issue Type: New Feature
> Components: UDF
> Reporter: Jonathan Chang
> Assignee: Jonathan Chang
> Priority: Minor
> Attachments: udfs.tar.gz, udfs.tar.gz
>
>
> Here some UD(A)Fs which can be incorporated into the Hive distribution:
> UDFArgMax - Find the 0-indexed index of the largest argument. e.g., ARGMAX(4, 5, 3) returns 1.
> UDFBucket - Find the bucket in which the first argument belongs. e.g., BUCKET(x, b_1, b_2, b_3, ...), will return the smallest i such that x > b_{i} but <= b_{i+1}. Returns 0 if x is smaller than all the buckets.
> UDFFindInArray - Finds the 1-index of the first element in the array given as the second argument. Returns 0 if not found. Returns NULL if either argument is NULL. E.g., FIND_IN_ARRAY(5, array(1,2,5)) will return 3. FIND_IN_ARRAY(5, array(1,2,3)) will return 0.
> UDFGreatCircleDist - Finds the great circle distance (in km) between two lat/long coordinates (in degrees).
> UDFLDA - Performs LDA inference on a vector given fixed topics.
> UDFNumberRows - Number successive rows starting from 1. Counter resets to 1 whenever any of its parameters changes.
> UDFPmax - Finds the maximum of a set of columns. e.g., PMAX(4, 5, 3) returns 5.
> UDFRegexpExtractAll - Like REGEXP_EXTRACT except that it returns all matches in an array.
> UDFUnescape - Returns the string unescaped (using C/Java style unescaping).
> UDFWhich - Given a boolean array, return the indices which are TRUE.
> UDFJaccard
> UDAFCollect - Takes all the values associated with a row and converts it into a list. Make sure to have: set hive.map.aggr = false;
> UDAFCollectMap - Like collect except that it takes tuples and generates a map.
> UDAFEntropy - Compute the entropy of a column.
> UDAFPearson (BROKEN!!!) - Computes the pearson correlation between two columns.
> UDAFTop - TOP(KEY, VAL) - returns the KEY associated with the largest value of VAL.
> UDAFTopN (BROKEN!!!) - Like TOP except returns a list of the keys associated with the N (passed as the third parameter) largest values of VAL.
> UDAFHistogram
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.