You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Harish Butani (JIRA)" <ji...@apache.org> on 2014/09/02 21:57:20 UTC

[jira] [Created] (HIVE-7940) Expose Machine Learning functions and Model application in Hive

Harish Butani created HIVE-7940:
-----------------------------------

             Summary: Expose Machine Learning functions and Model application in Hive
                 Key: HIVE-7940
                 URL: https://issues.apache.org/jira/browse/HIVE-7940
             Project: Hive
          Issue Type: New Feature
            Reporter: Harish Butani


*Machine Learning functions*
# [HiveMall|https://github.com/myui/hivemall] has demonstrated how to do machine learning in Hive. It has an extensive set of  functions; it shows a way through UDTFs and Amplify technique to do iterative computations. There is a lot of interest in the Hive User community to use HiveMall.
# Other possible ways to expose machine learning functionality:
#* via Script Operator(Or Table Functions) that call out to a Machine Learning service like [Oxdata|https://github.com/0xdata/h2o]. In this scheme the service's nodes would communicate outside of hive, process the data in multiple iterations and then return the result back into the hive pipeline.
#* At the language level, provide an iteration mechanism in Hive: this has more general applications: to express Recursive CTEs and also to express Graph Algorithms.

*Model Application*
Even when  Regression/Classification models are build in other tools we should provide a way to evaluate these models against the entire dataset residing in Hive. These can be exposed as UDFs in Hive. A possible route could be a generic PMML based module, for e.g. [JPMML-Hive|https://github.com/jpmml/jpmml-hive]. Or we should provide integration for specific libraries: Spark MLLib, R and Python (SciPy/NumPy) seem the most popular toolkits.


The *goal* would be to provide Machine Learning functionality as a Feature of Hive like [MadLib|http://madlib.net/] on Postgres, Pivotal, Impala etc.
Capturing this high level requirement in this jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)