You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/08/07 19:55:00 UTC

[jira] [Created] (MADLIB-1266) General fit function for PL/Python

Frank McQuillan created MADLIB-1266:
---------------------------------------

             Summary: General fit function for PL/Python
                 Key: MADLIB-1266
                 URL: https://issues.apache.org/jira/browse/MADLIB-1266
             Project: Apache MADlib
          Issue Type: New Feature
          Components: Module: Utilities
            Reporter: Frank McQuillan


Story

`As a data scientist`
I want to call a generic PL/Python UDF from SQL to fit a model
`so that`
I can use the use any code I write or Python libraries for model builing.
Interface

{code}
fit(
		source_table,			-- source table
		model_table,				-- model output table
		list_of_columns,			-- columns you want in GD, could be '*'
		list_of_columns_to_exclude, -- columns to explicitly exclude
		fit_udf,				        -- plpython UDF to fit model
		fit_udf_parameters,	        -- parameters for UDF, if any
		grouping_cols			-- groups to build separate models for (source table distributed by this grouping)
	);
{code}

Arguments
{code}
source_table
TEXT. Name of the table containing the data to load.

model_table
TEXT. Name of the table containing the model(s), with one row per group.

list_of_columns
TEXT. Comma-separated string of column names or expressions to load. 
Can also be '*' implying all columns are to be loaded (except for the ones included
 in the next argument that lists exclusions). The types of the columns can be mixed.  
Array columns can also be included in the list and will be loaded as is (i.e., not be flattened). (???)

list_of_columns_to_exclude
TEXT. Comma-separated string of column names to exclude from load. 
Typically used when 'list_of_columns' is set to '*'.

fit_udf
TEXT.  plpython UDF to fit model.

fit_udf_parameters (optional)
TEXT.  parameters for UDF, if any

grouping_cols (optional)
TEXT, default: NULL. Comma-separated list of column names to group the data by. 
This will produce multiple models, one for each group.
{code}


Open questions

1) Do we need separate fit functions for R and Python, or can we autodetect?
If we need separate ones, could call this module `fit_plpythonu' and the R one would be `fit_plr`.


Notes

1) Both keras & scikit-learn use the term `fit` which seems better than `train`.
(We will use the term `predict` for prediction in a separate story.)


Acceptance

1) Generate a model table for sample data set with multiple groups using a scikit-learn model.
2) Repeat for Keras/TF.
3) Repeat for XGBoost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)