You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2018/08/07 19:57:00 UTC

[jira] [Created] (MADLIB-1267) General predict function for PL/Python

Frank McQuillan created MADLIB-1267:
---------------------------------------

             Summary: General predict function for PL/Python
                 Key: MADLIB-1267
                 URL: https://issues.apache.org/jira/browse/MADLIB-1267
             Project: Apache MADlib
          Issue Type: New Feature
          Components: Module: Utilities
            Reporter: Frank McQuillan
             Fix For: v2.0


Context

Follow on from https://www.pivotaltracker.com/story/show/158990284


Story

`As a data scientist`
I want to call a generic PL/Python UDF from SQL to predict
`so that`
I can use the use any code I write or Python libraries for prediction.


Interface

{code}
predict(
		model_table,				-- model output table
		data_table,					-- data table to predict
		list_of_columns,			-- columns you want in GD, could be '*'   needed???
		list_of_columns_to_exclude, -- columns to explicitly exclude          needed???
		predict_udf,				-- plpython UDF to predict
		predict_udf_parameters,	    -- parameters for UDF, if any
		grouping_cols				-- groups to build separate models for (source table distributed by this grouping)  needed???
	);
{code}

Arguments
{code}
source_table
TEXT. Name of the table containing the data to load.

model_table
TEXT. Name of the table containing the model(s), with one row per group.

list_of_columns
TEXT. Comma-separated string of column names or expressions to load. 
Can also be '*' implying all columns are to be loaded (except for the ones included
 in the next argument that lists exclusions). The types of the columns can be mixed.  
Array columns can also be included in the list and will be loaded as is (i.e., not be flattened). (???)

list_of_columns_to_exclude
TEXT. Comma-separated string of column names to exclude from load. 
Typically used when 'list_of_columns' is set to '*'.

predict_udf
TEXT.  plpython UDF to predict.

predict_udf_parameters (optional)
TEXT.  parameters for UDF, if any

grouping_cols (optional)
TEXT, default: NULL. Comma-separated list of column names to group the data by. 
This will produce multiple models, one for each group.
{code}


Open questions

1) Do we need separate predict functions for R and Python, or can we autodetect?
If we need separate ones, could call this module `predict_plpythonu' and the R one would be `predict_plr`.

2) Do we need `list_of_columns` and `list_of_columns_to_exclude` 
or assume it is the same as the training table?

3) Scoring should be embarrassingly parallel, so do we need `grouping_cols` in the predict function?


Notes

1) scikit-learn use the term `predict` and keras uses `evaluate` but I think `predict` is better.


Acceptance

1) Generate a model table for sample data set with multiple groups using a scikit-learn model.  Use this predict function to score some test data.
2) Repeat for Keras/TF.
3) Repeat for XGBoost.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)