You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peter Knight (JIRA)" <ji...@apache.org> on 2018/07/04 11:28:00 UTC
[jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs
extensible for 3rd party libraries
[ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532576#comment-16532576 ]
Peter Knight edited comment on SPARK-19498 at 7/4/18 11:27 AM:
---------------------------------------------------------------
Just wanted to add to Lucas' post that I agree with them. I love the ML Pipeline concept but it is currently verbose to create a custom transformer. Here is an example of number 4 showing the amount of repeated/boilerplate code needed. You can see that the default values are being set 3 times for each parameter and the parameter names are being entered 10 times each!
{code:java}
from pyspark import keyword_only
from pyspark.ml.param.shared import Param, Params, TypeConverters
from pyspark.ml import Transformer
class StraightLine(Transformer):
@keyword_only
def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0):
super(StraightLine, self).__init__()
self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0)
kwargs = self._input_kwargs
self.setParams(**kwargs)
@keyword_only
def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0):
kwargs = self._input_kwargs
return self._set(**kwargs)
# inputCol Param
inputCol = Param(Params._dummy(), "inputCol", "specify the input column name (your X). (string)", typeConverter=TypeConverters.toString)
def setInputCol(self, value):
return self._set(inputCol=value)
def getInputCol(self):
return self.getOrDefault(self.inputCol)
# outputCol Param
outputCol = Param(Params._dummy(), "outputCol", "specify the output column name (your Y). (string)", typeConverter=TypeConverters.toString)
def setOutputCol(self, value):
return self._set(outputCol=value)
def getOutputCol(self):
return self.getOrDefault(self.outputCol)
# m Param
m = Param(Params._dummy(), "m", "specify m - the slope of the line. (float)", typeConverter=TypeConverters.toFloat)
def setM(self, value):
return self._set(m=value)
def getM(self):
return self.getOrDefault(self.m)
# c Param
c = Param(Params._dummy(), "c", "specify c - the y offset when x = 0. (float)", typeConverter=TypeConverters.toFloat)
def setC(self, value):
return self._set(c=value)
def getC(self):
return self.getOrDefault(self.c)
# Define the Transformer
def _transform(self, dataset):
# get all the lists
input_col = self.getInputCol()
if not input_col:
raise Exception("inputCol not supplied")
output_col = self.getOutputCol()
if not output_col:
raise Exception("outputCol not supplied")
return dataset.selectExpr("*", str(self.getM()) + " * " + input_col + " + " + str(self.getC()) + " AS " + output_col)
{code}
My preference would be to have an function: addParam(name,description,type,default_value,is_required) which would make the code look more like the code below (gone form 50+ lines to <10). (is_required would default to true, and would throw an error if that parameter were None). If explainParams also showed you the data type expected then I wouldn't have to add that to the description myself each time. If not easy to add getters and setter a generic getParam(name) would do.
{code:java}
class StraightLine(Transformer):
addParam("inputCol", "specify the input column name (your X).", String, None)
addParam("outputCol", "specify the output column name (your Y).", String, None)
addParam("m", "specify m - the slope of the line.", Float, 1.0)
addParam("c", "specify c - the y offset when x = 0.", Float, 0.0)
def _transform(self, dataset):
return dataset.selectExpr("*", str(self.getM()) + " * " + self.getInputCol() + " + " + str(self.getC()) + " AS " + self.getOutputCol())
{code}
was (Author: peter knight):
Just wanted to add to Lucas' post that I agree with them. I love the ML Pipeline concept but it is currently verbose to create a custom transformer. Here is an example of number 3 showing the amount of repeated/boilerplate code needed. You can see that the default values are being set 3 times for each parameter and the parameter names are being entered 10 times each!
{code}
from pyspark import keyword_only
from pyspark.ml.param.shared import Param, Params, TypeConverters
from pyspark.ml import Transformer
class StraightLine(Transformer):
@keyword_only
def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0):
super(StraightLine, self).__init__()
self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0)
kwargs = self._input_kwargs
self.setParams(**kwargs)
@keyword_only
def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0):
kwargs = self._input_kwargs
return self._set(**kwargs)
# inputCol Param
inputCol = Param(Params._dummy(), "inputCol", "specify the input column name (your X). (string)", typeConverter=TypeConverters.toString)
def setInputCol(self, value):
return self._set(inputCol=value)
def getInputCol(self):
return self.getOrDefault(self.inputCol)
# outputCol Param
outputCol = Param(Params._dummy(), "outputCol", "specify the output column name (your Y). (string)", typeConverter=TypeConverters.toString)
def setOutputCol(self, value):
return self._set(outputCol=value)
def getOutputCol(self):
return self.getOrDefault(self.outputCol)
# m Param
m = Param(Params._dummy(), "m", "specify m - the slope of the line. (float)", typeConverter=TypeConverters.toFloat)
def setM(self, value):
return self._set(m=value)
def getM(self):
return self.getOrDefault(self.m)
# c Param
c = Param(Params._dummy(), "c", "specify c - the y offset when x = 0. (float)", typeConverter=TypeConverters.toFloat)
def setC(self, value):
return self._set(c=value)
def getC(self):
return self.getOrDefault(self.c)
# Define the Transformer
def _transform(self, dataset):
# get all the lists
input_col = self.getInputCol()
if not input_col:
raise Exception("inputCol not supplied")
output_col = self.getOutputCol()
if not output_col:
raise Exception("outputCol not supplied")
return dataset.selectExpr("*", str(self.getM()) + " * " + input_col + " + " + str(self.getC()) + " AS " + output_col)
{code}
My preference would be to have an function: addParam(name,description,type,default_value,is_required) which would make the code look more like the code below (gone form 50+ lines to <10). (is_required would default to true, and would throw an error if that parameter were None). If explainParams also showed you the data type expected then I wouldn't have to add that to the description myself each time. If not easy to add getters and setter a generic getParam(name) would do.
{code}
class StraightLine(Transformer):
addParam("inputCol", "specify the input column name (your X).", String, None)
addParam("outputCol", "specify the output column name (your Y).", String, None)
addParam("m", "specify m - the slope of the line.", Float, 1.0)
addParam("c", "specify c - the y offset when x = 0.", Float, 0.0)
def _transform(self, dataset):
return dataset.selectExpr("*", str(self.getM()) + " * " + self.getInputCol() + " + " + str(self.getC()) + " AS " + self.getOutputCol())
{code}
> Discussion: Making MLlib APIs extensible for 3rd party libraries
> ----------------------------------------------------------------
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
> Issue Type: Brainstorming
> Components: ML
> Affects Versions: 2.2.0
> Reporter: Joseph K. Bradley
> Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we can make MLlib DataFrame-based APIs more extensible, especially for the purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough? Do they require changes before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or extensive enough?
> The easy answer is to make everything public, but that would be terrible of course in the long-term. Let's discuss what is needed and how we can present stable, sufficient, and easy-to-use APIs for 3rd-party developers.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org