You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Peter Knight (JIRA)" <ji...@apache.org> on 2018/07/04 11:28:00 UTC
[jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

    [ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16532576#comment-16532576 ] 

Peter Knight edited comment on SPARK-19498 at 7/4/18 11:27 AM:
---------------------------------------------------------------

Just wanted to add to Lucas' post that I agree with them. I love the ML Pipeline concept but it is currently verbose to create a custom transformer. Here is an example of number 4 showing the amount of repeated/boilerplate code needed.  You can see that the default values are being set 3 times for each parameter and the parameter names are being entered 10 times each!
{code:java}
from pyspark import keyword_only
from pyspark.ml.param.shared import Param, Params, TypeConverters
from pyspark.ml import Transformer

class StraightLine(Transformer):
	@keyword_only 
	def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0): 
		super(StraightLine, self).__init__() 
		self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0)
		kwargs = self._input_kwargs 
		self.setParams(**kwargs) 
	@keyword_only 
	def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0): 
		kwargs = self._input_kwargs 
		return self._set(**kwargs) 
		
	# inputCol Param	
	inputCol = Param(Params._dummy(), "inputCol", "specify the input column name (your X). (string)", typeConverter=TypeConverters.toString)
	def setInputCol(self, value):
		return self._set(inputCol=value)
	def getInputCol(self):
		return self.getOrDefault(self.inputCol)
		
	# outputCol Param	
	outputCol = Param(Params._dummy(), "outputCol", "specify the output column name (your Y). (string)", typeConverter=TypeConverters.toString)
	def setOutputCol(self, value):
		return self._set(outputCol=value)
	def getOutputCol(self):
		return self.getOrDefault(self.outputCol)

	# m Param	
	m = Param(Params._dummy(), "m", "specify m - the slope of the line. (float)", typeConverter=TypeConverters.toFloat)
	def setM(self, value):
		return self._set(m=value)
	def getM(self):
		return self.getOrDefault(self.m)		 

	# c Param	
	c = Param(Params._dummy(), "c", "specify c - the y offset when x = 0. (float)", typeConverter=TypeConverters.toFloat)
	def setC(self, value):
		return self._set(c=value)
	def getC(self):
		return self.getOrDefault(self.c)

	# Define the Transformer
	def _transform(self, dataset):
		
		# get all the lists
		input_col = self.getInputCol()
		if not input_col:
		    raise Exception("inputCol not supplied")
		    
		output_col = self.getOutputCol()
		if not output_col:
			raise Exception("outputCol not supplied")
			
		return dataset.selectExpr("*", str(self.getM()) + " * " + input_col + " + " + str(self.getC()) + " AS " + output_col) 
{code}
My preference would be to have an function: addParam(name,description,type,default_value,is_required) which would make the code look more like the code below (gone form 50+ lines to <10). (is_required would default to true, and would throw an error if that parameter were None). If explainParams also showed you the data type expected then I wouldn't have to add that to the description myself each time. If not easy to add getters and setter a generic getParam(name) would do. 
{code:java}
class StraightLine(Transformer):
	addParam("inputCol", "specify the input column name (your X).", String, None)
	addParam("outputCol", "specify the output column name (your Y).", String, None)
	addParam("m", "specify m - the slope of the line.", Float, 1.0)
	addParam("c", "specify c - the y offset when x = 0.", Float, 0.0)

	def _transform(self, dataset):
		return dataset.selectExpr("*", str(self.getM()) + " * " + self.getInputCol() + " + " + str(self.getC()) + " AS " + self.getOutputCol())
{code}


was (Author: peter knight):
Just wanted to add to Lucas' post that I agree with them. I love the ML Pipeline concept but it is currently verbose to create a custom transformer. Here is an example of number 3 showing the amount of repeated/boilerplate code needed.  You can see that the default values are being set 3 times for each parameter and the parameter names are being entered 10 times each!
{code}
from pyspark import keyword_only
from pyspark.ml.param.shared import Param, Params, TypeConverters
from pyspark.ml import Transformer

class StraightLine(Transformer):
	@keyword_only 
	def __init__(self, inputCol=None, outputCol=None, m=1.0, c=0.0): 
		super(StraightLine, self).__init__() 
		self._setDefault(inputCol=None, outputCol=None, m=1.0, c=0.0)
		kwargs = self._input_kwargs 
		self.setParams(**kwargs) 
	@keyword_only 
	def setParams(self, inputCol=None, outputCol=None, m=1.0, c=0.0): 
		kwargs = self._input_kwargs 
		return self._set(**kwargs) 
		
	# inputCol Param	
	inputCol = Param(Params._dummy(), "inputCol", "specify the input column name (your X). (string)", typeConverter=TypeConverters.toString)
	def setInputCol(self, value):
		return self._set(inputCol=value)
	def getInputCol(self):
		return self.getOrDefault(self.inputCol)
		
	# outputCol Param	
	outputCol = Param(Params._dummy(), "outputCol", "specify the output column name (your Y). (string)", typeConverter=TypeConverters.toString)
	def setOutputCol(self, value):
		return self._set(outputCol=value)
	def getOutputCol(self):
		return self.getOrDefault(self.outputCol)

	# m Param	
	m = Param(Params._dummy(), "m", "specify m - the slope of the line. (float)", typeConverter=TypeConverters.toFloat)
	def setM(self, value):
		return self._set(m=value)
	def getM(self):
		return self.getOrDefault(self.m)		 

	# c Param	
	c = Param(Params._dummy(), "c", "specify c - the y offset when x = 0. (float)", typeConverter=TypeConverters.toFloat)
	def setC(self, value):
		return self._set(c=value)
	def getC(self):
		return self.getOrDefault(self.c)

	# Define the Transformer
	def _transform(self, dataset):
		
		# get all the lists
		input_col = self.getInputCol()
		if not input_col:
		    raise Exception("inputCol not supplied")
		    
		output_col = self.getOutputCol()
		if not output_col:
			raise Exception("outputCol not supplied")
			
		return dataset.selectExpr("*", str(self.getM()) + " * " + input_col + " + " + str(self.getC()) + " AS " + output_col) 
{code}
My preference would be to have an function: addParam(name,description,type,default_value,is_required) which would make the code look more like the code below (gone form 50+ lines to <10). (is_required would default to true, and would throw an error if that parameter were None). If explainParams also showed you the data type expected then I wouldn't have to add that to the description myself each time. If not easy to add getters and setter a generic getParam(name) would do. 
{code}
class StraightLine(Transformer):
	addParam("inputCol", "specify the input column name (your X).", String, None)
	addParam("outputCol", "specify the output column name (your Y).", String, None)
	addParam("m", "specify m - the slope of the line.", Float, 1.0)
	addParam("c", "specify c - the y offset when x = 0.", Float, 0.0)

	def _transform(self, dataset):
		return dataset.selectExpr("*", str(self.getM()) + " * " + self.getInputCol() + " + " + str(self.getC()) + " AS " + self.getOutputCol())
{code}

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> ----------------------------------------------------------------
>
>                 Key: SPARK-19498
>                 URL: https://issues.apache.org/jira/browse/SPARK-19498
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we can make MLlib DataFrame-based APIs more extensible, especially for the purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or extensive enough?
> The easy answer is to make everything public, but that would be terrible of course in the long-term.  Let's discuss what is needed and how we can present stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org