You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lucas Partridge (JIRA)" <ji...@apache.org> on 2018/06/29 09:13:01 UTC

[jira] [Comment Edited] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

    [ https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527351#comment-16527351 ] 

Lucas Partridge edited comment on SPARK-19498 at 6/29/18 9:12 AM:
------------------------------------------------------------------

Ok great. Here's my feedback after wrapping a large complex Python algorithm for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond pyspark too.
 # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute showstopper. Training models can take hours and so we need to be able to save and reload models. Pending the availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a refinement of [https://stackoverflow.com/a/49195515/1843329] to work around this. Had this not been solved no further work would have been done.
 # Support for saving/loading more param types would be great. I had to use json.dumps to convert our algorithm's internal model into a string and then pretend it was a string param in order to save and load that with the rest of the transformer.
 # Given that pipelinemodels can be saved we also need the ability to export them easily for deployment on other clusters. The cluster where you train the model may be different to the one where you deploy it for predictions. A hack workaround is to use hdfs commands to copy the relevant files and directories but it would be great if we had simple single export/import commands in pyspark to move models.pipelines/pipelinemodels easily between clusters and to allow artifacts to be stored off-cluster.
 # Creating individual parameters with getters and setters is tedious and error-prone, especially if writing docs inline too. It would be great if as much of this boiler-plate as possible could be auto-generated from a simple parameter definition. I always groan when someone asks for an extra param at the moment!
 # The Ml Pipeline API seems to assume all the params lie on the estimator and none on the transformer. In the algorithm I wrapped the model/transformer has numerous params that are specific to it rather than the estimator. PipelineModel needs a getStages() command (just as Pipeline does) to get at the model so you can parameterise it. I had to use the undocumented .stages member instead.  But then if you want to call transform() on a pipelinemodel immediately after fitting it you also need some ability to set the model/transformer params in advance. I got around this by defining a params class for the estimator-only params and another class for the model-only params. I made the estimator inherit from both these classes and the model inherit from only the model-base params class. The estimator then just passes through any model-specific params to the model when it creates it at the end of its fit() method. But, to distinguish the model-only params from the estimator (e.g., when listing the params on the estimator) I had to prefix all the model-only params with a common value to identify them. This works but it's clunky and ugly.
 # The algorithm I ported works naturally with individually named column inputs. But the existing ML Pipeline library prefers DenseVectors. I ended up having to support both types of inputs - if a DenseVector input was 'None' I would take the data directly from the individually named  columns instead. If users want to use the algorithm by itself they can used the column-based input approach; if they want to work with algorithms from the built-in library (e.g., StandardScaler, Binarizer, etc) they can use the DenseVector approach instead.  Again this works but is clunky because you're having to handle two different forms of input inside the same implementation. Also DenseVectors are limited by their inability to handle missing values.
 # Similarly, I wanted to produce multiple separate columns for the outputs of the model's transform() method whereas most built-in algorithms seem to use a single DenseVector output column. DataFrame's withColumn() method could do with a withColumns() equivalent to make it easy to add multiple columns to a Dataframe instead of just one column at a time.
 # Documentation explaining how to create a custom estimator and transformer (preferably one with transformer-specific params) would be extremely useful for people. Most of what I learned I gleaned off StackOverflow and from looking at Spark's pipeline code.

Hope this list will be useful for improving ML Pipelines in future versions of Spark!


was (Author: lucas.partridge):
Ok great. Here's me feedback after wrapping a large complex Python algorithm for ML Pipeline on Spark 2.2.0. Several of these comments probably apply beyond pyspark too.
 # The inability to save and load custom pyspark models/pipelines/pipelinemodels is an absolute showstopper. Training models can take hours and so we need to be able to save and reload models. Pending the availability of https://issues.apache.org/jira/browse/SPARK-17025 I used a refinement of [https://stackoverflow.com/a/49195515/1843329] to work around this. Had this not been solved no further work would have been done.
 # Support for saving/loading more param types would be great. I had to use json.dumps to convert our algorithm's internal model into a string and then pretend it was a string param in order to save and load that with the rest of the transformer.
 # Given that pipelinemodels can be saved we also need the ability to export them easily for deployment on other clusters. The cluster where you train the model may be different to the one where you deploy it for predictions. A hack workaround is to use hdfs commands to copy the relevant files and directories but it would be great if we had simple single export/import commands in pyspark to move models.pipelines/pipelinemodels easily between clusters and to allow artifacts to be stored off-cluster.
 # Creating individual parameters with getters and setters is tedious and error-prone, especially if writing docs inline too. It would be great if as much of this boiler-plate as possible could be auto-generated from a simple parameter definition. I always groan when someone asks for an extra param at the moment!
 # The Ml Pipeline API seems to assume all the params lie on the estimator and none on the transformer. In the algorithm I wrapped the model/transformer has numerous params that are specific to it rather than the estimator. PipelineModel needs a getStages() command (just as Pipeline does) to get at the model so you can parameterise it. I had to use the undocumented .stages member instead.  But then if you want to call transform() on a pipelinemodel immediately after fitting it you also need some ability to set the model/transformer params in advance. I got around this by defining a params class for the estimator-only params and another class for the model-only params. I made the estimator inherit from both these classes and the model inherit from only the model-base params class. The estimator then just passes through any model-specific params to the model when it creates it at the end of its fit() method. But, to distinguish the model-only params from the estimator (e.g., when listing the params on the estimator) I had to prefix all the model-only params with a common value to identify them. This works but it's clunky and ugly.
 # The algorithm I ported works naturally with individually named column inputs. But the existing ML Pipeline library prefers DenseVectors. I ended up having to support both types of inputs - if a DenseVector input was 'None' I would take the data directly from the individually named  columns instead. If users want to use the algorithm by itself they can used the column-based input approach; if they want to work with algorithms from the built-in library (e.g., StandardScaler, Binarizer, etc) they can use the DenseVector approach instead.  Again this works but is clunky because you're having to handle two different forms of input inside the same implementation. Also DenseVectors are limited by their inability to handle missing values.
 # Similarly, I wanted to produce multiple separate columns for the outputs of the model's transform() method whereas most built-in algorithms seem to use a single DenseVector output column. DataFrame's withColumn() method could do with a withColumns() equivalent to make it easy to add multiple columns to a Dataframe instead of just one column at a time.
 # Documentation explaining how to create a custom estimator and transformer (preferably one with transformer-specific params) would be extremely useful for people. Most of what I learned I gleaned off StackOverflow and from looking at Spark's pipeline code.

Hope this list will be useful for improving ML Pipelines in future versions of Spark!

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> ----------------------------------------------------------------
>
>                 Key: SPARK-19498
>                 URL: https://issues.apache.org/jira/browse/SPARK-19498
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we can make MLlib DataFrame-based APIs more extensible, especially for the purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or extensive enough?
> The easy answer is to make everything public, but that would be terrible of course in the long-term.  Let's discuss what is needed and how we can present stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org