You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Seth Hendrickson (JIRA)" <ji...@apache.org> on 2016/02/03 21:00:42 UTC
[jira] [Commented] (SPARK-13068) Extend pyspark ml paramtype conversion to support lists

    [ https://issues.apache.org/jira/browse/SPARK-13068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131028#comment-15131028 ] 

Seth Hendrickson commented on SPARK-13068:
------------------------------------------

Extending param type conversion to lists introduces a number of new complexities which reveal the need for a more flexible type checking/conversion mechanism. First, checking {{type(value) == expectedType}} does not work when the value is a subclass of the {{expectedType}}. Second, this does not work for lists or nested lists of arbitrary depth (nor does it work for dictionaries). Third, this type of type checking/conversion does not lend itself well to converting between Spark Vectors, numpy arrays, and lists of arbitrary data types.

For example, if a parameter requires a Python list of floats, a user should be able to pass a numpy ndarray of ints without receiving a type conversion error in the JVM.

One solution to this is to add an optional field to the parameter constructor which excepts a validation function. This function accept the value as input, check if it conforms to the required type, and convert it if possible, otherwise raising a coherent, informative exception. An example of a {{float}} validation function could be:

{code:title=floatValidator.py|borderStyle=solid}
def floatValidator(value):
    if type(value) == float:
        return value
    else:
        try:
            return float(value)
        except:
            raise Exception("Could not convert value to float")
{code}

For primitive types we could create standard validation functions to be re-used; further, these validation functions can be automatically inferred without having to explicitly pass them to the constructor. However, if more complex types require non-standard validation the interface is flexible enough to allow a custom validation function. I am working on a PR with this interface, but would really appreciate feedback on the viability of this approach or alternate approaches that might be better.

cc [~yanboliang] [~josephkb] [~holdenk] 

> Extend pyspark ml paramtype conversion to support lists
> -------------------------------------------------------
>
>                 Key: SPARK-13068
>                 URL: https://issues.apache.org/jira/browse/SPARK-13068
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>            Reporter: holdenk
>            Priority: Trivial
>
> In SPARK-7675 we added type conversion for PySpark ML params. We should follow up and support param type conversion for lists and nested structures as required. This blocks having all PySpark ML params having type information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org