You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/10/29 08:25:49 UTC

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2995

    [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API

    Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
    
    After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
    
    cc @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark cleanup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2995.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2995
    
----
commit 731331fdafe9ce6e4bf24dc1e6667942e1e59587
Author: Davies Liu <da...@databricks.com>
Date:   2014-10-29T07:19:33Z

    simplify serialization in MLlib Python API

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-60953150
  
      [Test build #22454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22454/consoleFull) for   PR 2995 at commit [`43743e5`](https://github.com/apache/spark/commit/43743e59690f4fbce66c0dee8fa788d2bdce4a22).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2995#discussion_r19641421
  
    --- Diff: python/pyspark/mllib/common.py ---
    @@ -0,0 +1,148 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +from py4j.java_collections import MapConverter, ListConverter, JavaArray, JavaList
    +
    +from pyspark import RDD, SparkContext
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +
    +# Hack for support float('inf') in Py4j
    +_old_smart_decode = py4j.protocol.smart_decode
    +
    +_float_str_mapping = {
    +    'nan': 'NaN',
    +    'inf': 'Infinity',
    +    '-inf': '-Infinity',
    +}
    +
    +
    +def _new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return _float_str_mapping.get(s, s)
    +    return _old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = _new_smart_decode
    +
    +
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +# this will call the MLlib version of pythonToJava()
    +def _to_java_object_rdd(rdd, cache=False):
    +    """ Return an JavaRDD of Object by unpickling
    +
    +    It will convert each Python object into Java object by Pyrolite, whenever the
    +    RDD is serialized in batch or not.
    +    """
    +    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    +    if cache:
    +        rdd.cache()
    +    return rdd.ctx._jvm.SerDe.pythonToJava(rdd._jrdd, True)
    +
    +
    +def _py2java(sc, obj, cache=False):
    +    """ Convert Python object into Java """
    +    if isinstance(obj, RDD):
    +        obj = _to_java_object_rdd(obj, cache)
    +    elif isinstance(obj, SparkContext):
    +        obj = obj._jsc
    +    elif isinstance(obj, dict):
    +        obj = MapConverter().convert(obj, sc._gateway._gateway_client)
    +    elif isinstance(obj, (list, tuple)):
    +        obj = ListConverter().convert(obj, sc._gateway._gateway_client)
    +    elif isinstance(obj, JavaObject):
    +        pass
    +    elif isinstance(obj, (int, long, float, bool, basestring)):
    +        pass
    +    else:
    +        bytes = bytearray(PickleSerializer().dumps(obj))
    +        obj = sc._jvm.SerDe.loads(bytes)
    +    return obj
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        # convert RDD into JavaRDD
    +        if clsName != 'JavaRDD' and clsName.endswith("RDD"):
    +            r = r.toJavaRDD()
    +            clsName = 'JavaRDD'
    +
    +        if clsName == 'JavaRDD':
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
    +
    +        elif isinstance(r, (JavaArray, JavaList)) or clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
    +
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
    +
    +
    +def callJavaFunc(sc, func, *args):
    +    """ Call Java Function """
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def callAPI(name, *args):
    +    """ Call API in PythonMLLibAPI """
    +    sc = SparkContext._active_spark_context
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return callJavaFunc(sc, api, *args)
    +
    +
    +def callJavaFuncWithCache(sc, func, *args):
    --- End diff --
    
    I feel the logic here is a little confusing because it will go through the args and try caching if possible. It may be clearer if we provide `_cache_serialized` method and use it within each method wrapper, e.g.:
    
    ~~~
    model = callAPI("trainKMeansModel", _cache_serialized(rdd.map(_convert_to_vector)), k,
        maxIterations, runs, initializationMode)
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-60966673
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22454/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-61184818
  
      [Test build #22570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22570/consoleFull) for   PR 2995 at commit [`8fa6ec6`](https://github.com/apache/spark/commit/8fa6ec6bfe06e77fc41633873f20193318ccf114).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-61198260
  
    **[Test build #22570 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22570/consoleFull)**     for PR 2995 at commit [`8fa6ec6`](https://github.com/apache/spark/commit/8fa6ec6bfe06e77fc41633873f20193318ccf114)     after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-61198265
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22570/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-60888200
  
      [Test build #22439 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22439/consoleFull) for   PR 2995 at commit [`731331f`](https://github.com/apache/spark/commit/731331fdafe9ce6e4bf24dc1e6667942e1e59587).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class JavaModelWrapper(object):`
      * `class JavaVectorTransformer(JavaModelWrapper, VectorTransformer):`
      * `class StandardScalerModel(JavaVectorTransformer):`
      * `class IDFModel(JavaVectorTransformer):`
      * `class Word2VecModel(JavaVectorTransformer):`
      * `class MatrixFactorizationModel(JavaModelWrapper):`
      * `class MultivariateStatisticalSummary(JavaModelWrapper):`
      * `class DecisionTreeModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2995#discussion_r19641414
  
    --- Diff: python/pyspark/mllib/common.py ---
    @@ -0,0 +1,148 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +from py4j.java_collections import MapConverter, ListConverter, JavaArray, JavaList
    +
    +from pyspark import RDD, SparkContext
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +
    +# Hack for support float('inf') in Py4j
    +_old_smart_decode = py4j.protocol.smart_decode
    +
    +_float_str_mapping = {
    +    'nan': 'NaN',
    +    'inf': 'Infinity',
    +    '-inf': '-Infinity',
    +}
    +
    +
    +def _new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return _float_str_mapping.get(s, s)
    +    return _old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = _new_smart_decode
    +
    +
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +# this will call the MLlib version of pythonToJava()
    +def _to_java_object_rdd(rdd, cache=False):
    +    """ Return an JavaRDD of Object by unpickling
    +
    +    It will convert each Python object into Java object by Pyrolite, whenever the
    +    RDD is serialized in batch or not.
    +    """
    +    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    +    if cache:
    +        rdd.cache()
    +    return rdd.ctx._jvm.SerDe.pythonToJava(rdd._jrdd, True)
    +
    +
    +def _py2java(sc, obj, cache=False):
    +    """ Convert Python object into Java """
    +    if isinstance(obj, RDD):
    +        obj = _to_java_object_rdd(obj, cache)
    +    elif isinstance(obj, SparkContext):
    +        obj = obj._jsc
    +    elif isinstance(obj, dict):
    +        obj = MapConverter().convert(obj, sc._gateway._gateway_client)
    +    elif isinstance(obj, (list, tuple)):
    +        obj = ListConverter().convert(obj, sc._gateway._gateway_client)
    +    elif isinstance(obj, JavaObject):
    +        pass
    +    elif isinstance(obj, (int, long, float, bool, basestring)):
    +        pass
    +    else:
    +        bytes = bytearray(PickleSerializer().dumps(obj))
    +        obj = sc._jvm.SerDe.loads(bytes)
    +    return obj
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        # convert RDD into JavaRDD
    +        if clsName != 'JavaRDD' and clsName.endswith("RDD"):
    +            r = r.toJavaRDD()
    +            clsName = 'JavaRDD'
    +
    +        if clsName == 'JavaRDD':
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
    +
    +        elif isinstance(r, (JavaArray, JavaList)) or clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
    +
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
    +
    +
    +def callJavaFunc(sc, func, *args):
    +    """ Call Java Function """
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def callAPI(name, *args):
    --- End diff --
    
    `callAPI` could be more specific, e.g., `callMLlibFunc`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2995#discussion_r19641400
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
    @@ -260,6 +253,21 @@ class PythonMLLibAPI extends Serializable {
       }
     
       /**
    +   * A Wrapper of MatrixFactorizationModel to provide helpfer method for Python
    +   */
    +  private[python] class MatrixFactorizationModelWrapper(model: MatrixFactorizationModel)
    +    extends MatrixFactorizationModel(model.rank, model.userFeatures, model.productFeatures) {
    +
    +    def predict(usersProducts: JavaRDD[Array[Any]]): RDD[Rating] =
    --- End diff --
    
    `usersProducts` -> `'userAndProducts`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-60966668
  
      [Test build #22454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22454/consoleFull) for   PR 2995 at commit [`43743e5`](https://github.com/apache/spark/commit/43743e59690f4fbce66c0dee8fa788d2bdce4a22).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class JavaModelWrapper(object):`
      * `class JavaVectorTransformer(JavaModelWrapper, VectorTransformer):`
      * `class StandardScalerModel(JavaVectorTransformer):`
      * `class IDFModel(JavaVectorTransformer):`
      * `class Word2VecModel(JavaVectorTransformer):`
      * `class MatrixFactorizationModel(JavaModelWrapper):`
      * `class MultivariateStatisticalSummary(JavaModelWrapper):`
      * `class DecisionTreeModel(JavaModelWrapper):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-61182511
  
    Great! It simplifies things by a lot. Except minor inline comments, it looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-61198663
  
      [Test build #496 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/496/consoleFull) for   PR 2995 at commit [`8fa6ec6`](https://github.com/apache/spark/commit/8fa6ec6bfe06e77fc41633873f20193318ccf114).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2995#discussion_r19641709
  
    --- Diff: python/pyspark/mllib/common.py ---
    @@ -0,0 +1,148 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import py4j.protocol
    +from py4j.protocol import Py4JJavaError
    +from py4j.java_gateway import JavaObject
    +from py4j.java_collections import MapConverter, ListConverter, JavaArray, JavaList
    +
    +from pyspark import RDD, SparkContext
    +from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
    +
    +
    +# Hack for support float('inf') in Py4j
    +_old_smart_decode = py4j.protocol.smart_decode
    +
    +_float_str_mapping = {
    +    'nan': 'NaN',
    +    'inf': 'Infinity',
    +    '-inf': '-Infinity',
    +}
    +
    +
    +def _new_smart_decode(obj):
    +    if isinstance(obj, float):
    +        s = unicode(obj)
    +        return _float_str_mapping.get(s, s)
    +    return _old_smart_decode(obj)
    +
    +py4j.protocol.smart_decode = _new_smart_decode
    +
    +
    +_picklable_classes = [
    +    'LinkedList',
    +    'SparseVector',
    +    'DenseVector',
    +    'DenseMatrix',
    +    'Rating',
    +    'LabeledPoint',
    +]
    +
    +
    +# this will call the MLlib version of pythonToJava()
    +def _to_java_object_rdd(rdd, cache=False):
    +    """ Return an JavaRDD of Object by unpickling
    +
    +    It will convert each Python object into Java object by Pyrolite, whenever the
    +    RDD is serialized in batch or not.
    +    """
    +    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    +    if cache:
    +        rdd.cache()
    +    return rdd.ctx._jvm.SerDe.pythonToJava(rdd._jrdd, True)
    +
    +
    +def _py2java(sc, obj, cache=False):
    +    """ Convert Python object into Java """
    +    if isinstance(obj, RDD):
    +        obj = _to_java_object_rdd(obj, cache)
    +    elif isinstance(obj, SparkContext):
    +        obj = obj._jsc
    +    elif isinstance(obj, dict):
    +        obj = MapConverter().convert(obj, sc._gateway._gateway_client)
    +    elif isinstance(obj, (list, tuple)):
    +        obj = ListConverter().convert(obj, sc._gateway._gateway_client)
    +    elif isinstance(obj, JavaObject):
    +        pass
    +    elif isinstance(obj, (int, long, float, bool, basestring)):
    +        pass
    +    else:
    +        bytes = bytearray(PickleSerializer().dumps(obj))
    +        obj = sc._jvm.SerDe.loads(bytes)
    +    return obj
    +
    +
    +def _java2py(sc, r):
    +    if isinstance(r, JavaObject):
    +        clsName = r.getClass().getSimpleName()
    +        # convert RDD into JavaRDD
    +        if clsName != 'JavaRDD' and clsName.endswith("RDD"):
    +            r = r.toJavaRDD()
    +            clsName = 'JavaRDD'
    +
    +        if clsName == 'JavaRDD':
    +            jrdd = sc._jvm.SerDe.javaToPython(r)
    +            return RDD(jrdd, sc, AutoBatchedSerializer(PickleSerializer()))
    +
    +        elif isinstance(r, (JavaArray, JavaList)) or clsName in _picklable_classes:
    +            r = sc._jvm.SerDe.dumps(r)
    +
    +    if isinstance(r, bytearray):
    +        r = PickleSerializer().loads(str(r))
    +    return r
    +
    +
    +def callJavaFunc(sc, func, *args):
    +    """ Call Java Function """
    +    args = [_py2java(sc, a) for a in args]
    +    return _java2py(sc, func(*args))
    +
    +
    +def callAPI(name, *args):
    +    """ Call API in PythonMLLibAPI """
    +    sc = SparkContext._active_spark_context
    +    api = getattr(sc._jvm.PythonMLLibAPI(), name)
    +    return callJavaFunc(sc, api, *args)
    +
    +
    +def callJavaFuncWithCache(sc, func, *args):
    --- End diff --
    
    Good idea!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-60883555
  
      [Test build #22439 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22439/consoleFull) for   PR 2995 at commit [`731331f`](https://github.com/apache/spark/commit/731331fdafe9ce6e4bf24dc1e6667942e1e59587).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2995


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2995#discussion_r19642462
  
    --- Diff: python/pyspark/mllib/common.py ---
    @@ -0,0 +1,148 @@
    +#
    --- End diff --
    
    Ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-60888203
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22439/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-61205470
  
      [Test build #496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/496/consoleFull) for   PR 2995 at commit [`8fa6ec6`](https://github.com/apache/spark/commit/8fa6ec6bfe06e77fc41633873f20193318ccf114).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2995#issuecomment-61183970
  
    @mengxr I had addressed your comments, waiting for the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2995#discussion_r19641405
  
    --- Diff: python/pyspark/mllib/common.py ---
    @@ -0,0 +1,148 @@
    +#
    --- End diff --
    
    We used to have `_common.py`, which was for the same purpose. Shall we put underscore to the file name?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4124] [MLlib] [PySpark] simplify serial...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2995#discussion_r19641886
  
    --- Diff: python/pyspark/mllib/common.py ---
    @@ -0,0 +1,148 @@
    +#
    --- End diff --
    
    It's not common to use file name which starts with underscore, except that it has conflict, so I'd would like to keep `common`, it make the import looks better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org