You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xusen Yin (JIRA)" <ji...@apache.org> on 2016/01/15 15:09:39 UTC

[jira] [Updated] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

     [ https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xusen Yin updated SPARK-12834:
------------------------------
    Description: 
According to the Ser/De code in Python side:

{code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
  def _java2py(sc, r, encoding="bytes"):
    if isinstance(r, JavaObject):
        clsName = r.getClass().getSimpleName()
        # convert RDD into JavaRDD
        if clsName != 'JavaRDD' and clsName.endswith("RDD"):
            r = r.toJavaRDD()
            clsName = 'JavaRDD'

        if clsName == 'JavaRDD':
            jrdd = sc._jvm.SerDe.javaToPython(r)
            return RDD(jrdd, sc)

        if clsName == 'DataFrame':
            return DataFrame(r, SQLContext.getOrCreate(sc))

        if clsName in _picklable_classes:
            r = sc._jvm.SerDe.dumps(r)
        elif isinstance(r, (JavaArray, JavaList)):
            try:
                r = sc._jvm.SerDe.dumps(r)
            except Py4JJavaError:
                pass  # not pickable

    if isinstance(r, (bytearray, bytes)):
        r = PickleSerializer().loads(bytes(r), encoding=encoding)
    return r
{code}

We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, then deserialize them with PickleSerializer in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780

  was:
According to the Ser/De code in Python side:

{code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
  def _java2py(sc, r, encoding="bytes"):
    if isinstance(r, JavaObject):
        clsName = r.getClass().getSimpleName()
        # convert RDD into JavaRDD
        if clsName != 'JavaRDD' and clsName.endswith("RDD"):
            r = r.toJavaRDD()
            clsName = 'JavaRDD'

        if clsName == 'JavaRDD':
            jrdd = sc._jvm.SerDe.javaToPython(r)
            return RDD(jrdd, sc)

        if clsName == 'DataFrame':
            return DataFrame(r, SQLContext.getOrCreate(sc))

        if clsName in _picklable_classes:
            r = sc._jvm.SerDe.dumps(r)
        elif isinstance(r, (JavaArray, JavaList)):
            try:
                r = sc._jvm.SerDe.dumps(r)
            except Py4JJavaError:
                pass  # not pickable

    if isinstance(r, (bytearray, bytes)):
        r = PickleSerializer().loads(bytes(r), encoding=encoding)
    return r
{code}

We use SerDe.sumps to serialize JavaArray and JavaList in PythonMLLibAPI, then deserialize them with PickleSerializer in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780


> Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-12834
>                 URL: https://issues.apache.org/jira/browse/SPARK-12834
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Xusen Yin
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
>     if isinstance(r, JavaObject):
>         clsName = r.getClass().getSimpleName()
>         # convert RDD into JavaRDD
>         if clsName != 'JavaRDD' and clsName.endswith("RDD"):
>             r = r.toJavaRDD()
>             clsName = 'JavaRDD'
>         if clsName == 'JavaRDD':
>             jrdd = sc._jvm.SerDe.javaToPython(r)
>             return RDD(jrdd, sc)
>         if clsName == 'DataFrame':
>             return DataFrame(r, SQLContext.getOrCreate(sc))
>         if clsName in _picklable_classes:
>             r = sc._jvm.SerDe.dumps(r)
>         elif isinstance(r, (JavaArray, JavaList)):
>             try:
>                 r = sc._jvm.SerDe.dumps(r)
>             except Py4JJavaError:
>                 pass  # not pickable
>     if isinstance(r, (bytearray, bytes)):
>         r = PickleSerializer().loads(bytes(r), encoding=encoding)
>     return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, then deserialize them with PickleSerializer in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org