You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2015/04/07 00:52:12 UTC

[jira] [Created] (SPARK-6728) Improve performance of py4j for large bytearray

Davies Liu created SPARK-6728:
---------------------------------

             Summary: Improve performance of py4j for large bytearray
                 Key: SPARK-6728
                 URL: https://issues.apache.org/jira/browse/SPARK-6728
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
            Reporter: Davies Liu


PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M). 

In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash.

The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat. 

Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org