You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2015/04/07 00:52:12 UTC
[jira] [Created] (SPARK-6728) Improve performance of py4j for large
bytearray
Davies Liu created SPARK-6728:
---------------------------------
Summary: Improve performance of py4j for large bytearray
Key: SPARK-6728
URL: https://issues.apache.org/jira/browse/SPARK-6728
Project: Spark
Issue Type: Improvement
Components: PySpark
Reporter: Davies Liu
PySpark relies on py4j to transfer function arguments and return between Python and JVM, it's very slow to pass a large bytearray (larger than 10M).
In MLlib, it's possible to have a Vector with more than 100M bytes, which will need few GB memory, may crash.
The reason is that py4j use text protocol, it will encode the bytearray as base64, and do multiple string concat.
Binary will help a lot, create a issue for py4j: https://github.com/bartdag/py4j/issues/159
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org