You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hossein Falaki (JIRA)" <ji...@apache.org> on 2016/10/05 21:00:24 UTC
[jira] [Created] (SPARK-17790) Support for parallelizing/creating
DataFrame on data larger than 2GB
Hossein Falaki created SPARK-17790:
--------------------------------------
Summary: Support for parallelizing/creating DataFrame on data larger than 2GB
Key: SPARK-17790
URL: https://issues.apache.org/jira/browse/SPARK-17790
Project: Spark
Issue Type: Story
Components: SparkR
Affects Versions: 2.0.1
Reporter: Hossein Falaki
This issue is a more specific version of SPARK-17762.
Supporting larger than 2GB arguments is more general and arguably harder to do because the limit exists both in R and JVM (because we receive data as a ByteArray). However, to support parallalizing R data.frames that are larger than 2GB we can do what PySpark does.
PySpark uses files to transfer bulk data between Python and JVM. It has worked well for the large community of Spark Python users.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org