You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hossein Falaki (JIRA)" <ji...@apache.org> on 2016/10/05 21:00:24 UTC

[jira] [Created] (SPARK-17790) Support for parallelizing/creating DataFrame on data larger than 2GB

Hossein Falaki created SPARK-17790:
--------------------------------------

             Summary: Support for parallelizing/creating DataFrame on data larger than 2GB
                 Key: SPARK-17790
                 URL: https://issues.apache.org/jira/browse/SPARK-17790
             Project: Spark
          Issue Type: Story
          Components: SparkR
    Affects Versions: 2.0.1
            Reporter: Hossein Falaki


This issue is a more specific version of SPARK-17762. 
Supporting larger than 2GB arguments is more general and arguably harder to do because the limit exists both in R and JVM (because we receive data as a ByteArray). However, to support parallalizing R data.frames that are larger than 2GB we can do what PySpark does.

PySpark uses files to transfer bulk data between Python and JVM. It has worked well for the large community of Spark Python users. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org