You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Fabien (Jira)" <ji...@apache.org> on 2022/02/04 17:17:00 UTC
[jira] [Updated] (SPARK-38111) Retrieve a Spark dataframe as Arrow batches

     [ https://issues.apache.org/jira/browse/SPARK-38111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fabien updated SPARK-38111:
---------------------------
    Description: 
Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ?

I have a pretty large dataset on my cluster so I cannot collect it using [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] which download every thing at once and saturate my JVM memory

Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ?

This would be ideal to process the data batch per batch and avoid saturating the memory.
 

I am looking for an API like this (in Java)

 
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
    var batch = stream.getNextBatch()
    // do some stuff with the arrow batch
}
{code}

It would be even better if I can split the dataframe into several streams so I can download and process it in parallel

  was:
Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ?

I have a pretty large dataset on my cluster so I cannot collect it using [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] which download every thing at once and saturate the my JVM memory

Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ?

This would be ideal to process the data batch per batch and avoid saturating the memory.
 

I am looking for an API like this (in Java)

 
{code:java}
var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
    var batch = stream.getNextBatch()
    // do some stuff with the arrow batch
}
{code}

It would be even better if I can split the dataframe into several streams so I can download and process it in parallel


> Retrieve a Spark dataframe as Arrow batches
> -------------------------------------------
>
>                 Key: SPARK-38111
>                 URL: https://issues.apache.org/jira/browse/SPARK-38111
>             Project: Spark
>          Issue Type: Question
>          Components: Java API
>    Affects Versions: 3.2.0
>         Environment: Java 11
> Spark 3
>            Reporter: Fabien
>            Priority: Minor
>
> Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ?
> I have a pretty large dataset on my cluster so I cannot collect it using [collectAsList|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#collectAsList--] which download every thing at once and saturate my JVM memory
> Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ?
> This would be ideal to process the data batch per batch and avoid saturating the memory.
>  
> I am looking for an API like this (in Java)
>  
> {code:java}
> var stream = dataframe.collectAsArrowStream()
> while (stream.hasNextBatch()) {
>     var batch = stream.getNextBatch()
>     // do some stuff with the arrow batch
> }
> {code}
> It would be even better if I can split the dataframe into several streams so I can download and process it in parallel



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org