You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Shockang (Jira)" <ji...@apache.org> on 2021/09/26 03:54:00 UTC

[jira] [Comment Edited] (SPARK-36843) Add an iterator method to Dataset

    [ https://issues.apache.org/jira/browse/SPARK-36843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17420199#comment-17420199 ] 

Shockang edited comment on SPARK-36843 at 9/26/21, 3:53 AM:
------------------------------------------------------------

[~lxian2] You mean that one job collects all data and returns an iterator of byte array?


was (Author: shockang):
[~lxian2] You mean that a job collects all data and returns an iterator of byte array.

> Add an iterator method to Dataset
> ---------------------------------
>
>                 Key: SPARK-36843
>                 URL: https://issues.apache.org/jira/browse/SPARK-36843
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Li Xian
>            Priority: Minor
>
> The current org.apache.spark.sql.Dataset#toLocalIterator will submit multiple jobs for multiple partitions. 
> In my case, I would like to collect all partition at once to save the job scheduling cost and also has an iterator to save the memory on deserialization (instead of deserialize all rows at once, I want only one row is deserialized during the iteration)
> . 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org