You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2019/02/01 07:37:18 UTC

Avoiding collect but use foreach

Hi,

This:


*to_list = [list(row) for row in df.collect()]*


Gives:


[[5, 1, 1, 1, 2, 1, 3, 1, 1, 0], [5, 4, 4, 5, 7, 10, 3, 2, 1, 0], [3, 1, 1,
1, 2, 2, 3, 1, 1, 0], [6, 8, 8, 1, 3, 4, 3, 7, 1, 0], [4, 1, 1, 3, 2, 1, 3,
1, 1, 0]]


I want to avoid collect operation, but still convert the dataframe to a
python list of list just as above for downstream operations.


Is there a way, I can do it, maybe a better performant code that using
collect?


Thanks,

Aakash.

Re: Avoiding collect but use foreach

Posted by 刘虓 <ip...@gmail.com>.

hi,
I think you can make your python code into an udf and call udf in
foreachpartition.

Aakash Basu <aa...@gmail.com> 于2019年2月1日周五 下午3:37写道：

> Hi,
>
> This:
>
>
> *to_list = [list(row) for row in df.collect()]*
>
>
> Gives:
>
>
> [[5, 1, 1, 1, 2, 1, 3, 1, 1, 0], [5, 4, 4, 5, 7, 10, 3, 2, 1, 0], [3, 1,
> 1, 1, 2, 2, 3, 1, 1, 0], [6, 8, 8, 1, 3, 4, 3, 7, 1, 0], [4, 1, 1, 3, 2, 1,
> 3, 1, 1, 0]]
>
>
> I want to avoid collect operation, but still convert the dataframe to a
> python list of list just as above for downstream operations.
>
>
> Is there a way, I can do it, maybe a better performant code that using
> collect?
>
>
> Thanks,
>
> Aakash.
>