You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Driesprong, Fokko" <fo...@driesprong.frl> on 2019/12/01 11:22:12 UTC

Re: override collect_list

Hi Abhnav,

this sounds to me like a bad design, since it isn't scalable. Would it be
possible to store all the data in a database like hbase/bigtable/cassandra?
This would allow you to write the data from all the workers in parallel to
the database/

Cheers, Fokko

Op wo 27 nov. 2019 om 06:58 schreef Ranjan, Abhinav <
abhinav.ranjan001@gmail.com>:

> Hi all,
>
> I want to collect some rows in a list by using the spark's collect_list
> function.
>
> However, the no. of rows getting in the list is overflowing the memory. Is
> there any way to force the collection of rows onto the disk rather than in
> memory, or else instead of collecting it as a list, collect it as a list of
> list so as to avoid collecting it whole into the memory.
>
> *ex: df as:*
>
> *id        col1    col2*
>
> 1        as        sd
>
> 1        df        fg
>
> 1        gh        jk
>
> 2        rt        ty
>
> *df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))*
>
> *id        col3*
>
> 1        [(as,sd),(df,fg),(gh,jk)]
>
> 2        [(rt,ty)]
>
>
> so if id=1 is having too much rows than the list will overflow. How to
> avoid this scenario?
>
>
> Thanks,
>
> Abhnav
>
>
>