You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Ranjan, Abhinav" <ab...@gmail.com> on 2019/11/27 05:57:49 UTC

override collect_list

Hi all,

I want to collect some rows in a list by using the spark's collect_list 
function.

However, the no. of rows getting in the list is overflowing the memory. 
Is there any way to force the collection of rows onto the disk rather 
than in memory, or else instead of collecting it as a list, collect it 
as a list of list so as to avoid collecting it whole into the memory.

*_/ex: df as:/_*

*id        col1    col2*

1        as        sd

1        df        fg

1        gh        jk

2        rt        ty

*_/df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))/_*

*id        col3*

1        [(as,sd),(df,fg),(gh,jk)]

2        [(rt,ty)]


so if id=1 is having too much rows than the list will overflow. How to 
avoid this scenario?


Thanks,

Abhnav

Re: override collect_list

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Hi Abhnav,

this sounds to me like a bad design, since it isn't scalable. Would it be
possible to store all the data in a database like hbase/bigtable/cassandra?
This would allow you to write the data from all the workers in parallel to
the database/

Cheers, Fokko

Op wo 27 nov. 2019 om 06:58 schreef Ranjan, Abhinav <
abhinav.ranjan001@gmail.com>:

> Hi all,
>
> I want to collect some rows in a list by using the spark's collect_list
> function.
>
> However, the no. of rows getting in the list is overflowing the memory. Is
> there any way to force the collection of rows onto the disk rather than in
> memory, or else instead of collecting it as a list, collect it as a list of
> list so as to avoid collecting it whole into the memory.
>
> *ex: df as:*
>
> *id        col1    col2*
>
> 1        as        sd
>
> 1        df        fg
>
> 1        gh        jk
>
> 2        rt        ty
>
> *df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))*
>
> *id        col3*
>
> 1        [(as,sd),(df,fg),(gh,jk)]
>
> 2        [(rt,ty)]
>
>
> so if id=1 is having too much rows than the list will overflow. How to
> avoid this scenario?
>
>
> Thanks,
>
> Abhnav
>
>
>