You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Driesprong, Fokko" <fo...@driesprong.frl> on 2019/12/01 11:22:12 UTC
Re: override collect_list
Hi Abhnav,
this sounds to me like a bad design, since it isn't scalable. Would it be
possible to store all the data in a database like hbase/bigtable/cassandra?
This would allow you to write the data from all the workers in parallel to
the database/
Cheers, Fokko
Op wo 27 nov. 2019 om 06:58 schreef Ranjan, Abhinav <
abhinav.ranjan001@gmail.com>:
> Hi all,
>
> I want to collect some rows in a list by using the spark's collect_list
> function.
>
> However, the no. of rows getting in the list is overflowing the memory. Is
> there any way to force the collection of rows onto the disk rather than in
> memory, or else instead of collecting it as a list, collect it as a list of
> list so as to avoid collecting it whole into the memory.
>
> *ex: df as:*
>
> *id col1 col2*
>
> 1 as sd
>
> 1 df fg
>
> 1 gh jk
>
> 2 rt ty
>
> *df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))*
>
> *id col3*
>
> 1 [(as,sd),(df,fg),(gh,jk)]
>
> 2 [(rt,ty)]
>
>
> so if id=1 is having too much rows than the list will overflow. How to
> avoid this scenario?
>
>
> Thanks,
>
> Abhnav
>
>
>