You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rishi Shah <ri...@gmail.com> on 2019/10/20 04:56:41 UTC

[pyspark 2.4.3] nested windows function performance

Hi All,

I have a use case where I need to perform nested windowing functions on a
data frame to get final set of columns. Example:

w1 = Window.partitionBy('col1')
df = df.withColumn('sum1', F.sum('val'))

w2 = Window.partitionBy('col1', 'col2')
df = df.withColumn('sum2', F.sum('val'))

w3 = Window.partitionBy('col1', 'col2', 'col3')
df = df.withColumn('sum3', F.sum('val'))

These 3 partitions are not huge at all, however the data size is 2T parquet
snappy compressed. This throws a lot of outofmemory errors.

I would like to get some advice around whether nested window functions is a
good idea in pyspark? I wanted to avoid using multiple filter + joins to
get to the final state, as join can create crazy shuffle.

Any suggestions would be appreciated!

-- 
Regards,

Rishi Shah

Re: [pyspark 2.4.3] nested windows function performance

Posted by Georg Heiler <ge...@gmail.com>.

No, as you shuffle each time again (you always partition by different
windows)
Instead: could you choose a single window (w3 with more columns =fine
granular) and the nfilter out records to achieve the same result?

Or instead:
df.groupBy(a,b,c).agg(sort_array(collect_list(foo,bar,baz))
and then an UDF which performs your desired aggregation

Best,
Georg

Am Mo., 21. Okt. 2019 um 13:59 Uhr schrieb Rishi Shah <
rishishah.star@gmail.com>:

> Hi All,
>
> Any suggestions?
>
> Thanks,
> -Rishi
>
> On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah <ri...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have a use case where I need to perform nested windowing functions on a
>> data frame to get final set of columns. Example:
>>
>> w1 = Window.partitionBy('col1')
>> df = df.withColumn('sum1', F.sum('val'))
>>
>> w2 = Window.partitionBy('col1', 'col2')
>> df = df.withColumn('sum2', F.sum('val'))
>>
>> w3 = Window.partitionBy('col1', 'col2', 'col3')
>> df = df.withColumn('sum3', F.sum('val'))
>>
>> These 3 partitions are not huge at all, however the data size is 2T
>> parquet snappy compressed. This throws a lot of outofmemory errors.
>>
>> I would like to get some advice around whether nested window functions is
>> a good idea in pyspark? I wanted to avoid using multiple filter + joins to
>> get to the final state, as join can create crazy shuffle.
>>
>> Any suggestions would be appreciated!
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>

Re: [pyspark 2.4.3] nested windows function performance

Posted by Rishi Shah <ri...@gmail.com>.

Hi All,

Any suggestions?

Thanks,
-Rishi

On Sun, Oct 20, 2019 at 12:56 AM Rishi Shah <ri...@gmail.com>
wrote:

> Hi All,
>
> I have a use case where I need to perform nested windowing functions on a
> data frame to get final set of columns. Example:
>
> w1 = Window.partitionBy('col1')
> df = df.withColumn('sum1', F.sum('val'))
>
> w2 = Window.partitionBy('col1', 'col2')
> df = df.withColumn('sum2', F.sum('val'))
>
> w3 = Window.partitionBy('col1', 'col2', 'col3')
> df = df.withColumn('sum3', F.sum('val'))
>
> These 3 partitions are not huge at all, however the data size is 2T
> parquet snappy compressed. This throws a lot of outofmemory errors.
>
> I would like to get some advice around whether nested window functions is
> a good idea in pyspark? I wanted to avoid using multiple filter + joins to
> get to the final state, as join can create crazy shuffle.
>
> Any suggestions would be appreciated!
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Regards,

Rishi Shah