You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Subash Prabakar <su...@gmail.com> on 2019/06/12 17:05:52 UTC

Spark Dataframe NTILE function

Hi,

I am running a Spark Dataframe function of NTILE over a huge data - it
spills lot of data while sorting and eventually it fails.

The data size is roughly 80 Million record with size of 4G (not sure
whether its serialized or deserialized) - I am calculating NTILE(10) for
all these records order by one metric.

Few stats are below, I need help in finding alternatives or anyone did some
benchmarking of highest load this function can handle ?

The below snapshot is the calculation of NTILE for two columns separately -
each runs and that final 1 partition is where the complete data is present
- meaning, Window function moves all to 1 final partition to compute NTILE
- which is 80M in my case.

[image: Screen Shot 2019-06-13 at 12.51.56 AM.jpg]

Executor memory is 8G - with shuffle.storageMemory of 0.8 => so it is 5.5G

So ideally 80M records I saw inside the stage level metrics - it shows as
below,

*Shuffle read size / records*
[image: Screen Shot 2019-06-13 at 12.51.33 AM.jpg]

Is there any alternative or is it not feasible to perform this operation in
Spark SQL functions ?

Thanks,
Subash