You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Subash Prabakar <su...@gmail.com> on 2019/06/12 17:05:52 UTC
Spark Dataframe NTILE function
Hi,
I am running a Spark Dataframe function of NTILE over a huge data - it
spills lot of data while sorting and eventually it fails.
The data size is roughly 80 Million record with size of 4G (not sure
whether its serialized or deserialized) - I am calculating NTILE(10) for
all these records order by one metric.
Few stats are below, I need help in finding alternatives or anyone did some
benchmarking of highest load this function can handle ?
The below snapshot is the calculation of NTILE for two columns separately -
each runs and that final 1 partition is where the complete data is present
- meaning, Window function moves all to 1 final partition to compute NTILE
- which is 80M in my case.
[image: Screen Shot 2019-06-13 at 12.51.56 AM.jpg]
Executor memory is 8G - with shuffle.storageMemory of 0.8 => so it is 5.5G
So ideally 80M records I saw inside the stage level metrics - it shows as
below,
*Shuffle read size / records*
[image: Screen Shot 2019-06-13 at 12.51.33 AM.jpg]
Is there any alternative or is it not feasible to perform this operation in
Spark SQL functions ?
Thanks,
Subash