You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by snjv <sn...@gmail.com> on 2018/04/03 05:44:45 UTC
[Spark-sql]: DF parquet read write multiple tasks
Spark : 2.2
Number of cores : 128 ( all allocated to spark)
Filesystem : Alluxio 1.6
Block size on alluxio: 32MB
Input1 size : 586MB ( 150m records with only 1 column as int)
Input2 size : 50MB ( 10m records with only 1 column as int)
Input1 is spread across 20 parquet files. each file size is 29MB ( 1 alluxio
block for each file)
Input2 is also spread across 20 parquet files. Each file size is 2.2MB ( 1
alluxio block for each file)
Operation : Read parquet as DF
For Input1 : Number of tasks created is 120
For Input2 : number of tasks created is 20
How the number of tasks calculated for both?
secondly, If i look at task Details UI I am seeing some tasks "Input size"
as some xxx bytes while for some its in MB
Further investigation shows me exactly 20 tasks Input size is around 29MB
and rest 100 threads is some bytes.
We are using parquet-cpp to generate parquet files and then reading those
files in spark.
We want to know how the tasks are generated around 120 ( it should be 20 )?
Its blocking our core utilization
Thanks
Regards
Sanjeev
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org