You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by snjv <sn...@gmail.com> on 2018/04/03 05:44:45 UTC

[Spark-sql]: DF parquet read write multiple tasks

Spark : 2.2
Number of cores : 128 ( all allocated to spark)
Filesystem : Alluxio 1.6
Block size on alluxio: 32MB

Input1 size : 586MB ( 150m records with only 1 column as int)
Input2 size : 50MB ( 10m records with only 1 column as int)
 
Input1 is spread across 20 parquet files. each file size is 29MB ( 1 alluxio
block for each file) 
Input2 is also spread across 20 parquet files. Each file size is 2.2MB ( 1
alluxio block for each file)

Operation : Read parquet as DF

For Input1 : Number of tasks created is 120
For Input2 : number of tasks created is 20

How the number of tasks calculated for both?

secondly, If i look at  task Details UI I am seeing some tasks "Input size"
as some xxx bytes while for some its in MB 
Further investigation shows me exactly 20 tasks Input size  is around 29MB
and rest 100 threads is some bytes.

We are using parquet-cpp to generate parquet files and then reading those
files in spark.

We want to know how the tasks are generated around 120 ( it should be 20 )?
Its blocking  our core utilization 

Thanks

Regards
Sanjeev 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org