You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by KhajaAsmath Mohammed <md...@gmail.com> on 2017/05/25 21:57:04 UTC

shuffle write is very slow

Hi,

I am converting hive job with spark job. I have tested on small set and
logic is correct in hive and spark.

when i started testing on large data, spark is very slow when compared to
hive.

shuffle write is taking long time. any suggestions?

I am creating temporary table in spark and overwriting hive table with
partitions from that temporary table created on spark.

 dataframe_transposed.registerTempTable(srcTable)
    import sqlContext._
    import sqlContext.implicits._
    val query=s"INSERT OVERWRITE TABLE ${destTable} SELECT * from
${srcTable}"
    println(s"INSERT OVERWRITE TABLE ${destTable} SELECT * from
${srcTable}")
    logger.info(s"Executing Query ${query}")
    sqlContext.sql(query)

total size of dataframe is around 190 GB and it is running for ever in this
case while hive job can be completed in 4 hours.

Thanks,
Asmath.