You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Arman Yazdani (JIRA)" <ji...@apache.org> on 2018/02/25 08:36:00 UTC
[jira] [Commented] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

    [ https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375995#comment-16375995 ] 

Arman Yazdani commented on SPARK-21177:
---------------------------------------

I configured spark with hive. in my case when i want to save partitioned dataset to hive, spark waits about 10 minute for hive metastore and metastore process uses 100% of 1 thread of cpu. I changed log level of metastore to debug, and metastore waits after logging of getMTable function in objectStore file. in this 10 minute waiting, spark have not any job to do and just waits for hive metastore. this waiting goes up when number of partitions goes up.

> df.saveAsTable slows down linearly, with number of appends
> ----------------------------------------------------------
>
>                 Key: SPARK-21177
>                 URL: https://issues.apache.org/jira/browse/SPARK-21177
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Prashant Sharma
>            Priority: Major
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>       /_/
>          
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
>     val start = System.nanoTime()
>     f()
>     val end = System.nanoTime()
>     val timetaken = end - start
>     import scala.concurrent.duration._
>     println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>      |      |      |      |      |      |      | printTimeTaken: (str: String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 100000) {printTimeTaken("time to append to hive:", () => { Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> ....
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org