You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/07/24 12:43:03 UTC
[jira] [Reopened] (SPARK-21177) df.saveAsTable slows down linearly,
with number of appends
[ https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon reopened SPARK-21177:
----------------------------------
I am reopening this as I can reproduce:
{code}
def printTimeTaken(str: String, f: () => Unit) {
val start = System.nanoTime()
f()
val end = System.nanoTime()
val timetaken = end - start
import scala.concurrent.duration._
println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
}
for(i <- 1 to 1000) {
printTimeTaken("time to append to hive:", () => { Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })
}
{code}
{code}
...
Time taken for time to append to hive: is 166
Time taken for time to append to hive: is 155
Time taken for time to append to hive: is 164
...
Time taken for time to append to hive: is 374
Time taken for time to append to hive: is 360
Time taken for time to append to hive: is 377
{code}
{code}
scala> sql("SELECT count(*) from t1").show()
+--------+
|count(1)|
+--------+
| 2000|
+--------+
{code}
What I did is with {{format("hive")}} :
{code}
hive> create table t1 (value bigint);
{code}
Output:
{code}
...
Time taken for time to append to hive: is 593
Time taken for time to append to hive: is 587
Time taken for time to append to hive: is 580
...
Time taken for time to append to hive: is 506
Time taken for time to append to hive: is 511
Time taken for time to append to hive: is 507
{code}
{code}
scala> sql("SELECT count(*) from t1").show()
+--------+
|count(1)|
+--------+
| 2000|
+--------+
{code}
> df.saveAsTable slows down linearly, with number of appends
> ----------------------------------------------------------
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer.
> {code:java}
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT
> /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
> }
> | | | | | | | printTimeTaken: (str: String, f: () => Unit)Unit
> scala>
> for(i <- 1 to 100000) {printTimeTaken("time to append to hive:", () => { Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> ...
> Time taken for time to append to hive: is 3055
> ...
> Time taken for time to append to hive: is 22425
> ....
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe operation.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org