You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bo Meng (JIRA)" <ji...@apache.org> on 2016/01/08 00:12:39 UTC

[jira] [Commented] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

    [ https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088354#comment-15088354 ] 

Bo Meng commented on SPARK-12691:
---------------------------------

I believe this is not a bug. "unionAll" is equal to "UNION ALL" in SQL and its meaning can be found here:
http://www.w3schools.com/sql/sql_union.asp

So in your code, the size of "dd" will grow from each iteration because it will have duplicates. That is why you got growing time, since "dd" will be used again to union another data frame.

If you do not want to allow duplicates, you can add "distinct" to "dd".

> Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12691
>                 URL: https://issues.apache.org/jira/browse/SPARK-12691
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
>         Environment: Tested in Spark 1.3 and 1.4.
>            Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll seems to redo all the previous unionAll and therefore took self time plus all previous time, which, not precisely speaking, makes each unionAll look like a "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in Dataframes.
> ----- output start ----
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms
> Round 32 unionAll took 7 ms
> Round 33 unionAll took 7 ms
> Round 34 unionAll took 7 ms
> Round 35 unionAll took 7 ms
> Round 36 unionAll took 7 ms
> Round 37 unionAll took 8 ms
> Round 38 unionAll took 8 ms
> Round 39 unionAll took 8 ms
> Round 40 unionAll took 8 ms
> Round 41 unionAll took 9 ms
> Round 42 unionAll took 9 ms
> Round 43 unionAll took 9 ms
> Round 44 unionAll took 9 ms
> Round 45 unionAll took 9 ms
> Round 46 unionAll took 9 ms
> Round 47 unionAll took 9 ms
> Round 48 unionAll took 9 ms
> Round 49 unionAll took 10 ms
> Round 50 unionAll took 10 ms
> Round 51 unionAll took 10 ms
> Round 52 unionAll took 10 ms
> Round 53 unionAll took 11 ms
> Round 54 unionAll took 11 ms
> Round 55 unionAll took 11 ms
> Round 56 unionAll took 12 ms
> Round 57 unionAll took 12 ms
> Round 58 unionAll took 12 ms
> Round 59 unionAll took 12 ms
> Round 60 unionAll took 12 ms
> Round 61 unionAll took 12 ms
> Round 62 unionAll took 13 ms
> Round 63 unionAll took 13 ms
> Round 64 unionAll took 13 ms
> Round 65 unionAll took 13 ms
> Round 66 unionAll took 14 ms
> Round 67 unionAll took 14 ms
> Round 68 unionAll took 14 ms
> Round 69 unionAll took 14 ms
> Round 70 unionAll took 14 ms
> Round 71 unionAll took 14 ms
> Round 72 unionAll took 14 ms
> Round 73 unionAll took 14 ms
> Round 74 unionAll took 15 ms
> Round 75 unionAll took 15 ms
> Round 76 unionAll took 15 ms
> Round 77 unionAll took 15 ms
> Round 78 unionAll took 16 ms
> Round 79 unionAll took 16 ms
> Round 80 unionAll took 16 ms
> Round 81 unionAll took 16 ms
> Round 82 unionAll took 17 ms
> Round 83 unionAll took 17 ms
> Round 84 unionAll took 17 ms
> Round 85 unionAll took 17 ms
> Round 86 unionAll took 17 ms
> Round 87 unionAll took 18 ms
> Round 88 unionAll took 17 ms
> Round 89 unionAll took 18 ms
> Round 90 unionAll took 18 ms
> Round 91 unionAll took 18 ms
> Round 92 unionAll took 18 ms
> Round 93 unionAll took 18 ms
> Round 94 unionAll took 19 ms
> Round 95 unionAll took 19 ms
> Round 96 unionAll took 20 ms
> Round 97 unionAll took 20 ms
> Round 98 unionAll took 20 ms
> Round 99 unionAll took 20 ms
> Round 100 unionAll took 20 ms
> 100 unionAll took totally 1337 ms
> ----- output end ----



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org