You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Allen Liang (JIRA)" <ji...@apache.org> on 2016/01/10 06:48:39 UTC

[jira] [Issue Comment Deleted] (SPARK-12691) Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner

     [ https://issues.apache.org/jira/browse/SPARK-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allen Liang updated SPARK-12691:
--------------------------------
    Comment: was deleted

(was: Hi Bo Meng,

I understand your point, but I don't think this has something to do with size of dataframe. Or how do you explain this behavior doesn't happen when we do the same thing to RDDs. If you union all the RDDs in dataframes in above sample code, you'll find each round of RDD union takes a relatively constant time (NOT growing at all), which is expected.
The code attached is a simple sample to reproduce this issue and the time costed may not seem to be terrible here. However, in our real case, where we we have 202 dataframes (which all happen to be empty dataframe (meaning size is zero)) to unionAll, and it took around over 20 minutes to complete, which obviously is not acceptable.

To workaround this issue we actually directly unioned all the RDDs in those 202 dataframes and convert back the final RDD to dataframe in the end. And that whole workaround took around only 20+ seconds to complete. Compared to 20+ minutes when we unionAll 202 empty dataframes, this already is a huge improvement.

I think there has to be something wrong in the multiple dataframe unionAll or let's say there has to be something we can improve here.)

> Multiple unionAll on Dataframe seems to cause repeated calculations in a "Fibonacci" manner
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12691
>                 URL: https://issues.apache.org/jira/browse/SPARK-12691
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1
>         Environment: Tested in Spark 1.3 and 1.4.
>            Reporter: Allen Liang
>
> Multiple unionAll on Dataframe seems to cause repeated calculations. Here is the sample code to reproduce this issue.
> val dfs = for (i<-0 to 100) yield {
>   val df = sc.parallelize((0 to 10).zipWithIndex).toDF("A", "B")
>   df
> }
> var i = 1
> val s1 = System.currentTimeMillis()
> dfs.reduce{(a,b)=>{
>   val t1 = System.currentTimeMillis()
>   val dd = a unionAll b
>   val t2 = System.currentTimeMillis()
>   println("Round " + i + " unionAll took " + (t2 - t1) + " ms")
>   i = i + 1
>   dd
>   }
> }
> val s2 = System.currentTimeMillis()
> println((i - 1) + " unionAll took totally " + (s2 - s1) + " ms")
> And it printed as follows. And as you can see, it looks like each unionAll seems to redo all the previous unionAll and therefore took self time plus all previous time, which, not precisely speaking, makes each unionAll look like a "Fibonacci" action.
> BTW, this behaviour doesn't happen if I directly union all the RDDs in Dataframes.
> ----- output start ----
> Round 1 unionAll took 1 ms
> Round 2 unionAll took 1 ms
> Round 3 unionAll took 1 ms
> Round 4 unionAll took 1 ms
> Round 5 unionAll took 1 ms
> Round 6 unionAll took 1 ms
> Round 7 unionAll took 1 ms
> Round 8 unionAll took 2 ms
> Round 9 unionAll took 2 ms
> Round 10 unionAll took 2 ms
> Round 11 unionAll took 3 ms
> Round 12 unionAll took 3 ms
> Round 13 unionAll took 3 ms
> Round 14 unionAll took 3 ms
> Round 15 unionAll took 3 ms
> Round 16 unionAll took 4 ms
> Round 17 unionAll took 4 ms
> Round 18 unionAll took 4 ms
> Round 19 unionAll took 4 ms
> Round 20 unionAll took 4 ms
> Round 21 unionAll took 5 ms
> Round 22 unionAll took 5 ms
> Round 23 unionAll took 5 ms
> Round 24 unionAll took 5 ms
> Round 25 unionAll took 5 ms
> Round 26 unionAll took 6 ms
> Round 27 unionAll took 6 ms
> Round 28 unionAll took 6 ms
> Round 29 unionAll took 6 ms
> Round 30 unionAll took 6 ms
> Round 31 unionAll took 6 ms
> Round 32 unionAll took 7 ms
> Round 33 unionAll took 7 ms
> Round 34 unionAll took 7 ms
> Round 35 unionAll took 7 ms
> Round 36 unionAll took 7 ms
> Round 37 unionAll took 8 ms
> Round 38 unionAll took 8 ms
> Round 39 unionAll took 8 ms
> Round 40 unionAll took 8 ms
> Round 41 unionAll took 9 ms
> Round 42 unionAll took 9 ms
> Round 43 unionAll took 9 ms
> Round 44 unionAll took 9 ms
> Round 45 unionAll took 9 ms
> Round 46 unionAll took 9 ms
> Round 47 unionAll took 9 ms
> Round 48 unionAll took 9 ms
> Round 49 unionAll took 10 ms
> Round 50 unionAll took 10 ms
> Round 51 unionAll took 10 ms
> Round 52 unionAll took 10 ms
> Round 53 unionAll took 11 ms
> Round 54 unionAll took 11 ms
> Round 55 unionAll took 11 ms
> Round 56 unionAll took 12 ms
> Round 57 unionAll took 12 ms
> Round 58 unionAll took 12 ms
> Round 59 unionAll took 12 ms
> Round 60 unionAll took 12 ms
> Round 61 unionAll took 12 ms
> Round 62 unionAll took 13 ms
> Round 63 unionAll took 13 ms
> Round 64 unionAll took 13 ms
> Round 65 unionAll took 13 ms
> Round 66 unionAll took 14 ms
> Round 67 unionAll took 14 ms
> Round 68 unionAll took 14 ms
> Round 69 unionAll took 14 ms
> Round 70 unionAll took 14 ms
> Round 71 unionAll took 14 ms
> Round 72 unionAll took 14 ms
> Round 73 unionAll took 14 ms
> Round 74 unionAll took 15 ms
> Round 75 unionAll took 15 ms
> Round 76 unionAll took 15 ms
> Round 77 unionAll took 15 ms
> Round 78 unionAll took 16 ms
> Round 79 unionAll took 16 ms
> Round 80 unionAll took 16 ms
> Round 81 unionAll took 16 ms
> Round 82 unionAll took 17 ms
> Round 83 unionAll took 17 ms
> Round 84 unionAll took 17 ms
> Round 85 unionAll took 17 ms
> Round 86 unionAll took 17 ms
> Round 87 unionAll took 18 ms
> Round 88 unionAll took 17 ms
> Round 89 unionAll took 18 ms
> Round 90 unionAll took 18 ms
> Round 91 unionAll took 18 ms
> Round 92 unionAll took 18 ms
> Round 93 unionAll took 18 ms
> Round 94 unionAll took 19 ms
> Round 95 unionAll took 19 ms
> Round 96 unionAll took 20 ms
> Round 97 unionAll took 20 ms
> Round 98 unionAll took 20 ms
> Round 99 unionAll took 20 ms
> Round 100 unionAll took 20 ms
> 100 unionAll took totally 1337 ms
> ----- output end ----



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org