You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by pnpritchard <ni...@falkonry.com> on 2015/07/14 22:30:21 UTC

DataFrame.withColumn() recomputes columns even after cache()

Hi!

I am seeing some unexpected behavior with regards to cache() in DataFrames.
Here goes:

In my Scala application, I have created a DataFrame that I run multiple
operations on. It is expensive to recompute the DataFrame, so I have called
cache() after it gets created. 

I notice that the cache() works as expected for some operations (e.g. count,
filter, etc). However, when I run the withColumn() operation, the DataFrame
gets recomputed.

Is this the expected behavior? Is there a workaround for this?

Thanks,
Nick


P.S. Here is an example program to highlight this:
```
    // Examples udf's that println when called
    val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
    val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }

    // Initial dataset
    val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value")

    // Add column by applying twice udf
    val df2 = df1.withColumn("twice", twice($"value"))
    df2.cache()
    df2.count() //prints Computed: twice(1)

    // Add column by applying triple udf
    val df3 = df2.withColumn("triple", triple($"value"))
    df3.cache()
    df3.count() //prints Computed: twice(1)\nComputed: triple(1)
```





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: DataFrame.withColumn() recomputes columns even after cache()

Posted by pnpritchard <ni...@falkonry.com>.
I was able to workaround this by converting the DataFrame to an RDD and then
back to DataFrame. This seems very weird to me, so any insight would be much
appreciated!

Thanks,
Nick


P.S. Here's the updated code with the workaround:
```
    // Examples udf's that println when called
    val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 }
    val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 }

    // Initial dataset
    val df1 = sc.parallelize(Seq(("a", 1))).toDF("id", "value")

    // Add column by applying twice udf
    val df2 = {
      val tmp = df1.withColumn("twice", twice($"value"))
      sqlContext.createDataFrame(tmp.rdd, tmp.schema)
    }
    df2.cache()
    df2.count() //prints Computed: twice(1)

    // Add column by applying triple udf
    val df3 = df2.withColumn("triple", triple($"value"))
    df3.cache()
    df3.count() //prints Computed: triple(1)
```



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-withColumn-recomputes-columns-even-after-cache-tp23836p23839.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org