You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@spark.apache.org by gu...@apache.org on 2019/04/02 05:13:06 UTC

[spark] branch master updated: [SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new 0b150f8  [SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn
0b150f8 is described below

commit 0b150f833c226b3d7111d48e24013e4bf5af2900
Author: Marco Gaido <ma...@gmail.com>
AuthorDate: Tue Apr 2 14:12:47 2019 +0900

    [SPARK-26224][SQL] Advice the user when creating many project on subsequent calls to withColumn
    
    ## What changes were proposed in this pull request?
    
    We have seen many cases when users make several subsequent calls to `withColumn` on a Dataset. This leads now to the generation of a lot of `Project` nodes on the top of the plan, with serious problem which can lead also to `StackOverflowException`s.
    
    The PR improves the doc of `withColumn`, in order to advise the user to avoid this pattern and do something different, ie. a single select with all the column he/she needs.
    
    ## How was this patch tested?
    
    NA
    
    Closes #23285 from mgaido91/SPARK-26224.
    
    Authored-by: Marco Gaido <ma...@gmail.com>
    Signed-off-by: Hyukjin Kwon <gu...@apache.org>
---
 sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
index 69c2f61..477e873 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -2151,6 +2151,11 @@ class Dataset[T] private[sql](
    * `column`'s expression must only refer to attributes supplied by this Dataset. It is an
    * error to add a column that refers to some other Dataset.
    *
+   * Please notice that this method introduces a `Project`. This means that using it in loops in
+   * order to add several columns can generate very big plans which can cause huge performance
+   * issues and even `StackOverflowException`s. A much better alternative use `select` with the
+   * list of columns to add.
+   *
    * @group untypedrel
    * @since 2.0.0
    */


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org