You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/04 03:32:39 UTC

[GitHub] [spark] Yikun opened a new pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Yikun opened a new pull request #32431:
URL: https://github.com/apache/spark/pull/32431


   ### What changes were proposed in this pull request?
   This PR added the multiple columns adding support for Spark scala/java/python API.
   - Expose `withColumns` with Map input as public API in Scala/Java
   - Add `withColumns` in PySpark
   
   There was also some discussion about adding multiple columns in past JIRA([SPARK-1225](https://issues.apache.org/jira/browse/SPARK-12225), [SPARK-26224](https://issues.apache.org/jira/browse/SPARK-26224)) and [ML](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Multiple-columns-adding-replacing-support-in-PySpark-DataFrame-API-td31164.html).
   
   ### Why are the changes needed?
   There were a private method `withColumns` can add columns at one pass [1]:
   https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
   
   However, it was not exposed as public API in Scala/Java, and also PySpark user can only use `withColumn` to add one column or replacing the existing one column that has the same name. 
   
   For example, if the PySpark user want to add multiple columns, they should call `withColumn` again and again like:
   ```Python
   df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
   ```
   After this patch, the user can use the `withColumn` with columns list args complete columns adding at one pass:
   ```Python
   df.withColumn({"key1":  col("key1"), "key2":col("key2"), "key3": col("key3")})
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, this PR exposes `withColumns` as public API, and also adds `withColumns` API in PySpark .
   
   
   ### How was this patch tested?
   - Add new multiple columns adding test, passed
   - Existing test, passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833539205


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42730/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SaurabhChawla100 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SaurabhChawla100 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r625585169



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def with_columns(self, col_names, cols):

Review comment:
       Similar to this change in python, i think same change is needed at the DataFrame.R to expose with_columns in R also.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833401254


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138206/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800325976



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":

Review comment:
       I admit that this is sort of not so pretty in Python context.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627872861



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2395,6 +2395,36 @@ class Dataset[T] private[sql](
    */
   def withColumn(colName: String, col: Column): DataFrame = withColumns(Seq(colName), Seq(col))
 
+  /**
+   * (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns
+   * that has the same names.
+   *
+   * `colsMap` is a map of column name and column, the column must only refer to attributes
+   * supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
+   *
+   * @group untypedrel
+   * @since 3.2.0
+   */
+  def withColumns(colsMap: Map[String, Column]): DataFrame = {
+    val colNames = colsMap.flatMap{ case (colName, _) => Seq(colName) }.toSeq

Review comment:
       colsMap.keys.toSeq?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833968881


   I'm okay. cc @ueshin @viirya @BryanCutler @zero323 FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832363556


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42668/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832363556


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42668/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-899132820


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029599807


   @HyukjinKwon Sure, will reopen and  rebase it soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1031091002


   Hm, leveraging keyword arguments is actually interesting. Though I think I prefer `withColumns` because we should also think about Scala side API. Maybe we can push this API in first, and think about Pythonic variant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805231319



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same names.
+
+        The colsMap is a map of column name and column, the column must only refer to attributes
+        supplied by this Dataset. It is an error to add columns that refer to some other Dataset.

Review comment:
       Sure, will add a note on `Parameters` section for `colsMap`:
   ```
   Currently, only single map is supported.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834426502


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138244/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833545790


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42730/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833738968


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138208/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832327840


   **[Test build #138147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138147/testReport)** for PR 32431 at commit [`ba7d4e0`](https://github.com/apache/spark/commit/ba7d4e0c1e16af44ee59bab1eeaabf150bcece72).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r625680973



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def with_columns(self, col_names, cols):

Review comment:
       Let's keep it consistent with Scala side for now. Note that camelCase doesn't violate PEP8 (e.g., `threading` in Python)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627250235



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def with_columns(self, col_names, cols):

Review comment:
       @SaurabhChawla100 Thanks for reminder, I am not very familar with R, I am going to submit it in a separate PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565


   Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the docstring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928


   Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
   
   ```patch
   diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
   index ee68865c98..00a7a4543e 100644
   --- a/python/pyspark/sql/dataframe.py
   +++ b/python/pyspark/sql/dataframe.py
   @@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            return Column(jc)
    
        @overload
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":
   +    def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
            ...
    
        @overload
   -    def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
   +    def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
            ...
    
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":  # type: ignore[misc]
   +    def select(  # type: ignore[misc]
   +        self,
   +        *cols: "ColumnOrName",
   +        **namedCols: Column,
   +    ) -> "DataFrame":
            """Projects a set of expressions and returns a new :class:`DataFrame`.
    
            .. versionadded:: 1.3.0
   @@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
                column names (string) or expressions (:class:`Column`).
                If one of the column names is '*', that column is expanded to include all columns
                in the current :class:`DataFrame`.
   +        namedCols : :class:`Column`
   +            exprasssions select under given alaias.
    
            Examples
            --------
   @@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            [Row(name='Alice', age=2), Row(name='Bob', age=5)]
            >>> df.select(df.name, (df.age + 10).alias('age')).collect()
            [Row(name='Alice', age=12), Row(name='Bob', age=15)]
   -        """
   -        jdf = self._jdf.select(self._jcols(*cols))
   +        >>> df.select(
   +        ...     "age",
   +        ...     name_lower=lower("name"),
   +        ...     name_upper=upper("name"),
   +        ...     age_plus_one=col("age") + 1,
   +        ... ).limit(1).collect()
   +        [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
   +        """
   +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
   +            cols = cols[0]  # type: ignore[assignment]
   +        jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
            return DataFrame(jdf, self.sql_ctx)
    
        @overload
   
   ```
   
   will be able to use it like this:
   
   ```python
   df = spark.range(10).select(rand(42).alias("id"))
   
   df.select(
       "id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id=log("id")
   ).show()
   ```
   
   making it similar to `dplyr::mutate`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928


   Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
   
   ```patch
   diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
   index ee68865c98..00a7a4543e 100644
   --- a/python/pyspark/sql/dataframe.py
   +++ b/python/pyspark/sql/dataframe.py
   @@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            return Column(jc)
    
        @overload
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":
   +    def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
            ...
    
        @overload
   -    def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
   +    def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
            ...
    
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":  # type: ignore[misc]
   +    def select(  # type: ignore[misc]
   +        self,
   +        *cols: "ColumnOrName",
   +        **namedCols: Column,
   +    ) -> "DataFrame":
            """Projects a set of expressions and returns a new :class:`DataFrame`.
    
            .. versionadded:: 1.3.0
   @@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
                column names (string) or expressions (:class:`Column`).
                If one of the column names is '*', that column is expanded to include all columns
                in the current :class:`DataFrame`.
   +        namedCols : :class:`Column`
   +            exprasssions select under given alaias.
    
            Examples
            --------
   @@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            [Row(name='Alice', age=2), Row(name='Bob', age=5)]
            >>> df.select(df.name, (df.age + 10).alias('age')).collect()
            [Row(name='Alice', age=12), Row(name='Bob', age=15)]
   -        """
   -        jdf = self._jdf.select(self._jcols(*cols))
   +        >>> df.select(
   +        ...     "age",
   +        ...     name_lower=lower("name"),
   +        ...     name_upper=upper("name"),
   +        ...     age_plus_one=col("age") + 1,
   +        ... ).limit(1).collect()
   +        [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
   +        """
   +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
   +            cols = cols[0]  # type: ignore[assignment]
   +        jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
            return DataFrame(jdf, self.sql_ctx)
    
        @overload
   
   ```
   
   will be able to use it like this:
   
   ```
   df = spark.range(10).select(rand(42).alias("id"))
   
   df.select(
       "id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id = log("id")
   ).show()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834443022


   I still think it is a bit redundant, but I'm fine with it, if others find this useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627874876



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap):
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same name.

Review comment:
       name -> names




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800326456



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":

Review comment:
       Or we could at least make this `colsMap` as keyword arguments .. ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565


   Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the Python docstring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029584075


   Seems like there is community support for this API. Probably we could try to reopen and proceed, @Yikun.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833432212


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42728/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1030987526


   @Yikun would you mind rebasing this to update the CI results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928


   Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
   
   ```patch
   diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
   index ee68865c98..00a7a4543e 100644
   --- a/python/pyspark/sql/dataframe.py
   +++ b/python/pyspark/sql/dataframe.py
   @@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            return Column(jc)
    
        @overload
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":
   +    def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
            ...
    
        @overload
   -    def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
   +    def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
            ...
    
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":  # type: ignore[misc]
   +    def select(  # type: ignore[misc]
   +        self,
   +        *cols: "ColumnOrName",
   +        **namedCols: Column,
   +    ) -> "DataFrame":
            """Projects a set of expressions and returns a new :class:`DataFrame`.
    
            .. versionadded:: 1.3.0
   @@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
                column names (string) or expressions (:class:`Column`).
                If one of the column names is '*', that column is expanded to include all columns
                in the current :class:`DataFrame`.
   +        namedCols : :class:`Column`
   +            exprasssions select under given alaias.
    
            Examples
            --------
   @@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            [Row(name='Alice', age=2), Row(name='Bob', age=5)]
            >>> df.select(df.name, (df.age + 10).alias('age')).collect()
            [Row(name='Alice', age=12), Row(name='Bob', age=15)]
   -        """
   -        jdf = self._jdf.select(self._jcols(*cols))
   +        >>> df.select(
   +        ...     "age",
   +        ...     name_lower=lower("name"),
   +        ...     name_upper=upper("name"),
   +        ...     age_plus_one=col("age") + 1,
   +        ... ).limit(1).collect()
   +        [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
   +        """
   +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
   +            cols = cols[0]  # type: ignore[assignment]
   +        jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
            return DataFrame(jdf, self.sql_ctx)
    
        @overload
   
   ```
   
   will be able to use it like this:
   
   ```python
   df = spark.range(10).select(rand(42).alias("id"))
   
   df.select(
       "id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id = log("id")
   ).show()
   ```
   
   making it similar to `dplyr::mutate`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029864676


   @drernie
   
   > Which doesn't even seem to be documented for Python:
   
   That's nothing more than standard Python unpacking. Can be done with any Python function and ones supporting variadic arguments (`*cols`) in particular.  Could be changed to
   
   ```python
   df.select(["*"] + [
                       F.sum(col).over(windowval).alias(col_name)
                       for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
                   ])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834426502


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138244/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833432751


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42728/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833399775


   **[Test build #138206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138206/testReport)** for PR 32431 at commit [`b527346`](https://github.com/apache/spark/commit/b527346aaa6f42c7425abce4df50434b158c4bb4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833545790


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42730/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-974026612


   > Hey @Yikun could we reopen this PR so we'd review this and add that multiple cols support?
   
   Yes for sure, it's ok for me to continue this work, but we still need reopen from mantainers if we really need it.
   
   BTW, you could also share your idea why you need it. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832412104


   **[Test build #138147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138147/testReport)** for PR 32431 at commit [`ba7d4e0`](https://github.com/apache/spark/commit/ba7d4e0c1e16af44ee59bab1eeaabf150bcece72).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834388955


   **[Test build #138244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138244/testReport)** for PR 32431 at commit [`3f5102d`](https://github.com/apache/spark/commit/3f5102d5be8240053b7092b329ba71f67220770c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833502424


   **[Test build #138208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138208/testReport)** for PR 32431 at commit [`cf77411`](https://github.com/apache/spark/commit/cf77411d1fd5dece718e857a0fc294d42f6d568e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833738968


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138208/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029599807


   @HyukjinKwon Sure, will reopen and  rebase it soon. Emm, but looks like I have no permission to reopen PR, would you mind help reopen it? or I can just submit a new PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928


   Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
   
   ```patch
   diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
   index ee68865c98..00a7a4543e 100644
   --- a/python/pyspark/sql/dataframe.py
   +++ b/python/pyspark/sql/dataframe.py
   @@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            return Column(jc)
    
        @overload
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":
   +    def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
            ...
    
        @overload
   -    def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
   +    def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
            ...
    
   -    def select(self, *cols: "ColumnOrName") -> "DataFrame":  # type: ignore[misc]
   +    def select(  # type: ignore[misc]
   +        self,
   +        *cols: "ColumnOrName",
   +        **namedCols: Column,
   +    ) -> "DataFrame":
            """Projects a set of expressions and returns a new :class:`DataFrame`.
    
            .. versionadded:: 1.3.0
   @@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
                column names (string) or expressions (:class:`Column`).
                If one of the column names is '*', that column is expanded to include all columns
                in the current :class:`DataFrame`.
   +        namedCols : :class:`Column`
   +            exprasssions select under given alaias.
    
            Examples
            --------
   @@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
            [Row(name='Alice', age=2), Row(name='Bob', age=5)]
            >>> df.select(df.name, (df.age + 10).alias('age')).collect()
            [Row(name='Alice', age=12), Row(name='Bob', age=15)]
   -        """
   -        jdf = self._jdf.select(self._jcols(*cols))
   +        >>> df.select(
   +        ...     "age",
   +        ...     name_lower=lower("name"),
   +        ...     name_upper=upper("name"),
   +        ...     age_plus_one=col("age") + 1,
   +        ... ).limit(1).collect()
   +        [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
   +        """
   +        if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
   +            cols = cols[0]  # type: ignore[assignment]
   +        jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
            return DataFrame(jdf, self.sql_ctx)
    
        @overload
   
   ```
   
   will be able to use it like this:
   
   ```python
   df = spark.range(10).select(rand(42).alias("id"))
   
   df.select(
       "id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id = log("id")
   ).show()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805150373



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same names.
+
+        The colsMap is a map of column name and column, the column must only refer to attributes
+        supplied by this Dataset. It is an error to add columns that refer to some other Dataset.

Review comment:
       Shall we add that only one map is supported?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832327840


   **[Test build #138147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138147/testReport)** for PR 32431 at commit [`ba7d4e0`](https://github.com/apache/spark/commit/ba7d4e0c1e16af44ee59bab1eeaabf150bcece72).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627873450



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2395,6 +2395,36 @@ class Dataset[T] private[sql](
    */
   def withColumn(colName: String, col: Column): DataFrame = withColumns(Seq(colName), Seq(col))
 
+  /**
+   * (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns
+   * that has the same names.
+   *
+   * `colsMap` is a map of column name and column, the column must only refer to attributes
+   * supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
+   *
+   * @group untypedrel
+   * @since 3.2.0
+   */
+  def withColumns(colsMap: Map[String, Column]): DataFrame = {
+    val colNames = colsMap.flatMap{ case (colName, _) => Seq(colName) }.toSeq

Review comment:
       oh, we should better do `val (colNames, newCols) = colsMap.toSeq.unzip`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832355419


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42668/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627874943



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap):
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same name.
+
+        The colsMap is a map of column name and column, the column must only refer to attribute

Review comment:
       attribute -> attributes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833432751


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42728/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627874698



##########
File path: python/pyspark/sql/dataframe.pyi
##########
@@ -250,6 +250,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         self, cols: Union[List[str], Tuple[str]], support: Optional[float] = ...
     ) -> DataFrame: ...
     def withColumn(self, colName: str, col: Column) -> DataFrame: ...
+    def withColumns(self, colsMap: Dict[str, Column] ) -> DataFrame: ...

Review comment:
       `colsMap: Dict[str, Column] )` -> `colsMap: Dict[str, Column])`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627251519



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2398,8 +2398,14 @@ class Dataset[T] private[sql](
   /**
    * Returns a new Dataset by adding columns or replacing the existing columns that has
    * the same names.
+   *
+   * `column`'s expression in `cols` must only refer to attributes supplied by this Dataset.
+   * It is an error to add columns that refers to some other Dataset.
+   *
+   * @group untypedrel
+   * @since 3.2.0
    */
-  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+  def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {

Review comment:
       I'd like to add Map like `colsMap`, it's more readable, I will do it in next PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833705927


   **[Test build #138208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138208/testReport)** for PR 32431 at commit [`cf77411`](https://github.com/apache/spark/commit/cf77411d1fd5dece718e857a0fc294d42f6d568e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1039722541


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] miltad commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

miltad commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-974012235


   Hey @Yikun could we reopen this PR so we'd review this and add that multiple cols support? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

github-actions[bot] closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805150373



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same names.
+
+        The colsMap is a map of column name and column, the column must only refer to attributes
+        supplied by this Dataset. It is an error to add columns that refer to some other Dataset.

Review comment:
       Shall we add that only one map is supported?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037055281


    @ueshin @viirya @BryanCutler @zero323 It would be good if you could take a look, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834192100


   **[Test build #138244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138244/testReport)** for PR 32431 at commit [`3f5102d`](https://github.com/apache/spark/commit/3f5102d5be8240053b7092b329ba71f67220770c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833401254


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138206/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r628004164



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2395,6 +2395,36 @@ class Dataset[T] private[sql](
    */
   def withColumn(colName: String, col: Column): DataFrame = withColumns(Seq(colName), Seq(col))
 
+  /**
+   * (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns
+   * that has the same names.
+   *
+   * `colsMap` is a map of column name and column, the column must only refer to attributes
+   * supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
+   *
+   * @group untypedrel
+   * @since 3.2.0
+   */
+  def withColumns(colsMap: Map[String, Column]): DataFrame = {
+    val colNames = colsMap.flatMap{ case (colName, _) => Seq(colName) }.toSeq

Review comment:
       done, thanks for your suggestion!

##########
File path: python/pyspark/sql/dataframe.pyi
##########
@@ -250,6 +250,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
         self, cols: Union[List[str], Tuple[str]], support: Optional[float] = ...
     ) -> DataFrame: ...
     def withColumn(self, colName: str, col: Column) -> DataFrame: ...
+    def withColumns(self, colsMap: Dict[str, Column] ) -> DataFrame: ...

Review comment:
       done

##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap):
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same name.
+
+        The colsMap is a map of column name and column, the column must only refer to attribute
+        supplied by this Dataset. It is an error to add columns that refers to some other Dataset.

Review comment:
       done

##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap):
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same name.

Review comment:
       done

##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap):
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same name.
+
+        The colsMap is a map of column name and column, the column must only refer to attribute

Review comment:
       done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832412930


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138147/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800587645



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":

Review comment:
       Yep, that means for now, we only allow:
   ```python
   withColumns({"col1": col1, "col2": col2})
   ```
   in future, we can also enable kwargs to allow:
   ```python
   # With args and kwargs
   withColumns({"col1": col1, "col2": col2}, col3=col3)
   # With only kwargs
   withColumns(col4=col4)
   ```
   
   If no objection, I will update to *colsMap in this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805231319



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same names.
+
+        The colsMap is a map of column name and column, the column must only refer to attributes
+        supplied by this Dataset. It is an error to add columns that refer to some other Dataset.

Review comment:
       Sure, will add a note on `Parameters` section for `colsMap`:
   ```
   Currently, only single map is supported.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833399775


   **[Test build #138206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138206/testReport)** for PR 32431 at commit [`b527346`](https://github.com/apache/spark/commit/b527346aaa6f42c7425abce4df50434b158c4bb4).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] xkrogen commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

xkrogen commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r626731636



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2398,8 +2398,14 @@ class Dataset[T] private[sql](
   /**
    * Returns a new Dataset by adding columns or replacing the existing columns that has
    * the same names.
+   *
+   * `column`'s expression in `cols` must only refer to attributes supplied by this Dataset.
+   * It is an error to add columns that refers to some other Dataset.
+   *
+   * @group untypedrel
+   * @since 3.2.0
    */
-  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+  def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {

Review comment:
       +1 this is more intuitive and easier to read / see the associations between names and definitions:
   ```
   withColumns(Map(
     "col1" -> col(...),
     "col2" -> col(...)
   ))
   ```
   vs.
   ```
   withColumns(Seq("col1", "col2"), Seq(col(...), col(...)))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r625681341



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2398,8 +2398,14 @@ class Dataset[T] private[sql](
   /**
    * Returns a new Dataset by adding columns or replacing the existing columns that has
    * the same names.
+   *
+   * `column`'s expression in `cols` must only refer to attributes supplied by this Dataset.
+   * It is an error to add columns that refers to some other Dataset.
+   *
+   * @group untypedrel
+   * @since 3.2.0
    */
-  private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+  def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {

Review comment:
       If we're adding them, I think we should change the signature to either map or a list of tuple




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833502424


   **[Test build #138208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138208/testReport)** for PR 32431 at commit [`cf77411`](https://github.com/apache/spark/commit/cf77411d1fd5dece718e857a0fc294d42f6d568e).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834231467


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42766/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627248178



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def with_columns(self, col_names, cols):

Review comment:
       @HyukjinKwon OK, I will change the name to `withColumns `




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834231467


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42766/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] drernie commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

drernie commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029392644


   My experience (and others) suggests that repeatedly calling withColumn is highly inefficient:
   
   https://stackoverflow.com/questions/41400504/spark-scala-repeated-calls-to-withcolumn-using-the-same-function-on-multiple-c/41400588#41400588
   
   The suggested alternative is using select in a very non-obvious way:
   ```
               df.select(
                   "*", # selects all existing columns
                   *[
                       F.sum(col).over(windowval).alias(col_name)
                       for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
                   ]
               )
   ```
   Which doesn't even seem to be documented for Python:
   https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.select.html
   
   I would greatly appreciate this API being made available, as it would greatly enhance the performance and reliability of my notebooks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1039722541


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1030998850


   @HyukjinKwon Done.
   
   As mentioned by @zero323 , the same effect can be achieved by extending the kwargs of select to support multiple columns.
   
   Personally, I think `withColumns` might be more ideal and better readability if there are no negative effects. I guess it is also the reason why we introduced `withColumn` API before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565


   Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the docstring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833429830


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42728/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833543010


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42730/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832412930


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138147/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834228386






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834192100


   **[Test build #138244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138244/testReport)** for PR 32431 at commit [`3f5102d`](https://github.com/apache/spark/commit/3f5102d5be8240053b7092b329ba71f67220770c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833401223


   **[Test build #138206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138206/testReport)** for PR 32431 at commit [`b527346`](https://github.com/apache/spark/commit/b527346aaa6f42c7425abce4df50434b158c4bb4).
    * This patch **fails Scala style tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627875037



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap):
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same name.
+
+        The colsMap is a map of column name and column, the column must only refer to attribute
+        supplied by this Dataset. It is an error to add columns that refers to some other Dataset.

Review comment:
       refers -> refer




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800524768



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
             support = 0.01
         return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
 
+    def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":

Review comment:
       We can switch to `*colsMap` now (in the future we can switch to positional only)
   
   ```python
   def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
       assert len(colsMap) == 1
       ...
   ```
   
   to cleanly enable `**kwargs` in the future, if there is enough support for such feature.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support

Posted by GitBox <gi...@apache.org>.

zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565


   Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the Python docstring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org