You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/21 14:23:46 UTC

[GitHub] [spark] Yikun opened a new pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Yikun opened a new pull request #32276:
URL: https://github.com/apache/spark/pull/32276


   ### What changes were proposed in this pull request?
   This PR added the multiple columns adding support for PySpark.dataframe.withColumn.
   
   ### Why are the changes needed?
   Now, the spark `withColumn` can add columns at one pass [1]:
    https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396
   but the PySpark user can only use `withColumn` to add one column or replacing the existing one column that has the same name. 
   
   For example, if the PySpark user want to add multiple columns, they should call `withColumn` again and again like:
   ```Python
   self.df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
   ```
   After this patch, the user can use the `withColumn` with columns list args complete columns adding at one pass:
   ```Python
   self.df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), col("key3")])
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, the input types of withColumn are changed, the PySpark can use withColumn to add multiple columns directly.
   
   
   ### How was this patch tested?
   - Add new multiple columns adding test, passed
   - Existing test, passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824104046


   **[Test build #137738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137738/testReport)** for PR 32276 at commit [`b2dbe9b`](https://github.com/apache/spark/commit/b2dbe9b6c9225267b7a3160e59be13530c7b7f52).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824121899


   **[Test build #137738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137738/testReport)** for PR 32276 at commit [`b2dbe9b`](https://github.com/apache/spark/commit/b2dbe9b6c9225267b7a3160e59be13530c7b7f52).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824178659


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42265/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a change in pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32276:
URL: https://github.com/apache/spark/pull/32276#discussion_r617626255



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2451,10 +2454,26 @@ def withColumn(self, colName, col):
         --------
         >>> df.withColumn('age2', df.age + 2).collect()
         [Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
-
+        >>> df.withColumn(['age2', 'age3'], [df.age + 2, df.age + 3]).collect()
+        [Row(age=2, name='Alice', age2=4, age3=5), Row(age=5, name='Bob', age2=7, age3=8)]
         """
-        assert isinstance(col, Column), "col should be Column"
-        return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
+        if not isinstance(colName, (str, list, tuple)):
+            raise TypeError("colName must be string or list/tuple of column names.")
+        if not isinstance(col, (Column, list, tuple)):
+            raise TypeError("col must be a column or list/tuple of columns.")
+
+        # Covert the colName and col to list
+        col_names = [colName] if isinstance(colName, str) else colName
+        col = [col] if isinstance(col, Column) else col
+
+        # Covert tuple to list
+        col_names = list(col_names) if isinstance(col_names, tuple) else col_names
+        col = list(col) if isinstance(col, tuple) else col
+
+        return DataFrame(
+            self._jdf.withColumns(_to_seq(self._sc, col_names), self._jcols(col)),

Review comment:
       Notice that I use the `withColumns` in here which is a [private method](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402) in scala. if we use the `withColumn`, it will raise an unmatch error, because the function assginment can be recoginized by py4j.
   
   Addition note:  I didn't expose the **private withColumns** API in PySpark, just to match the ability of the scala withColumn API. That means, the scala [withColumn API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396) can receive multiple columns now, and support by calling the [internal withColumns API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396-L2402).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824135229


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137738/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a change in pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32276:
URL: https://github.com/apache/spark/pull/32276#discussion_r617626255



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2451,10 +2454,26 @@ def withColumn(self, colName, col):
         --------
         >>> df.withColumn('age2', df.age + 2).collect()
         [Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
-
+        >>> df.withColumn(['age2', 'age3'], [df.age + 2, df.age + 3]).collect()
+        [Row(age=2, name='Alice', age2=4, age3=5), Row(age=5, name='Bob', age2=7, age3=8)]
         """
-        assert isinstance(col, Column), "col should be Column"
-        return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
+        if not isinstance(colName, (str, list, tuple)):
+            raise TypeError("colName must be string or list/tuple of column names.")
+        if not isinstance(col, (Column, list, tuple)):
+            raise TypeError("col must be a column or list/tuple of columns.")
+
+        # Covert the colName and col to list
+        col_names = [colName] if isinstance(colName, str) else colName
+        col = [col] if isinstance(col, Column) else col
+
+        # Covert tuple to list
+        col_names = list(col_names) if isinstance(col_names, tuple) else col_names
+        col = list(col) if isinstance(col, tuple) else col
+
+        return DataFrame(
+            self._jdf.withColumns(_to_seq(self._sc, col_names), self._jcols(col)),

Review comment:
       Notice that I use the `withColumns` in here which is a [private method](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402) in scala. if we use the `withColumn`, it will raise an unmatch error, because the function assginment can be recoginized by py4j.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824475320


   I don't think `withColumn` can take multiple columns now.  Would you mind crafting an example please? The private `withColumns` can. We should probably expose `withColumns` in Scala and PySpark together but that would need some discussion - it flips the decision made once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun closed pull request #32276: [WIP][SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun closed pull request #32276:
URL: https://github.com/apache/spark/pull/32276


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824107872


   cc @HyukjinKwon @zero323


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on pull request #32276: [WIP][SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824616155


   ML link: http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Multiple-columns-adding-replacing-support-in-PySpark-DataFrame-API-td31164.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun edited a comment on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun edited a comment on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824472080


   > As you already noticed these methods are private and, as such, not intended for end-user.
   
   ~@zero323 Maybe I bring some misunderstanding in previous , I didn't expose the **private withColumns** API in PySpark, just to match the ability of the scala withColumn API. That means, the scala [withColumn API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396) can receive multiple columns now, and support by calling the [internal withColumns API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396-L2402).~
   
   >  we should think about having an API in Scala too to match. It was sort of rejected once at SPARK-12225
   
   @HyukjinKwon ~See the first reply for misunderstanding, actually, SPARK-12225 have already support the multiple columns adding support on scala withColumn API in https://github.com/apache/spark/commit/3ca367083e196e6487207211e6c49d4bbfe31288  ([SPARK-22001](https://issues.apache.org/jira/browse/SPARK-22001))~ Yes, we only add the multiple columns adding support on scala private withColumns in https://github.com/apache/spark/commit/3ca367083e196e6487207211e6c49d4bbfe31288  ([SPARK-22001](https://issues.apache.org/jira/browse/SPARK-22001))
   
   > Additionally, naming would be rather confusing.
   > we should discuss adding such functionality for all APIs and having less confusing name.
   
   Sure, I will send an email to dev ML to decribe these confusing naming things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a change in pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32276:
URL: https://github.com/apache/spark/pull/32276#discussion_r617626255



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2451,10 +2454,26 @@ def withColumn(self, colName, col):
         --------
         >>> df.withColumn('age2', df.age + 2).collect()
         [Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
-
+        >>> df.withColumn(['age2', 'age3'], [df.age + 2, df.age + 3]).collect()
+        [Row(age=2, name='Alice', age2=4, age3=5), Row(age=5, name='Bob', age2=7, age3=8)]
         """
-        assert isinstance(col, Column), "col should be Column"
-        return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
+        if not isinstance(colName, (str, list, tuple)):
+            raise TypeError("colName must be string or list/tuple of column names.")
+        if not isinstance(col, (Column, list, tuple)):
+            raise TypeError("col must be a column or list/tuple of columns.")
+
+        # Covert the colName and col to list
+        col_names = [colName] if isinstance(colName, str) else colName
+        col = [col] if isinstance(col, Column) else col
+
+        # Covert tuple to list
+        col_names = list(col_names) if isinstance(col_names, tuple) else col_names
+        col = list(col) if isinstance(col, tuple) else col
+
+        return DataFrame(
+            self._jdf.withColumns(_to_seq(self._sc, col_names), self._jcols(col)),

Review comment:
       Notice that I use the `withColumns` in here which is a [private method](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402) in scala. The scala private [withColumns API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396) can receive multiple columns now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a change in pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32276:
URL: https://github.com/apache/spark/pull/32276#discussion_r617626255



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2451,10 +2454,26 @@ def withColumn(self, colName, col):
         --------
         >>> df.withColumn('age2', df.age + 2).collect()
         [Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
-
+        >>> df.withColumn(['age2', 'age3'], [df.age + 2, df.age + 3]).collect()
+        [Row(age=2, name='Alice', age2=4, age3=5), Row(age=5, name='Bob', age2=7, age3=8)]
         """
-        assert isinstance(col, Column), "col should be Column"
-        return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
+        if not isinstance(colName, (str, list, tuple)):
+            raise TypeError("colName must be string or list/tuple of column names.")
+        if not isinstance(col, (Column, list, tuple)):
+            raise TypeError("col must be a column or list/tuple of columns.")
+
+        # Covert the colName and col to list
+        col_names = [colName] if isinstance(colName, str) else colName
+        col = [col] if isinstance(col, Column) else col
+
+        # Covert tuple to list
+        col_names = list(col_names) if isinstance(col_names, tuple) else col_names
+        col = list(col) if isinstance(col, tuple) else col
+
+        return DataFrame(
+            self._jdf.withColumns(_to_seq(self._sc, col_names), self._jcols(col)),

Review comment:
       Notice that I use the `withColumns` which is a [private method](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402) in scala. if we use the `withColumn`, it will raise an unmatch error, because the function assginment can be recoginized by py4j.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824135229


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/137738/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824179646


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42265/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824175123


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42265/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824461143


   Yeah, we should think about having an API in Scala too to match. It was sort of rejected once at SPARK-12225 but I think it's worthwhile discussing as @zero323 suggested. @Yikun you might want to send an email to dev mailing list to discuss further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824481426


   @HyukjinKwon yep, we only add the multiple columns adding support on scala private withColumns, updated the previous comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun edited a comment on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun edited a comment on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824472080


   >  we should think about having an API in Scala too to match. It was sort of rejected once at SPARK-12225
   
   @HyukjinKwon ~See the first reply for misunderstanding, actually, SPARK-12225 have already support the multiple columns adding support on scala withColumn API in https://github.com/apache/spark/commit/3ca367083e196e6487207211e6c49d4bbfe31288  ([SPARK-22001](https://issues.apache.org/jira/browse/SPARK-22001))~ Yes, we only add the multiple columns adding support on scala private withColumns in https://github.com/apache/spark/commit/3ca367083e196e6487207211e6c49d4bbfe31288  ([SPARK-22001](https://issues.apache.org/jira/browse/SPARK-22001))
   
   > Additionally, naming would be rather confusing.
   > we should discuss adding such functionality for all APIs and having less confusing name.
   
   Sure, I will send an email to dev ML to decribe these confusing naming things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824472080


   > As you already noticed these methods are private and, as such, not intended for end-user.
   
   @zero323 Maybe I bring some misunderstanding in previous , I didn't expose the **private withColumns** API in PySpark, just to match the ability of the scala withColumn API. That means, the scala [withColumn API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396) can receive multiple columns now, and support by calling the [internal withColumns API](https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2396-L2402).
   
   >  we should think about having an API in Scala too to match. It was sort of rejected once at SPARK-12225
   
   @HyukjinKwon See the first reply for misunderstanding, actually, SPARK-12225 have already support the multiple columns adding support on scala withColumn API in https://github.com/apache/spark/commit/3ca367083e196e6487207211e6c49d4bbfe31288  ([SPARK-22001](https://issues.apache.org/jira/browse/SPARK-22001))
   
   > Additionally, naming would be rather confusing.
   > we should discuss adding such functionality for all APIs and having less confusing name.
   
   Agree, the scala API withColumn receive some confusing naming input to support the muliple columns adding,I will send an email to dev ML to decribe these confusing naming things.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on pull request #32276: [WIP][SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-831736413


   See new PR: https://github.com/apache/spark/pull/32431 it exposes the `withColumns` in scala/java, and add `with_columns` in PySpark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824179646


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42265/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32276:
URL: https://github.com/apache/spark/pull/32276#issuecomment-824104046


   **[Test build #137738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/137738/testReport)** for PR 32276 at commit [`b2dbe9b`](https://github.com/apache/spark/commit/b2dbe9b6c9225267b7a3160e59be13530c7b7f52).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Yikun commented on a change in pull request #32276: [SPARK-35173][PYTHON][SQL] Add multiple columns adding support for PySpark

Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32276:
URL: https://github.com/apache/spark/pull/32276#discussion_r617620582



##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2451,10 +2454,26 @@ def withColumn(self, colName, col):
         --------
         >>> df.withColumn('age2', df.age + 2).collect()
         [Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
-
+        >>> df.withColumn(['age2', 'age3'], [df.age + 2, df.age + 3]).collect()
+        [Row(age=2, name='Alice', age2=4, age3=5), Row(age=5, name='Bob', age2=7, age3=8)]
         """
-        assert isinstance(col, Column), "col should be Column"
-        return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
+        if not isinstance(colName, (str, list, tuple)):
+            raise TypeError("colName must be string or list/tuple of column names.")

Review comment:
       Note that there are some wrong usage on ValueError in some other methods, I add an issue on https://issues.apache.org/jira/browse/SPARK-35176




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org