You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/04 03:32:39 UTC
[GitHub] [spark] Yikun opened a new pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Yikun opened a new pull request #32431:
URL: https://github.com/apache/spark/pull/32431
### What changes were proposed in this pull request?
This PR added the multiple columns adding support for Spark scala/java/python API.
- Expose `withColumns` with Map input as public API in Scala/Java
- Add `withColumns` in PySpark
There was also some discussion about adding multiple columns in past JIRA([SPARK-1225](https://issues.apache.org/jira/browse/SPARK-12225), [SPARK-26224](https://issues.apache.org/jira/browse/SPARK-26224)) and [ML](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Multiple-columns-adding-replacing-support-in-PySpark-DataFrame-API-td31164.html).
### Why are the changes needed?
There were a private method `withColumns` can add columns at one pass [1]:
https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
However, it was not exposed as public API in Scala/Java, and also PySpark user can only use `withColumn` to add one column or replacing the existing one column that has the same name.
For example, if the PySpark user want to add multiple columns, they should call `withColumn` again and again like:
```Python
df.withColumn("key1", col("key1")).withColumn("key2", col("key2")).withColumn("key3", col("key3"))
```
After this patch, the user can use the `withColumn` with columns list args complete columns adding at one pass:
```Python
df.withColumn({"key1": col("key1"), "key2":col("key2"), "key3": col("key3")})
```
### Does this PR introduce _any_ user-facing change?
Yes, this PR exposes `withColumns` as public API, and also adds `withColumns` API in PySpark .
### How was this patch tested?
- Add new multiple columns adding test, passed
- Existing test, passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833539205
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42730/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SaurabhChawla100 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SaurabhChawla100 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r625585169
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def with_columns(self, col_names, cols):
Review comment:
Similar to this change in python, i think same change is needed at the DataFrame.R to expose with_columns in R also.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833401254
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138206/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800325976
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":
Review comment:
I admit that this is sort of not so pretty in Python context.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627872861
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2395,6 +2395,36 @@ class Dataset[T] private[sql](
*/
def withColumn(colName: String, col: Column): DataFrame = withColumns(Seq(colName), Seq(col))
+ /**
+ * (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns
+ * that has the same names.
+ *
+ * `colsMap` is a map of column name and column, the column must only refer to attributes
+ * supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
+ *
+ * @group untypedrel
+ * @since 3.2.0
+ */
+ def withColumns(colsMap: Map[String, Column]): DataFrame = {
+ val colNames = colsMap.flatMap{ case (colName, _) => Seq(colName) }.toSeq
Review comment:
colsMap.keys.toSeq?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833968881
I'm okay. cc @ueshin @viirya @BryanCutler @zero323 FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832363556
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42668/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832363556
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42668/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-899132820
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029599807
@HyukjinKwon Sure, will reopen and rebase it soon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1031091002
Hm, leveraging keyword arguments is actually interesting. Though I think I prefer `withColumns` because we should also think about Scala side API. Maybe we can push this API in first, and think about Pythonic variant.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805231319
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same names.
+
+ The colsMap is a map of column name and column, the column must only refer to attributes
+ supplied by this Dataset. It is an error to add columns that refer to some other Dataset.
Review comment:
Sure, will add a note on `Parameters` section for `colsMap`:
```
Currently, only single map is supported.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834426502
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138244/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833545790
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42730/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833738968
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138208/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832327840
**[Test build #138147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138147/testReport)** for PR 32431 at commit [`ba7d4e0`](https://github.com/apache/spark/commit/ba7d4e0c1e16af44ee59bab1eeaabf150bcece72).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r625680973
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def with_columns(self, col_names, cols):
Review comment:
Let's keep it consistent with Scala side for now. Note that camelCase doesn't violate PEP8 (e.g., `threading` in Python)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627250235
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def with_columns(self, col_names, cols):
Review comment:
@SaurabhChawla100 Thanks for reminder, I am not very familar with R, I am going to submit it in a separate PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565
Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the docstring.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928
Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
```patch
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ee68865c98..00a7a4543e 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
return Column(jc)
@overload
- def select(self, *cols: "ColumnOrName") -> "DataFrame":
+ def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
...
@overload
- def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
+ def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
...
- def select(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc]
+ def select( # type: ignore[misc]
+ self,
+ *cols: "ColumnOrName",
+ **namedCols: Column,
+ ) -> "DataFrame":
"""Projects a set of expressions and returns a new :class:`DataFrame`.
.. versionadded:: 1.3.0
@@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
column names (string) or expressions (:class:`Column`).
If one of the column names is '*', that column is expanded to include all columns
in the current :class:`DataFrame`.
+ namedCols : :class:`Column`
+ exprasssions select under given alaias.
Examples
--------
@@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
[Row(name='Alice', age=2), Row(name='Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name='Alice', age=12), Row(name='Bob', age=15)]
- """
- jdf = self._jdf.select(self._jcols(*cols))
+ >>> df.select(
+ ... "age",
+ ... name_lower=lower("name"),
+ ... name_upper=upper("name"),
+ ... age_plus_one=col("age") + 1,
+ ... ).limit(1).collect()
+ [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
+ """
+ if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
+ cols = cols[0] # type: ignore[assignment]
+ jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
return DataFrame(jdf, self.sql_ctx)
@overload
```
will be able to use it like this:
```python
df = spark.range(10).select(rand(42).alias("id"))
df.select(
"id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id=log("id")
).show()
```
making it similar to `dplyr::mutate`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928
Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
```patch
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ee68865c98..00a7a4543e 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
return Column(jc)
@overload
- def select(self, *cols: "ColumnOrName") -> "DataFrame":
+ def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
...
@overload
- def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
+ def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
...
- def select(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc]
+ def select( # type: ignore[misc]
+ self,
+ *cols: "ColumnOrName",
+ **namedCols: Column,
+ ) -> "DataFrame":
"""Projects a set of expressions and returns a new :class:`DataFrame`.
.. versionadded:: 1.3.0
@@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
column names (string) or expressions (:class:`Column`).
If one of the column names is '*', that column is expanded to include all columns
in the current :class:`DataFrame`.
+ namedCols : :class:`Column`
+ exprasssions select under given alaias.
Examples
--------
@@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
[Row(name='Alice', age=2), Row(name='Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name='Alice', age=12), Row(name='Bob', age=15)]
- """
- jdf = self._jdf.select(self._jcols(*cols))
+ >>> df.select(
+ ... "age",
+ ... name_lower=lower("name"),
+ ... name_upper=upper("name"),
+ ... age_plus_one=col("age") + 1,
+ ... ).limit(1).collect()
+ [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
+ """
+ if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
+ cols = cols[0] # type: ignore[assignment]
+ jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
return DataFrame(jdf, self.sql_ctx)
@overload
```
will be able to use it like this:
```
df = spark.range(10).select(rand(42).alias("id"))
df.select(
"id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id = log("id")
).show()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834443022
I still think it is a bit redundant, but I'm fine with it, if others find this useful.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627874876
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap):
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same name.
Review comment:
name -> names
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800326456
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":
Review comment:
Or we could at least make this `colsMap` as keyword arguments .. ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565
Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the Python docstring.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029584075
Seems like there is community support for this API. Probably we could try to reopen and proceed, @Yikun.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833432212
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42728/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1030987526
@Yikun would you mind rebasing this to update the CI results?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928
Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
```patch
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ee68865c98..00a7a4543e 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
return Column(jc)
@overload
- def select(self, *cols: "ColumnOrName") -> "DataFrame":
+ def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
...
@overload
- def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
+ def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
...
- def select(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc]
+ def select( # type: ignore[misc]
+ self,
+ *cols: "ColumnOrName",
+ **namedCols: Column,
+ ) -> "DataFrame":
"""Projects a set of expressions and returns a new :class:`DataFrame`.
.. versionadded:: 1.3.0
@@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
column names (string) or expressions (:class:`Column`).
If one of the column names is '*', that column is expanded to include all columns
in the current :class:`DataFrame`.
+ namedCols : :class:`Column`
+ exprasssions select under given alaias.
Examples
--------
@@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
[Row(name='Alice', age=2), Row(name='Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name='Alice', age=12), Row(name='Bob', age=15)]
- """
- jdf = self._jdf.select(self._jcols(*cols))
+ >>> df.select(
+ ... "age",
+ ... name_lower=lower("name"),
+ ... name_upper=upper("name"),
+ ... age_plus_one=col("age") + 1,
+ ... ).limit(1).collect()
+ [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
+ """
+ if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
+ cols = cols[0] # type: ignore[assignment]
+ jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
return DataFrame(jdf, self.sql_ctx)
@overload
```
will be able to use it like this:
```python
df = spark.range(10).select(rand(42).alias("id"))
df.select(
"id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id = log("id")
).show()
```
making it similar to `dplyr::mutate`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029864676
@drernie
> Which doesn't even seem to be documented for Python:
That's nothing more than standard Python unpacking. Can be done with any Python function and ones supporting variadic arguments (`*cols`) in particular. Could be changed to
```python
df.select(["*"] + [
F.sum(col).over(windowval).alias(col_name)
for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
])
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834426502
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138244/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833432751
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42728/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833399775
**[Test build #138206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138206/testReport)** for PR 32431 at commit [`b527346`](https://github.com/apache/spark/commit/b527346aaa6f42c7425abce4df50434b158c4bb4).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833545790
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42730/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-974026612
> Hey @Yikun could we reopen this PR so we'd review this and add that multiple cols support?
Yes for sure, it's ok for me to continue this work, but we still need reopen from mantainers if we really need it.
BTW, you could also share your idea why you need it. Thanks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832412104
**[Test build #138147 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138147/testReport)** for PR 32431 at commit [`ba7d4e0`](https://github.com/apache/spark/commit/ba7d4e0c1e16af44ee59bab1eeaabf150bcece72).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834388955
**[Test build #138244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138244/testReport)** for PR 32431 at commit [`3f5102d`](https://github.com/apache/spark/commit/3f5102d5be8240053b7092b329ba71f67220770c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833502424
**[Test build #138208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138208/testReport)** for PR 32431 at commit [`cf77411`](https://github.com/apache/spark/commit/cf77411d1fd5dece718e857a0fc294d42f6d568e).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833738968
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138208/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029599807
@HyukjinKwon Sure, will reopen and rebase it soon. Emm, but looks like I have no permission to reopen PR, would you mind help reopen it? or I can just submit a new PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029850928
Since we are revisiting it have a counter-proposal ‒ instead exposing new API, let's just improve UX of what we already have. For example, if we tweak `select` to support keyword arguments like this:
```patch
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index ee68865c98..00a7a4543e 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1941,14 +1941,18 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
return Column(jc)
@overload
- def select(self, *cols: "ColumnOrName") -> "DataFrame":
+ def select(self, *cols: "ColumnOrName", **acols: Column) -> "DataFrame":
...
@overload
- def select(self, __cols: Union[List[Column], List[str]]) -> "DataFrame":
+ def select(self, __cols: Union[List[Column], List[str]], **acols: Column) -> "DataFrame":
...
- def select(self, *cols: "ColumnOrName") -> "DataFrame": # type: ignore[misc]
+ def select( # type: ignore[misc]
+ self,
+ *cols: "ColumnOrName",
+ **namedCols: Column,
+ ) -> "DataFrame":
"""Projects a set of expressions and returns a new :class:`DataFrame`.
.. versionadded:: 1.3.0
@@ -1959,6 +1963,8 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
column names (string) or expressions (:class:`Column`).
If one of the column names is '*', that column is expanded to include all columns
in the current :class:`DataFrame`.
+ namedCols : :class:`Column`
+ exprasssions select under given alaias.
Examples
--------
@@ -1968,8 +1974,17 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
[Row(name='Alice', age=2), Row(name='Bob', age=5)]
>>> df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name='Alice', age=12), Row(name='Bob', age=15)]
- """
- jdf = self._jdf.select(self._jcols(*cols))
+ >>> df.select(
+ ... "age",
+ ... name_lower=lower("name"),
+ ... name_upper=upper("name"),
+ ... age_plus_one=col("age") + 1,
+ ... ).limit(1).collect()
+ [Row(age=2, name_lower='alice', name_upper='ALICE', age_plus_one=3)]
+ """
+ if len(cols) == 1 and isinstance(cols[0], (list, tuple)):
+ cols = cols[0] # type: ignore[assignment]
+ jdf = self._jdf.select(self._jcols(*cols, *[c.alias(a) for a, c in namedCols.items()]))
return DataFrame(jdf, self.sql_ctx)
@overload
```
will be able to use it like this:
```python
df = spark.range(10).select(rand(42).alias("id"))
df.select(
"id", plus_one=col("id") + 1, times_two=col("id") * 2, log_id = log("id")
).show()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805150373
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same names.
+
+ The colsMap is a map of column name and column, the column must only refer to attributes
+ supplied by this Dataset. It is an error to add columns that refer to some other Dataset.
Review comment:
Shall we add that only one map is supported?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832327840
**[Test build #138147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138147/testReport)** for PR 32431 at commit [`ba7d4e0`](https://github.com/apache/spark/commit/ba7d4e0c1e16af44ee59bab1eeaabf150bcece72).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627873450
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2395,6 +2395,36 @@ class Dataset[T] private[sql](
*/
def withColumn(colName: String, col: Column): DataFrame = withColumns(Seq(colName), Seq(col))
+ /**
+ * (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns
+ * that has the same names.
+ *
+ * `colsMap` is a map of column name and column, the column must only refer to attributes
+ * supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
+ *
+ * @group untypedrel
+ * @since 3.2.0
+ */
+ def withColumns(colsMap: Map[String, Column]): DataFrame = {
+ val colNames = colsMap.flatMap{ case (colName, _) => Seq(colName) }.toSeq
Review comment:
oh, we should better do `val (colNames, newCols) = colsMap.toSeq.unzip`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832355419
Kubernetes integration test unable to build dist.
exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42668/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627874943
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap):
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same name.
+
+ The colsMap is a map of column name and column, the column must only refer to attribute
Review comment:
attribute -> attributes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833432751
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42728/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627874698
##########
File path: python/pyspark/sql/dataframe.pyi
##########
@@ -250,6 +250,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
self, cols: Union[List[str], Tuple[str]], support: Optional[float] = ...
) -> DataFrame: ...
def withColumn(self, colName: str, col: Column) -> DataFrame: ...
+ def withColumns(self, colsMap: Dict[str, Column] ) -> DataFrame: ...
Review comment:
`colsMap: Dict[str, Column] )` -> `colsMap: Dict[str, Column])`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627251519
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2398,8 +2398,14 @@ class Dataset[T] private[sql](
/**
* Returns a new Dataset by adding columns or replacing the existing columns that has
* the same names.
+ *
+ * `column`'s expression in `cols` must only refer to attributes supplied by this Dataset.
+ * It is an error to add columns that refers to some other Dataset.
+ *
+ * @group untypedrel
+ * @since 3.2.0
*/
- private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+ def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
Review comment:
I'd like to add Map like `colsMap`, it's more readable, I will do it in next PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833705927
**[Test build #138208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138208/testReport)** for PR 32431 at commit [`cf77411`](https://github.com/apache/spark/commit/cf77411d1fd5dece718e857a0fc294d42f6d568e).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1039722541
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] miltad commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
miltad commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-974012235
Hey @Yikun could we reopen this PR so we'd review this and add that multiple cols support?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805150373
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same names.
+
+ The colsMap is a map of column name and column, the column must only refer to attributes
+ supplied by this Dataset. It is an error to add columns that refer to some other Dataset.
Review comment:
Shall we add that only one map is supported?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037055281
@ueshin @viirya @BryanCutler @zero323 It would be good if you could take a look, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834192100
**[Test build #138244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138244/testReport)** for PR 32431 at commit [`3f5102d`](https://github.com/apache/spark/commit/3f5102d5be8240053b7092b329ba71f67220770c).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833401254
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138206/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r628004164
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2395,6 +2395,36 @@ class Dataset[T] private[sql](
*/
def withColumn(colName: String, col: Column): DataFrame = withColumns(Seq(colName), Seq(col))
+ /**
+ * (Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns
+ * that has the same names.
+ *
+ * `colsMap` is a map of column name and column, the column must only refer to attributes
+ * supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
+ *
+ * @group untypedrel
+ * @since 3.2.0
+ */
+ def withColumns(colsMap: Map[String, Column]): DataFrame = {
+ val colNames = colsMap.flatMap{ case (colName, _) => Seq(colName) }.toSeq
Review comment:
done, thanks for your suggestion!
##########
File path: python/pyspark/sql/dataframe.pyi
##########
@@ -250,6 +250,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
self, cols: Union[List[str], Tuple[str]], support: Optional[float] = ...
) -> DataFrame: ...
def withColumn(self, colName: str, col: Column) -> DataFrame: ...
+ def withColumns(self, colsMap: Dict[str, Column] ) -> DataFrame: ...
Review comment:
done
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap):
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same name.
+
+ The colsMap is a map of column name and column, the column must only refer to attribute
+ supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
Review comment:
done
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap):
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same name.
Review comment:
done
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap):
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same name.
+
+ The colsMap is a map of column name and column, the column must only refer to attribute
Review comment:
done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832412930
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138147/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800587645
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":
Review comment:
Yep, that means for now, we only allow:
```python
withColumns({"col1": col1, "col2": col2})
```
in future, we can also enable kwargs to allow:
```python
# With args and kwargs
withColumns({"col1": col1, "col2": col2}, col3=col3)
# With only kwargs
withColumns(col4=col4)
```
If no objection, I will update to *colsMap in this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r805231319
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,41 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same names.
+
+ The colsMap is a map of column name and column, the column must only refer to attributes
+ supplied by this Dataset. It is an error to add columns that refer to some other Dataset.
Review comment:
Sure, will add a note on `Parameters` section for `colsMap`:
```
Currently, only single map is supported.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #32431:
URL: https://github.com/apache/spark/pull/32431
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833399775
**[Test build #138206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138206/testReport)** for PR 32431 at commit [`b527346`](https://github.com/apache/spark/commit/b527346aaa6f42c7425abce4df50434b158c4bb4).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] xkrogen commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
xkrogen commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r626731636
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2398,8 +2398,14 @@ class Dataset[T] private[sql](
/**
* Returns a new Dataset by adding columns or replacing the existing columns that has
* the same names.
+ *
+ * `column`'s expression in `cols` must only refer to attributes supplied by this Dataset.
+ * It is an error to add columns that refers to some other Dataset.
+ *
+ * @group untypedrel
+ * @since 3.2.0
*/
- private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+ def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
Review comment:
+1 this is more intuitive and easier to read / see the associations between names and definitions:
```
withColumns(Map(
"col1" -> col(...),
"col2" -> col(...)
))
```
vs.
```
withColumns(Seq("col1", "col2"), Seq(col(...), col(...)))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r625681341
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2398,8 +2398,14 @@ class Dataset[T] private[sql](
/**
* Returns a new Dataset by adding columns or replacing the existing columns that has
* the same names.
+ *
+ * `column`'s expression in `cols` must only refer to attributes supplied by this Dataset.
+ * It is an error to add columns that refers to some other Dataset.
+ *
+ * @group untypedrel
+ * @since 3.2.0
*/
- private[spark] def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
+ def withColumns(colNames: Seq[String], cols: Seq[Column]): DataFrame = {
Review comment:
If we're adding them, I think we should change the signature to either map or a list of tuple
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833502424
**[Test build #138208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138208/testReport)** for PR 32431 at commit [`cf77411`](https://github.com/apache/spark/commit/cf77411d1fd5dece718e857a0fc294d42f6d568e).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834231467
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42766/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627248178
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,43 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def with_columns(self, col_names, cols):
Review comment:
@HyukjinKwon OK, I will change the name to `withColumns `
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834231467
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42766/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] drernie commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
drernie commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1029392644
My experience (and others) suggests that repeatedly calling withColumn is highly inefficient:
https://stackoverflow.com/questions/41400504/spark-scala-repeated-calls-to-withcolumn-using-the-same-function-on-multiple-c/41400588#41400588
The suggested alternative is using select in a very non-obvious way:
```
df.select(
"*", # selects all existing columns
*[
F.sum(col).over(windowval).alias(col_name)
for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
]
)
```
Which doesn't even seem to be documented for Python:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.select.html
I would greatly appreciate this API being made available, as it would greatly enhance the performance and reliability of my notebooks.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1039722541
Merged to master.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
Yikun commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1030998850
@HyukjinKwon Done.
As mentioned by @zero323 , the same effect can be achieved by extending the kwargs of select to support multiple columns.
Personally, I think `withColumns` might be more ideal and better readability if there are no negative effects. I guess it is also the reason why we introduced `withColumn` API before?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565
Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the docstring.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833429830
Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42728/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833543010
Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42730/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-832412930
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138147/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834228386
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-834192100
**[Test build #138244 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138244/testReport)** for PR 32431 at commit [`3f5102d`](https://github.com/apache/spark/commit/3f5102d5be8240053b7092b329ba71f67220770c).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-833401223
**[Test build #138206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138206/testReport)** for PR 32431 at commit [`b527346`](https://github.com/apache/spark/commit/b527346aaa6f42c7425abce4df50434b158c4bb4).
* This patch **fails Scala style tests**.
* This patch merges cleanly.
* This patch adds no public classes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r627875037
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2423,6 +2423,38 @@ def freqItems(self, cols, support=None):
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap):
+ """
+ Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+ existing columns that has the same name.
+
+ The colsMap is a map of column name and column, the column must only refer to attribute
+ supplied by this Dataset. It is an error to add columns that refers to some other Dataset.
Review comment:
refers -> refer
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 commented on a change in pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 commented on a change in pull request #32431:
URL: https://github.com/apache/spark/pull/32431#discussion_r800524768
##########
File path: python/pyspark/sql/dataframe.py
##########
@@ -2911,6 +2911,37 @@ def freqItems(
support = 0.01
return DataFrame(self._jdf.stat().freqItems(_to_seq(self._sc, cols), support), self.sql_ctx)
+ def withColumns(self, colsMap: Dict[str, Column]) -> "DataFrame":
Review comment:
We can switch to `*colsMap` now (in the future we can switch to positional only)
```python
def withColumns(self, *colsMap: Dict[str, Column]) -> "DataFrame":
assert len(colsMap) == 1
...
```
to cleanly enable `**kwargs` in the future, if there is enough support for such feature.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zero323 edited a comment on pull request #32431: [SPARK-35173][SQL][PYTHON] Add multiple columns adding support
Posted by GitBox <gi...@apache.org>.
zero323 edited a comment on pull request #32431:
URL: https://github.com/apache/spark/pull/32431#issuecomment-1037157565
Same as for the previous iteration ‒ I am neutral. Implementation looks OK, just minor comments for the Python docstring.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org