You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/24 20:46:10 UTC

[GitHub] [spark] amaliujia opened a new pull request, #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

amaliujia opened a new pull request, #38793:
URL: https://github.com/apache/spark/pull/38793

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Implement DataFrame.withColumn(s).
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   API coverage
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   NO
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033010196


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   should we assert on the number of parts being present in the names list?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033010049


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -465,6 +466,29 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_with_columns(self):
+        # SPARK-41256: test withColumn(s).
+        self.assert_eq(

Review Comment:
   Does equality work like this for the pandas DFs? I'm wondering if this should be `pdf.equals(other)`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328309352

   hmm well @grundprinzip maybe you should tell us why it is repeated names :) https://github.com/apache/spark/commit/0f7eaeee6445aaf05310229bdec22f6953113f2e


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327985075

   What is the reason for alias to have a multi part identifier? Does this actually make sense? Maybe we need to fix this as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031915952


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   What happens if I use the same name multiple times? I guess this is more for the client side to figure that out, but it would be good to mention this, and maybe state a preference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031932105


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   IMO we just need to define what will happen to duplicates and document this. We can either throw an error, or pick one (proto3 would pick the last value).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1032061331


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   +1 on using a repeated tuple here. The reason is that the behavior should be defined by the API not by the language implementation. The reason we have consistent behavior in Spark today is that Python converts to Scala before even calling the method so there is only one client implementation.
   
   My suggestion would be to change this to:
   
   ```
   repeated Expression.Alias col_map
   ```
   
   The Alias has a reference to an arbitrary expression and a name which is exactly what we want.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1326828507

   @zhengruifeng @grundprinzip @cloud-fan @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034274139


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -503,6 +503,63 @@ def _show_string(
         assert pdf is not None
         return pdf["show_string"][0]
 
+    def withColumns(self, colsMap: Dict[str, Expression]) -> "DataFrame":

Review Comment:
   There is no such API in PySpark so I think probably no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328310814

   Good point! The reason for making it repeated was that if you have a function that you alias the function might produce multiple columns. 
   
   So I think the repeated is fine here and we should probably update the documentation of the proto to leave a trace so that we don't have to do a history search all the time. 
   
   The key part is that for scalar columns just one value is present. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033010690


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -465,6 +466,29 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_with_columns(self):
+        # SPARK-41256: test withColumn(s).
+        self.assert_eq(

Review Comment:
   I believe test assert is more powerful. If you encounter a false-equal case this assert prints much more information to help understand where is the in-quality.  
   
   `assert pdf.equals(other)` will just see these two are not equal, that's all. see https://github.com/apache/spark/pull/38670



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034229052


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   I think it's better to have server-side validation if it's not hard. Or we can add a new proto message `MultiAlias` so that this is "type-safe".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031932719


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   I can change this to named expression list. Will wait a bit to see if there is an opposite opinion. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327966772

   @amaliujia why is using `Alias` not ideal for this case? It seems to fit the bill rather nicely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031959388


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -477,6 +477,63 @@ def _show_string(
         assert pdf is not None
         return pdf["show_string"][0]
 
+    def withColumns(self, colsMap: Dict[str, Expression]) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same names.
+
+        The colsMap is a map of column name and column, the column must only refer to attributes
+        supplied by this Dataset. It is an error to add columns that refer to some other Dataset.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        colsMap : dict
+            a dict of column name and :class:`Column`.
+
+        Returns
+        -------
+        :class:`DataFrame`
+            DataFrame with new or replaced columns.
+        """
+        if not isinstance(colsMap, dict):
+            raise TypeError("colsMap must be dict of column name and column.")
+
+        return DataFrame.withPlan(
+            plan.WithColumns(self._plan, colsMap),
+            session=self._session,
+        )
+
+    def withColumn(self, colName: str, col: Expression) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding a column or replacing the
+        existing column that has the same name.
+
+        The column expression must be an expression over this :class:`DataFrame`; attempting to add
+        a column from some other :class:`DataFrame` will raise an error.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        colName : str
+            string, name of the new column.
+        col : :class:`Column`
+            a :class:`Column` expression for the new column.
+
+        Returns
+        -------
+        :class:`DataFrame`
+            DataFrame with new or replaced column.
+        """
+        if not isinstance(col, Expression):
+            raise TypeError("col should be Column")

Review Comment:
   ```suggestion
               raise TypeError("col should be Expression")
   ```



##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,18 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val m = rel.getColsMap.asScala.toMap
+    val names = m.map { case (k, _) =>
+      k
+    }.toSeq
+    val cols = m.map { case (_, c) =>
+      Column(transformExpression(c))
+    }.toSeq
+

Review Comment:
   ```suggestion
       val (names, cols) = m.mapValues(c => Column(transformExpression(c))).toSeq.unzip
   ```



##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -357,6 +358,29 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_with_columns(self):
+        # SPARK-41256: tes withColumn(s).

Review Comment:
   ```suggestion
           # SPARK-41256: test withColumn(s).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328558979

   My two cents:
   
   In general, I agree with @amaliujia that we should not reuse a proto message with a mismatched semantic. But we should define the semantic by the proto message itself, not the DataFrame API. For a `WithColumns` operator, I think using `Alias` matches its semantic (add/replace columns with the provided expressions according to the provided names). The problem I see here is `Alias` can contain a list of alias names. We should either document it clearly that `WithColumns` can only use `Alias` with a single alias name, or introduce `MultiAlias` like what catalyst does.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034273829


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -506,6 +506,12 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(select(1), spark.sql("SELECT 1"))
   }
 
+  test("Test withColumns") {
+    comparePlans(
+      connectTestRelation.withColumns(Map("id" -> 1024, "col_not_exist" -> 2048)),

Review Comment:
   yep we have one for Connect DSL
   ```
       implicit def intToLiteral(i: Int): Expression =
         Expression
           .newBuilder()
           .setLiteral(Expression.Literal.newBuilder().setI32(i))
           .build()
   ```
           



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033948922


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   Documented this into proto and clients implementation can take care of it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031931607


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   I see the reason to move to list of named expressions.
   
   I also want to point out another thing so we have comprehensive discussion:
   
   a list of named expressions means there could be duplicates while current API contract does not allow duplicates.  If an implementation submits a list of named expressions which contains duplicates, we gonna need to define a way to dedup. The best is if we can define a sorted unique map but maybe that is not easy to achieve by proto. 
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327039829

   We can't change the existing Scala/Python APIs that take a map, but at least the proto message should use list to keep the order. We can claim to fail if the list elements have name conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328309986

   By reading the commit https://github.com/apache/spark/commit/0f7eaeee6445aaf05310229bdec22f6953113f2e honestly I am not sure if Python side has a valid usage for Alias or not.
   
   Maybe @cloud-fan do you recall we support multiple name parts for alias? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034274139


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -503,6 +503,63 @@ def _show_string(
         assert pdf is not None
         return pdf["show_string"][0]
 
+    def withColumns(self, colsMap: Dict[str, Expression]) -> "DataFrame":

Review Comment:
   There is no such API in PySpark so I think probably no.
   
   Just FYI I found PySpark DataFrame API has some different signatures with Scala Dataset.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034277068


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   I really want to not validation more on server side. Partially because I feel like that is not good user experience (partially of the partially is because how we throw exceptions in client side).
   
   If we ever can eliminate ambiguity (thus lead to a good API design) then why not, which can be achieved by have two proto message, one is for single alias and another is for multiple alias.
   
   Even we include a `bool` to indicate if this is a singe alias or multi alias is better than server side validation. IMO, a API (even it's proto) has caused implementation ambiguity is bad API. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1329841763

   @amaliujia  It would be great if you can update the commit message with a bit more context on what proto fields were added and why it's implemented the way it is as a summary of our discussion here. So when someone goes through the history of the commits they can see the decision making process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1329840658

   Thanks for the good discussion and the clear commit message.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1329860082

   @grundprinzip the commit message eventually is the PR description so I updated there. 
   
   Maybe in the future we need a developer doc to explain some of our current proto design decisions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034229626


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -506,6 +506,12 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(select(1), spark.sql("SELECT 1"))
   }
 
+  test("Test withColumns") {
+    comparePlans(
+      connectTestRelation.withColumns(Map("id" -> 1024, "col_not_exist" -> 2048)),

Review Comment:
   is there a scala implicit that turns `1024` to literal?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034230127


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -503,6 +503,63 @@ def _show_string(
         assert pdf is not None
         return pdf["show_string"][0]
 
+    def withColumns(self, colsMap: Dict[str, Expression]) -> "DataFrame":

Review Comment:
   shall we also add the `withColumns` overload that takes 2 lists?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031958866


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   For proto I guess we will keep Expression. TBH I don't know why Spark chooses `Column` to refer to expressions....
   
   Expressions, Relations, etc. are sort of shared understanding to describe relational plans, and column usually refers to column reference...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033010370


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,18 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name

Review Comment:
   ```suggestion
     // name exists in the input relation, then replacing the column. If the column name
   ```



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,18 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  //
+  // An exception is thrown when there are duplicated names.

Review Comment:
   ```suggestion
     // An exception is thrown when duplicated names are present in the mapping.
   ```



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,18 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column

Review Comment:
   ```suggestion
     // Given a column name, apply the corresponding expression on the column. If the column
   ```



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,18 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.

Review Comment:
   ```suggestion
     // does not exist in the input relation, then add it as a new column.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328658376

   > Maybe we can do Project[*, add columns]?
   
   The semantic is different. `df.withColumns` can replace columns if the given name(s) conflict with the existing columns. `Project[*, added columns]` can't do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1329610420

   Rebased and tried to address existing comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031919374


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   An alternative way of modelling this would be to use named expressions instead of key value pairs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031957413


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   +1 on using repeated instead of map here to keep the order,
   
   after the `Column` if refactored, will we use `Column` instead of `Expression` in proto messages?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034277068


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   I really want to not validation more on server side. Partially because I feel like that is not good user experience (partially of the partially is because how we throw exceptions in client side).
   
   If we ever can eliminate ambiguity (thus lead to a good API design) then why not, which can be achieved by have two proto message, one is for single alias and another is for multiple alias.
   
   Even we include a `bool` to indicate if this is a singe alias or multi alias is better than server side validation.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033950701


##########
python/pyspark/sql/connect/proto/relations_pb2.pyi:
##########
@@ -1601,3 +1610,48 @@ class RenameColumnsByNameToNameMap(google.protobuf.message.Message):
     ) -> None: ...
 
 global___RenameColumnsByNameToNameMap = RenameColumnsByNameToNameMap
+
+class WithColumns(google.protobuf.message.Message):
+    """Adding columns or replacing the existing columns that has the same names."""

Review Comment:
   The comment is auto generated from the comment in the proto. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033010690


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -465,6 +466,29 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_with_columns(self):
+        # SPARK-41256: test withColumn(s).
+        self.assert_eq(

Review Comment:
   I believe this assert is more powerful. If you encounter a false-equal case this assert prints much more information to help understand where is the in-quality.  
   
   `assert pdf.equals(other)` will just see these two are not equal, that's all. see https://github.com/apache/spark/pull/38670



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327980390

   I pushed a commit to use `Alias` but now I know why I have been having a sense of the different semantics:
   
   ```
     message Alias {
       Expression expr = 1;
       repeated string name = 2;
       // Alias metadata expressed as a JSON map.
       optional string metadata = 3;
     }
    ```
    
    Alias has a list of name parts while WithColumn assumes one name part (a single string). Right now I just assume there is one name part passed from the Alias messages. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034324598


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   Added a server side validation with a test case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328586566

   I don't think anyone is disputing that reusing messages in different contexts just because the fields are identical is a bad idea. 
   
   One thing I'm wondering and I'm sorry that I didn't bring this up earlier is if we shouldn't implement WithColum using a Projection? The main difference I see is that to make projection work we have to resolve the schema to be able to append columns. 
   
   Maybe we can do Project[*, add columns]?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328671269

   > > Maybe we can do Project[*, add columns]?
   > 
   > The semantic is different. `df.withColumns` can replace columns if the given name(s) conflict with the existing columns. `Project[*, added columns]` can't do it.
   
   Yes `Project[*, added columns]` cant do that, but you're able to replace the columns on the client side by resolving the schema first. What I'm trying to figure out is where we draw the line between adding another input relation type to avoid become a different Py4j.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1330260674

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031921894


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   This is an interesting topic. Given current withColumns API design, users cannot maintain or control the order over schema fields already (please correct me if I am wrong).
   
   The nice thing to have is, if a user call withColumns by same parameter on the same DataFrame twice, the user see the same output schema through this proto (ordering not predictable but at least consistent)
   
   However I don't know if this is feasible. For Python/Scala, we offer API like Dict/Map which once is used, the value iteration won't be deterministic so we cannot produce stable ordering on clients side already thus cannot preserve it through proto. Is this true? 
   
   If we ever can stability produce an ordering on clients, we can of course maintain the ordering through proto. 
   
   cc @cloud-fan @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327052842

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327971103

   @amaliujia the semantics of `Alias` and the `name` + `expression` are 100% the same. Both are named expressions, and at the end of the day `withColumns` will produce a `Project` using exactly those `Alias` expressions. The replacement/appending semantics of withColumn does not change anything here. Come to thing of it, having metadata here would actually make sense.
   
   There is no way we are ever going to change `Alias`, this has been in databases for the last 50+ years. However, let's entertain the idea of us changing `Alias`, you can solve those problems with proper versioning and a breaking change process (we are going to need both at some point in the not so far away future). I am assuming you are changing `Alias` in the connect protocol, in that case such a breaking change probably means you will introduce a new expression.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031931607


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   I see the reason to move to list of named expressions.
   
   I also want to point out another thing so we have comprehensive discussion:
   
   a list of named expressions means there could be duplicates while current API contract does not support duplicates.  If an implementation submits a list of named expressions which contains duplicates, we gonna need to define a way to dedup. The best is if we can define a sorted unique map but maybe that is not easy to achieve by proto. 
    



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1329420695

   @grundprinzip IMO we should go for a server side implementation of `withColumns` for the following reasons:
   - Connect in its current form is lazy. Going the project route would violate that, because we would need to execute a RPC to get the schema. This in itself is not a problem, however when the schema of the table changes you might end up with missing columns or errors.
   - Code reuse. As @amaliujia mentioned, we are likely to have `withColumns` in multiple clients (scala will definitely have this as well). I want to drive home the point here that withColumns has some non-trivial name resolution logic (with name parsing and a bunch of other things), I think it is better to do it once, and have a consistent user experience.
   - Customers love to stack withColumns on top of each other (e.g. do analysis of all columns in a dataframe in a loop). I would love to move this into the analyzer at some point. That should make it quite a bit faster.
   
   Taking a step back I think we need some decision framework on where to implement what. In this case the logic is non-trivial and I think it should belong on the sever.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033950708


##########
python/pyspark/sql/connect/proto/relations_pb2.pyi:
##########
@@ -1601,3 +1610,48 @@ class RenameColumnsByNameToNameMap(google.protobuf.message.Message):
     ) -> None: ...
 
 global___RenameColumnsByNameToNameMap = RenameColumnsByNameToNameMap
+
+class WithColumns(google.protobuf.message.Message):
+    """Adding columns or replacing the existing columns that has the same names."""

Review Comment:
   I changed in the source which is `relations.proto`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033963796


##########
python/pyspark/sql/connect/proto/relations_pb2.pyi:
##########
@@ -1601,3 +1610,48 @@ class RenameColumnsByNameToNameMap(google.protobuf.message.Message):
     ) -> None: ...
 
 global___RenameColumnsByNameToNameMap = RenameColumnsByNameToNameMap
+
+class WithColumns(google.protobuf.message.Message):
+    """Adding columns or replacing the existing columns that has the same names."""

Review Comment:
   I pass it text in a new Gmail to find this errors. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034311728


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   I will close this comment by doing a validation on the server side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031955877


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   In addition, Python's dictionary keeps the insertion order by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327968820

   @hvanhovell 
   
   Well the simple answer is WithColumn API's parameters are not named `Alias`.
   
   But my longer thought is:
   
   Alias has its defined semantics (including those fields). That semantics does not match what this PR looks for. Alias happens to have two fields that can help this PR (and Alias has one more field which is metadata for the its semantics).
   
   I think it's a good to avoid using a message only because the fields type fit but the semantics does not match.  In principle,   The message semantics can evolve so over-using it will cause trouble. Just a bad example: one day we will decide that Alias should only accept an alias which type is float. Then those places that truly deal with Alias which makes sense. But our WithColumns will not be compatible because we happen to use Alias due to same field type but not match with its semantics.
   
   
   So we re-use a message because of the matching semantics. If the semantics match but even there is a lack of capability, we can even extend the message by adding more fields etc. to help.  But it does not work vice verse.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327972232

   @hvanhovell 
   
   I see you are thinking better now. 
   
   When I say different semantics, what I mean is at the DataFrame API surface (which is WithColumn). But if you say that at the end, in implementation those parameters are converted to Alias, which is right and at that moment the semantics match.
   
   I don't have strong opinion here. Probably I will just change to use `Alias`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033014590


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -465,6 +466,29 @@ def test_fill_na(self):
             self.spark.sql(query).na.fill({"a": True, "b": 2}).toPandas(),
         )
 
+    def test_with_columns(self):
+        # SPARK-41256: test withColumn(s).
+        self.assert_eq(

Review Comment:
   SG



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328658539

   @cloud-fan thanks. I will stay on the `Alias` probably try to document it clearly for the `WithColumn`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1328675019

   Hmmmm it's a trade off between proto number and implementation complexity. I personally choose proto because there are more than 1 client. The implementation complexity will need to times N. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033300215


##########
python/pyspark/sql/connect/proto/relations_pb2.pyi:
##########
@@ -1601,3 +1610,48 @@ class RenameColumnsByNameToNameMap(google.protobuf.message.Message):
     ) -> None: ...
 
 global___RenameColumnsByNameToNameMap = RenameColumnsByNameToNameMap
+
+class WithColumns(google.protobuf.message.Message):
+    """Adding columns or replacing the existing columns that has the same names."""

Review Comment:
   has -> have



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033949229


##########
python/pyspark/sql/connect/proto/relations_pb2.pyi:
##########
@@ -1601,3 +1610,48 @@ class RenameColumnsByNameToNameMap(google.protobuf.message.Message):
     ) -> None: ...
 
 global___RenameColumnsByNameToNameMap = RenameColumnsByNameToNameMap
+
+class WithColumns(google.protobuf.message.Message):
+    """Adding columns or replacing the existing columns that has the same names."""

Review Comment:
   this is in fact auto-generated..



##########
python/pyspark/sql/connect/proto/relations_pb2.pyi:
##########
@@ -1601,3 +1610,48 @@ class RenameColumnsByNameToNameMap(google.protobuf.message.Message):
     ) -> None: ...
 
 global___RenameColumnsByNameToNameMap = RenameColumnsByNameToNameMap
+
+class WithColumns(google.protobuf.message.Message):
+    """Adding columns or replacing the existing columns that has the same names."""
+
+    DESCRIPTOR: google.protobuf.descriptor.Descriptor
+
+    INPUT_FIELD_NUMBER: builtins.int
+    NAME_EXPR_LIST_FIELD_NUMBER: builtins.int
+    @property
+    def input(self) -> global___Relation:
+        """(Required) The input relation."""
+    @property
+    def name_expr_list(
+        self,
+    ) -> google.protobuf.internal.containers.RepeatedCompositeFieldContainer[
+        pyspark.sql.connect.proto.expressions_pb2.Expression.Alias
+    ]:
+        """(Required)
+
+        Given a column name, apply corresponding expression on the column. If column

Review Comment:
   dito



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031955610


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   In fact, we can keep the order in `withColumns` by using `scala.collection.immutable.ListMap`. e.g.,) 
   
   ```scala
   spark.range(1).withColumns(scala.collection.immutable.ListMap("id" -> $"id"))
   ```
   
   So, it's up to users to decide if they care the order or not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031955784


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   Internally we convert them to a list of pairs so let's use list to keep the order from the map



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031915952


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   What happens if I use the same name multiple times? I guess this is more for the client side to figure that out, but it would be good to mention this, and maybe state a preference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031958866


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   For proto I guess we will keep Expression. TBH I don't know why Spark chooses `Column` to refer to expressions....



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034295175


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   Let's not use the current crude way of doing error handling as an excuse to not do server side validation. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031918373


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   Can we use something here that guarantees an order? I don't think the current map based approach to withColumns was particularly well thought through (unordered maps will produce unstable schema's) and I don't necessarily want to tie myself to that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1034305923


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   The current API design is not good because it carries ambiguity thus it asks server to validate. This happens because we put two things on the same proto field then server side needs to either if else or validate one case. This is the primary point. The current way for error handling only makes it looks even terrible.



##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -242,6 +243,17 @@ class SparkConnectPlanner(session: SparkSession) {
       .logicalPlan
   }
 
+  private def transformWithColumns(rel: proto.WithColumns): LogicalPlan = {
+    val (names, cols) =
+      rel.getNameExprListList.asScala
+        .map(e => (e.getName(0), Column(transformExpression(e.getExpr))))

Review Comment:
   The current API design is not good because it carries ambiguity thus it asks server to validate. This happens because we put two things on the same proto field then server side needs to either if else or validate one case. This is the primary point. The current way for error handling only makes it looks even more terrible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1032721067


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -477,6 +477,63 @@ def _show_string(
         assert pdf is not None
         return pdf["show_string"][0]
 
+    def withColumns(self, colsMap: Dict[str, Expression]) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same names.
+
+        The colsMap is a map of column name and column, the column must only refer to attributes
+        supplied by this Dataset. It is an error to add columns that refer to some other Dataset.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        colsMap : dict
+            a dict of column name and :class:`Column`.
+
+        Returns
+        -------
+        :class:`DataFrame`
+            DataFrame with new or replaced columns.
+        """
+        if not isinstance(colsMap, dict):
+            raise TypeError("colsMap must be dict of column name and column.")
+
+        return DataFrame.withPlan(
+            plan.WithColumns(self._plan, colsMap),
+            session=self._session,
+        )
+
+    def withColumn(self, colName: str, col: Expression) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding a column or replacing the
+        existing column that has the same name.
+
+        The column expression must be an expression over this :class:`DataFrame`; attempting to add
+        a column from some other :class:`DataFrame` will raise an error.
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        colName : str
+            string, name of the new column.
+        col : :class:`Column`
+            a :class:`Column` expression for the new column.
+
+        Returns
+        -------
+        :class:`DataFrame`
+            DataFrame with new or replaced column.
+        """
+        if not isinstance(col, Expression):
+            raise TypeError("col should be Column")

Review Comment:
   I think we want to `Column` to the top level API anyway: https://github.com/apache/spark/pull/38806



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38793:
URL: https://github.com/apache/spark/pull/38793#issuecomment-1327951036

   I ended up with 
   
   1. Change WithColumns proto to use a list of name/columns.
   2. I didn't choose to use `Alias` which is not that ideal for this case. I just use a small message to wrap the name/expression pair.
   3. I follow current API to just throw exception when there are duplicated names.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] hvanhovell commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
hvanhovell commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1031929419


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -457,3 +458,16 @@ message RenameColumnsByNameToNameMap {
   // duplicated B are not allowed.
   map<string, string> rename_columns_map = 2;
 }
+
+// Adding columns or replacing the existing columns that has the same names.
+message WithColumns {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required)
+  //
+  // Given a column name, apply corresponding expression on the column. If column
+  // name exists in the input relation, then replacing the column. if column name
+  // does not exist in the input relation, then adding the column.
+  map<string, Expression> cols_map = 2;

Review Comment:
   There is nothing we can do about the current APIs. I just think there is a very real chance that we will end up with a version of withColumns that takes an ordered collection of columns. A downside of the current APIs is that when you change platform versions you might end up with a different ordering.
   
   One of the weirder things here will be is that the order you see on the client side (by iterating over the map), does not have to be the same order that is used by the server. This is because the map implementations used on both ends are different (e.g. use different hashes). This will be confusing to end users. I think that alone is a very strong reason to change this to a list of name-expression pairs.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033304280


##########
python/pyspark/sql/connect/proto/relations_pb2.pyi:
##########
@@ -1601,3 +1610,48 @@ class RenameColumnsByNameToNameMap(google.protobuf.message.Message):
     ) -> None: ...
 
 global___RenameColumnsByNameToNameMap = RenameColumnsByNameToNameMap
+
+class WithColumns(google.protobuf.message.Message):
+    """Adding columns or replacing the existing columns that has the same names."""
+
+    DESCRIPTOR: google.protobuf.descriptor.Descriptor
+
+    INPUT_FIELD_NUMBER: builtins.int
+    NAME_EXPR_LIST_FIELD_NUMBER: builtins.int
+    @property
+    def input(self) -> global___Relation:
+        """(Required) The input relation."""
+    @property
+    def name_expr_list(
+        self,
+    ) -> google.protobuf.internal.containers.RepeatedCompositeFieldContainer[
+        pyspark.sql.connect.proto.expressions_pb2.Expression.Alias
+    ]:
+        """(Required)
+
+        Given a column name, apply corresponding expression on the column. If column

Review Comment:
   apply corresponding -> apply the corresponding
   replacing -> replace
   adding -> adds 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on a diff in pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on code in PR #38793:
URL: https://github.com/apache/spark/pull/38793#discussion_r1033306110


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -479,6 +479,63 @@ def _show_string(
         assert pdf is not None
         return pdf["show_string"][0]
 
+    def withColumns(self, colsMap: Dict[str, Expression]) -> "DataFrame":
+        """
+        Returns a new :class:`DataFrame` by adding multiple columns or replacing the
+        existing columns that has the same names.

Review Comment:
   has -> have



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #38793: [SPARK-41256][CONNECT] Implement DataFrame.withColumn(s)
URL: https://github.com/apache/spark/pull/38793


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org