You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/11 21:05:15 UTC

[GitHub] [spark] grundprinzip opened a new pull request, #38631: [SPARK-40809] [CONNECT] [FOLLOW]

grundprinzip opened a new pull request, #38631:
URL: https://github.com/apache/spark/pull/38631

   ### What changes were proposed in this pull request?
   This extends the implementation of column aliases in Spark Connect with supporting lists of column names and providing the appropriate implementation for the Python side.
   
   ### Why are the changes needed?
   Compatibility
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   UT in Python and Scala


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021211335


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;
+    // Alias metadata expressed as a JSON map.
+    optional string metadata = 3;

Review Comment:
   nit: previously @amaliujia added
   ```
     message DataSource {
       // Required. Supported formats include: parquet, orc, text, json, parquet, csv, avro.
       string format = 1;
       // Optional. If not set, Spark will infer the schema.
       string schema = 2;
       // The key is case insensitive.
       map<string, string> options = 3;
     }
   ```
   
   Shall we also have a guideline for adding string-string map field? Do we prefer proto map field or just JSON string?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021263016


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021212933


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -348,7 +350,16 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      val md = if (alias.hasMetadata()) {
+        Some(Metadata.fromJson(alias.getMetadata))
+      } else {
+        None
+      }
+      Alias(transformExpression(alias.getExpr), alias.getName(0))(explicitMetadata = md)
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   I'm adding this right now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022297572


##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,73 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "ColumnAlias":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        alias : str
+            desired column names (collects all positional arguments passed)
+
+        Other Parameters
+        ----------------
+        metadata: dict
+            a dict of information to be stored in ``metadata`` attribute of the
+            corresponding :class:`StructField <pyspark.sql.types.StructField>` (optional, keyword
+            only argument)
+
+        Returns
+        -------
+        :class:`Column`
+            Column representing whether each element of Column is aliased with new name or names.
+
+        Examples
+        --------
+        >>> df = spark.createDataFrame(
+        ...      [(2, "Alice"), (5, "Bob")], ["age", "name"])
+        >>> df.select(df.age.alias("age2")).collect()
+        [Row(age2=2), Row(age2=5)]
+        >>> df.select(df.age.alias("age3", metadata={'max': 99})).schema['age3'].metadata['max']
+        99
+        """
+        metadata = kwargs.pop("metadata", None)
+        assert not kwargs, "Unexpected kwargs where passed: %s" % kwargs
+        return ColumnAlias(self, list(alias), metadata)
+
+
+class ColumnAlias(Expression):
+    def __init__(self, parent: Expression, alias: list[str], metadata: Any):
+
+        self._alias = alias
+        self._metadata = metadata
+        self._parent = parent
+
+    def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
+        if len(self._alias) == 1:
+            exp = proto.Expression()
+            exp.alias.name.append(self._alias[0])
+            exp.alias.expr.CopyFrom(self._parent.to_plan(session))
+
+            if self._metadata:
+                import json

Review Comment:
   Nit but I think we can just import this on the top (since this is builtin anyway).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022495692


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -248,6 +248,20 @@ def test_simple_datasource_read(self) -> None:
             actualResult = pandasResult.values.tolist()
             self.assertEqual(len(expectResult), len(actualResult))
 
+    def test_alias(self) -> None:
+        """Testing supported and unsupported alias"""
+        col0 = (
+            self.connect.range(1, 10)
+            .select(col("id").alias("name", metadata={"max": 99}))
+            .schema()
+            .names[0]
+        )
+        self.assertEqual("name", col0)
+
+        with self.assertRaises(Exception) as exc:

Review Comment:
   ```suggestion
           with self.assertRaises(grpc.RpcError) as exc:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020564580


##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "Expression":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 1.3.0

Review Comment:
   Fixed.



##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "Expression":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 1.3.0
+
+        Parameters
+        ----------
+        alias : str
+            desired column names (collects all positional arguments passed)
+
+        Other Parameters
+        ----------------
+        metadata: dict
+            a dict of information to be stored in ``metadata`` attribute of the
+            corresponding :class:`StructField <pyspark.sql.types.StructField>` (optional, keyword
+            only argument)
+
+            .. versionchanged:: 2.2.0
+               Added optional ``metadata`` argument.

Review Comment:
   removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020563219


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -334,7 +334,11 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      Alias(transformExpression(alias.getExpr), alias.getName(0))()
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   I actually expect Scala client will re-use Connect DSL. 
   
   if you take a closer look, Connect DSL is DataFrame like internal API that offers `toProto` logic (equivalent to `plan.py` for python). 
   
   Anything adding to the DSL will not longer help testing coverage but also help Scala client.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022503590


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -44,7 +44,7 @@
     from pyspark.sql.connect.typing import ColumnOrString, ExpressionOrString
     from pyspark.sql.connect.client import RemoteSparkSession
 
-ColumnOrName = Union[Column, str]
+ColumnOrName = Union[Expression, str]

Review Comment:
   I think this fell out of a different PR I had for transform. The reason we need it eventually is because right now from a type perspective select can only use columns but should be able to refer to any expression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022298464


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -248,6 +248,20 @@ def test_simple_datasource_read(self) -> None:
             actualResult = pandasResult.values.tolist()
             self.assertEqual(len(expectResult), len(actualResult))
 
+    def test_alias(self) -> None:
+        """Testing supported and unsupported alias"""
+        col0 = (
+            self.connect.range(1, 10)
+            .select(col("id").alias("name", metadata={"max": 99}))
+            .schema()
+            .names[0]
+        )
+        self.assertEqual("name", col0)
+
+        with self.assertRaises(Exception) as exc:

Review Comment:
   nit but I would catch narrower exception than `Exception`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022492699


##########
python/pyspark/sql/tests/connect/test_connect_column_expressions.py:
##########
@@ -134,6 +134,16 @@ def test_list_to_literal(self):
         lit_list_plan = fun.lit([fun.lit(10), fun.lit("str")]).to_plan(None)
         self.assertIsNotNone(lit_list_plan)
 
+    def test_column_alias(self) -> None:
+        # SPARK-40809 - Support for Column Aliases
+        col0 = fun.col("a").alias("martin")
+        self.assertEqual("Alias(Column(a), (martin))", str(col0))
+
+        col0 = fun.col("a").alias("martin", metadata={"pii": True})
+        plan = col0.to_plan(self.session)
+        self.assertIsNotNone(plan)
+        self.assertEqual(plan.alias.metadata, '{"pii": true}')

Review Comment:
   See the test above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022651011


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -44,7 +44,7 @@
     from pyspark.sql.connect.typing import ColumnOrString, ExpressionOrString
     from pyspark.sql.connect.client import RemoteSparkSession
 
-ColumnOrName = Union[Column, str]
+ColumnOrName = Union[Expression, str]

Review Comment:
   tarting mypy annotations test...
   annotations failed mypy checks:
   python/pyspark/sql/tests/connect/test_connect_basic.py:308: error: Argument 1 to "select" of "DataFrame" has incompatible type "ColumnAlias"; expected "Union[Column, str]"  [arg-type]
   python/pyspark/sql/tests/connect/test_connect_basic.py:315: error: Argument 1 to "select" of "DataFrame" has incompatible type "ColumnAlias"; expected "Union[Column, str]"  [arg-type]
   F



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020565080


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -334,7 +334,11 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      Alias(transformExpression(alias.getExpr), alias.getName(0))()
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   I understand the concepts of the DSL and the testing (I was there when it was written), but thanks a lot for the detailed explanation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021059774


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;

Review Comment:
   In catalyst, we have a `MultiAlias` expression for this case. Shall we match catalyst here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022317274


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -44,7 +44,7 @@
     from pyspark.sql.connect.typing import ColumnOrString, ExpressionOrString
     from pyspark.sql.connect.client import RemoteSparkSession
 
-ColumnOrName = Union[Column, str]
+ColumnOrName = Union[Expression, str]

Review Comment:
   is this change needed? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020561328


##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "Expression":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 1.3.0

Review Comment:
   version 3.4.0



##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "Expression":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 1.3.0
+
+        Parameters
+        ----------
+        alias : str
+            desired column names (collects all positional arguments passed)
+
+        Other Parameters
+        ----------------
+        metadata: dict
+            a dict of information to be stored in ``metadata`` attribute of the
+            corresponding :class:`StructField <pyspark.sql.types.StructField>` (optional, keyword
+            only argument)
+
+            .. versionchanged:: 2.2.0
+               Added optional ``metadata`` argument.

Review Comment:
   we don't need this: 
   
   Connect is new API and everything will start from 3.4.0



##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "Expression":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 1.3.0
+
+        Parameters
+        ----------
+        alias : str
+            desired column names (collects all positional arguments passed)
+
+        Other Parameters
+        ----------------
+        metadata: dict
+            a dict of information to be stored in ``metadata`` attribute of the
+            corresponding :class:`StructField <pyspark.sql.types.StructField>` (optional, keyword
+            only argument)
+
+            .. versionchanged:: 2.2.0
+               Added optional ``metadata`` argument.
+
+        Returns
+        -------
+        :class:`Column`
+            Column representing whether each element of Column is aliased with new name or names.
+
+        Examples
+        --------
+        >>> df = spark.createDataFrame(
+        ...      [(2, "Alice"), (5, "Bob")], ["age", "name"])
+        >>> df.select(df.age.alias("age2")).collect()
+        [Row(age2=2), Row(age2=5)]
+        >>> df.select(df.age.alias("age3", metadata={'max': 99})).schema['age3'].metadata['max']
+        99
+        """
+        metadata = kwargs.pop("metadata", None)
+        assert not kwargs, "Unexpected kwargs where passed: %s" % kwargs
+        return ColumnAlias(self, list(alias), metadata)
+
+
+class ColumnAlias(Expression):
+    def __init__(self, parent: Expression, alias: list[str], metadata: Any):
+
+        self._alias = alias
+        self._metadata = metadata
+        self._parent = parent
+
+    def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
+        if len(self._alias) == 1:
+            if self._metadata:
+                raise ValueError("Creating aliases with metadata is not supported.")
+            else:
+                exp = proto.Expression()
+                exp.alias.name.append(self._alias[0])
+                exp.alias.expr.CopyFrom(self._parent.to_plan(session))
+                return exp
+        else:
+            if self._metadata:
+                raise ValueError("metadata can only be provided for a single column")

Review Comment:
   Wait literally metadata is not supported yet..
   
   Maybe just one `raise ValueError`



##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -334,7 +334,11 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      Alias(transformExpression(alias.getExpr), alias.getName(0))()
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   Can you make a change in Connect DSL and add a server side test in `SparkConnectProtoSuite`? 
   
   Should be straightforward. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020561852


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -334,7 +334,11 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      Alias(transformExpression(alias.getExpr), alias.getName(0))()
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   Can you make add this in Connect DSL and add a server side test in `SparkConnectProtoSuite`? 
   
   Should be straightforward. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021107215


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;

Review Comment:
   Since we're not trying to match all of the Catalyst plans but be compatible on the logical level, I find it weird to use `MultiAlias` here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021214764


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;
+    // Alias metadata expressed as a JSON map.
+    optional string metadata = 3;

Review Comment:
   That's a very good question. I think in this case it's not as easy possible because it actually depends on the type and is not a plain `map<string, string>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020577947


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -334,7 +334,11 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      Alias(transformExpression(alias.getExpr), alias.getName(0))()
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   I added a new test relation with a `map` type, added the type converters, test agg function and now it covers this branch as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38631:
URL: https://github.com/apache/spark/pull/38631#issuecomment-1316659453

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022498105


##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,73 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "ColumnAlias":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        alias : str
+            desired column names (collects all positional arguments passed)
+
+        Other Parameters
+        ----------------
+        metadata: dict
+            a dict of information to be stored in ``metadata`` attribute of the
+            corresponding :class:`StructField <pyspark.sql.types.StructField>` (optional, keyword
+            only argument)
+
+        Returns
+        -------
+        :class:`Column`
+            Column representing whether each element of Column is aliased with new name or names.
+
+        Examples
+        --------
+        >>> df = spark.createDataFrame(
+        ...      [(2, "Alice"), (5, "Bob")], ["age", "name"])
+        >>> df.select(df.age.alias("age2")).collect()
+        [Row(age2=2), Row(age2=5)]
+        >>> df.select(df.age.alias("age3", metadata={'max': 99})).schema['age3'].metadata['max']
+        99
+        """
+        metadata = kwargs.pop("metadata", None)
+        assert not kwargs, "Unexpected kwargs where passed: %s" % kwargs
+        return ColumnAlias(self, list(alias), metadata)
+
+
+class ColumnAlias(Expression):
+    def __init__(self, parent: Expression, alias: list[str], metadata: Any):
+
+        self._alias = alias
+        self._metadata = metadata
+        self._parent = parent
+
+    def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
+        if len(self._alias) == 1:
+            exp = proto.Expression()
+            exp.alias.name.append(self._alias[0])
+            exp.alias.expr.CopyFrom(self._parent.to_plan(session))
+
+            if self._metadata:
+                import json

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1028933832


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -301,6 +301,20 @@ def test_simple_datasource_read(self) -> None:
             actualResult = pandasResult.values.tolist()
             self.assertEqual(len(expectResult), len(actualResult))
 
+    def test_alias(self) -> None:
+        """Testing supported and unsupported alias"""
+        col0 = (
+            self.connect.range(1, 10)
+            .select(col("id").alias("name", metadata={"max": 99}))
+            .schema()
+            .names[0]
+        )
+        self.assertEqual("name", col0)
+
+        with self.assertRaises(grpc.RpcError) as exc:
+            self.connect.range(1, 10).select(col("id").alias("this", "is", "not")).collect()
+        self.assertIn("Buffer(this, is, not)", str(exc.exception))

Review Comment:
   I made a followup at https://github.com/apache/spark/pull/38752 :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022503915


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -44,7 +44,7 @@
     from pyspark.sql.connect.typing import ColumnOrString, ExpressionOrString
     from pyspark.sql.connect.client import RemoteSparkSession
 
-ColumnOrName = Union[Column, str]
+ColumnOrName = Union[Expression, str]

Review Comment:
   ```suggestion
   ColumnOrName = Union[Column, str]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38631:
URL: https://github.com/apache/spark/pull/38631#issuecomment-1313590270

   @HyukjinKwon can you review the python side?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022299234


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -141,10 +149,35 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(connectPlan2, sparkPlan2)
   }
 
-  test("column alias") {
+  test("[SPARK-40809] column alias") {

Review Comment:
   nit but:
   ```suggestion
     test("SPARK-40809: column alias") {
   ```
   per https://spark.apache.org/contributing.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client
URL: https://github.com/apache/spark/pull/38631


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38631:
URL: https://github.com/apache/spark/pull/38631#issuecomment-1312631820

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021212753


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -348,7 +350,16 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      val md = if (alias.hasMetadata()) {
+        Some(Metadata.fromJson(alias.getMetadata))
+      } else {
+        None
+      }
+      Alias(transformExpression(alias.getExpr), alias.getName(0))(explicitMetadata = md)
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   shall we fail here if `alias.hasMetadata` is true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022317152


##########
python/pyspark/sql/tests/connect/test_connect_column_expressions.py:
##########
@@ -134,6 +134,16 @@ def test_list_to_literal(self):
         lit_list_plan = fun.lit([fun.lit(10), fun.lit("str")]).to_plan(None)
         self.assertIsNotNone(lit_list_plan)
 
+    def test_column_alias(self) -> None:
+        # SPARK-40809 - Support for Column Aliases
+        col0 = fun.col("a").alias("martin")
+        self.assertEqual("Alias(Column(a), (martin))", str(col0))
+
+        col0 = fun.col("a").alias("martin", metadata={"pii": True})
+        plan = col0.to_plan(self.session)
+        self.assertIsNotNone(plan)
+        self.assertEqual(plan.alias.metadata, '{"pii": true}')

Review Comment:
   what about also adding a test for `MultiAlias` in python?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022297572


##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,73 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "ColumnAlias":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 3.4.0
+
+        Parameters
+        ----------
+        alias : str
+            desired column names (collects all positional arguments passed)
+
+        Other Parameters
+        ----------------
+        metadata: dict
+            a dict of information to be stored in ``metadata`` attribute of the
+            corresponding :class:`StructField <pyspark.sql.types.StructField>` (optional, keyword
+            only argument)
+
+        Returns
+        -------
+        :class:`Column`
+            Column representing whether each element of Column is aliased with new name or names.
+
+        Examples
+        --------
+        >>> df = spark.createDataFrame(
+        ...      [(2, "Alice"), (5, "Bob")], ["age", "name"])
+        >>> df.select(df.age.alias("age2")).collect()
+        [Row(age2=2), Row(age2=5)]
+        >>> df.select(df.age.alias("age3", metadata={'max': 99})).schema['age3'].metadata['max']
+        99
+        """
+        metadata = kwargs.pop("metadata", None)
+        assert not kwargs, "Unexpected kwargs where passed: %s" % kwargs
+        return ColumnAlias(self, list(alias), metadata)
+
+
+class ColumnAlias(Expression):
+    def __init__(self, parent: Expression, alias: list[str], metadata: Any):
+
+        self._alias = alias
+        self._metadata = metadata
+        self._parent = parent
+
+    def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
+        if len(self._alias) == 1:
+            exp = proto.Expression()
+            exp.alias.name.append(self._alias[0])
+            exp.alias.expr.CopyFrom(self._parent.to_plan(session))
+
+            if self._metadata:
+                import json

Review Comment:
   Nit but I think we can just import this on the top.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] LuciferYang commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
LuciferYang commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1028908773


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -301,6 +301,20 @@ def test_simple_datasource_read(self) -> None:
             actualResult = pandasResult.values.tolist()
             self.assertEqual(len(expectResult), len(actualResult))
 
+    def test_alias(self) -> None:
+        """Testing supported and unsupported alias"""
+        col0 = (
+            self.connect.range(1, 10)
+            .select(col("id").alias("name", metadata={"max": 99}))
+            .schema()
+            .names[0]
+        )
+        self.assertEqual("name", col0)
+
+        with self.assertRaises(grpc.RpcError) as exc:
+            self.connect.range(1, 10).select(col("id").alias("this", "is", "not")).collect()
+        self.assertIn("Buffer(this, is, not)", str(exc.exception))

Review Comment:
   @grundprinzip  The Scala 2.13 daily test has failed in recent times, which seems to be related to this assert:
   ```
   ======================================================================
   FAIL [0.742s]: test_alias (pyspark.sql.tests.connect.test_connect_basic.SparkConnectTests)
   Testing supported and unsupported alias
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/__w/spark/spark/python/pyspark/sql/tests/connect/test_connect_basic.py", line 337, in test_alias
       self.assertIn("Buffer(this, is, not)", str(exc.exception))
   AssertionError: 'Buffer(this, is, not)' not found in '<_MultiThreadedRendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNKNOWN\n\tdetails = "[INTERNAL_ERROR] Found the unresolved operator: \'Project [id#24L AS List(this, is, not)]"\n\tdebug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:15002 {created_time:"2022-11-21T19:25:59.902777537+00:00", grpc_status:2, grpc_message:"[INTERNAL_ERROR] Found the unresolved operator: \\\'Project [id#24L AS List(this, is, not)]"}"\n>'
   ```
   
   https://github.com/apache/spark/actions/runs/3517345801/jobs/5895043640
   
   This seems to be related to the difference of Scala version. Can we simplify the assertion? such as `self.assertIn("(this, is, not)", str(exc.exception))`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020563219


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -334,7 +334,11 @@ class SparkConnectPlanner(session: SparkSession) {
   }
 
   private def transformAlias(alias: proto.Expression.Alias): NamedExpression = {
-    Alias(transformExpression(alias.getExpr), alias.getName)()
+    if (alias.getNameCount == 1) {
+      Alias(transformExpression(alias.getExpr), alias.getName(0))()
+    } else {
+      MultiAlias(transformExpression(alias.getExpr), alias.getNameList.asScala.toSeq)

Review Comment:
   I actually expect Scala client will re-use Connect DSL. 
   
   if you take a closer look, Connect DSL is DataFrame like internal API that offers `toProto` logic (equivalent to `plan.py` for python). 
   
   Anything adding to the DSL will not only help testing coverage but also help Scala client potentially.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38631:
URL: https://github.com/apache/spark/pull/38631#issuecomment-1312192934

   @amaliujia @cloud-fan @HyukjinKwon @hvanhovell 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1020586736


##########
python/pyspark/sql/connect/column.py:
##########
@@ -82,6 +82,74 @@ def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
     def __str__(self) -> str:
         ...
 
+    def alias(self, *alias: str, **kwargs: Any) -> "Expression":
+        """
+        Returns this column aliased with a new name or names (in the case of expressions that
+        return more than one column, such as explode).
+
+        .. versionadded:: 1.3.0
+
+        Parameters
+        ----------
+        alias : str
+            desired column names (collects all positional arguments passed)
+
+        Other Parameters
+        ----------------
+        metadata: dict
+            a dict of information to be stored in ``metadata`` attribute of the
+            corresponding :class:`StructField <pyspark.sql.types.StructField>` (optional, keyword
+            only argument)
+
+            .. versionchanged:: 2.2.0
+               Added optional ``metadata`` argument.
+
+        Returns
+        -------
+        :class:`Column`
+            Column representing whether each element of Column is aliased with new name or names.
+
+        Examples
+        --------
+        >>> df = spark.createDataFrame(
+        ...      [(2, "Alice"), (5, "Bob")], ["age", "name"])
+        >>> df.select(df.age.alias("age2")).collect()
+        [Row(age2=2), Row(age2=5)]
+        >>> df.select(df.age.alias("age3", metadata={'max': 99})).schema['age3'].metadata['max']
+        99
+        """
+        metadata = kwargs.pop("metadata", None)
+        assert not kwargs, "Unexpected kwargs where passed: %s" % kwargs
+        return ColumnAlias(self, list(alias), metadata)
+
+
+class ColumnAlias(Expression):
+    def __init__(self, parent: Expression, alias: list[str], metadata: Any):
+
+        self._alias = alias
+        self._metadata = metadata
+        self._parent = parent
+
+    def to_plan(self, session: "RemoteSparkSession") -> "proto.Expression":
+        if len(self._alias) == 1:
+            if self._metadata:
+                raise ValueError("Creating aliases with metadata is not supported.")
+            else:
+                exp = proto.Expression()
+                exp.alias.name.append(self._alias[0])
+                exp.alias.expr.CopyFrom(self._parent.to_plan(session))
+                return exp
+        else:
+            if self._metadata:
+                raise ValueError("metadata can only be provided for a single column")

Review Comment:
   I just pushed the implementation of metadata as well. However, now the schema does not return the metadata, but at least the input is consistent with Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021120668


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;

Review Comment:
   OK, then shall we fail if it's a multi alias but the metadata fied is specified?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021203690


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;

Review Comment:
   makes sense, let me add a server side check to the client side one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1021219879


##########
connector/connect/src/main/protobuf/spark/connect/expressions.proto:
##########
@@ -170,6 +170,8 @@ message Expression {
 
   message Alias {
     Expression expr = 1;
-    string name = 2;
+    repeated string name = 2;
+    // Alias metadata expressed as a JSON map.
+    optional string metadata = 3;

Review Comment:
   ah the metadata value can be any type, then JSON LGTM. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022651386


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -44,7 +44,7 @@
     from pyspark.sql.connect.typing import ColumnOrString, ExpressionOrString
     from pyspark.sql.connect.client import RemoteSparkSession
 
-ColumnOrName = Union[Column, str]
+ColumnOrName = Union[Expression, str]

Review Comment:
   ColumnAlias is an expression and not an unresolved attribtue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38631: [SPARK-40809] [CONNECT] [FOLLOW] Support `alias()` in Python client

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38631:
URL: https://github.com/apache/spark/pull/38631#discussion_r1022298812


##########
python/pyspark/sql/tests/connect/test_connect_column_expressions.py:
##########
@@ -134,6 +134,16 @@ def test_list_to_literal(self):
         lit_list_plan = fun.lit([fun.lit(10), fun.lit("str")]).to_plan(None)
         self.assertIsNotNone(lit_list_plan)
 
+    def test_column_alias(self) -> None:
+        # SPARK-40809 - Support for Column Aliases

Review Comment:
   nit:
   ```suggestion
           # SPARK-40809: Support for Column Aliases
   ```
   per https://spark.apache.org/contributing.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org