You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/25 21:06:24 UTC

[GitHub] [spark] amaliujia opened a new pull request, #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

amaliujia opened a new pull request, #38806:
URL: https://github.com/apache/spark/pull/38806

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   In the past, the expression system is pretty different between Connect and PySpark. `Column` is the first class API for all the expression in PySpark while `Expression` is the top level API in Connect. To maintain API Compatibility we need to promote `Column` in Connect. Also Connect has its special needs to maintain a series of `to_proto` methods to convert expressions to proto representation.
   
   This PR:
   1. Promote `Column` to be the first class citizen.
   2. Wrap `Expression` into `Column` and do not expose it into DataFrame.
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     4. If you fix a bug, you can clarify why it is a bug.
   -->
   API Compatibility
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   NO
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Existing UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility
URL: https://github.com/apache/spark/pull/38806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034747919


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   OK cool, now this PR is unblocked. @amaliujia can you resolve the code conflicts, please?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328305392

   
   
   
   > To make the review of this less painful, can you please outline not just what the result is going to be but how you achieved it?
   > 
   > In addition, it would be great if you can outline more, what you think the contract between the different layers is. When the DF deals with a `Column` and passes it to the `LogicalPlan` instance, should it pass it's nested expression or the `Column`? There are reasons for both.
   > 
   > Not sure you're going to address this in this PR, but today, there is no clear guidance on wether a plan object takes a string for a column or an expression. I think we should have an opinion here.
   > 
   > My suggestion would be to first layout the plan more in the PR description before doing lots of refactoring where folks are going to have strong opinions.
   > 
   > Thanks
   
   I believe these are implementation details after public API? Can we focus on discussing `dataframe.py` and `column.py` in this PR? More specially, focus on decide if we should:
   1. Column wrap expressions.
   2. Column be the base class of the expressions.
   3. Any caveats if we choose either options.
   
   This is the root question to a series of following design questions.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034048528


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,70 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):
+    """
+    A column in a DataFrame. Column can refer to different things based on the
+    wrapped expression. Some common examples include attribute references, functions,
+    literals, etc.
+
+    .. versionadded:: 3.4.0
+    """
+
+    def __init__(self, expr: Expression) -> None:
+        self._expr = expr
+
+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op("<")
+    __add__ = _bin_op("+")
+    __sub__ = _bin_op("-")
+    __mul__ = _bin_op("*")
+    __div__ = _bin_op("/")
+    __truediv__ = _bin_op("/")
+    __mod__ = _bin_op("%")
+    __radd__ = _bin_op("+", reverse=True)
+    __rsub__ = _bin_op("-", reverse=True)
+    __rmul__ = _bin_op("*", reverse=True)
+    __rdiv__ = _bin_op("/", reverse=True)
+    __rtruediv__ = _bin_op("/", reverse=True)
+    __pow__ = _bin_op("pow")
+    __rpow__ = _bin_op("pow", reverse=True)
+    __ge__ = _bin_op(">=")
+    __le__ = _bin_op("<=")
+    # __eq__ = _bin_op("==")  # ignore [assignment]
+
+    def __eq__(self, other: Any) -> "Column":  # type: ignore[override]
+        """Returns a binary expression with the current column as the left
+        side and the other expression as the right side.
+        """
+        from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
+
+        if isinstance(other, get_args(PrimitiveType)):
+            other = lit(other)
+        return scalar_function("==", self, other)
+
+    def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
+        return self._expr.to_plan(session)

Review Comment:
   yea... expression is the bridge with proto so using `expression` in proto also makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1033977637


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -148,23 +148,23 @@ def selectExpr(self, *expr: Union[str, List[str]]) -> "DataFrame":
             expr = expr[0]  # type: ignore[assignment]
         for element in expr:
             if isinstance(element, str):
-                sql_expr.append(SQLExpression(element))
+                sql_expr.append(sql_expression(element))
             else:
-                sql_expr.extend([SQLExpression(e) for e in element])
+                sql_expr.extend([sql_expression(e) for e in element])
 
         return DataFrame.withPlan(plan.Project(self._plan, *sql_expr), session=self._session)
 
-    def agg(self, *exprs: Union[Expression, Dict[str, str]]) -> "DataFrame":
+    def agg(self, *exprs: Union[Column, Dict[str, str]]) -> "DataFrame":
         if not exprs:
             raise ValueError("Argument 'exprs' must not be empty")

Review Comment:
   must not-> can not



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034329584


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,70 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):
+    """
+    A column in a DataFrame. Column can refer to different things based on the
+    wrapped expression. Some common examples include attribute references, functions,
+    literals, etc.
+
+    .. versionadded:: 3.4.0
+    """
+
+    def __init__(self, expr: Expression) -> None:
+        self._expr = expr
+
+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op("<")
+    __add__ = _bin_op("+")
+    __sub__ = _bin_op("-")
+    __mul__ = _bin_op("*")
+    __div__ = _bin_op("/")
+    __truediv__ = _bin_op("/")
+    __mod__ = _bin_op("%")
+    __radd__ = _bin_op("+", reverse=True)
+    __rsub__ = _bin_op("-", reverse=True)
+    __rmul__ = _bin_op("*", reverse=True)
+    __rdiv__ = _bin_op("/", reverse=True)
+    __rtruediv__ = _bin_op("/", reverse=True)
+    __pow__ = _bin_op("pow")
+    __rpow__ = _bin_op("pow", reverse=True)
+    __ge__ = _bin_op(">=")
+    __le__ = _bin_op("<=")
+    # __eq__ = _bin_op("==")  # ignore [assignment]
+
+    def __eq__(self, other: Any) -> "Column":  # type: ignore[override]
+        """Returns a binary expression with the current column as the left
+        side and the other expression as the right side.
+        """
+        from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
+
+        if isinstance(other, get_args(PrimitiveType)):
+            other = lit(other)
+        return scalar_function("==", self, other)
+
+    def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
+        return self._expr.to_plan(session)
+
+    def alias(self, *alias: str, **kwargs: Any) -> "Column":
+        return Column(self._expr.alias(*alias, **kwargs))
+
+    def desc(self) -> "Column":
+        return Column(self._expr.desc())
+
+    def asc(self) -> "Column":
+        return Column(self._expr.asc())
+
+    def ascending(self) -> bool:
+        return self._expr.ascending()
+
+    def nullsLast(self) -> bool:

Review Comment:
   I think what @HyukjinKwon meant was that this doesn't match the def in Colum in PySpark
   
   ```
   asc = _unary_op("asc", _asc_doc)
   asc_nulls_first = _unary_op("asc_nulls_first", _asc_nulls_first_doc)
   asc_nulls_last = _unary_op("asc_nulls_last", _asc_nulls_last_doc)
   desc = _unary_op("desc", _desc_doc)
   desc_nulls_first = _unary_op("desc_nulls_first", _desc_nulls_first_doc)
   desc_nulls_last = _unary_op("desc_nulls_last", _desc_nulls_last_doc)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034218435


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   In Catalyst, we have to wrap `Expression` with `Column`, because `Expression` is an internal class with many internal APIs and we can't expose it to end users directly.
   
   In Spark Connect, I think we can skip the wrapping if `Expression` doesn't have any internal APIs. But end users will know we have different concrete `Column` subclasses, which is kind of an internal detail. If this doesn't matter, I think `Column` and `Expression` can be merged. cc @grundprinzip 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1033090021


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,70 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):
+    """
+    A column in a DataFrame. Column can refer to different things based on the
+    wrapped expression. Some common examples include attribute references, functions,
+    literals, etc.
+
+    .. versionadded:: 3.4.0
+    """
+
+    def __init__(self, expr: Expression) -> None:
+        self._expr = expr
+
+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op("<")
+    __add__ = _bin_op("+")
+    __sub__ = _bin_op("-")
+    __mul__ = _bin_op("*")
+    __div__ = _bin_op("/")
+    __truediv__ = _bin_op("/")
+    __mod__ = _bin_op("%")
+    __radd__ = _bin_op("+", reverse=True)
+    __rsub__ = _bin_op("-", reverse=True)
+    __rmul__ = _bin_op("*", reverse=True)
+    __rdiv__ = _bin_op("/", reverse=True)
+    __rtruediv__ = _bin_op("/", reverse=True)
+    __pow__ = _bin_op("pow")
+    __rpow__ = _bin_op("pow", reverse=True)
+    __ge__ = _bin_op(">=")
+    __le__ = _bin_op("<=")
+    # __eq__ = _bin_op("==")  # ignore [assignment]
+
+    def __eq__(self, other: Any) -> "Column":  # type: ignore[override]
+        """Returns a binary expression with the current column as the left
+        side and the other expression as the right side.
+        """
+        from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
+
+        if isinstance(other, get_args(PrimitiveType)):
+            other = lit(other)
+        return scalar_function("==", self, other)
+
+    def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
+        return self._expr.to_plan(session)

Review Comment:
   now I think it is fine to just use expressions in proto messages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034443578


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   @grundprinzip can you give your opinion clearly? It seems most reviewers here agree to have `Column` wrapping `Expression`, as it can hide the implementation details the most. It's also consistent with `DataFrame` wrapping `LogicalPlan` in Spark Connect today. Are you OK with moving forward with this design?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328114369

   I did a pass over the refactoring and I think it generally goes in the right direction. My main concerns are that it seems weird to have `Column` wrap expressions. I don't think there is a good reason for that. `Column` should just be the base class and then all other expressions derive from it. 
   
   This would then as well reduce the need for the functions `sql_expression` etc that don't really seem fitting.
   
   I think that this will allow the conceptual usage in the right way and what users expect from the interface. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tomvanbussel commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
tomvanbussel commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1032838723


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   Renaming `Expression` to `Column` would break `type(expr) is Column`. This means that existing PySpark code has to be rewritten to `isinstance(expr, Column)`. Nesting the expression seems like the better option here, given that the goal of these changes is API compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328304378

   Wrapping expression into Column is following https://github.com/apache/spark/blob/ed3775704bbdc9a9c479dc06565c8bf8c4d9640c/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L142
   
   Well I think the design principle is expose public API (Column) and hide internal API  (Expression). 
   
   Let me update PR description to include probably some examples about Column API usage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1033087883


##########
python/pyspark/sql/connect/column.py:
##########
@@ -30,53 +30,34 @@
 
 def _bin_op(
     name: str, doc: str = "binary function", reverse: bool = False
-) -> Callable[["Column", Any], "Expression"]:
-    def _(self: "Column", other: Any) -> "Expression":
+) -> Callable[["Column", Any], "Column"]:
+    def _(self: "Column", other: Any) -> "Column":
         from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
 
         if isinstance(other, get_args(PrimitiveType)):
-            other = LiteralExpression(other)
+            other = lit(other)
         if not reverse:
-            return ScalarFunctionExpression(name, self, other)
+            return scalar_function(name, self, other)
         else:
-            return ScalarFunctionExpression(name, other, self)
+            return scalar_function(name, other, self)
 
     return _
 
 
+def scalar_function(op: str, *args: "Column") -> "Column":
+    return Column(ScalarFunctionExpression(op, *args))
+
+
+def sql_expression(expr: str) -> "Column":
+    return Column(SQLExpression(expr))
+
+
 class Expression(object):

Review Comment:
   nit: maybe we can have a separate file `expressions.py` for Expression?



##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,70 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):
+    """
+    A column in a DataFrame. Column can refer to different things based on the
+    wrapped expression. Some common examples include attribute references, functions,
+    literals, etc.
+
+    .. versionadded:: 3.4.0
+    """
+
+    def __init__(self, expr: Expression) -> None:
+        self._expr = expr
+
+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op("<")
+    __add__ = _bin_op("+")
+    __sub__ = _bin_op("-")
+    __mul__ = _bin_op("*")
+    __div__ = _bin_op("/")
+    __truediv__ = _bin_op("/")
+    __mod__ = _bin_op("%")
+    __radd__ = _bin_op("+", reverse=True)
+    __rsub__ = _bin_op("-", reverse=True)
+    __rmul__ = _bin_op("*", reverse=True)
+    __rdiv__ = _bin_op("/", reverse=True)
+    __rtruediv__ = _bin_op("/", reverse=True)
+    __pow__ = _bin_op("pow")
+    __rpow__ = _bin_op("pow", reverse=True)
+    __ge__ = _bin_op(">=")
+    __le__ = _bin_op("<=")
+    # __eq__ = _bin_op("==")  # ignore [assignment]
+
+    def __eq__(self, other: Any) -> "Column":  # type: ignore[override]
+        """Returns a binary expression with the current column as the left
+        side and the other expression as the right side.
+        """
+        from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
+
+        if isinstance(other, get_args(PrimitiveType)):
+            other = lit(other)
+        return scalar_function("==", self, other)
+
+    def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
+        return self._expr.to_plan(session)

Review Comment:
   now it think it is fine to just use expressions in proto messages.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1327886223

   This is a hard move but is required to
   1. achieve API compatible in the future
   2. unblock left column/function work
   
   @HyukjinKwon @grundprinzip @zhengruifeng 
   cc @hvanhovell @xinrong-meng @cloud-fan  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328305174

   Please keep in mind that there is a difference between the Catalyst expressions and the client API. They are not the same thing. 
   
   We should try to avoid adding layers in the client just because the Scala API has them. In particular because we're translating into the private API on the server. 
   
   Let's make sure we're doing this for a valid reason and not just because it's the first thing that came to mind. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328698464

   From API evolving perspective, I'm fine with wrapping the expressions in the Column. In an ideal case, I think it might make most sense if the logical plan only deals with Expressions and not Columns because the Column is the public facing interface.
   
   Similarly with mapping `str`. This needs to be done before invoking the logical plan.
   
   WDYT?
   
   
   Btw, the following expression is not a thing in Python if UnresolvedFunction inherits from Column. There are no casts in Python. The casts are purely for the type checking
   
   ```
   return cast(Column, UnresolvedFunction("like",  self, LiteralExpression(str)))
   ```
   
   ```
   typing.cast(typ, val)
   Cast a value to a type.
   
   This returns the value unchanged. To the type checker this signals that the return value has the designated type, but at runtime we intentionally don’t check anything (we want this to be as fast as possible).
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1331345073

   I have rebased and removed non-compatible API from the `Column` interface. 
   
   There are two follows up I am not addressing in this PR given these are implementation details: 
   1.  Whether we should only pass `Column` or `Expression` into Logical Plan.
   2.  Whether we can complete remove `str` from Logical Plan surface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1329807587

   > > From API evolving perspective, I'm fine with wrapping the expressions in the Column. In an ideal case, I think it might make most sense if the logical plan only deals with Expressions and not Columns because the Column is the public facing
   > > interface.
   > 
   > We can do it. I am not sure if this can be bring you much convenience though. For example we can pass the wrapped expression (`column._expr`) into LogicalPlan, but that is a base class. In this case, use `Column` and use `Expression` are not making big difference. We can pass `Expression` for simplification.
   
   The benefit of passing `Expression` is that you avoid circular dependencies.
   
   > 
   > I guess you are thinking we pass the right sub-class of Expression to a LogicalPlan? Hmmm we might still be able to do. It sounds like we need to some isinstance for mypy. If you want to go to this approach I would prefer to check it as follow ups.
   
   I dont think that you need to pass the "right sub-class" of Expression to the logical plan. For all intends and purposes the logical plan doesn't really care about the exact sub-type of the expression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034320960


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,70 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):
+    """
+    A column in a DataFrame. Column can refer to different things based on the
+    wrapped expression. Some common examples include attribute references, functions,
+    literals, etc.
+
+    .. versionadded:: 3.4.0
+    """
+
+    def __init__(self, expr: Expression) -> None:
+        self._expr = expr
+
+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op("<")
+    __add__ = _bin_op("+")
+    __sub__ = _bin_op("-")
+    __mul__ = _bin_op("*")
+    __div__ = _bin_op("/")
+    __truediv__ = _bin_op("/")
+    __mod__ = _bin_op("%")
+    __radd__ = _bin_op("+", reverse=True)
+    __rsub__ = _bin_op("-", reverse=True)
+    __rmul__ = _bin_op("*", reverse=True)
+    __rdiv__ = _bin_op("/", reverse=True)
+    __rtruediv__ = _bin_op("/", reverse=True)
+    __pow__ = _bin_op("pow")
+    __rpow__ = _bin_op("pow", reverse=True)
+    __ge__ = _bin_op(">=")
+    __le__ = _bin_op("<=")
+    # __eq__ = _bin_op("==")  # ignore [assignment]
+
+    def __eq__(self, other: Any) -> "Column":  # type: ignore[override]
+        """Returns a binary expression with the current column as the left
+        side and the other expression as the right side.
+        """
+        from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
+
+        if isinstance(other, get_args(PrimitiveType)):
+            other = lit(other)
+        return scalar_function("==", self, other)
+
+    def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
+        return self._expr.to_plan(session)
+
+    def alias(self, *alias: str, **kwargs: Any) -> "Column":
+        return Column(self._expr.alias(*alias, **kwargs))
+
+    def desc(self) -> "Column":
+        return Column(self._expr.desc())
+
+    def asc(self) -> "Column":
+        return Column(self._expr.asc())
+
+    def ascending(self) -> bool:
+        return self._expr.ascending()
+
+    def nullsLast(self) -> bool:

Review Comment:
   Hm, PySpark's `Column` doesn't have this. In fact, `SortOrder` doesn't exist. Should probably match it with that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034278581


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   This is my question given I am not familiar with PySpark:
   
   If we have a base class `Column` and then have many sub-class inherits it to form the Expression family, does it count as leaking internal implementation to users for PySpark or Python users when they start to use Spark Connect Python client?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034323760


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,70 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):
+    """
+    A column in a DataFrame. Column can refer to different things based on the
+    wrapped expression. Some common examples include attribute references, functions,
+    literals, etc.
+
+    .. versionadded:: 3.4.0
+    """
+
+    def __init__(self, expr: Expression) -> None:
+        self._expr = expr
+
+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op("<")
+    __add__ = _bin_op("+")
+    __sub__ = _bin_op("-")
+    __mul__ = _bin_op("*")
+    __div__ = _bin_op("/")
+    __truediv__ = _bin_op("/")
+    __mod__ = _bin_op("%")
+    __radd__ = _bin_op("+", reverse=True)
+    __rsub__ = _bin_op("-", reverse=True)
+    __rmul__ = _bin_op("*", reverse=True)
+    __rdiv__ = _bin_op("/", reverse=True)
+    __rtruediv__ = _bin_op("/", reverse=True)
+    __pow__ = _bin_op("pow")
+    __rpow__ = _bin_op("pow", reverse=True)
+    __ge__ = _bin_op(">=")
+    __le__ = _bin_op("<=")
+    # __eq__ = _bin_op("==")  # ignore [assignment]
+
+    def __eq__(self, other: Any) -> "Column":  # type: ignore[override]
+        """Returns a binary expression with the current column as the left
+        side and the other expression as the right side.
+        """
+        from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
+
+        if isinstance(other, get_args(PrimitiveType)):
+            other = lit(other)
+        return scalar_function("==", self, other)
+
+    def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
+        return self._expr.to_plan(session)
+
+    def alias(self, *alias: str, **kwargs: Any) -> "Column":
+        return Column(self._expr.alias(*alias, **kwargs))
+
+    def desc(self) -> "Column":
+        return Column(self._expr.desc())
+
+    def asc(self) -> "Column":
+        return Column(self._expr.asc())
+
+    def ascending(self) -> bool:
+        return self._expr.ascending()
+
+    def nullsLast(self) -> bool:

Review Comment:
   I feel like we cannot avoid this, even though there might be some Python techniques to have these implicitly.
   
   This is due to polymorphism.....



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1035166775


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   Thank all for this discussion! I will rebase and resolve conflicts.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tomvanbussel commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
tomvanbussel commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1032838723


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   Renaming `Expression` to `Column` would break `type(expr) is Column`. This means that existing code has to be rewritten to `isinstance(expr, Column)`. Nesting the expression seems like the better option here, given that the goal of these changes is API compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328699228

   Is DataFrame a wrapper over LogicalPlan in Spark Connect? If not, why Column needs to be a wrapper over Expression?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1329630889

   > From API evolving perspective, I'm fine with wrapping the expressions in the Column. In an ideal case, I think it might make most sense if the logical plan only deals with Expressions and not Columns because the Column is the public facing 
   interface.
   
   We can do it. I am not sure if this can be bring you much convenience though. For example we can pass the wrapped expression (`column._expr`) into LogicalPlan, but that is a base class. In this case, use `Column` and use `Expression` are not making big difference. We can pass `Expression` for simplification. 
   
   I guess you are thinking we pass the right sub-class of Expression to a LogicalPlan? Hmmm we might still be able to do. It sounds like we need to some isinstance for mypy.  If you want to go to this approach I would prefer to check it as follow ups.
   
   
   > 
   > Similarly with mapping `str`. This needs to be done before invoking the logical plan.
   > 
   
   I agree with this. DataFrame API needs to understand the strings can wrap accordingly. LogicalPlan only deal with Expressions/Columns which have stronger semantics. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1033199435


##########
python/pyspark/sql/connect/column.py:
##########
@@ -30,53 +30,34 @@
 
 def _bin_op(
     name: str, doc: str = "binary function", reverse: bool = False
-) -> Callable[["Column", Any], "Expression"]:
-    def _(self: "Column", other: Any) -> "Expression":
+) -> Callable[["Column", Any], "Column"]:
+    def _(self: "Column", other: Any) -> "Column":
         from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
 
         if isinstance(other, get_args(PrimitiveType)):
-            other = LiteralExpression(other)
+            other = lit(other)
         if not reverse:
-            return ScalarFunctionExpression(name, self, other)
+            return scalar_function(name, self, other)
         else:
-            return ScalarFunctionExpression(name, other, self)
+            return scalar_function(name, other, self)
 
     return _
 
 
+def scalar_function(op: str, *args: "Column") -> "Column":
+    return Column(ScalarFunctionExpression(op, *args))
+
+
+def sql_expression(expr: str) -> "Column":
+    return Column(SQLExpression(expr))
+
+
 class Expression(object):

Review Comment:
   I tried this but hit a cycle import issue so I stopped. Maybe that is solvable but didn't spend too much time on that.
   
   If you think this is very important I can go back to check what was wrong. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1329445446

   @cloud-fan technical Dataframe is a wrapper of both Logical plan and columns (Logical Plan is not exposed to users)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328140891

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1032986029


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   @HyukjinKwon any though on this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1328125892

   To make the review of this less painful, can you please outline not just what the result is going to be but how you achieved it?
   
   In addition, it would be great if you can outline more, what you think the contract between the different layers is. When the DF deals with a `Column` and passes it to the `LogicalPlan` instance, should it pass it's nested expression or the `Column`? There are reasons for both.
   
   Not sure you're going to address this in this PR, but today, there is no clear guidance on wether a plan object takes a string for a column or an expression. I think we should have an opinion here.
   
   My suggestion would be to first layout the plan more in the PR description before doing lots of refactoring where folks are going to have strong opinions.
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1032839603


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   I think that is an interesting observation, not sure though how much this is a relevant case in practice because `type` explicitly does not check inheritance and is usually much to fragile.
   
   Second, I'm not sure that the choice was actually because of this particularity in Python.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034373263


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   Sorry, but this comment is not helpful. 
   
   1) If merging `Expression` and `Column`, `SortOrder` would still be a derived class from `Expression` in the same way as literal is. The reason SortOrder does not exist is because PySpark simply delegates it to Java and doesn't know. cf `def desc: Column = withExpr { SortOrder(expr, Descending) }`
   
   2) If `Column` hides `Expression` it doesn't matter what `SortOrder` is or is not.
   
   
   Since we're doing design by review here (which is the worst way of doing designs), going forward let's make sure to include proposed examples of what you want to change and why. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034335214


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   If we're merging `Expression` to `Column`, we should remove all these `SortOrder`, etc (which does not exist in PySpark). For the current hierarchy, it looks fine to wrap `Expression` with `Column` since `Expression`s here look more internal anyway



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1032689579


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   What's the reason for nesting the expression into column instead of renaming Expression to Column?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1033975997


##########
python/pyspark/sql/connect/column.py:
##########
@@ -232,19 +230,22 @@ def __str__(self) -> str:
         return f"Literal({self._value})"
 
 
-class Column(Expression):
+class ColumnReference(Expression):
     """Represents a column reference. There is no guarantee that this column
     actually exists. In the context of this project, we refer by its name and

Review Comment:
   refer by its -> refer to it by its



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034253147


##########
python/pyspark/sql/connect/column.py:
##########
@@ -30,53 +30,34 @@
 
 def _bin_op(
     name: str, doc: str = "binary function", reverse: bool = False
-) -> Callable[["Column", Any], "Expression"]:
-    def _(self: "Column", other: Any) -> "Expression":
+) -> Callable[["Column", Any], "Column"]:
+    def _(self: "Column", other: Any) -> "Column":
         from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
 
         if isinstance(other, get_args(PrimitiveType)):
-            other = LiteralExpression(other)
+            other = lit(other)
         if not reverse:
-            return ScalarFunctionExpression(name, self, other)
+            return scalar_function(name, self, other)
         else:
-            return ScalarFunctionExpression(name, other, self)
+            return scalar_function(name, other, self)
 
     return _
 
 
+def scalar_function(op: str, *args: "Column") -> "Column":
+    return Column(ScalarFunctionExpression(op, *args))
+
+
+def sql_expression(expr: str) -> "Column":
+    return Column(SQLExpression(expr))
+
+
 class Expression(object):

Review Comment:
   not a big deal, we can do it later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1034458478


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,62 @@ def to_plan(self, session: "RemoteSparkSession") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):

Review Comment:
   I stated already that I'm fine with wrapping expressions. My previous comment was on the sort order. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38806:
URL: https://github.com/apache/spark/pull/38806#discussion_r1035303543


##########
python/pyspark/sql/connect/column.py:
##########
@@ -314,3 +323,70 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
 
     def __str__(self) -> str:
         return f"({self._op} ({', '.join([str(x) for x in self._args])}))"
+
+
+class Column(object):
+    """
+    A column in a DataFrame. Column can refer to different things based on the
+    wrapped expression. Some common examples include attribute references, functions,
+    literals, etc.
+
+    .. versionadded:: 3.4.0
+    """
+
+    def __init__(self, expr: Expression) -> None:
+        self._expr = expr
+
+    __gt__ = _bin_op(">")
+    __lt__ = _bin_op("<")
+    __add__ = _bin_op("+")
+    __sub__ = _bin_op("-")
+    __mul__ = _bin_op("*")
+    __div__ = _bin_op("/")
+    __truediv__ = _bin_op("/")
+    __mod__ = _bin_op("%")
+    __radd__ = _bin_op("+", reverse=True)
+    __rsub__ = _bin_op("-", reverse=True)
+    __rmul__ = _bin_op("*", reverse=True)
+    __rdiv__ = _bin_op("/", reverse=True)
+    __rtruediv__ = _bin_op("/", reverse=True)
+    __pow__ = _bin_op("pow")
+    __rpow__ = _bin_op("pow", reverse=True)
+    __ge__ = _bin_op(">=")
+    __le__ = _bin_op("<=")
+    # __eq__ = _bin_op("==")  # ignore [assignment]
+
+    def __eq__(self, other: Any) -> "Column":  # type: ignore[override]
+        """Returns a binary expression with the current column as the left
+        side and the other expression as the right side.
+        """
+        from pyspark.sql.connect._typing import PrimitiveType
+        from pyspark.sql.connect.functions import lit
+
+        if isinstance(other, get_args(PrimitiveType)):
+            other = lit(other)
+        return scalar_function("==", self, other)
+
+    def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
+        return self._expr.to_plan(session)
+
+    def alias(self, *alias: str, **kwargs: Any) -> "Column":
+        return Column(self._expr.alias(*alias, **kwargs))
+
+    def desc(self) -> "Column":
+        return Column(self._expr.desc())
+
+    def asc(self) -> "Column":
+        return Column(self._expr.asc())
+
+    def ascending(self) -> bool:
+        return self._expr.ascending()
+
+    def nullsLast(self) -> bool:

Review Comment:
   I removed those non-existing API compatible stuff. 
   
   This was because PySpark does not need to deal with SortOrder but Connect has to. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38806: [SPARK-41268][CONNECT][PYTHON] Refactor "Column" for API Compatibility

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38806:
URL: https://github.com/apache/spark/pull/38806#issuecomment-1331507460

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org