You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/29 20:56:08 UTC

[GitHub] [spark] grundprinzip opened a new pull request, #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

grundprinzip opened a new pull request, #39296:
URL: https://github.com/apache/spark/pull/39296

   ### What changes were proposed in this pull request?
   To reenable the doc tests for `Column`, this patch fixes the string representation of the `Column`, in particular this introduces a workaround for a binary unresolved function that produces an infix representation of the function string. 
   
   ### Why are the changes needed?
   Compatibility
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Reenabled doc tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059218456


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -420,6 +420,12 @@ def __repr__(self) -> str:
             return f"{self._name}({', '.join([str(arg) for arg in self._args])})"
 
 
+class UnresolvedBinaryFunction(UnresolvedFunction):

Review Comment:
   I think this issue is a bit complicated, since not all binary ops have this type of string repr, for example:
   
   ```
   In [11]: df.name + 1
   Out[11]: Column<'(name + 1)'>
   
   In [12]: df.name + df.age
   Out[12]: Column<'(name + age)'>
   
   In [13]: df.name.startswith("n")
   Out[13]: Column<'startswith(name, n)'>
   
   In [14]: df.name.startswith(df.age)
   Out[14]: Column<'startswith(name, age)'>
   ```
   
   we may also need to take unary ops into account.
   
   I guess we need to keep a list to control which function can be represented in this way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059649687


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -420,6 +420,12 @@ def __repr__(self) -> str:
             return f"{self._name}({', '.join([str(arg) for arg in self._args])})"
 
 
+class UnresolvedBinaryFunction(UnresolvedFunction):

Review Comment:
   I added the behavior for infix / prefix notation and fixed the alias behavior as well. For the remaining simple checks I did it looks good so far and should get us to a reasonable comparison.
   
   I don't think the goal should be compatibility on this non-api but similar readability. If we find cases that are unreasonable to fix, we should apply the change that @HyukjinKwon proposed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059233553


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -420,6 +420,12 @@ def __repr__(self) -> str:
             return f"{self._name}({', '.join([str(arg) for arg in self._args])})"
 
 
+class UnresolvedBinaryFunction(UnresolvedFunction):

Review Comment:
   Is our goal to make the Column string representation consistent between client and server? We don't have a general rule about how to generate Column string representation, I think a lot of manual efforts are needed to check the server-side implementation and copy it to the client.
   
   Shall we only take care of some common ones like attribute and literal?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class
URL: https://github.com/apache/spark/pull/39296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059727323


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -414,6 +414,12 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
         return fun
 
     def __repr__(self) -> str:
+        # Special handling for certain infix operators that require slightly
+        # different printing.
+        if len(self._args) == 2 and len(self._name) == 1:

Review Comment:
   Actually, what about we just whitelist, and make it nicer? For example, there are only a fixed number of operators (20 ~ 30) that PySpark supports plug `eqNullSafe`, and very likely we won't add some more about this.
   
   Otherwise, I would just even remove these new logics, and just fix the doctest with `...`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #39296:
URL: https://github.com/apache/spark/pull/39296#issuecomment-1368178327

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059291258


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -420,6 +420,12 @@ def __repr__(self) -> str:
             return f"{self._name}({', '.join([str(arg) for arg in self._args])})"
 
 
+class UnresolvedBinaryFunction(UnresolvedFunction):

Review Comment:
   I had this first but it feels weird and the code looks a bit ugly. in particular because the only place where we use this is in the Column class for the bin_op function. But given that we add even more logic I will change it back. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059287998


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -420,6 +420,12 @@ def __repr__(self) -> str:
             return f"{self._name}({', '.join([str(arg) for arg in self._args])})"
 
 
+class UnresolvedBinaryFunction(UnresolvedFunction):

Review Comment:
   ok, I think it is fine to start with only handling `binary ops with single character operator -> infix`.
   
   I think we can simply modify the `__repr__` method of `UnresolvedFunction`, seems unnecessary to add a new class `UnresolvedBinaryFunction`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059248076


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -420,6 +420,12 @@ def __repr__(self) -> str:
             return f"{self._name}({', '.join([str(arg) for arg in self._args])})"
 
 
+class UnresolvedBinaryFunction(UnresolvedFunction):

Review Comment:
   Since there was probably not a spec to follow for the string representation of these objects, we can try to do our best to make the doc tests pass and be somewhat compatible since this is not a real API:
   
     * binary ops with single character operator -> infix
     * other binary ops -> prefix
     * for other expression types we directly resolve check with the server



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1059558393


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -420,6 +420,12 @@ def __repr__(self) -> str:
             return f"{self._name}({', '.join([str(arg) for arg in self._args])})"
 
 
+class UnresolvedBinaryFunction(UnresolvedFunction):

Review Comment:
   We could just remove this `UnresolvedBinaryFunction` and fix the doctests like:
   
   ```
       >>> df.age + 1
       Column<...>
       >>> 1 / df.age
       Column<...>
   ```
   
   for now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39296: [SPARK-41757][CONNECT] Fixing String representation for Column class

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #39296:
URL: https://github.com/apache/spark/pull/39296#discussion_r1071659168


##########
python/pyspark/sql/connect/expressions.py:
##########
@@ -414,6 +414,12 @@ def to_plan(self, session: "SparkConnectClient") -> proto.Expression:
         return fun
 
     def __repr__(self) -> str:
+        # Special handling for certain infix operators that require slightly
+        # different printing.
+        if len(self._args) == 2 and len(self._name) == 1:

Review Comment:
   Let me just take it over at https://github.com/apache/spark/pull/39616 by doing a (very) simple version for now to make the tests pass.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org