You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "dtenedor (via GitHub)" <gi...@apache.org> on 2024/02/28 19:13:20 UTC

[PR] [SPARK-47214][Python] Create API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

dtenedor opened a new pull request, #45319:
URL: https://github.com/apache/spark/pull/45319

   ### What changes were proposed in this pull request?
   
   This PR creates an API for 'analyze' method to differentiate constant NULL arguments and other types of arguments.
   
   ### Why are the changes needed?
   
   Before this PR, each `AnalyzeArgument` had a `value` field which was non-None if the argument was a constant-foldable scalar expression:
   
   ```
   @dataclass(frozen=True)
   class AnalyzeArgument:
       """
       The argument for Python UDTF's analyze static method.
   
       Parameters
       ----------
       dataType : :class:`DataType`
           The argument's data type
       value : any, optional
           The calculated value if the argument is foldable; otherwise None
       isTable : bool
           If True, the argument is a table argument.
       """
   
       dataType: DataType
       value: Optional[Any]
       isTable: bool
   ```
   
   However, it was also `None` if the argument is literal NULL.
   
   Now we add a new field to tell the difference:
   
   ```
   isConstantExpression : bool
       If True, the argument is a constant-foldable scalar expression. Then the 'value' field
       contains None if the argument is a NULL literal, or a non-None value if the argument is a
       non-NULL literal. In this way, we can distinguish between a literal NULL argument and other
       types of arguments such as complex expression trees or table arguments where the 'value'
       field is always None.
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   This PR applies test coverage.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "amaliujia (via GitHub)" <gi...@apache.org>.
amaliujia commented on code in PR #45319:
URL: https://github.com/apache/spark/pull/45319#discussion_r1506632295


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala:
##########
@@ -207,14 +207,22 @@ class UserDefinedPythonTableFunctionAnalyzeRunner(
         case NamedArgumentExpression(k, v) => (Some(k), v)
         case _ => (None, expr)
       }
+      // Write the 'value' field to provide the argument's value if it is a foldable scalar\

Review Comment:
   nit: extra `\` after `scalar`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45319:
URL: https://github.com/apache/spark/pull/45319#discussion_r1506689984


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -78,14 +78,16 @@ def read_arguments(infile: IO) -> Tuple[List[AnalyzeArgument], Dict[str, Analyze
     kwargs: Dict[str, AnalyzeArgument] = {}
     for _ in range(num_args):
         dt = _parse_datatype_json_string(utf8_deserializer.loads(infile))
-        if read_bool(infile):  # is foldable
+        is_constant_expression = read_bool(infile)
+        if is_constant_expression:
             value = pickleSer._read_with_length(infile)
             if dt.needConversion():
                 value = dt.fromInternal(value)
         else:
             value = None
-        is_table = read_bool(infile)  # is table argument
-        argument = AnalyzeArgument(dataType=dt, value=value, isTable=is_table)
+        is_table = read_bool(infile)
+        argument = AnalyzeArgument(
+            dataType=dt, value=value, isTable=is_table, isConstantExpression=is_constant_expression)

Review Comment:
   Seems to be making the linter angry. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #45319:
URL: https://github.com/apache/spark/pull/45319#issuecomment-1970173842

   The failed test seems not related to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45319:
URL: https://github.com/apache/spark/pull/45319#issuecomment-1969949502

   Thanks @amaliujia @ueshin for your reviews! I made the suggestions that you proposed, please take another look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on code in PR #45319:
URL: https://github.com/apache/spark/pull/45319#discussion_r1506692938


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -78,14 +78,16 @@ def read_arguments(infile: IO) -> Tuple[List[AnalyzeArgument], Dict[str, Analyze
     kwargs: Dict[str, AnalyzeArgument] = {}
     for _ in range(num_args):
         dt = _parse_datatype_json_string(utf8_deserializer.loads(infile))
-        if read_bool(infile):  # is foldable
+        is_constant_expression = read_bool(infile)
+        if is_constant_expression:
             value = pickleSer._read_with_length(infile)
             if dt.needConversion():
                 value = dt.fromInternal(value)
         else:
             value = None
-        is_table = read_bool(infile)  # is table argument
-        argument = AnalyzeArgument(dataType=dt, value=value, isTable=is_table)
+        is_table = read_bool(infile)
+        argument = AnalyzeArgument(
+            dataType=dt, value=value, isTable=is_table, isConstantExpression=is_constant_expression)

Review Comment:
   👍 
   ```
   spark$ dev/reformat-python 
   reformatted spark/python/pyspark/sql/worker/analyze_udtf.py
   
   All done! ✨ 🍰 ✨
   1 file reformatted, 984 files left unchanged.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45319:
URL: https://github.com/apache/spark/pull/45319#discussion_r1506560801


##########
python/pyspark/sql/worker/analyze_udtf.py:
##########
@@ -84,8 +84,10 @@ def read_arguments(infile: IO) -> Tuple[List[AnalyzeArgument], Dict[str, Analyze
                 value = dt.fromInternal(value)
         else:
             value = None
-        is_table = read_bool(infile)  # is table argument
-        argument = AnalyzeArgument(dataType=dt, value=value, isTable=is_table)
+        is_table = read_bool(infile)
+        is_constant_expression = read_bool(infile)

Review Comment:
   Then receive `is_constant_expression` from `read_bool(inline)` at line 81?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.
dtenedor commented on PR #45319:
URL: https://github.com/apache/spark/pull/45319#issuecomment-1969675584

   cc @dongjoon-hyun @ueshin @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on code in PR #45319:
URL: https://github.com/apache/spark/pull/45319#discussion_r1506558129


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonFunction.scala:
##########
@@ -207,14 +207,22 @@ class UserDefinedPythonTableFunctionAnalyzeRunner(
         case NamedArgumentExpression(k, v) => (Some(k), v)
         case _ => (None, expr)
       }
+      // Write the 'value' field to provide the argument's value if it is a foldable scalar\
+      // expression.
       if (value.foldable) {
         dataOut.writeBoolean(true)

Review Comment:
   Can we reuse this? This `true` and the `false` at line 217 should be the same as `value.foldable`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin closed pull request #45319: [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments
URL: https://github.com/apache/spark/pull/45319


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47214][Python] Create UDTF API for 'analyze' method to differentiate constant NULL arguments and other types of arguments [spark]

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #45319:
URL: https://github.com/apache/spark/pull/45319#issuecomment-1970174021

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org