You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by jo...@apache.org on 2022/01/04 19:05:28 UTC
[spark] branch branch-3.0 updated: [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()

This is an automated email from the ASF dual-hosted git repository.

joshrosen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 3aaf722  [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()
3aaf722 is described below

commit 3aaf722a0d6552733504c794f59d390c349dfa80
Author: Josh Rosen <jo...@databricks.com>
AuthorDate: Tue Jan 4 10:59:53 2022 -0800

    [SPARK-37784][SQL] Correctly handle UDTs in CodeGenerator.addBufferedState()
    
    ### What changes were proposed in this pull request?
    
    This PR fixes a correctness issue in the CodeGenerator.addBufferedState() helper method (which is used by the SortMergeJoinExec operator).
    
    The addBufferedState() method generates code for buffering values that come from a row in an operator's input iterator, performing any necessary copying so that the buffered values remain correct after the input iterator advances to the next row.
    
    The current logic does not correctly handle UDTs: these fall through to the match statement's default branch, causing UDT values to be buffered without copying. This is problematic if the UDT's underlying SQL type is an array, map, struct, or string type (since those types require copying). Failing to copy values can lead to correctness issues or crashes.
    
    This patch's fix is simple: when the dataType is a UDT, use its underlying sqlType for determining whether values need to be copied. I used an existing helper function to perform this type unwrapping.
    
    ### Why are the changes needed?
    
    Fix a correctness issue.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    I manually tested this change by re-running a workload which failed with a segfault prior to this patch. See JIRA for more details: https://issues.apache.org/jira/browse/SPARK-37784
    
    So far I have been unable to come up with a CI-runnable regression test which would have failed prior to this change (my only working reproduction runs in a pre-production environment and does not fail in my development environment).
    
    Closes #35066 from JoshRosen/SPARK-37784.
    
    Authored-by: Josh Rosen <jo...@databricks.com>
    Signed-off-by: Josh Rosen <jo...@databricks.com>
    (cherry picked from commit eeef48fac412a57382b02ba3f39456d96379b5f5)
    Signed-off-by: Josh Rosen <jo...@databricks.com>
---
 .../apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
index 74c6ea5..90ebe40 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
@@ -330,7 +330,7 @@ class CodegenContext extends Logging {
    */
   def addBufferedState(dataType: DataType, variableName: String, initCode: String): ExprCode = {
     val value = addMutableState(javaType(dataType), variableName)
-    val code = dataType match {
+    val code = UserDefinedType.sqlType(dataType) match {
       case StringType => code"$value = $initCode.clone();"
       case _: StructType | _: ArrayType | _: MapType => code"$value = $initCode.copy();"
       case _ => code"$value = $initCode;"

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org