You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/26 02:17:43 UTC

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #29421: [SPARK-32388][SQL][test-hadoop2.7][test-hive1.2] TRANSFORM with schema-less mode should keep the same with hive

AngersZhuuuu commented on a change in pull request #29421:
URL: https://github.com/apache/spark/pull/29421#discussion_r511688786



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
##########
@@ -111,15 +111,14 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
               .zip(outputFieldWriters)
               .map { case (data, writer) => writer(data) })
       } else {
-        // In schema less mode, hive default serde will choose first two output column as output
-        // if output column size less then 2, it will throw ArrayIndexOutOfBoundsException.
-        // Here we change spark's behavior same as hive's default serde.
-        // But in hive, TRANSFORM with schema less behavior like origin spark, we will fix this
-        // to keep spark and hive behavior same in SPARK-32388
+        // In schema less mode, hive will choose first two output column as output.
+        // If output column size less then 2, it will return NULL for columns with missing values.
+        // Here we split row string and choose first 2 values, if values's size less then 2,
+        // we pad NULL value until 2 to make behavior same with hive.
         val kvWriter = CatalystTypeConverters.createToCatalystConverter(StringType)
         prevLine: String =>
           new GenericInternalRow(
-            prevLine.split(outputRowFormat).slice(0, 2)
+            prevLine.split(outputRowFormat).slice(0, 2).padTo(2, null)

Review comment:
       > @AngersZhuuuu, can you add a configuration to control the legacy behaviour, and add a note to the migration guide since it's arguable to say this is a bug? With that, I think I can leave my sign-off and merge.
   
   This suggestion looks good to me, but since we have fix a similar issue.  https://github.com/apache/spark/commit/c75a82794fc6a0f35697f8e1258562d43e860f68#diff-1bdf8d74f6b4f84d5acb1d1490503cde337fe64fe5d21339d7fedb06bd7cabc0
   
   How about merge this and  I will follow a pr to add a configuration to control the legacy behavior.  Change these two point together seems to be better?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org