You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "rangadi (via GitHub)" <gi...@apache.org> on 2023/12/06 19:04:01 UTC

[PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

rangadi opened a new pull request, #44214:
URL: https://github.com/apache/spark/pull/44214

   
   
   ### What changes were proposed in this pull request?
   This updates the the behavior of `from_protobuf()` built function when underlying record fails to deserialize. 
   
     * **Current behvior**: 
       * By default, this would throw an error and the query fails. [This part is not changed in the PR] 
       * When `mode` is set to 'PERMISSIVE' it return a non-null struct with each of the inner fields set to null e.g. `{ "field_a": null, "field_b": null }`  etc.
          * This is not very convenient to the users. They don't know if this was due malformed record or if the input itself has null. It is very hard to check for each of the fields for null in SQL query.
         
     * **New behavior**
       * When `mode` is set to 'PERMISSIVE' it simply returns `null`.
   
   
   ### Why are the changes needed?
   This makes it easier for users to detect and handle malformed records.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, but this does not change the contract. In fact, it clarifies it. 
   
   ### How was this patch tested?
    - Unit tests are updated.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1419883619


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   Sorry, I just discovered that the positioning of this tick is a bug fix, so option A is also fine to me.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #44214:
URL: https://github.com/apache/spark/pull/44214#issuecomment-1843528714

   cc: @LuciferYang, @SandishKumarHN, @justaparth 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #44214:
URL: https://github.com/apache/spark/pull/44214#issuecomment-1847497554

   Thanks @HyukjinKwon for merging. I will create a backport.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #44214: [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails.
URL: https://github.com/apache/spark/pull/44214


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1420015427


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   Thanks. This existing from day 1. Updated 'affects version' to 3.4.0. 
   What branches should we commit? I am not sure how to specify that in the PR.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   Thanks. This exists from day 1. Updated 'affects version' to 3.4.0. 
   What branches should we commit? I am not sure how to specify that in the PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on PR #44214:
URL: https://github.com/apache/spark/pull/44214#issuecomment-1844204708

   @LuciferYang, I was thinking about such an option too. Mostly it is not required. 
   The current behaviour this PR is more surprising to the users. We have see only a few customers try 'permissive' option. 
   One downside of adding the option is that we need keep all the old code. Not a bit deal, but let me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1419486979


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   @LuciferYang, @justaparth moving the discussion about backward compatibility to this comment so that we have a nice thread. I agree with concerns. We have three options to handle this:
   
     * (A) Do not include option to have old behavior (this PR, **preferred** by author).
         * I think 99.99 % of the cases this is will be either not relevant (because of mode is not set), or does not matter.
         * For the remaining cases I think users will understand and update the query.
    * (B) Provide an option to switch to old behavior. The default is new behavior (**less preferred**).
         * I am ok to to do this option, though I don't think it will be used much.
         * We will have a deprecation warning and will remove this option and relevant code in next major release.
    * (C) Keep old behavior as the default and provide option for new behavior (**least preferred**).
        * I do not think this is a good option.
        * It  makes it hard for users to benefit from the PR.
         



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1419885633


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   @rangadi I remember that Spark 3.4 already had this feature, but Jira only mentioned 3.5, so does this bug not exist in 3.4? Or should we need to correct `Affects Version/s` and `Fix Version/s` in Jira?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1417863243


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {
-    case st: StructType =>
-      val resultRow = new SpecificInternalRow(st.map(_.dataType))
-      for (i <- 0 until st.length) {
-        resultRow.setNullAt(i)
-      }
-      resultRow
-
-    case _ =>
-      null
-  }
-
   private def handleException(e: Throwable): Any = {
     parseMode match {
-      case PermissiveMode =>
-        nullResultRow
+      case PermissiveMode => null

Review Comment:
   This is the main fix. Returns `null` rather than _struct with null fields_.



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -39,16 +39,8 @@ private[sql] case class ProtobufDataToCatalyst(
 
   override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
 
-  override lazy val dataType: DataType = {
-    val dt = SchemaConverters.toSqlType(messageDescriptor, protobufOptions).dataType
-    parseMode match {
-      // With PermissiveMode, the output Catalyst row might contain columns of null values for
-      // corrupt records, even if some of the columns are not nullable in the user-provided schema.
-      // Therefore we force the schema to be all nullable here.
-      case PermissiveMode => dt.asNullable

Review Comment:
   This is not needed anymore since we return null rather than non-null struct with null fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1420024329


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   Make sense. Added `3.4.3` as as well. Thanks for updating. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #44214:
URL: https://github.com/apache/spark/pull/44214#issuecomment-1846707121

   Merged to master and branch-3.5.
   
   It has some conflict in branch-3.4. Would you mind creating a backporting PR please @rangadi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #44214:
URL: https://github.com/apache/spark/pull/44214#issuecomment-1844151075

   This is a change in the default behavior of the built-in function. Should we consider adding a config to restore the legacy behavior? Additionally, since this is a user-facing change, I think we should add a note in the sql-migration-guide.md file.
   
   also cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "rangadi (via GitHub)" <gi...@apache.org>.
rangadi commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1420024879


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   @LuciferYang once you approve, I will ask @HyukjinKwon  to merge this to appropriate branches.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-46275] Protobuf: Return null in permissive mode when deserialization fails. [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #44214:
URL: https://github.com/apache/spark/pull/44214#discussion_r1420020977


##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   > This exists from day 1. Updated 'affects version' to 3.4.0.
   
   I think the `Affects Version/s:` should be `3.4.2`, `3.5.0`, `4.0.0`
   
   then the `Fix Version/s` should be `3.4.3`, `3.5.1` and `4.0.0`?



##########
connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala:
##########
@@ -87,22 +79,9 @@ private[sql] case class ProtobufDataToCatalyst(
     mode
   }
 
-  @transient private lazy val nullResultRow: Any = dataType match {

Review Comment:
   > This exists from day 1. Updated 'affects version' to 3.4.0.
   
   I think the `Affects Version/s` should be `3.4.2`, `3.5.0`, `4.0.0`
   
   then the `Fix Version/s` should be `3.4.3`, `3.5.1` and `4.0.0`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org