You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/24 05:11:05 UTC

[GitHub] [spark] wzx140 opened a new pull request, #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

wzx140 opened a new pull request, #36973:
URL: https://github.com/apache/spark/pull/36973

   …oDeserializer
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer.
   
   
   ### Why are the changes needed?
   Fix ByteBuffer forget to rewind after get in AvroDeserializer.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Add ut in AvroCatalystDataConversionSuite.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xkrogen commented on a diff in pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
xkrogen commented on code in PR #36973:
URL: https://github.com/apache/spark/pull/36973#discussion_r906200144


##########
connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala:
##########
@@ -360,4 +362,27 @@ class AvroCatalystDataConversionSuite extends SparkFunSuite
       None,
       new OrderedFilters(Seq(Not(EqualTo("Age", 39))), sqlSchema))
   }
+
+  test("AvroDeserializer with binary type") {
+    val jsonFormatSchema =
+      """
+        |{ "type": "record",
+        |  "name": "record",
+        |  "fields" : [{
+        |    "name": "a",
+        |    "type": {
+        |      "type": "bytes"
+        |    }

Review Comment:
   Why is there a nested type here instead of just `"type": "bytes"` at the top-level?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36973:
URL: https://github.com/apache/spark/pull/36973#discussion_r906168886


##########
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##########
@@ -195,6 +195,8 @@ private[sql] class AvroDeserializer(
           case b: ByteBuffer =>
             val bytes = new Array[Byte](b.remaining)
             b.get(bytes)
+            // Do not forget to reset the position
+            b.rewind()

Review Comment:
   I think you're right but can you explain why this is needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzx140 commented on a diff in pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #36973:
URL: https://github.com/apache/spark/pull/36973#discussion_r906183583


##########
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##########
@@ -195,6 +195,8 @@ private[sql] class AvroDeserializer(
           case b: ByteBuffer =>
             val bytes = new Array[Byte](b.remaining)
             b.get(bytes)
+            // Do not forget to reset the position
+            b.rewind()

Review Comment:
   HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. You can take a look at added unit test. The second call of deserializer will return an InternalRow with empty binary column.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzx140 commented on a diff in pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #36973:
URL: https://github.com/apache/spark/pull/36973#discussion_r906223396


##########
connector/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala:
##########
@@ -360,4 +362,27 @@ class AvroCatalystDataConversionSuite extends SparkFunSuite
       None,
       new OrderedFilters(Seq(Not(EqualTo("Age", 39))), sqlSchema))
   }
+
+  test("AvroDeserializer with binary type") {
+    val jsonFormatSchema =
+      """
+        |{ "type": "record",
+        |  "name": "record",
+        |  "fields" : [{
+        |    "name": "a",
+        |    "type": {
+        |      "type": "bytes"
+        |    }

Review Comment:
   This is not necessary and it has been removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzx140 closed pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
wzx140 closed pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…
URL: https://github.com/apache/spark/pull/36973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #36973:
URL: https://github.com/apache/spark/pull/36973#issuecomment-1166154260

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
mridulm commented on PR #36973:
URL: https://github.com/apache/spark/pull/36973#issuecomment-1165358344

   +CC @xkrogen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen closed pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
srowen closed pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…
URL: https://github.com/apache/spark/pull/36973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzx140 commented on pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
wzx140 commented on PR #36973:
URL: https://github.com/apache/spark/pull/36973#issuecomment-1165341994

   @gengliangwang Would you like to review this pr?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #36973:
URL: https://github.com/apache/spark/pull/36973#issuecomment-1166530911

   Can you try rebasing and pushing? some doc build steps failed, which is unrelated to the change, but might resolve it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzx140 commented on pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
wzx140 commented on PR #36973:
URL: https://github.com/apache/spark/pull/36973#issuecomment-1165741695

   
   
   
   
   > Code changes LGTM. I checked the other place the deserializer uses a `ByteBuffer`, in the `(BYTES, _: DecimalType)` case, and it doesn't have the same problem because `decimalConversions.fromBytes()` duplicates the `ByteBuffer` before extracting from it.
   > 
   > Can you update the PR description to have more details on why this is needed, basically what you described in [your comment](https://github.com/apache/spark/pull/36973/files#r906183583)? It might also be helpful to update the summary to be more descriptive about the impact rather than the mechanics of the change, something like "Fix repeated deserialization of BYTES type in AvroDeserializer" (and then the body can describe the mechanical/technical aspects of the change).
   
   @xkrogen Already updated the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on a diff in pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #36973:
URL: https://github.com/apache/spark/pull/36973#discussion_r906186558


##########
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##########
@@ -195,6 +195,8 @@ private[sql] class AvroDeserializer(
           case b: ByteBuffer =>
             val bytes = new Array[Byte](b.remaining)
             b.get(bytes)
+            // Do not forget to reset the position
+            b.rewind()

Review Comment:
   Sounds good. I wonder why this never surfaced before? seems like it would mean any binary cols in Avro don't work?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzx140 commented on a diff in pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
wzx140 commented on code in PR #36973:
URL: https://github.com/apache/spark/pull/36973#discussion_r906192030


##########
connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala:
##########
@@ -195,6 +195,8 @@ private[sql] class AvroDeserializer(
           case b: ByteBuffer =>
             val bytes = new Array[Byte](b.remaining)
             b.get(bytes)
+            // Do not forget to reset the position
+            b.rewind()

Review Comment:
   Maybe this is not common to call this twice to deserialize one avro data object



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wzx140 commented on pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
wzx140 commented on PR #36973:
URL: https://github.com/apache/spark/pull/36973#issuecomment-1166733628

   @srowen Thanks for your help. I have rebased and pushed. Now github Action is all successful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srowen commented on pull request #36973: [SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr…

Posted by GitBox <gi...@apache.org>.
srowen commented on PR #36973:
URL: https://github.com/apache/spark/pull/36973#issuecomment-1166746551

   Merged to master/3.3/3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org