You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "justaparth (via GitHub)" <gi...@apache.org> on 2023/05/09 17:12:56 UTC

[GitHub] [spark] justaparth opened a new pull request, #41108: [SPARK-43427] spark protobuf: modify deserialization behavior of unsigned integer types

justaparth opened a new pull request, #41108:
URL: https://github.com/apache/spark/pull/41108

https://issues.apache.org/jira/browse/SPARK-43427

### What changes were proposed in this pull request?

**Explanation**
Protobuf supports unsigned integer types, including `uint32` and `uint64`. When deserializing protobuf values with fields of these types, uint32 is converted to `IntegerType` and uint64 is converted to `LongType` in the resulting spark struct. `IntegerType` and `LongType` are [signed](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will come out as negative (I.e. overflow).

I propose that we deserialize unsigned integer types into a type that can contain them correctly, e.g.
uint32 => `LongType`
uint64 => `Decimal(20, 0)`

**Backwards Compatibility / Default Behavior**
**Should we maintain backwards compatibility and we add an option that allows deserializing these types differently? OR should we change change the default behavior (with an option to go back to the old way)?**

I think by default it makes more sense to deserialize them as the larger types so that it's semantically more correct. However, there may be existing users of this library that would be affected by this behavior change. Though, maybe we can justify the change since the function is tagged as `Experimental` (and spark 3.4.0 was only released very recently).

**Precedent**
I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and https://github.com/apache/spark/pull/31921

### Why are the changes needed?
Improve unsigned integer deserialization behavior.

### Does this PR introduce _any_ user-facing change?
Yes, as written it would change the deserialization behavior of unsigned integer field types. However,

### How was this patch tested?
Unit Testing

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] justaparth commented on pull request #41108: [SPARK-43427] spark protobuf: modify deserialization behavior of unsigned integer types

Posted by "justaparth (via GitHub)" <gi...@apache.org>.

justaparth commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1543602788

   also cc @HyukjinKwon as you reviewed https://github.com/apache/spark/pull/31921 and have reviewed many proto changes too 😅 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] justaparth commented on a diff in pull request #41108: [SPARK-43427] spark protobuf: modify deserialization behavior of unsigned integer types

Posted by "justaparth (via GitHub)" <gi...@apache.org>.

justaparth commented on code in PR #41108:
URL: https://github.com/apache/spark/pull/41108#discussion_r1189463819


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1531,6 +1531,47 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("test unsigned integer types") {
+    // The java type for uint32 and uint64 is signed integer and long respectively.
+    // Let's check that we're converting correctly
+    val sample = spark.range(1).select(
+      lit(
+        SimpleMessage
+          .newBuilder()
+          .setUint32Value(Integer.MIN_VALUE)
+          .setUint64Value(Long.MinValue)
+          .build()
+          .toByteArray
+      ).as("raw_proto"))
+
+    val expected = spark.range(1).select(
+      lit(Integer.toUnsignedLong(Integer.MIN_VALUE).longValue).as("uint32_value"),

Review Comment:
   using MIN_VALUE as its all 1s in binary and is the largest possible number if its interpreted as "unsigned"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] justaparth commented on a diff in pull request #41108: [SPARK-43427] spark protobuf: modify deserialization behavior of unsigned integer types

Posted by "justaparth (via GitHub)" <gi...@apache.org>.

justaparth commented on code in PR #41108:
URL: https://github.com/apache/spark/pull/41108#discussion_r1188906519


##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1465,6 +1465,49 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("test unsigned integer types") {
+    // The java type for uint32 and uint64 is signed integer and long respectively.
+    // Let's check that we're converting correctly
+    val sample = spark.range(1).select(
+      lit(
+        UnsignedTypes
+          .newBuilder()
+          .setUint32Value(Integer.MIN_VALUE)
+          .setUint64Value(Long.MinValue)
+          .build()
+          .toByteArray
+      ).as("raw_proto"))
+
+    val expected = spark.range(1).select(
+      struct(
+        lit(Integer.toUnsignedLong(Integer.MIN_VALUE).longValue).as("uint32_value"),
+        lit(BigDecimal(java.lang.Long.toUnsignedString(Long.MinValue))).as("uint64_value")
+      ).as("proto")
+    )
+
+    val expectedWithLegacy = spark.range(1).select(
+      struct(
+        lit(Integer.MIN_VALUE).as("uint32_value"),

Review Comment:
   using `MIN_VALUE` as its all `1`s in binary and is the largest possible number if its "unsigned"



##########
connector/protobuf/src/test/scala/org/apache/spark/sql/protobuf/ProtobufFunctionsSuite.scala:
##########
@@ -1465,6 +1465,49 @@ class ProtobufFunctionsSuite extends QueryTest with SharedSparkSession with Prot
     }
   }
 
+  test("test unsigned integer types") {
+    // The java type for uint32 and uint64 is signed integer and long respectively.
+    // Let's check that we're converting correctly
+    val sample = spark.range(1).select(
+      lit(
+        UnsignedTypes
+          .newBuilder()
+          .setUint32Value(Integer.MIN_VALUE)
+          .setUint64Value(Long.MinValue)
+          .build()
+          .toByteArray
+      ).as("raw_proto"))
+
+    val expected = spark.range(1).select(
+      struct(
+        lit(Integer.toUnsignedLong(Integer.MIN_VALUE).longValue).as("uint32_value"),
+        lit(BigDecimal(java.lang.Long.toUnsignedString(Long.MinValue))).as("uint64_value")
+      ).as("proto")
+    )
+
+    val expectedWithLegacy = spark.range(1).select(
+      struct(
+        lit(Integer.MIN_VALUE).as("uint32_value"),

Review Comment:
   using `MIN_VALUE` as its all `1`s in binary and is the largest possible number if its interpreted as "unsigned"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] justaparth commented on pull request #41108: [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types

Posted by "justaparth (via GitHub)" <gi...@apache.org>.

justaparth commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1556292297

   > Where is the information loss or overflow? Java code generated by Protobuf for a uint32 field also returns an `int`, not `long`.
   
   sorry i didn't get a chance to reply to this until now. There is no information loss, technically, as uint32 is 4 bytes and uint64 is 8 bytes, same as int and long respectively. However, there is overflow in the representation.
   
   Here's an example:
   
   Consider a protobuf message like:
   ```
   syntax = "proto3";
   
   message Test {
     uint64 val = 1;
   }
   ```
   
   Generate a protobuf with a value above 2^63. I did this in python with a small script like:
   
   ```
   import test_pb2
   
   s = test_pb2.Test()
   s.val = 9223372036854775809 # 2**63 + 1
   serialized = s.SerializeToString()
   print(serialized)
   ```
   
   This generates the binary representation:
   
   ```
   b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'
   ```
   
   Then, deserialize this using `from_protobuf`. I did this in a notebook so its easier to see, but could reproduce in a scala test as well:
   
   <img width="597" alt="image" src="https://github.com/apache/spark/assets/1002986/a6c58c19-b9d3-44d4-8c2a-605991d3d5de">
   
   
   This is exactly what we'd expect when you take a 64 bit number with the highest bit as `1` and then try to interpret it as a signed number (long). 
   
   So this PR propose some changes to the deserialization behavior. However, I don't know if its right to change the default or have an option to allow unpacking as a larger number.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] justaparth commented on pull request #41108: [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types

Posted by "justaparth (via GitHub)" <gi...@apache.org>.

justaparth commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1557723788

> What if you have a UDF that converts this to BigDecimal? Will you get the value back? I guess that is the intention behind why protobuf-java casts unsiged to signed in its Java methods. I think it simpler to go this way.

Yeah, there is no information loss so you can get the right value the way I did in this PR (Integer.toUnsignedLong, Long.toUnsignedString). I think, though, it's useful if the `spark-protobuf` library can do this; the burden of taking a struct and trying to do this transformation is cumbersome, in my opinion.

However, one additional piece of information is that **for unsigned types in parquet, the default behavior is to represent them in larger types**. I put this in the PR description but see this ticket https://issues.apache.org/jira/browse/SPARK-34817 implemented in this PR: https://github.com/apache/spark/pull/31921. Or the existing code today https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L243-L247 which shows that **by default** parquet unsigned values are actually expanded to larger types in spark.

So, since this same problem/solution exists in another storage format, I think its useful to implement this behavior here as well. I also think that it actually _does_ make sense to do it by default, as parquet already does this. However, i'm open also to doing this transformation behind an option so that no existing usages are broken. Mainly, I want to just make sure we do what is the most correct and broadly consistent thing to do (and i'm not really sure exactly what that is, and would love some other inputs). cc @HyukjinKwon as well here since you reviewed the original PR doing this for parquet!

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #41108: [SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1722349658

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rangadi commented on pull request #41108: [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types

Posted by "rangadi (via GitHub)" <gi...@apache.org>.

rangadi commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1547235141

   Where is the information loss or overflow? Java code generated by Protobuf for a uint32 field also returns an `int`, not `long`.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] rangadi commented on pull request #41108: [SPARK-43427] spark protobuf: modify serde behavior of unsigned integer types

Posted by "rangadi (via GitHub)" <gi...@apache.org>.

rangadi commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1557649439

   >  So this PR proposes some changes to the deserialization behavior. However, I don't know if its right to change the default or have an option to allow unpacking as a larger type.
   
   What if you have a UDF that converts this to BigDecimal? Will you get the value back?
   I guess that is the intention behind why protobuf-java casts unsiged to signed in its Java methods. 
   I think it simpler to go this way. Given these kinds of issues, I guess it is not a good practice to use unsiged in protobuf. It can be intepreted correctly at application level when they are infact used this way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] justaparth commented on pull request #41108: [SPARK-43427] spark protobuf: modify deserialization behavior of unsigned integer types

Posted by "justaparth (via GitHub)" <gi...@apache.org>.

justaparth commented on PR #41108:
URL: https://github.com/apache/spark/pull/41108#issuecomment-1540562071

   cc @rangadi i've made a draft implementation here but just wanted to get your thoughts quickly on:
   
   1. does the problem / solution make sense?
   2. should we make this behavior the default or should we add an option to turn it on? 
   
   thanks 🙏 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #41108: [SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #41108: [SPARK-43427][Protobuf] spark protobuf: modify serde behavior of unsigned integer types
URL: https://github.com/apache/spark/pull/41108


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org