You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/04 11:00:37 UTC
[GitHub] [iceberg] sauliusvl opened a new issue #4038: UUID type support in Spark is incomplete?

sauliusvl opened a new issue #4038:
URL: https://github.com/apache/iceberg/issues/4038


   Consider a table created in Trino (it has a native UUID type that maps to UUID in Iceberg):
   
   ```sql
   CREATE TABLE test (id UUID) WITH (location = 'hdfs://rr-hdpz1/user/iceberg/test', format = 'PARQUET');
   insert into test values (uuid '12151fd2-7586-11e9-8f9e-2a86e4085a59')
   ```
   
   Everything looks good, I can query the table, running `parquet meta` on the resulting file in HDFS suggests it's written correctly according to the parquet specification (byte array of length 16 with logical type `UUID`):
   
   ```
   message table {
     optional fixed_len_byte_array(16) id (UUID) = 1;
   }
   ```
   
   now I try to read it in Spark:
   
   ```bash
   scala> spark.sql("select * from test").printSchema
   root
    |-- id: string (nullable = true)
   
   scala> spark.sql("select * from test").show(false)
   22/02/03 12:16:18 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (bn-hdpz1.vinted.net executor 2): java.lang.ClassCastException: class [B cannot be cast to class org.apache.spark.unsafe.types.UTF8String ([B is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app')
   	at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getUTF8String(rows.scala:46)
   ```
   Same thing when trying to insert from spark:
   
   ```bash
   scala> spark.sql("insert into test values ('e9238c3e-3aa6-4668-aceb-f9507a8f8d59')")
   22/02/03 12:42:29 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4) (hx-hdpz2.vinted.net executor 3): java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class [B (org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'; [B is in module java.base of loader 'bootstrap')
   	at org.apache.iceberg.spark.data.SparkParquetWriters$ByteArrayWriter.write(SparkParquetWriters.java:291)
   ```
   
   
   The [docs](https://iceberg.apache.org/#spark-writes/#type-compatibility) seem to suggest that UUID should be converted to a string in Spark, but after reading the source code I don't see how is this supposed to work: the `UUID` type gets simply mapped to `String` [here](https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/TypeToSparkType.java#L112), which is not enough - the byte representation of a UUID can't be cast to `String` straightforwardly.
   
   The way I see it Iceberg should either:
    - somehow do the conversion back and forth to string automatically upon reading/inserting (not sure if this is possible and it's not optimal performance wise, as the string representation is 32 bytes vs. the raw 16 bytes)
    - map it to a byte array and force users to convert to `java.util.UUID` if they really want to see the text representation - optimal, but not user friendly
    - implement a Spark [user defined type](https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/types/UserDefinedType.html) for UUID and convert to it - not sure about performance implications here
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org