You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Grant Henke (JIRA)" <ji...@apache.org> on 2018/05/24 21:35:00 UTC

[jira] [Created] (KUDU-2454) Avro Import/Export does not round trip

Grant Henke created KUDU-2454:
---------------------------------

             Summary: Avro Import/Export does not round trip
                 Key: KUDU-2454
                 URL: https://issues.apache.org/jira/browse/KUDU-2454
             Project: Kudu
          Issue Type: Bug
            Reporter: Grant Henke


When exporting to Avro columns with type Byte or Short are treated as Integers because Avro doesn't have a Byte or Short type. When re-importing the data, the job fails because the column types do not match.

Ideally spark-avro would solve this by safely casting the values back to the smaller type. Guava has utilities to make this straightforward. (ex. Shorts.checkedCast(i)). We could send a pull request to spark-avro to fix this, or add some special handling to the Kudu side to handle the safe downconversion. 

Another type issue when exporting is that Decimal values are written as Strings instead of BigDecimal logical types. There are a few un-merged pull request to fix that here: 
 * [https://github.com/databricks/spark-avro/pull/276]
 * [https://github.com/databricks/spark-avro/pull/121]

Additionally Timestamp values are written as longs instead of Timestamp logical types (timestamp-micros). This is a data corruption issue because the long [value that is output|https://github.com/databricks/spark-avro/blob/0764d699015975acf87dc5210cca8a43db84196a/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala#L103] is in milliseconds (Timestamp.getTime()) but the expected long value for a Kudu Timestamp column should be in microseconds.

Given all these issues, ImportExportFiles needs a lot more test coverage before we suggest it's use. Currently it only tests importing Strings form a CSV and does not test Avro or parquet support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)