You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Grant Henke (JIRA)" <ji...@apache.org> on 2018/05/24 21:36:00 UTC

[jira] [Updated] (KUDU-2454) Avro Import/Export does not round trip

     [ https://issues.apache.org/jira/browse/KUDU-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Henke updated KUDU-2454:
------------------------------
    Affects Version/s: 1.5.0

> Avro Import/Export does not round trip
> --------------------------------------
>
>                 Key: KUDU-2454
>                 URL: https://issues.apache.org/jira/browse/KUDU-2454
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Grant Henke
>            Priority: Critical
>
> When exporting to Avro columns with type Byte or Short are treated as Integers because Avro doesn't have a Byte or Short type. When re-importing the data, the job fails because the column types do not match.
> Ideally spark-avro would solve this by safely casting the values back to the smaller type. Guava has utilities to make this straightforward. (ex. Shorts.checkedCast(i)). We could send a pull request to spark-avro to fix this, or add some special handling to the Kudu side to handle the safe downconversion. 
> Another type issue when exporting is that Decimal values are written as Strings instead of BigDecimal logical types. There are a few un-merged pull request to fix that here: 
>  * [https://github.com/databricks/spark-avro/pull/276]
>  * [https://github.com/databricks/spark-avro/pull/121]
> Additionally Timestamp values are written as longs instead of Timestamp logical types (timestamp-micros). This is a data corruption issue because the long [value that is output|https://github.com/databricks/spark-avro/blob/0764d699015975acf87dc5210cca8a43db84196a/src/main/scala/com/databricks/spark/avro/AvroOutputWriter.scala#L103] is in milliseconds (Timestamp.getTime()) but the expected long value for a Kudu Timestamp column should be in microseconds.
> Given all these issues, ImportExportFiles needs a lot more test coverage before we suggest it's use. Currently it only tests importing Strings form a CSV and does not test Avro or parquet support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)