You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Almeida, Julius" <Ju...@intuit.com.INVALID> on 2020/10/16 18:25:49 UTC
Store timestamp with Logical Types
Hi Team,
I wish to store timestamp(eg: 2020-10-13 13:01:0) in parquet using avro schema to load to hive.
but I don’t see any option to do so, I pass it as a string, but hive fails to read it as it is expecting timestamp
I am using ParquetIO from apache beam to write output files in parquet format.
avro schema:
{
"type": "record",
"name": "TestRecord",
"namespace": "com.test-v2.avro",
"fields": [{
"name": "reason",
"type": "string"
}, {
"name": "event",
"type": "string"
}, {
"name": "timestamp",
"type": "string"
}]
}
output Event:
{
"reason": "Invalid",
"event": "dropped",
"timestamp": "2020-10-13 13:01:05"
}
Hive schema is
reason -> varchar(100),
event -> varchar(100),
timestamp -> timestamp
but when trying to query hive it fails to read timestamp column since input event has string type.
I see this Jira in place https://issues.apache.org/jira/browse/AVRO-2924
And I believe the code change can be done here : https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java
Correct me if I am wrong.
Thanks,
Julius
Re: Store timestamp with Logical Types
Posted by Jeremy Custenborder <jc...@gmail.com>.
Timestamps in avro are represented by a long. Your best bet is to
convert the string to the long representation. For example to ms since
epoch, then write to avro. There is a logical type for this. Hive will
automatically read this as a timestamp. If you already have the data
in a hive table as a string, you can do a query to parse the string
timestamp to an actual timestamp, then write to parquet. Long story
short I would change whatever is writing the avro file to use the
proper timestamp encoding.
https://avro.apache.org/docs/current/spec.html#Timestamp+%28millisecond+precision%29
On Fri, Oct 16, 2020 at 1:26 PM Almeida, Julius
<Ju...@intuit.com.invalid> wrote:
>
>
> Hi Team,
>
> I wish to store timestamp(eg: 2020-10-13 13:01:0) in parquet using avro schema to load to hive.
> but I don’t see any option to do so, I pass it as a string, but hive fails to read it as it is expecting timestamp
> I am using ParquetIO from apache beam to write output files in parquet format.
>
> avro schema:
> {
> "type": "record",
> "name": "TestRecord",
> "namespace": "com.test-v2.avro",
> "fields": [{
> "name": "reason",
> "type": "string"
> }, {
> "name": "event",
> "type": "string"
> }, {
> "name": "timestamp",
> "type": "string"
> }]
> }
> output Event:
> {
> "reason": "Invalid",
> "event": "dropped",
> "timestamp": "2020-10-13 13:01:05"
> }
> Hive schema is
> reason -> varchar(100),
> event -> varchar(100),
> timestamp -> timestamp
>
>
> but when trying to query hive it fails to read timestamp column since input event has string type.
>
> I see this Jira in place https://issues.apache.org/jira/browse/AVRO-2924
>
> And I believe the code change can be done here : https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java
>
> Correct me if I am wrong.
>
> Thanks,
> Julius