You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Mithun Radhakrishnan (JIRA)" <ji...@apache.org> on 2016/09/19 20:54:21 UTC
[jira] [Updated] (HIVE-14789) Avro Table-reads bork when using
SerDe-generated table-schema.
[ https://issues.apache.org/jira/browse/HIVE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mithun Radhakrishnan updated HIVE-14789:
----------------------------------------
Attachment: HIVE-14789-reproduce.patch
This attachment has a qfile-test that reproduces the error I'm talking about, including a scrubbed data-file that's readable with the schema-literal, but not without it.
This was a fairly common failure at Yahoo. Our current recommendation is for users to only use Avro tables with the schema-file with which they were produced. The metastore-based schema is to be ignored entirely.
I've already tried modifying how the Avro schema is generated from {{columns.list.types}}, but I find that the conversions (to and fro) are lossy, brittle and unreliable. :/
> Avro Table-reads bork when using SerDe-generated table-schema.
> --------------------------------------------------------------
>
> Key: HIVE-14789
> URL: https://issues.apache.org/jira/browse/HIVE-14789
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 1.2.1, 2.0.1
> Reporter: Mithun Radhakrishnan
> Assignee: Mithun Radhakrishnan
> Attachments: HIVE-14789-reproduce.patch
>
>
> AvroSerDe allows one to skip the table-columns in a table-definition when creating a table, as long as the TBLPROPERTIES includes a valid {{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are inferred from processing the Avro schema file/literal.
> The problem is that the inferred schema might not be congruent with the actual schema in the Avro schema file/literal. Consider the following table definition:
> {code:sql}
> CREATE TABLE avro_schema_break_1
> ROW FORMAT
> SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.literal'='{
> "type": "record",
> "name": "Messages",
> "namespace": "net.myth",
> "fields": [
> {
> "name": "header",
> "type": [
> "null",
> {
> "type": "record",
> "name": "HeaderInfo",
> "fields": [
> {
> "name": "inferred_event_type",
> "type": [
> "null",
> "string"
> ],
> "default": null
> },
> {
> "name": "event_type",
> "type": [
> "null",
> "string"
> ],
> "default": null
> },
> {
> "name": "event_version",
> "type": [
> "null",
> "string"
> ],
> "default": null
> }
> ]
> }
> ]
> },
> {
> "name": "messages",
> "type": {
> "type": "array",
> "items": {
> "name": "MessageInfo",
> "type": "record",
> "fields": [
> {
> "name": "message_id",
> "type": [
> "null",
> "string"
> ],
> "doc": "Message-ID"
> },
> {
> "name": "received_date",
> "type": [
> "null",
> "long"
> ],
> "doc": "Received Date"
> },
> {
> "name": "sent_date",
> "type": [
> "null",
> "long"
> ]
> },
> {
> "name": "from_name",
> "type": [
> "null",
> "string"
> ]
> },
> {
> "name": "flags",
> "type": [
> "null",
> {
> "type": "record",
> "name": "Flags",
> "fields": [
> {
> "name": "is_seen",
> "type": [
> "null",
> "boolean"
> ],
> "default": null
> },
> {
> "name": "is_read",
> "type": [
> "null",
> "boolean"
> ],
> "default": null
> },
> {
> "name": "is_flagged",
> "type": [
> "null",
> "boolean"
> ],
> "default": null
> }
> ]
> }
> ],
> "default": null
> }
> ]
> }
> }
> }
> ]
> }');
> {code}
> This produces a table with the following schema:
> {noformat}
> 2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] hive.log: DDL: struct avro_schema_break_1 { struct<inferred_event_type:string,event_type:string,event_version:string> header, list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>> messages}
> {noformat}
> Data written to this table using the AvroSchema from {{avro.schema.literal}} using Pig's {{AvroStorage}} cannot be read using Hive using the generated table schema. This is the exception one sees:
> {noformat}
> java.io.IOException: org.apache.avro.AvroTypeException: Found net.myth.HeaderInfo, expecting union
> at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
> at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
> at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
> at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019)
> at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
> at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162)
> at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136)
> at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172)
> at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104)
> at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59)
> ...
> {noformat}
> The only way to read this table is by using the attached {{avro.schema.literal}} or {{avro.schema.url}}. This has implications on systems where data could be produced externally to Hive. It also has repercussions on table-replication using Falcon/GDM, in that the schema file/literal needs to be replicated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)