You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Mithun Radhakrishnan (JIRA)" <ji...@apache.org> on 2016/09/19 20:54:21 UTC
[jira] [Updated] (HIVE-14789) Avro Table-reads bork when using SerDe-generated table-schema.

     [ https://issues.apache.org/jira/browse/HIVE-14789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mithun Radhakrishnan updated HIVE-14789:
----------------------------------------
    Attachment: HIVE-14789-reproduce.patch

This attachment has a qfile-test that reproduces the error I'm talking about, including a scrubbed data-file that's readable with the schema-literal, but not without it. 

This was a fairly common failure at Yahoo. Our current recommendation is for users to only use Avro tables with the schema-file with which they were produced. The metastore-based schema is to be ignored entirely.

I've already tried modifying how the Avro schema is generated from {{columns.list.types}}, but I find that the conversions (to and fro) are lossy, brittle and unreliable. :/

> Avro Table-reads bork when using SerDe-generated table-schema.
> --------------------------------------------------------------
>
>                 Key: HIVE-14789
>                 URL: https://issues.apache.org/jira/browse/HIVE-14789
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 1.2.1, 2.0.1
>            Reporter: Mithun Radhakrishnan
>            Assignee: Mithun Radhakrishnan
>         Attachments: HIVE-14789-reproduce.patch
>
>
> AvroSerDe allows one to skip the table-columns in a table-definition when creating a table, as long as the TBLPROPERTIES includes a valid {{avro.schema.url}} or {{avro.schema.literal}}. The table-columns are inferred from processing the Avro schema file/literal.
> The problem is that the inferred schema might not be congruent with the actual schema in the Avro schema file/literal. Consider the following table definition:
> {code:sql}
> CREATE TABLE avro_schema_break_1
> ROW FORMAT
> SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
> STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
> TBLPROPERTIES ('avro.schema.literal'='{
>   "type": "record",
>   "name": "Messages",
>   "namespace": "net.myth",
>   "fields": [
>     {
>       "name": "header",
>       "type": [
>         "null",
>         {
>           "type": "record",
>           "name": "HeaderInfo",
>           "fields": [
>             {
>               "name": "inferred_event_type",
>               "type": [
>                 "null",
>                 "string"
>               ],
>               "default": null
>             },
>             {
>               "name": "event_type",
>               "type": [
>                 "null",
>                 "string"
>               ],
>               "default": null
>             },
>             {
>               "name": "event_version",
>               "type": [
>                 "null",
>                 "string"
>               ],
>               "default": null
>             }    
>           ]
>         }
>       ]
>     },
>     {
>       "name": "messages",
>       "type": {
>         "type": "array",
>         "items": {
>           "name": "MessageInfo",
>           "type": "record",
>           "fields": [
>             {
>               "name": "message_id",
>               "type": [
>                 "null",
>                 "string"
>               ],
>               "doc": "Message-ID"
>             },
>             {
>               "name": "received_date",
>               "type": [
>                 "null",
>                 "long"
>               ],
>               "doc": "Received Date"
>             },
>             {
>               "name": "sent_date",
>               "type": [
>                 "null",
>                 "long"
>               ]
>             },
>             {
>               "name": "from_name",
>               "type": [
>                 "null",
>                 "string"
>               ]
>             },
>             {
>               "name": "flags",
>               "type": [
>                 "null",
>                 {
>                   "type": "record",
>                   "name": "Flags",
>                   "fields": [
>                     {
>                       "name": "is_seen",
>                       "type": [
>                         "null",
>                         "boolean"
>                       ],
>                       "default": null
>                     },
>                     {
>                       "name": "is_read",
>                       "type": [
>                         "null",
>                         "boolean"
>                       ],
>                       "default": null
>                     },
>                     {
>                       "name": "is_flagged",
>                       "type": [
>                         "null",
>                         "boolean"
>                       ],
>                       "default": null
>                     }
>                   ]
>                 }
>               ],
>               "default": null
>             }
>           ]
>         }
>       }
>     }
>   ]
> }');
> {code}
> This produces a table with the following schema:
> {noformat}
> 2016-09-19T13:23:42,934 DEBUG [0ce7e586-13ea-4390-ac2a-6dac36e8a216 main] hive.log: DDL: struct avro_schema_break_1 { struct<inferred_event_type:string,event_type:string,event_version:string> header, list<struct<message_id:string,received_date:i64,sent_date:i64,from_name:string,flags:struct<is_seen:bool,is_read:bool,is_flagged:bool>>> messages}
> {noformat}
> Data written to this table using the AvroSchema from {{avro.schema.literal}} using Pig's {{AvroStorage}} cannot be read using Hive using the generated table schema. This is the exception one sees:
> {noformat}
> java.io.IOException: org.apache.avro.AvroTypeException: Found net.myth.HeaderInfo, expecting union
>   at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:521)
>   at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
>   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
>   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2019)
>   at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:253)
>   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:336)
>   at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:1162)
>   at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:1136)
>   at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:172)
>   at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:104)
>   at org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver(TestCliDriver.java:59)
> ...
> {noformat}
> The only way to read this table is by using the attached {{avro.schema.literal}} or {{avro.schema.url}}. This has implications on systems where data could be produced externally to Hive. It also has repercussions on table-replication using Falcon/GDM, in that the schema file/literal needs to be replicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)