You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2016/06/22 14:13:57 UTC

[jira] [Updated] (SPARK-13572) HiveContext reads avro Hive tables incorrectly

     [ https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Lian updated SPARK-13572:
-------------------------------
    Description: 
I am using PySpark to read avro-based tables from Hive and while the avro tables can be read, some of the columns are incorrectly read - showing value {{None}} instead of the actual value.

{noformat}
>>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest where year=2016 and month=2 and day=29 limit 3""")
>>> results_df.take(3)
[Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
 Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
 Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
{noformat}

Observe the {{None}} values at most of the fields. Surprisingly not all fields, only some of them are showing {{None}} instead of the real values. The table definition does not show anything specific about these columns.

Running the same query in Hive:

{noformat}
c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where year=2016 and month=2 and day=29 limit 3;
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition  | opsconsole_ingest.kafkaoffset  |      opsconsole_ingest.uuid       |         opsconsole_ingest.mid         |         opsconsole_ingest.iid         | opsconsole_ingest.product  | opsconsole_ingest.utctime  | opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | opsconsole_ingest.displayname  | opsconsole_ingest.category  | opsconsole_ingest.source_filename  | opsconsole_ingest.year  | opsconsole_ingest.month  | opsconsole_ingest.day  |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| 11.0                                     | 0.0                               | 3.83399394E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 8                           | 3.0 SP11 (8.110.7601.18923)  | MSXML 3.0 Version              | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
| 11.0                                     | 0.0                               | 3.83399395E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 2                           | GenuineIntel                 | CPU Vendor                     | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
| 11.0                                     | 0.0                               | 3.83399396E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 141                         | 4                            | Screens                        | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
3 rows selected (1.252 seconds)
{noformat}

Attached shows that no error or warning logs are generated by Spark.

Also the table definition is attached.


  was:
I am using PySpark to read avro-based tables from Hive and while the avro tables can be read, some of the columns are incorrectly read - showing value "None" instead of the actual value.

>>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest where year=2016 and month=2 and day=29 limit 3""")
>>> results_df.take(3)
[Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
 Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
 Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]

Observe the "None" values at most of the fields. Surprisingly not all fields, only some of them are showing "None" instead of the real values. The table definition does not show anything specific about these columns.

Running the same query in Hive:
c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where year=2016 and month=2 and day=29 limit 3;
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition  | opsconsole_ingest.kafkaoffset  |      opsconsole_ingest.uuid       |         opsconsole_ingest.mid         |         opsconsole_ingest.iid         | opsconsole_ingest.product  | opsconsole_ingest.utctime  | opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | opsconsole_ingest.displayname  | opsconsole_ingest.category  | opsconsole_ingest.source_filename  | opsconsole_ingest.year  | opsconsole_ingest.month  | opsconsole_ingest.day  |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
| 11.0                                     | 0.0                               | 3.83399394E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 8                           | 3.0 SP11 (8.110.7601.18923)  | MSXML 3.0 Version              | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
| 11.0                                     | 0.0                               | 3.83399395E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 2                           | GenuineIntel                 | CPU Vendor                     | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
| 11.0                                     | 0.0                               | 3.83399396E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 141                         | 4                            | Screens                        | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
+------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
3 rows selected (1.252 seconds)

Attached shows that no error or warning logs are generated by Spark.
Also the table definition is attached.



> HiveContext reads avro Hive tables incorrectly 
> -----------------------------------------------
>
>                 Key: SPARK-13572
>                 URL: https://issues.apache.org/jira/browse/SPARK-13572
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.5.2, 1.6.0, 1.6.1
>         Environment: Hive 0.13.1, Spark 1.5.2
>            Reporter: Zoltan Fedor
>         Attachments: logs, table_definition
>
>
> I am using PySpark to read avro-based tables from Hive and while the avro tables can be read, some of the columns are incorrectly read - showing value {{None}} instead of the actual value.
> {noformat}
> >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest where year=2016 and month=2 and day=29 limit 3""")
> >>> results_df.take(3)
> [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, statvalue=None, displayname=None, category=None, source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
> {noformat}
> Observe the {{None}} values at most of the fields. Surprisingly not all fields, only some of them are showing {{None}} instead of the real values. The table definition does not show anything specific about these columns.
> Running the same query in Hive:
> {noformat}
> c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where year=2016 and month=2 and day=29 limit 3;
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition  | opsconsole_ingest.kafkaoffset  |      opsconsole_ingest.uuid       |         opsconsole_ingest.mid         |         opsconsole_ingest.iid         | opsconsole_ingest.product  | opsconsole_ingest.utctime  | opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | opsconsole_ingest.displayname  | opsconsole_ingest.category  | opsconsole_ingest.source_filename  | opsconsole_ingest.year  | opsconsole_ingest.month  | opsconsole_ingest.day  |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> | 11.0                                     | 0.0                               | 3.83399394E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 8                           | 3.0 SP11 (8.110.7601.18923)  | MSXML 3.0 Version              | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
> | 11.0                                     | 0.0                               | 3.83399395E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 2                           | GenuineIntel                 | CPU Vendor                     | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
> | 11.0                                     | 0.0                               | 3.83399396E8                   | EF0D03C409681B98646F316CA1088973  | 174f53fb-ca9b-d3f9-64e1-7631bf906817  | 00000000-0000-0000-0000-000000000000  | est                        | 2016-01-13T06:58:19        | 141                         | 4                            | Screens                        | PC Information              | ops-20160228_23_35_01.gz           | 2016                    | 2                        | 29                     |
> +------------------------------------------+-----------------------------------+--------------------------------+-----------------------------------+---------------------------------------+---------------------------------------+----------------------------+----------------------------+-----------------------------+------------------------------+--------------------------------+-----------------------------+------------------------------------+-------------------------+--------------------------+------------------------+--+
> 3 rows selected (1.252 seconds)
> {noformat}
> Attached shows that no error or warning logs are generated by Spark.
> Also the table definition is attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org