You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Siddharth Anand (JIRA)" <ji...@apache.org> on 2018/05/15 17:54:00 UTC

[jira] [Resolved] (AIRFLOW-2452) Document field_dict for HiveCliHook.load_file must be OrderedDict

     [ https://issues.apache.org/jira/browse/AIRFLOW-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Anand resolved AIRFLOW-2452.
--------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

Issue resolved by pull request #3347
[https://github.com/apache/incubator-airflow/pull/3347]

> Document field_dict for HiveCliHook.load_file must be OrderedDict
> -----------------------------------------------------------------
>
>                 Key: AIRFLOW-2452
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2452
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: docs, Documentation, hive_hooks, hooks
>            Reporter: Kengo Seki
>            Assignee: Kengo Seki
>            Priority: Major
>             Fix For: 2.0.0
>
>
> HiveCliHook.load_file has a parameter called field_dict, which defines name-type pairs for columns, must be OrderedDict. If not, users can get unexpected result. Example:
> Given the following input file:
> {code}
> $ head /tmp/baby_names.csv
> 1880,John,0.081541,boy
> 1880,William,0.080511,boy
> 1880,James,0.050057,boy
> 1880,Charles,0.045167,boy
> 1880,George,0.043292,boy
> 1880,Frank,0.02738,boy
> 1880,Joseph,0.022229,boy
> 1880,Thomas,0.021401,boy
> 1880,Henry,0.020641,boy
> {code}
> Load the file via HiveCliHook.load_file with field_dict as a normal dict:
> {code}
> In [1]: from airflow.hooks.hive_hooks import HiveCliHook
> In [2]: hook = HiveCliHook()
> [2018-05-10 19:49:31,819] {base_hook.py:85} INFO - Using connection to: localhost
> In [3]: field_dict = {
>    ...:     "year": "INT",
>    ...:     "name": "STRING",
>    ...:     "pct": "DOUBLE",
>    ...:     "sex": "STRING",
>    ...: }
> In [4]: hook.load_file(filepath="/tmp/baby_names.csv", table="baby_names", field_dict=field_dict, recreate=True)
> [2018-05-10 19:51:53,854] {hive_hooks.py:424} INFO - DROP TABLE IF EXISTS baby_names;
> CREATE TABLE IF NOT EXISTS baby_names (
> sex STRING,
>     name STRING,
>     pct DOUBLE,
>     year INT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS textfile
> ;
> (snip)
> [2018-05-10 19:52:17,965] {hive_hooks.py:232} INFO - Table default.baby_names stats: [numFiles=1, numRows=0, totalSize=1289, rawDataSize=0]
> [2018-05-10 19:52:17,966] {hive_hooks.py:232} INFO - OK
> [2018-05-10 19:52:17,967] {hive_hooks.py:232} INFO - Time taken: 1.349 seconds
> {code}
> The file is loaded, but fields in the CREATE TABLE statement are disordered. So the loaded data is not correctly selected from Hive:
> {code}
> hive> SELECT * FROM baby_names LIMIT 10;
> OK
> 1880    John    0.081541        NULL
> 1880    William 0.080511        NULL
> 1880    James   0.050057        NULL
> 1880    Charles 0.045167        NULL
> 1880    George  0.043292        NULL
> 1880    Frank   0.02738 NULL
> 1880    Joseph  0.022229        NULL
> 1880    Thomas  0.021401        NULL
> 1880    Henry   0.020641        NULL
> 1880    Robert  0.020404        NULL
> Time taken: 2.465 seconds, Fetched: 10 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)