You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kengo Seki (JIRA)" <ji...@apache.org> on 2018/05/11 00:02:00 UTC

[jira] [Created] (AIRFLOW-2452) Document field_dict for HiveCliHook.load_file must be OrderedDict

Kengo Seki created AIRFLOW-2452:
-----------------------------------

             Summary: Document field_dict for HiveCliHook.load_file must be OrderedDict
                 Key: AIRFLOW-2452
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2452
             Project: Apache Airflow
          Issue Type: Improvement
          Components: docs, Documentation, hive_hooks, hooks
            Reporter: Kengo Seki
            Assignee: Kengo Seki


HiveCliHook.load_file has a parameter called field_dict, which defines name-type pairs for columns, must be OrderedDict. If not, users can get unexpected result. Example:

Given the following input file:

{code}
$ head /tmp/baby_names.csv
1880,John,0.081541,boy
1880,William,0.080511,boy
1880,James,0.050057,boy
1880,Charles,0.045167,boy
1880,George,0.043292,boy
1880,Frank,0.02738,boy
1880,Joseph,0.022229,boy
1880,Thomas,0.021401,boy
1880,Henry,0.020641,boy
{code}

Load the file via HiveCliHook.load_file with field_dict as a normal dict:

{code}
In [1]: from airflow.hooks.hive_hooks import HiveCliHook

In [2]: hook = HiveCliHook()
[2018-05-10 19:49:31,819] {base_hook.py:85} INFO - Using connection to: localhost

In [3]: field_dict = {
   ...:     "year": "INT",
   ...:     "name": "STRING",
   ...:     "pct": "DOUBLE",
   ...:     "sex": "STRING",
   ...: }

In [4]: hook.load_file(filepath="/tmp/baby_names.csv", table="baby_names", field_dict=field_dict, recreate=True)
[2018-05-10 19:51:53,854] {hive_hooks.py:424} INFO - DROP TABLE IF EXISTS baby_names;
CREATE TABLE IF NOT EXISTS baby_names (
sex STRING,
    name STRING,
    pct DOUBLE,
    year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS textfile
;

(snip)

[2018-05-10 19:52:17,965] {hive_hooks.py:232} INFO - Table default.baby_names stats: [numFiles=1, numRows=0, totalSize=1289, rawDataSize=0]
[2018-05-10 19:52:17,966] {hive_hooks.py:232} INFO - OK
[2018-05-10 19:52:17,967] {hive_hooks.py:232} INFO - Time taken: 1.349 seconds
{code}

The file is loaded, but fields in the CREATE TABLE statement are disordered. So the loaded data is not correctly selected from Hive:

{code}
hive> SELECT * FROM baby_names LIMIT 10;
OK
1880    John    0.081541        NULL
1880    William 0.080511        NULL
1880    James   0.050057        NULL
1880    Charles 0.045167        NULL
1880    George  0.043292        NULL
1880    Frank   0.02738 NULL
1880    Joseph  0.022229        NULL
1880    Thomas  0.021401        NULL
1880    Henry   0.020641        NULL
1880    Robert  0.020404        NULL
Time taken: 2.465 seconds, Fetched: 10 row(s)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)