You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kengo Seki (JIRA)" <ji...@apache.org> on 2018/05/11 00:02:00 UTC
[jira] [Created] (AIRFLOW-2452) Document field_dict for
HiveCliHook.load_file must be OrderedDict
Kengo Seki created AIRFLOW-2452:
-----------------------------------
Summary: Document field_dict for HiveCliHook.load_file must be OrderedDict
Key: AIRFLOW-2452
URL: https://issues.apache.org/jira/browse/AIRFLOW-2452
Project: Apache Airflow
Issue Type: Improvement
Components: docs, Documentation, hive_hooks, hooks
Reporter: Kengo Seki
Assignee: Kengo Seki
HiveCliHook.load_file has a parameter called field_dict, which defines name-type pairs for columns, must be OrderedDict. If not, users can get unexpected result. Example:
Given the following input file:
{code}
$ head /tmp/baby_names.csv
1880,John,0.081541,boy
1880,William,0.080511,boy
1880,James,0.050057,boy
1880,Charles,0.045167,boy
1880,George,0.043292,boy
1880,Frank,0.02738,boy
1880,Joseph,0.022229,boy
1880,Thomas,0.021401,boy
1880,Henry,0.020641,boy
{code}
Load the file via HiveCliHook.load_file with field_dict as a normal dict:
{code}
In [1]: from airflow.hooks.hive_hooks import HiveCliHook
In [2]: hook = HiveCliHook()
[2018-05-10 19:49:31,819] {base_hook.py:85} INFO - Using connection to: localhost
In [3]: field_dict = {
...: "year": "INT",
...: "name": "STRING",
...: "pct": "DOUBLE",
...: "sex": "STRING",
...: }
In [4]: hook.load_file(filepath="/tmp/baby_names.csv", table="baby_names", field_dict=field_dict, recreate=True)
[2018-05-10 19:51:53,854] {hive_hooks.py:424} INFO - DROP TABLE IF EXISTS baby_names;
CREATE TABLE IF NOT EXISTS baby_names (
sex STRING,
name STRING,
pct DOUBLE,
year INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS textfile
;
(snip)
[2018-05-10 19:52:17,965] {hive_hooks.py:232} INFO - Table default.baby_names stats: [numFiles=1, numRows=0, totalSize=1289, rawDataSize=0]
[2018-05-10 19:52:17,966] {hive_hooks.py:232} INFO - OK
[2018-05-10 19:52:17,967] {hive_hooks.py:232} INFO - Time taken: 1.349 seconds
{code}
The file is loaded, but fields in the CREATE TABLE statement are disordered. So the loaded data is not correctly selected from Hive:
{code}
hive> SELECT * FROM baby_names LIMIT 10;
OK
1880 John 0.081541 NULL
1880 William 0.080511 NULL
1880 James 0.050057 NULL
1880 Charles 0.045167 NULL
1880 George 0.043292 NULL
1880 Frank 0.02738 NULL
1880 Joseph 0.022229 NULL
1880 Thomas 0.021401 NULL
1880 Henry 0.020641 NULL
1880 Robert 0.020404 NULL
Time taken: 2.465 seconds, Fetched: 10 row(s)
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)