You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/05/15 17:54:00 UTC
[jira] [Commented] (AIRFLOW-2452) Document field_dict for
HiveCliHook.load_file must be OrderedDict
[ https://issues.apache.org/jira/browse/AIRFLOW-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16476259#comment-16476259 ]
ASF subversion and git services commented on AIRFLOW-2452:
----------------------------------------------------------
Commit 648b14b4d95bf3aca26e8b54ffe8585b52efc8fd in incubator-airflow's branch refs/heads/master from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=648b14b ]
[AIRFLOW-2452] Document field_dict must be OrderedDict
HiveCliHook.load_file has a parameter called
field_dict, which defines name-type pairs
for columns, must be OrderedDict so as to
keep columns' order, but it's undocumented.
This PR adds an note about that, and fixes
HiveCliHook.load_df function which calls
load_file internally.
Closes #3347 from sekikn/AIRFLOW-2452
> Document field_dict for HiveCliHook.load_file must be OrderedDict
> -----------------------------------------------------------------
>
> Key: AIRFLOW-2452
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2452
> Project: Apache Airflow
> Issue Type: Improvement
> Components: docs, Documentation, hive_hooks, hooks
> Reporter: Kengo Seki
> Assignee: Kengo Seki
> Priority: Major
> Fix For: 2.0.0
>
>
> HiveCliHook.load_file has a parameter called field_dict, which defines name-type pairs for columns, must be OrderedDict. If not, users can get unexpected result. Example:
> Given the following input file:
> {code}
> $ head /tmp/baby_names.csv
> 1880,John,0.081541,boy
> 1880,William,0.080511,boy
> 1880,James,0.050057,boy
> 1880,Charles,0.045167,boy
> 1880,George,0.043292,boy
> 1880,Frank,0.02738,boy
> 1880,Joseph,0.022229,boy
> 1880,Thomas,0.021401,boy
> 1880,Henry,0.020641,boy
> {code}
> Load the file via HiveCliHook.load_file with field_dict as a normal dict:
> {code}
> In [1]: from airflow.hooks.hive_hooks import HiveCliHook
> In [2]: hook = HiveCliHook()
> [2018-05-10 19:49:31,819] {base_hook.py:85} INFO - Using connection to: localhost
> In [3]: field_dict = {
> ...: "year": "INT",
> ...: "name": "STRING",
> ...: "pct": "DOUBLE",
> ...: "sex": "STRING",
> ...: }
> In [4]: hook.load_file(filepath="/tmp/baby_names.csv", table="baby_names", field_dict=field_dict, recreate=True)
> [2018-05-10 19:51:53,854] {hive_hooks.py:424} INFO - DROP TABLE IF EXISTS baby_names;
> CREATE TABLE IF NOT EXISTS baby_names (
> sex STRING,
> name STRING,
> pct DOUBLE,
> year INT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS textfile
> ;
> (snip)
> [2018-05-10 19:52:17,965] {hive_hooks.py:232} INFO - Table default.baby_names stats: [numFiles=1, numRows=0, totalSize=1289, rawDataSize=0]
> [2018-05-10 19:52:17,966] {hive_hooks.py:232} INFO - OK
> [2018-05-10 19:52:17,967] {hive_hooks.py:232} INFO - Time taken: 1.349 seconds
> {code}
> The file is loaded, but fields in the CREATE TABLE statement are disordered. So the loaded data is not correctly selected from Hive:
> {code}
> hive> SELECT * FROM baby_names LIMIT 10;
> OK
> 1880 John 0.081541 NULL
> 1880 William 0.080511 NULL
> 1880 James 0.050057 NULL
> 1880 Charles 0.045167 NULL
> 1880 George 0.043292 NULL
> 1880 Frank 0.02738 NULL
> 1880 Joseph 0.022229 NULL
> 1880 Thomas 0.021401 NULL
> 1880 Henry 0.020641 NULL
> 1880 Robert 0.020404 NULL
> Time taken: 2.465 seconds, Fetched: 10 row(s)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)