You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/05/10 08:36:00 UTC

[jira] [Commented] (AIRFLOW-2441) Fix bugs in HiveCliHook.load_df

    [ https://issues.apache.org/jira/browse/AIRFLOW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470074#comment-16470074 ] 

ASF subversion and git services commented on AIRFLOW-2441:
----------------------------------------------------------

Commit 74027c9a6ba5f54a7b6392f6dd79d5b8a8782d7b in incubator-airflow's branch refs/heads/master from [~sekikn]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=74027c9 ]

[AIRFLOW-2441] Fix bugs in HiveCliHook.load_df

This PR fixes HiveCliHook.load_df to:

1. encode delimiter with the specified encoding
   before passing it to pandas.DataFrame.to_csv
   so as not to fail

2. flush output file by pandas.DataFrame.to_csv
   before executing LOAD DATA statement

3. remove header and row index from output file
   by pandas.DataFrame.to_csv so as to read it
   as expected via Hive

Closes #3334 from sekikn/AIRFLOW-2441


> Fix bugs in HiveCliHook.load_df
> -------------------------------
>
>                 Key: AIRFLOW-2441
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2441
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: hive_hooks, hooks
>            Reporter: Kengo Seki
>            Assignee: Kengo Seki
>            Priority: Major
>             Fix For: 1.10.0
>
>
> {{HiveCliHook.load_df}} has some bugs and doesn't work for now.
> 1. Executing it fails as follows:
> {code}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
> In [3]: from airflow.hooks.hive_hooks import HiveCliHook
> In [4]: hook = HiveCliHook()
> [2018-05-08 06:38:19,211] {base_hook.py:85} INFO - Using connection to: localhost
> In [5]: hook.load_df(df, "t")
> (snip)
> TypeError: "delimiter" must be string, not unicode
> {code}
> To solve this, "delimiter" parameter should be encoded by "encoding" parameter. The latter is declared but unused for now.
> 2. For small dataset, it loads an empty file into Hive:
> {code}
> In [1]: import pandas as pd
>    ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
>    ...: from airflow.hooks.hive_hooks import HiveCliHook
>    ...: hook = HiveCliHook()
>    ...: hook.load_df(df, "t")
>    ...:
> (snip)
> [2018-05-08 20:46:48,883] {hive_hooks.py:231} INFO - Loading data to table default.t
> [2018-05-08 20:46:49,448] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
> {code}
> {code}
> hive> SELECT count(*) FROM t;
> (snip)
> OK
> 0
> Time taken: 4.962 seconds, Fetched: 1 row(s)
> {code}
> This is because the file contents is still in buffer when LOAD DATA statement is executed. That should be flushed just like {{HiveCliHook.run_cli}} does.
> 3. Even with fixes for #1 and #2, unexpected data is loaded into Hive:
> {code}
> In [1]: import pandas as pd
>    ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
>    ...: from airflow.hooks.hive_hooks import HiveCliHook
>    ...: hook = HiveCliHook()
>    ...: hook.load_df(df, "t")
>    ...:
> (snip)
> [2018-05-08 20:57:17,467] {hive_hooks.py:231} INFO - Loading data to table default.t
> [2018-05-08 20:57:18,163] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=21, rawDataSize=0]
> {code}
> {code}
> hive> SELECT * FROM t;
> OK
> 0
> 1
> 2
> Time taken: 2.317 seconds, Fetched: 4 row(s)
> {code}
> This is because {{pandas.DataFrame.to_csv}} outputs data into file with row index by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)