You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Aizhamal Nurmamat kyzy (JIRA)" <ji...@apache.org> on 2019/05/17 20:29:04 UTC

[jira] [Updated] (AIRFLOW-2441) Fix bugs in HiveCliHook.load_df

     [ https://issues.apache.org/jira/browse/AIRFLOW-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aizhamal Nurmamat kyzy updated AIRFLOW-2441:
--------------------------------------------
    Labels: hive hive-hooks  (was: )

Moving hive_hooks to hooks, adding hooks label for component refactor.

> Fix bugs in HiveCliHook.load_df
> -------------------------------
>
>                 Key: AIRFLOW-2441
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-2441
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: hive_hooks, hooks
>            Reporter: Kengo Seki
>            Assignee: Kengo Seki
>            Priority: Major
>              Labels: hive, hive-hooks
>             Fix For: 1.10.0
>
>
> {{HiveCliHook.load_df}} has some bugs and doesn't work for now.
> 1. Executing it fails as follows:
> {code}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
> In [3]: from airflow.hooks.hive_hooks import HiveCliHook
> In [4]: hook = HiveCliHook()
> [2018-05-08 06:38:19,211] {base_hook.py:85} INFO - Using connection to: localhost
> In [5]: hook.load_df(df, "t")
> (snip)
> TypeError: "delimiter" must be string, not unicode
> {code}
> To solve this, "delimiter" parameter should be encoded by "encoding" parameter. The latter is declared but unused for now.
> 2. For small dataset, it loads an empty file into Hive:
> {code}
> In [1]: import pandas as pd
>    ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
>    ...: from airflow.hooks.hive_hooks import HiveCliHook
>    ...: hook = HiveCliHook()
>    ...: hook.load_df(df, "t")
>    ...:
> (snip)
> [2018-05-08 20:46:48,883] {hive_hooks.py:231} INFO - Loading data to table default.t
> [2018-05-08 20:46:49,448] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
> {code}
> {code}
> hive> SELECT count(*) FROM t;
> (snip)
> OK
> 0
> Time taken: 4.962 seconds, Fetched: 1 row(s)
> {code}
> This is because the file contents is still in buffer when LOAD DATA statement is executed. That should be flushed just like {{HiveCliHook.run_cli}} does.
> 3. Even with fixes for #1 and #2, unexpected data is loaded into Hive:
> {code}
> In [1]: import pandas as pd
>    ...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
>    ...: from airflow.hooks.hive_hooks import HiveCliHook
>    ...: hook = HiveCliHook()
>    ...: hook.load_df(df, "t")
>    ...:
> (snip)
> [2018-05-08 20:57:17,467] {hive_hooks.py:231} INFO - Loading data to table default.t
> [2018-05-08 20:57:18,163] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=21, rawDataSize=0]
> {code}
> {code}
> hive> SELECT * FROM t;
> OK
> 0
> 1
> 2
> Time taken: 2.317 seconds, Fetched: 4 row(s)
> {code}
> This is because {{pandas.DataFrame.to_csv}} outputs data into file with row index by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)