You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Kengo Seki (JIRA)" <ji...@apache.org> on 2018/05/09 01:08:00 UTC
[jira] [Created] (AIRFLOW-2441) Fix bugs in HiveCliHook.load_df
Kengo Seki created AIRFLOW-2441:
-----------------------------------
Summary: Fix bugs in HiveCliHook.load_df
Key: AIRFLOW-2441
URL: https://issues.apache.org/jira/browse/AIRFLOW-2441
Project: Apache Airflow
Issue Type: Bug
Components: hive_hooks, hooks
Reporter: Kengo Seki
Assignee: Kengo Seki
{{HiveCliHook.load_df}} has some bugs and doesn't work for now.
1. Executing it fails as follows:
{code}
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
In [3]: from airflow.hooks.hive_hooks import HiveCliHook
In [4]: hook = HiveCliHook()
[2018-05-08 06:38:19,211] {base_hook.py:85} INFO - Using connection to: localhost
In [5]: hook.load_df(df, "t")
(snip)
TypeError: "delimiter" must be string, not unicode
{code}
To solve this, "delimiter" parameter should be encoded by "encoding" parameter. The latter is declared but unused for now.
2. For small dataset, it loads an empty file into Hive:
{code}
In [1]: import pandas as pd
...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
...: from airflow.hooks.hive_hooks import HiveCliHook
...: hook = HiveCliHook()
...: hook.load_df(df, "t")
...:
(snip)
[2018-05-08 20:46:48,883] {hive_hooks.py:231} INFO - Loading data to table default.t
[2018-05-08 20:46:49,448] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=0, rawDataSize=0]
{code}
{code}
hive> SELECT count(*) FROM t;
(snip)
OK
0
Time taken: 4.962 seconds, Fetched: 1 row(s)
{code}
This is because the file contents is still in buffer when LOAD DATA statement is executed. That should be flushed just like {{HiveCliHook.run_cli}} does.
3. Even with fixes for #1 and #2, unexpected data is loaded into Hive:
{code}
In [1]: import pandas as pd
...: df = pd.DataFrame({"c": ["foo", "bar", "baz"]})
...: from airflow.hooks.hive_hooks import HiveCliHook
...: hook = HiveCliHook()
...: hook.load_df(df, "t")
...:
(snip)
[2018-05-08 20:57:17,467] {hive_hooks.py:231} INFO - Loading data to table default.t
[2018-05-08 20:57:18,163] {hive_hooks.py:231} INFO - Table default.t stats: [numFiles=1, numRows=0, totalSize=21, rawDataSize=0]
{code}
{code}
hive> SELECT * FROM t;
OK
0
1
2
Time taken: 2.317 seconds, Fetched: 4 row(s)
{code}
This is because {{pandas.DataFrame.to_csv}} outputs data into file with row index by default.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)