You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Yun Xu (JIRA)" <ji...@apache.org> on 2019/07/26 20:01:00 UTC

[jira] [Updated] (AIRFLOW-5053) Add support for configuring under-the-hood csv writer in MySqlToHiveTransfer Operator

     [ https://issues.apache.org/jira/browse/AIRFLOW-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yun Xu updated AIRFLOW-5053:
----------------------------
    Description: 
[https://github.com/apache/airflow/blob/master/airflow/operators/mysql_to_hive.py#L125]

MySqlToHiveTransfer uses csv.wirter under the hood, however, when MySql table includes Json columns, it'll by default double quotes when encountering csv quotechar or special char, but it doesn't transfer it back ("" double quotes remains) when loading to Hive tables, causing invalid Json payload in hive columns.

e.g. '["true"]' (MySql) => '[""true""]' (Hive, invalid json payload)

In our case, we fixed it by creating our own customized MySqlToHiveTransfer Operator by overriding the original class's execute method, basically replacing the csv writer with our own configs.
{code:java}
// configure csv_writer for supporting json columns
csv_writer = csv.writer(f, delimiter=self.delimiter,
		quoting=csv.QUOTE_NONE,
		quotechar='',
		escapechar='@',
		encoding="utf-8")
{code}
It'd be good if we could at least expose those csv configs, e.g. through csv.Dialect.

  was:
[https://github.com/apache/airflow/blob/master/airflow/operators/mysql_to_hive.py#L125]

MySqlToHiveTransfer uses csv.wirter under the hood, however, when MySql table includes Json columns, it'll by default double quotes when encountering csv quotechar or special char, but it doesn't transfer it back ("" double quotes remains) when loading to Hive tables, causing invalid Json payload in hive columns.



e.g. '["true"]' (MySql) => '[""true""]' (Hive, invalid json payload)

In our case, we fixed it by creating our own customized MySqlToHiveTransfer Operator by overriding the original class's execute method, basically replacing the csv writer with our own configs.
{code:java}
// configure csv_writer for supporting json columns
csv_writer = csv.writer(f, delimiter=self.delimiter,
		quoting=csv.QUOTE_NONE,
		quotechar='',
		escapechar='@',
		encoding="utf-8")
{code}
It'd be good if we could at least expose those csv configs.


> Add support for configuring under-the-hood csv writer in MySqlToHiveTransfer Operator
> -------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-5053
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-5053
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: operators
>    Affects Versions: 1.10.3
>            Reporter: Yun Xu
>            Priority: Major
>
> [https://github.com/apache/airflow/blob/master/airflow/operators/mysql_to_hive.py#L125]
> MySqlToHiveTransfer uses csv.wirter under the hood, however, when MySql table includes Json columns, it'll by default double quotes when encountering csv quotechar or special char, but it doesn't transfer it back ("" double quotes remains) when loading to Hive tables, causing invalid Json payload in hive columns.
> e.g. '["true"]' (MySql) => '[""true""]' (Hive, invalid json payload)
> In our case, we fixed it by creating our own customized MySqlToHiveTransfer Operator by overriding the original class's execute method, basically replacing the csv writer with our own configs.
> {code:java}
> // configure csv_writer for supporting json columns
> csv_writer = csv.writer(f, delimiter=self.delimiter,
> 		quoting=csv.QUOTE_NONE,
> 		quotechar='',
> 		escapechar='@',
> 		encoding="utf-8")
> {code}
> It'd be good if we could at least expose those csv configs, e.g. through csv.Dialect.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)