You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Michael Ghen (JIRA)" <ji...@apache.org> on 2017/09/09 19:51:02 UTC
[jira] [Updated] (AIRFLOW-1586) MySQL to GCS to BigQuery fails for
tables with date types
[ https://issues.apache.org/jira/browse/AIRFLOW-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Ghen updated AIRFLOW-1586:
----------------------------------
Description:
For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud Storage and then loads from GCS to BigQuery will fail.
When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes are parse, converted to timestamps. However, the schema file sets the type as "STRING."
This creates a problem when you try to load the JSON into BigQuery using the schema file.
*Steps to reproduce*
0. Set up a MySQL connection and GCP connection in Airflow.
1. Create a MySQL table with a "date" field and put some data into the table.
{code}
CREATE TABLE table_with_date (
date_field date,
datetime_field datetime
);
INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW());
{code}
2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to BigQuery (use the schema file). For example:
{code}
extract = MySqlToGoogleCloudStorageOperator(
task_id="extract_table",
mysql_conn_id='mysql_connection',
google_cloud_storage_conn_id='gcp_connection',
sql="SELECT * FROM table_with_date",
bucket='gcs-bucket',
filename='table_with_date.json',
schema_filename='schemas/table_with_date.json',
dag=dag)
load = GoogleCloudStorageToBigQueryOperator(
task_id="load_table",
bigquery_conn_id='gcp_connection',
google_cloud_storage_conn_id='gcp_connection',
bucket='gcs-bucket',
destination_project_dataset_table="dataset.table_with_date",
source_objects=['table_with_date.json'],
schema_object='schemas/table_with_date.json',
source_format='NEWLINE_DELIMITED_JSON',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
dag=dag)
load.set_upstream(extract)
{code}
3. Run the DAG
Expected: The DAG runs successfully.
Actual: The `load_table` task fails with error:
{code}
...
{u'reason': u'invalid',
u'message': u'JSON parsing error in row starting at position 0: Could not convert value to string. Field: date_field; Value: 1504929600.000000',
u'location': u'gs://gcs-bucket/table_with_date.json'
...
{code}
*Comments:*
Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py` where the types get converted. In `convert_types` both `date` and `datetime` types get converted to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug almost turned my team off from using airflow because we had a lot of tables that didn't flow into BigQuery because we used the `date` type a lot.
was:
For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud Storage and then loads from GCS to BigQuery will fail.
When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes are parse, converted to timestamps. However, the schema file sets the type as "STRING."
This creates a problem when you try to load the JSON into BigQuery using the schema file.
*Steps to reproduce*
0. Set up a MySQL connection and GCP connection in Airflow.
1. Create a MySQL table with a "date" field and put some data into the table.
```
CREATE TABLE table_with_date (
date_field date,
datetime_field datetime
);
INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW());
```
2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to BigQuery (use the schema file). For example:
```
extract = MySqlToGoogleCloudStorageOperator(
task_id="extract_table",
mysql_conn_id='mysql_connection',
google_cloud_storage_conn_id='gcp_connection',
sql="SELECT * FROM table_with_date",
bucket='gcs-bucket',
filename='table_with_date.json',
schema_filename='schemas/table_with_date.json',
dag=dag)
load = GoogleCloudStorageToBigQueryOperator(
task_id="load_table",
bigquery_conn_id='gcp_connection',
google_cloud_storage_conn_id='gcp_connection',
bucket='gcs-bucket',
destination_project_dataset_table="dataset.table_with_date",
source_objects=['table_with_date.json'],
schema_object='schemas/table_with_date.json',
source_format='NEWLINE_DELIMITED_JSON',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
dag=dag)
load.set_upstream(extract)
```
3. Run the DAG
Expected: The DAG runs successfully.
Actual: The `load_table` task fails with error:
```
...
{u'reason': u'invalid',
u'message': u'JSON parsing error in row starting at position 0: Could not convert value to string. Field: date_field; Value: 1504929600.000000',
u'location': u'gs://gcs-bucket/table_with_date.json'
...
```
*Comments:*
Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py` where the types get converted. In `convert_types` both `date` and `datetime` types get converted to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug almost turned my team off from using airflow because we had a lot of tables that didn't flow into BigQuery because we used the `date` type a lot.
> MySQL to GCS to BigQuery fails for tables with date types
> ---------------------------------------------------------
>
> Key: AIRFLOW-1586
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1586
> Project: Apache Airflow
> Issue Type: Bug
> Components: gcp
> Environment: Google Cloud Platform
> Reporter: Michael Ghen
> Assignee: Michael Ghen
> Priority: Minor
>
> For tables in MySQL that use a "date" type, a dag that exports from MySQL to Google Cloud Storage and then loads from GCS to BigQuery will fail.
> When the table is exported as JSON to a GCS bucket, date fields are parsed as datetimes are parse, converted to timestamps. However, the schema file sets the type as "STRING."
> This creates a problem when you try to load the JSON into BigQuery using the schema file.
> *Steps to reproduce*
> 0. Set up a MySQL connection and GCP connection in Airflow.
> 1. Create a MySQL table with a "date" field and put some data into the table.
> {code}
> CREATE TABLE table_with_date (
> date_field date,
> datetime_field datetime
> );
> INSERT INTO table_with_date (date_field, datetime_field) VALUES ('2017-09-09',NOW());
> {code}
> 2. Create a DAG that will export the data from the MySQL to GCS and then load from GCS to BigQuery (use the schema file). For example:
> {code}
> extract = MySqlToGoogleCloudStorageOperator(
> task_id="extract_table",
> mysql_conn_id='mysql_connection',
> google_cloud_storage_conn_id='gcp_connection',
> sql="SELECT * FROM table_with_date",
> bucket='gcs-bucket',
> filename='table_with_date.json',
> schema_filename='schemas/table_with_date.json',
> dag=dag)
> load = GoogleCloudStorageToBigQueryOperator(
> task_id="load_table",
> bigquery_conn_id='gcp_connection',
> google_cloud_storage_conn_id='gcp_connection',
> bucket='gcs-bucket',
> destination_project_dataset_table="dataset.table_with_date",
> source_objects=['table_with_date.json'],
> schema_object='schemas/table_with_date.json',
> source_format='NEWLINE_DELIMITED_JSON',
> create_disposition='CREATE_IF_NEEDED',
> write_disposition='WRITE_TRUNCATE',
> dag=dag)
> load.set_upstream(extract)
> {code}
> 3. Run the DAG
> Expected: The DAG runs successfully.
> Actual: The `load_table` task fails with error:
> {code}
> ...
> {u'reason': u'invalid',
> u'message': u'JSON parsing error in row starting at position 0: Could not convert value to string. Field: date_field; Value: 1504929600.000000',
> u'location': u'gs://gcs-bucket/table_with_date.json'
> ...
> {code}
> *Comments:*
> Seems like this was just a simple oversight in the section of `airflow/contrib/operators/mysql_to_gcs.py` where the types get converted. In `convert_types` both `date` and `datetime` types get converted to timestamps but in `type_map` there is no mapping for `FIELD_TYPE.date`. This small bug almost turned my team off from using airflow because we had a lot of tables that didn't flow into BigQuery because we used the `date` type a lot.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)