You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/26 16:15:14 UTC

[GitHub] [airflow] dclandau opened a new issue, #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

dclandau opened a new issue, #25313:
URL: https://github.com/apache/airflow/issues/25313

   ### Apache Airflow Provider(s)
   
   google
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-google==6.8.0
   
   ### Apache Airflow version
   
   2.3.2
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   When using the `PostgresToGCSOperator(..., export_format='parquet', approx_max_file_size_bytes=Y, ...)`, when a temporary file exceeds the size defined by Y, the current file is not yielded, and no new chunk is created. Meaning that only 1 chunk will be uploaded irregardless of the size specified Y.
   
   I believe [this](https://github.com/apache/airflow/blob/d876b4aa6d86f589b9957a2e69484c9e5365eba8/airflow/providers/google/cloud/transfers/sql_to_gcs.py#L253) line of code which is responsible for verifying whether the temporary file has exceeded its size, to be the culprit, considering the call to `tmp_file_handle.tell()` is always returning 0 after a `parquet_writer.write_table(tbl)` call [[here]](https://github.com/apache/airflow/blob/d876b4aa6d86f589b9957a2e69484c9e5365eba8/airflow/providers/google/cloud/transfers/sql_to_gcs.py#L240).
   
   Therefore, regardless of the size of the temporary file already being bigger than the defined approximate limit Y, no new file will be created and only a single chunk will be uploaded.
   
   ### What you think should happen instead
   
   This behaviour is erroneous as when the temporary file exceeds the size defined by Y, it should upload the current temporary file and then create a new file to upload after successfully uploading the current file to GCS.
   
   A possible fix could be to use the `import os` package to determine the size of the temporary file with `os.stat(tmp_file_handle).st_size`, instead of using `tmp_file_handle.tell()`.
   
   ### How to reproduce
   
   1. Create a postgres connection on airflow with id `postgres_test_conn`.
   2. Create a gcp connection on airflow with id `gcp_test_conn`.
   3. In the database referenced by the `postgres_test_conn`, in the public schema create a table `large_table`, where the total amount of data In the table is big enough to exceed the 10MB limit defined in the `approx_max_file_size_bytes` parameter.
   4. Create a bucket named `issue_BaseSQLToGCSOperator_bucket`, in the gcp account referenced by the `gcp_test_conn`.
   5. Create the dag exemplified in the excerpt below, and manually trigger the dag to fetch all the data from `large_table`, to insert in the `issue_BaseSQLToGCSOperator_bucket`. We should expect multiple chunks to be created, but due to this bug, only 1 chunk will be uploaded with the whole data from `large_table`.
   
   ```python
   import pendulum
   
   from airflow import DAG
   from airflow.providers.google.cloud.transfers.postgres_to_gcs import PostgresToGCSOperator
   
   
   with DAG(
       dag_id="issue_BaseSQLToGCSOperator",
       start_date=pendulum.parse("2022-01-01"),
   )as dag:
       task = PostgresToGCSOperator(
           task_id='extract_task',
           filename='uploading-{}.parquet',
           bucket="issue_BaseSQLToGCSOperator_bucket",
           export_format='parquet',
           approx_max_file_size_bytes=10_485_760,
           sql="SELECT * FROM large_table",
           postgres_conn_id='postgres_test_conn',
           gcp_conn_id='gcp_test_conn',
       )
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] koroder commented on issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
koroder commented on issue #25313:
URL: https://github.com/apache/airflow/issues/25313#issuecomment-1202288439

   @eladkal  I am looking to make my first contribution to Airflow. Would it be okay if I work on this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #25313:
URL: https://github.com/apache/airflow/issues/25313#issuecomment-1195692164

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #25313:
URL: https://github.com/apache/airflow/issues/25313#issuecomment-1198002078

   feel free to submit PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #25313:
URL: https://github.com/apache/airflow/issues/25313#issuecomment-1202365072

   Feel free but check if maybe @dclandau has not done something already 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dclandau commented on issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
dclandau commented on issue #25313:
URL: https://github.com/apache/airflow/issues/25313#issuecomment-1202449927

   Here is the [PR](https://github.com/apache/airflow/pull/25469). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dclandau commented on issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
dclandau commented on issue #25313:
URL: https://github.com/apache/airflow/issues/25313#issuecomment-1202406417

   I've got a commit ready. I am just preparing the commit message to be within the guidelines


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] koroder commented on issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
koroder commented on issue #25313:
URL: https://github.com/apache/airflow/issues/25313#issuecomment-1202415856

   > I've got a commit ready. I am just preparing the commit message to be within the guidelines
   
   Ok. I will find something new :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #25313: BaseSQLToGCSOperator parquet export format not limiting file size bug
URL: https://github.com/apache/airflow/issues/25313


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org