You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "eric-ke-long (via GitHub)" <gi...@apache.org> on 2023/06/07 03:06:15 UTC

[GitHub] [airflow] eric-ke-long opened a new issue, #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

eric-ke-long opened a new issue, #31750:
URL: https://github.com/apache/airflow/issues/31750

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   BaseSQLToGCSOperator creates row group for each rows during parquet generation, which cause compression not work and increase file size.
   ![image](https://github.com/apache/airflow/assets/51909776/bf256065-c130-4354-81c7-8ca2ed4e8d93)
   
   
   ### What you think should happen instead
   
   _No response_
   
   ### How to reproduce
   
   OracleToGCSOperator(
           task_id='oracle_to_gcs_parquet_test',
           gcp_conn_id=GCP_FPDATALAKE,
           oracle_conn_id=ORACLE_CONNECTION,
           sql='',
           bucket=GCS_BUCKET_NAME,
           filename='',
           export_format='parquet',
   )
   
   ### Operating System
   
   CentOS Linux 7
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-apache-hive  2.1.0
   apache-airflow-providers-apache-sqoop 2.0.2
   apache-airflow-providers-celery       3.0.0
   apache-airflow-providers-common-sql   1.2.0
   apache-airflow-providers-ftp          3.1.0
   apache-airflow-providers-google       8.4.0
   apache-airflow-providers-http         4.0.0
   apache-airflow-providers-imap         3.0.0
   apache-airflow-providers-mysql        3.0.0
   apache-airflow-providers-oracle       2.1.0
   apache-airflow-providers-salesforce   5.3.0
   apache-airflow-providers-sqlite       3.2.1
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hussein-awala commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "hussein-awala (via GitHub)" <gi...@apache.org>.
hussein-awala commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1585609145

   @eric-ke-long can you test #31831 and try to change the `parquet_row_group_size` paramter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "boring-cyborg[bot] (via GitHub)" <gi...@apache.org>.
boring-cyborg[bot] commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1579801783

   Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] phanikumv commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "phanikumv (via GitHub)" <gi...@apache.org>.
phanikumv commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1582439746

   Thanks for your response. I will try to reproduce the issue and analyze it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eric-ke-long commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "eric-ke-long (via GitHub)" <gi...@apache.org>.
eric-ke-long commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1591081757

   @hussein-awala Thanks for your update, sorry for late response. I just got chance to check chat box. I made a fix for this in my local and looks like our ideas are totally same. I believe your solution can solve above issue very well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] phanikumv commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "phanikumv (via GitHub)" <gi...@apache.org>.
phanikumv commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1584012746

   @eric-ke-long how did you check the file size(how did you get the info shown in the screenshot)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hussein-awala commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "hussein-awala (via GitHub)" <gi...@apache.org>.
hussein-awala commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1584102613

   This is normal, since we read the rows and we write them to the parquet one by one.
   I have a solution for this problem, I will implement it and let you test it before merging the PR, but I see the issue more like a feature request (specify the row group size) rather than a bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] phanikumv commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "phanikumv (via GitHub)" <gi...@apache.org>.
phanikumv commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1580163114

   Are you getting any error while doing this operation? If not, I think this a new feature for the `OracleToGCSOperator`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eric-ke-long commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "eric-ke-long (via GitHub)" <gi...@apache.org>.
eric-ke-long commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1580241746

   @phanikumv I don't receive any error however it turn a 300KB parquet into 156MB, a kind of defect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hussein-awala commented on issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "hussein-awala (via GitHub)" <gi...@apache.org>.
hussein-awala commented on issue #31750:
URL: https://github.com/apache/airflow/issues/31750#issuecomment-1591096314

   Great! This new feature will be released soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] hussein-awala closed issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation

Posted by "hussein-awala (via GitHub)" <gi...@apache.org>.
hussein-awala closed issue #31750: BaseSQLToGCSOperator creates row group for each rows during parquet generation
URL: https://github.com/apache/airflow/issues/31750


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org