You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/17 10:05:06 UTC

[GitHub] [airflow] danneaves-ee opened a new issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

danneaves-ee opened a new issue #20901:
URL: https://github.com/apache/airflow/issues/20901


   ### Apache Airflow Provider(s)
   
   google
   
   ### Versions of Apache Airflow Providers
   
   ```
   $ pip freeze | grep apache-airflow-providers
   apache-airflow-providers-ftp==2.0.1
   apache-airflow-providers-google==6.3.0
   apache-airflow-providers-http==2.0.2
   apache-airflow-providers-imap==2.1.0
   apache-airflow-providers-pagerduty==2.1.0
   apache-airflow-providers-sftp==2.4.0
   apache-airflow-providers-sqlite==2.0.1
   apache-airflow-providers-ssh==2.3.0
   ```
   
   ### Apache Airflow version
   
   2.2.3 (latest released)
   
   ### Operating System
   
   Ubuntu 20.04.3 LTS
   
   ### Deployment
   
   Composer
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   Using `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` to load the contents of a file into `xcom` unexpectedly casts the file bytes to string.
   
   ### What you expected to happen
   
   `GCSToLocalFilesystemOperator` should not cast to string
   
   ### How to reproduce
   
   Store a file on gcs;
   ```
   Hello World!
   ```
   
   Read file to xcom
   ```
   my_task = GCSToLocalFilesystemOperator(
       task_id='my_task',
       bucket=bucket,
       object_name=object_path,
       store_to_xcom_key='my_xcom_key',
   )
   ```
   
   Access via jinja;
   ```
   {{ ti.xcom_pull(task_ids="my_task", key="my_xcom_key") }}
   ```
   
   XCom result is;
   ```
   b'Hello World!'
   ```
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1014672576


   This is expected. We cannot treat object stored in GCS as "string". It's always retrieved as Bytes. If you know what encoding is used you can easily decode it `ti.xcom_pull(task_ids="my_task", key="my_xcom_key").decode('utf-8') for example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1014819190


   Ah. I see - this actually converts the value of the bytes to "nice string" representation: https://docs.python.org/3/library/stdtypes.html#str
   
   > Passing a bytes object to str() without the encoding or errors arguments falls under the first case of returning the informal string representation (see also the -b command-line option to Python). For example:
   
   ```
   str(b'Zoot!')
   "b'Zoot!'"
   ```
   
   So those are not "bytes" but string containing `b'` :scream: 
   
   Yeah. Agree it's a bug then. Would you like to fix it maybe? What do you think should be the fix @danneaves-ee  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] danneaves-ee commented on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
danneaves-ee commented on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1015218971


   Thanks for reopening @potiuk :smile: 
   
   GCS provides a "Content-Encoding" meta key, however, it's optional so I guess not many people would use it. 
   
   Since XComs must be serializable, we have replaced the lines as follows in our work-around;
   ```
   # airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator ~142
   - context['ti'].xcom_push(key=self.store_to_xcom_key, value=str(file_bytes))
   + context['ti'].xcom_push(key=self.store_to_xcom_key, value=str(file_bytes, encoding='utf-8'))
   ```
   
   But it could be safer to use `sys.getdefaultencoding()` instead of `utf-8`. Or even expose the encoding as a parameter to the class?
   
   What do you think @potiuk?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1014343662


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1014819190


   Ah. I see - this actually converts the value of the bytes to nice "string" representation: https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not
   
   > Passing a bytes object to str() without the encoding or errors arguments falls under the first case of returning the informal string representation (see also the -b command-line option to Python). For example:
   
   ```
   str(b'Zoot!')
   "b'Zoot!'"
   ```
   
   So those are not "bytes" but string containing `b'` :scream: 
   
   Yeah. Agree it's a bug then. Would you like to fix it maybe? What do you think should be the fix @danneaves-ee  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1015256657


   I think the problem is that we really want to store binary data as string. What I thing should be really good solution is to convert the bytes into string assuming that the bytes follow the encoding specified. And if the data is encoded using different encoding, it will fail.
   
   I think adding optional "encoding" parameter which falls-back to`'utf-8` is a better idea. getdefaultencoding is something I do not really trust in Python. In many places (including str() built-in function, encoding=utf-8 is default rather than getdefaultencoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] danneaves-ee commented on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
danneaves-ee commented on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1014702809


   @potiuk please look at the code here; https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/transfers/gcs_to_local.py#L142
   
   This is a bug.
   
   You cannot cast `byte` to `str` in this way and successfully decode it after. The `str()` method must be supplied with an `encoding` param like this;
   
   ```
   str(file_bytes, encoding='utf-8')
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1015256657


   I think the problem is that we really want to store binary data as string. What I thing should be really good solution is to convert the bytes into string assuming that the bytes follow the encoding specified. And if the data is encoded using different encoding, it will fail.
   
   I think adding optional "encoding" parameter which falls-back to `utf-8` is a better idea. getdefaultencoding is something I do not really trust in Python. In many places (including str() built-in function, encoding=utf-8 is default rather than getdefaultencoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1015257641


   I assigned you to it :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1014819190


   Ah. I see - this actually converts the value of the bytes to "nice string" representation: https://meet.google.com/xmi-romh-zvd?authuser=0
   
   > Passing a bytes object to str() without the encoding or errors arguments falls under the first case of returning the informal string representation (see also the -b command-line option to Python). For example:
   
   ```
   str(b'Zoot!')
   "b'Zoot!'"
   ```
   
   So those are not "bytes" but string containing `b'` :scream: 
   
   Yeah. Agree it's a bug then. Would you like to fix it maybe? What do you think should be the fix @danneaves-ee  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #20901:
URL: https://github.com/apache/airflow/issues/20901


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #20901:
URL: https://github.com/apache/airflow/issues/20901


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #20901: Bytes cast to String in `airflow.providers.google.cloud.transfers.gcs_to_local.GCSToLocalFilesystemOperator` ~142

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #20901:
URL: https://github.com/apache/airflow/issues/20901#issuecomment-1014819190


   Ah. I see - this actually converts the value of the bytes to "nice string" representation: https://docs.python.org/3/library/stdtypes.html#boolean-operations-and-or-not
   
   > Passing a bytes object to str() without the encoding or errors arguments falls under the first case of returning the informal string representation (see also the -b command-line option to Python). For example:
   
   ```
   str(b'Zoot!')
   "b'Zoot!'"
   ```
   
   So those are not "bytes" but string containing `b'` :scream: 
   
   Yeah. Agree it's a bug then. Would you like to fix it maybe? What do you think should be the fix @danneaves-ee  ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org