You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/05 14:07:36 UTC

[GitHub] [airflow] eskarimov opened a new pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

eskarimov opened a new pull request #17446:
URL: https://github.com/apache/airflow/pull/17446


   Re-opened PR #17284 
   ### **Issue:**
   Currently Google Cloud Transfer Service operators `CloudDataTransferServiceGCSToGCSOperator` and `CloudDataTransferServiceS3ToGCSOperator` don't provide functionality to specify destination path, i.e. prefix for transferred objects, hence all the objects should simply land into the root of the bucket.  
   
   ### **Solution:**
   [The Storage Transfer Service allows to specify prefix for transferred objects](https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#GcsData), the PR adds this functionality with a new input parameter for operators above.
   
   By default it'd be `None`, so transferred objects will still land into the root of a bucket.
   
   Ideally I'd like to use just a single input parameter for specifying destination location, similar as it's done for [`S3ToGCSOperator`](https://github.com/apache/airflow/blob/main/airflow/providers/google/cloud/transfers/s3_to_gcs.py), parsing GCS URL, but it'd break backward compatibility for already existing pipelines. Hence, adding a new input parameter seems a more reasonable solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov edited a comment on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov edited a comment on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-946657445


   @uranusjr thank you so much for triggering the tests - I've missed the failed static code check unfortunately, so I had to fix it (and also properly setup pre-commit hooks to avoid such mistakes in the future).
   
   The test `Build and test provider packages wheel` is going to fail, particularly job `Install and test provider packages and airflow on Airflow 2.1 files`, as some packages using [`GCSHook`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L117) depend on [`normalize_directory_path()`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L42), which have moved to `airflow.utils.helpers` in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj merged pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
mik-laj merged pull request #17446:
URL: https://github.com/apache/airflow/pull/17446


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-937372966


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-937372966


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-953569881


   The function `normalize_directory_path` was moved to stay inside `google` provider package, in order to stay compatible with Airflow 2.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source / destination paths are unspecified? From the docs: 
   
   _"path: Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] iostreamdoth edited a comment on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
iostreamdoth edited a comment on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-893547790


   I see now. 
   PATH was missing earlier from transfer spec for gcs data sink.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-930156422


   > I think that the docstring on CloudDataTransferServiceS3ToGCSOperator will have to be updated to reflect these changes:
   > 
   > https://github.com/eskarimov/airflow/blob/0a3ae26c231f4b19fccb2133946fffa34baff975/airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py#L741-L744
   
   Done, added newly created input params `s3_path`, `gcs_path`, kindly requesting to review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not None.

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`.

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket, PATH: self.destination_path},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
   ```
   
   etc
   

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source /destination paths are unspecified? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source / destination paths are unspecified? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source / destination paths are unspecified? From the docs: 
   
   _"path: Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source / destination paths are unspecified? From the docs: 
   
   _"path: Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-955168639


   Could someone trigger the tests workflow further, please? Should be green now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] iostreamdoth commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
iostreamdoth commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-893530997


   Wondering if that should be part of transfer_options.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] iostreamdoth commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
iostreamdoth commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-893547790


   I see now. 
   PATH: self.gcs_path was missing earlier from transfer spec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r719463937



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Great, thanks for clarifying.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source / destination paths are unspecified? From the docs: 
   
   _"path: Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not None.

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r716284370



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -885,7 +892,7 @@ def _create_body(self) -> dict:
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
                 AWS_S3_DATA_SOURCE: {BUCKET_NAME: self.s3_bucket},

Review comment:
       Maybe we should add the option to include a source path here at the same time? It looks like the transferSpec supports a path argument for S3 bucket sources
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#AwsS3Data




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source / destination paths are unspecified? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-930541630


   Hey @uranusjr @SamWheating  -> any last look /approve? If that looks good I plan to release provider packages tomorrow morning and I could include that one (if tests pass of course :)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718488859



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -885,7 +892,7 @@ def _create_body(self) -> dict:
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
                 AWS_S3_DATA_SOURCE: {BUCKET_NAME: self.s3_bucket},

Review comment:
       Thank you Sam, totally agree it'd bring great value 👍 and sorry I missed it

##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1054,7 +1066,7 @@ def _create_body(self) -> dict:
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
                 GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},

Review comment:
       Done, similarly as for the above S3 -> GCS




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov edited a comment on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov edited a comment on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-946657445


   @uranusjr thank you so much for triggering the tests - I've missed the static code checks unfortunately, so I had to fix failed checks (and also properly setup pre-commit hooks to avoid such mistakes in the future).
   
   The test `Build and test provider packages wheel` is going to fail, particularly job `Install and test provider packages and airflow on Airflow 2.1 files`, as some packages using [`GCSHook`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L117) depend on [`normalize_directory_path()`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L42), which have moved to `airflow.utils.helpers` in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718488859



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -885,7 +892,7 @@ def _create_body(self) -> dict:
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
                 AWS_S3_DATA_SOURCE: {BUCKET_NAME: self.s3_bucket},

Review comment:
       Thank you Sam, totally agree it'd bring great value 👍 and sorry I missed it




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r719148800



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       I've checked for both source and sink - Google's API correctly handles it omitting `None` args for path/prefix.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket, PATH: self.destination_path},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
   ```
   
   etc
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here if the source /destination paths are unspecified? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718913319



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1053,8 +1075,8 @@ def _create_body(self) -> dict:
             DESCRIPTION: self.description,
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
-                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
-                GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
+                GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket, PATH: self.source_path},

Review comment:
       Do you think its ok to pass `None` in here? From the docs: 
   
   _"Must be an empty string or full path name that ends with a '/'. This field is treated as an object prefix. As such, it should generally not begin with a '/'."_
   
   https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#awss3data
   
   If this is the case then maybe the default value for paths should be `""`, or we should be only adding `PATH` to this dictionary if it is not `None`, something like:
   
   ```python
           body = {
               DESCRIPTION: self.description,
               STATUS: GcpTransferJobsStatus.ENABLED,
               TRANSFER_SPEC: {
                   GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},
                   GCS_DATA_SINK: {BUCKET_NAME: self.destination_bucket},
               },
           }
   
           if self.source_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SOURCE][PATH] = self.source_path
               
           if self.destination_path is not None:
               body[TRANSFER_SPEC][GCS_DATA_SINK][PATH] = self.destination_path
   ```
   
   etc
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-951776824


   > @eskarimov You cannot create new methods in the Airflow core as we are trying to maintain backward compatibility with Airflow 2.1.
   
   Ah, I see, thanks @mik-laj. 
   I'd propose then to keep the function `normalize_directory_path` inside `google` provider package by moving it into `airflow.providers.google.cloud.utils`, what do you think? I'll prepare the commit then, if it's fine. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-951781725


   I'm fine with it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-946657445


   @uranusjr thank you so much for triggering the tests - I've missed the static code checks unfortunately, so I had to fix failed checks (and also properly setup pre-commit hooks to avoid such mistakes in the future).
   
   The only thing I'm not sure about is failing tests for building and testing provider packages with wheel - some operators import and use [`GCSHook`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L117) class, which depends on [`normalize_directory_path()`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L42), which have moved in this PR.
   
   At the same time building with `sdist` format completes successfully.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r686834461



##########
File path: airflow/providers/google/cloud/hooks/gcs.py
##########
@@ -1207,3 +1204,7 @@ def _parse_gcs_url(gsurl: str) -> Tuple[str, str]:
     # Remove leading '/' but NOT trailing one
     blob = parsed_url.path.lstrip('/')
     return bucket, blob
+
+
+def _normalize_directory_path(source_object: Optional[str]) -> Optional[str]:
+    return source_object + "/" if source_object and not source_object.endswith("/") else source_object

Review comment:
       Hi @uranusjr ,
   thank you for reviewing, I'd like to share why I think it'd be better to keep the function in its current place:
   - If we move it under `airflow.utils` it might be treated as a general function for normalizing directory path (for instance removing redundant separators or converting slashes to backslashes for Windows, etc), while here it's purely about having a slash at the end of the path.
   
   - Another options is moving it under `airflow.providers.google.cloud.utils`, but it seems less logical to me, because the function is related with normalizing GCS path, I'd expect to find this function under the class/file related with GCS together with other functions.
   
   Regarding the underscore prefix: I understand and share your point to get rid of the leading underscore, as we import this function in other modules, so it doesn't seem like something for internal use.
   On another side I've tried to follow the same pattern as the previous function in the same file [`_parse_gcs_url`](https://github.com/apache/airflow/blob/8d2d264cdd7bb79b85b2b6124c7fc596ac9142f8/airflow/providers/google/cloud/hooks/gcs.py#L1192-L1206) - it's used across [BigQuery ](https://github.com/apache/airflow/blob/67cbb0f181f806edb16ca12fb7a2638b5f31eb58/airflow/providers/google/cloud/operators/bigquery.py#L39), [AzureFileShareToGCSOperator ](https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/google/cloud/transfers/azure_fileshare_to_gcs.py#L24), [S3ToGCSOperator](https://github.com/apache/airflow/blob/866a601b76e219b3c043e1dbbc8fb22300866351/airflow/providers/google/cloud/transfers/s3_to_gcs.py#L25) operators. Please let me know your thoughts.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r716284540



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1054,7 +1066,7 @@ def _create_body(self) -> dict:
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
                 GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},

Review comment:
       Also here, if we're going to modify this operator to accept a destination path, should we add a `source_path` argument as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-930156422


   > I think that the docstring on CloudDataTransferServiceS3ToGCSOperator will have to be updated to reflect these changes:
   > 
   > https://github.com/eskarimov/airflow/blob/0a3ae26c231f4b19fccb2133946fffa34baff975/airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py#L741-L744
   
   Done, added newly created input params `s3_path`, `gcs_path`, kindly requesting to review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r718489377



##########
File path: airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py
##########
@@ -1054,7 +1066,7 @@ def _create_body(self) -> dict:
             STATUS: GcpTransferJobsStatus.ENABLED,
             TRANSFER_SPEC: {
                 GCS_DATA_SOURCE: {BUCKET_NAME: self.source_bucket},

Review comment:
       Done, similarly as for the above S3 -> GCS




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov edited a comment on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov edited a comment on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-946657445


   @uranusjr thank you so much for triggering the tests - I've missed the static code checks unfortunately, so I had to fix failed checks (and also properly setup pre-commit hooks to avoid such mistakes in the future).
   
   The test `Build and test provider packages wheel` is going to fail, particularly job `Install and test provider packages and airflow on Airflow 2.1 files`, as the packages using [`GCSHook`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L117) depend on [`normalize_directory_path()`](https://github.com/apache/airflow/blob/e2d6266ef9109e541babbb734c722d2204f31365/airflow/providers/google/cloud/hooks/gcs.py#L42), which have moved to `airflow.utils.helpers` in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] SamWheating commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
SamWheating commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-927399976


   I think that the docstring on CloudDataTransferServiceS3ToGCSOperator will have to be updated to reflect these changes:
   
   https://github.com/eskarimov/airflow/blob/0a3ae26c231f4b19fccb2133946fffa34baff975/airflow/providers/google/cloud/operators/cloud_storage_transfer_service.py#L741-L744
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#discussion_r686506159



##########
File path: airflow/providers/google/cloud/hooks/gcs.py
##########
@@ -1207,3 +1204,7 @@ def _parse_gcs_url(gsurl: str) -> Tuple[str, str]:
     # Remove leading '/' but NOT trailing one
     blob = parsed_url.path.lstrip('/')
     return bucket, blob
+
+
+def _normalize_directory_path(source_object: Optional[str]) -> Optional[str]:
+    return source_object + "/" if source_object and not source_object.endswith("/") else source_object

Review comment:
       This function should be moved somewhere in the `utils` directory (and named without the leading prefix)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-930541630


   Hey @uranusjr @SamWheating  -> any last look /approve? If that looks good I plan to release provider packages tomorrow morning and I could include that one (if tests pass of course :)).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-945894875


   I had to make small changes in the most recent commit, as described in the following issue #14682,  validation/normalisation of input parameters, which could be templated, should happen after the task preparation stage. As otherwise the normalisation function might receive an input like `{{ var.value.destination_path }}` or similar.
   
   Please let me know your thoughts, or anything, what I could improve in order to let the PR be merged.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eskarimov commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
eskarimov commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-951157425


   Sorry for bothering, just wanted to check if I could do anything additional on the PR in order to get it merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #17446: Add support of `path` parameter for GCloud Storage Transfer Service operators

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #17446:
URL: https://github.com/apache/airflow/pull/17446#issuecomment-951647873


   @eskarimov You cannot create new methods in the Airflow core as we are trying to maintain backward compatibility with Airflow 2.1. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org