You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/20 12:22:46 UTC

[GitHub] [airflow] EmadMokhtar opened a new issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

EmadMokhtar opened a new issue #10426:
URL: https://github.com/apache/airflow/issues/10426


   **Description**
   
   Support passing multiple prefixes to `GoogleCloudStorageListOperator` and `GoogleCloudStorageDeleteOperator` operators.
   
   **Use case / motivation**
   
   I have this folder structure in GCS bucket.
   
   ```
   +-- year={year}
   |   +-- month={month}
   |       +--day={day}
   |           +-- topic={topic1}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic2}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic3}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic4}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic5}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic6}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic7}
   |       +--day={day}
   |           +-- topic={topic1}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic2}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic3}
   |                 +--file 1
   |                 +--file 2
   |                 +--file 3
   |           +-- topic={topic4}
   |           +-- topic={topic5}
   |           +-- topic={topic6}
   |           +-- topic={topic7}
   |           ....
   ```
   
   What I need to achieve is delete one day of objects. For example, I need to delete objects in `year=2020/month=08/day=19`. I can do that easily using `gsutils`. In `gsutil` you can delete them via wild card `gsutil ear=2020/month=08/day=19/*` but using the REST APIs you can't even if you use a prefix. The reason is there is no one prefix to get all the objects inside a folder. I achieved that by using multiple prefixes and for each prefix, I will get the list of objects. Unfortunately, I can't pass more than one prefix to the operators.
   
   **Prefixes used**
   - year=2020/month=08/day=19``
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-677630865


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-677971503


   @EmadMokhtar  I assigned you to this ticket. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EmadMokhtar commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
EmadMokhtar commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-677933790


   @mik-laj I would like to make a PR for this. Please assign the ticket to me and I create a PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-922951625


   @EmadMokhtar are you still working on this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-830793326


   > What I'm planning to do is to modify the `GCSHook.list()` method to accept `prefixes` instead of `prefix`. I need to know how we can do that with backward compatibility? Some old code will assume this hook is accepting one prefix and we need to raise a deprecation warning. Or maybe it is only used internally and I need to refactor the operators who use it?
   
   `prefix` is a parameter of `list_blobs` https://googleapis.dev/python/storage/latest/client.html
   even if you modify the parameter on the hook at the end you will still be able to utalize only single prefix each time.
   You can modify prefix to accept `Optional[str,List[str]]` that way the modification is also backward compatible. This has some similarities to approach suggested on https://github.com/apache/airflow/issues/15001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-830846445


   @eladkal 's proposal sounds good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-677874120


   @EmadMokhtar Do you need help creating a PR? Your use case looks interesting and it looks like it's worth thinking about.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EmadMokhtar commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
EmadMokhtar commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-727603579


   # Decision
   
   What I'm planning to do is to modify the `GCSHook.list()` method to accept `prefixes` instead of `prefix`. I need to know how we can do that with backward compatibility? Some old code will assume this hook is accepting one prefix and we need to raise a deprecation warning. Or maybe it is only used internally and I need to refactor the operators who use it?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EmadMokhtar commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
EmadMokhtar commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-952712589


   > 
   > 
   > @EmadMokhtar are you still working on this issue?
   
   I want to but I'm facing issues with setup the dev environment for Airflow. I will give it another try an upcoming week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] EmadMokhtar commented on issue #10426: Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

Posted by GitBox <gi...@apache.org>.
EmadMokhtar commented on issue #10426:
URL: https://github.com/apache/airflow/issues/10426#issuecomment-677633259


   # Initial implementation
   
   I implement it like this in my project
   
   
   ``` python
       def execute(self, context):
           hook = GoogleCloudStorageHook(
               google_cloud_storage_conn_id=self.google_cloud_storage_conn_id,
               delegate_to=self.delegate_to
           )
   
           if self.objects:
               objects = self.objects
           else:
               objects = []
               for prefix in self.prefixes:
                   prefix_objects = hook.list(bucket=self.bucket_name,
                                              prefix=prefix)
                   objects.extend(prefix_objects)
                   self.log.info(prefix_objects)
   
           self.log.info("Deleting %s objects from %s",
                         len(objects), self.bucket_name)
           for object_name in objects:
               hook.delete(bucket=self.bucket_name,
                           object=object_name)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org