You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/07/04 04:49:38 UTC

[GitHub] [airflow] dstandish commented on issue #16627: add more filter options to list_keys of S3Hook

dstandish commented on issue #16627:
URL: https://github.com/apache/airflow/issues/16627#issuecomment-873521269


   I did this for our internal repo and what I did **was refactor list_keys to call a list_objects method** so you could get the full objects and filter after:
   
   ```python
       @provide_bucket_name
       def list_objects(
           self,
           bucket_name: Optional[str] = None,
           prefix: Optional[str] = None,
           delimiter: Optional[str] = None,
           page_size: Optional[int] = None,
           max_items: Optional[int] = None,
           start_after_key: Optional[str] = None,
           start_after_time: Optional['DateTime'] = None,
       ) -> List[S3Object]:
           """
           Lists keys in a bucket under prefix and not containing delimiter
   
           Args:
               bucket_name: the name of the bucket
               prefix: a key prefix
               delimiter: the delimiter marks key hierarchy.
               page_size: pagination size
               max_items: maximum items to return
               start_after_key: should return only keys greater than this key
               start_after_time: should return only keys with LastModified attr greater than this time
   ```
   
   this lets you use either start after key (which is supported by list_objects_v2) or start after time (which is what you're after, and which requires that we list out every file in the prefix).
   
   and if people want to use other object info for filtering it would be easy to do.
   
   I think that might not be a bad way to go here.  
   
   then list keys somehting like this:
   
   ```python
       @provide_bucket_name
       def list_keys(
           self,
           bucket_name: Optional[str] = None,
           prefix: Optional[str] = None,
           delimiter: Optional[str] = None,
           page_size: Optional[int] = None,
           max_items: Optional[int] = None,
           start_after_key: Optional[str] = None,
           start_after_time: Optional['DateTime'] = None,
       ) -> list:
           objects = self.list_objects(
               bucket_name=bucket_name,
               prefix=prefix,
               delimiter=delimiter,
               page_size=page_size,
               max_items=max_items,
               start_after_key=start_after_key,
               start_after_time=start_after_time,
           )
           return [o.Key for o in objects]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org