You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/07/07 04:18:28 UTC

[GitHub] [airflow] dstandish edited a comment on pull request #16821: Add more filter options to list_keys of S3Hook

dstandish edited a comment on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-875262841


   > Let me offer a recommendation.
   > 
   > Rather than adding `start_after_datetime` and `to_datetime`, just add a single param called `object_filter` that takes a function `(obj: dict) -> bool`
   > 
   > then this function can be applied like so:
   > 
   > ```python
   >         for page in response:
   >             if 'Contents' in page:
   >                 for k in page['Contents']:
   >                     <evaluate object filter here and only append contditionally>
   >                     keys.append(k['Key'])
   > ```
   > 
   > with obj filter people can apply whatevery logic they want and they don't have to understand what start_from_datetime / to_datetime mean and how to use (since they aren't part of the s3 api)
   > 
   > what do yall think
   
   Maybe this isn't the greatest idea...  I thought there were more attributes available in the response.
   
   Here's what you get in the response:
   
   ```
   {'Key': 'tmp/hello/1.txt',
    'LastModified': datetime.datetime(2021, 7, 6, 3, 13, 41, tzinfo=tzutc()),
    'ETag': '"49f68a5c8493ec2c0bf489821c21fc3b"',
    'Size': 2,
    'StorageClass': 'STANDARD'}
   ```
   
   So I guess there isn't much else in that response that you'd want to filter on.
   
   Still, putting the filtering in the user's hands (through the use of an `object_filter` callable would remove the ambiguity of strictly greater than vs greater than (and same with less than)?
   
   Also there's the problem of naming.
   
   If we include both parameters `from` and `to`, we should probably call them `from_datetime` and `to_datetime`, where `from` is `>=` and `to` is `<`.  
   
   One other thing.... I am not sure why we use the jmespath when we already have the datetime object available in the for loop in the existing code.  I think it is best to compare `LastModified` directly with the user supplied datetime rather than converting everything to string and doing a json search.  For one this is additional transformation that inherently could introduce an error through formatting difference.  I also suspect the json search is significantly less efficient but not sure.  Bottom line, if you already have datetime on both sides of the comparison, why convert them both to string to compare them?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org