You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/07/17 15:34:37 UTC

[GitHub] [airflow] sunank200 opened a new pull request #16821: Add more filter options to list_keys of S3Hook

sunank200 opened a new pull request #16821:
URL: https://github.com/apache/airflow/pull/16821


   Add more filter options to list_keys of S3Hook
       
   This commit adds following filters to list the keys in list_keys of S3Hook:
       - `start_after_key` filters the any keys after the specified key. start_after_key can be any key in the bucket.
       - `start_after_datetime` filters all the keys with last modified date-time greater than or equal to the start_after_datetime.
       - `to_datetime` filters all the keys with last modified date-time less than or equal to the to_datetime
      
   closes: #16627  
   
   This change wouldn't affect dependencies for other operators like S3DeleteObjectsOperator, S3ListOperator, S3Hook methods:get_wildcard_key, delete_bucket and S3KeysUnchangedSensor.
   
   Corresponding unittest has been added to test_s3.py.
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #16627
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ferruzzi commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-874409253


   Is this functionality we want in the hooks?  Per the discussion [here](https://github.com/apache/airflow/pull/16571#discussion_r660433624) it sounds like that should be in the operator, maybe?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r664912477



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(

Review comment:
       why are you filtering even when `StartAfter` is `None`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-875262841


   > Let me offer a recommendation.
   > 
   > Rather than adding `start_after_datetime` and `to_datetime`, just add a single param called `object_filter` that takes a function `(obj: dict) -> bool`
   > 
   > then this function can be applied like so:
   > 
   > ```python
   >         for page in response:
   >             if 'Contents' in page:
   >                 for k in page['Contents']:
   >                     <evaluate object filter here and only append contditionally>
   >                     keys.append(k['Key'])
   > ```
   > 
   > with obj filter people can apply whatevery logic they want and they don't have to understand what start_from_datetime / to_datetime mean and how to use (since they aren't part of the s3 api)
   > 
   > what do yall think
   
   Maybe this isn't the greatest idea...  I thought there were more attributes available in the response.
   
   Here's what you get in the response:
   
   ```
   {'Key': 'tmp/hello/1.txt',
    'LastModified': datetime.datetime(2021, 7, 6, 3, 13, 41, tzinfo=tzutc()),
    'ETag': '"49f68a5c8493ec2c0bf489821c21fc3b"',
    'Size': 2,
    'StorageClass': 'STANDARD'}
   ```
   
   So at least for now there isn't much else in that response that you'd want to filter on.
   
   Still, putting the filtering in the user's hands would remove the ambiguity of strictly greater than vs greater than (and same with less than)?  
   
   Also there's the problem of naming.
   
   If we include both parameters `from` and `to`, we should probably call them `from_datetime` and `to_datetime`, where `from` is `>=` and `to` is `<`.  
   
   One other thing.... I am not sure why we use the jmespath when we already have the datetime object available in the for loop in the existing code.  I think it is best to compare `LastModified` directly with the user supplied datetime rather than converting everything to string and doing a json search.  For one this is additional transformation that inherently could introduce an error through formatting difference.  I also suspect the json search is significantly less efficient but not sure.  Bottom line, if you already have datetime on both sides of the comparison, why convert them both to string to compare them?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish edited a comment on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish edited a comment on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-875262841


   > Let me offer a recommendation.
   > 
   > Rather than adding `start_after_datetime` and `to_datetime`, just add a single param called `object_filter` that takes a function `(obj: dict) -> bool`
   > 
   > then this function can be applied like so:
   > 
   > ```python
   >         for page in response:
   >             if 'Contents' in page:
   >                 for k in page['Contents']:
   >                     <evaluate object filter here and only append contditionally>
   >                     keys.append(k['Key'])
   > ```
   > 
   > with obj filter people can apply whatevery logic they want and they don't have to understand what start_from_datetime / to_datetime mean and how to use (since they aren't part of the s3 api)
   > 
   > what do yall think
   
   Maybe this isn't the greatest idea...  I thought there were more attributes available in the response.
   
   Here's what you get in the response:
   
   ```
   {'Key': 'tmp/hello/1.txt',
    'LastModified': datetime.datetime(2021, 7, 6, 3, 13, 41, tzinfo=tzutc()),
    'ETag': '"49f68a5c8493ec2c0bf489821c21fc3b"',
    'Size': 2,
    'StorageClass': 'STANDARD'}
   ```
   
   So I guess there isn't much else in that response that you'd want to filter on.
   
   Still, putting the filtering in the user's hands (through the use of an `object_filter` callable would remove the ambiguity of strictly greater than vs greater than (and same with less than)?
   
   Also there's the problem of naming.
   
   If we include both parameters `from` and `to`, we should probably call them `from_datetime` and `to_datetime`, where `from` is `>=` and `to` is `<`.  
   
   One other thing.... I am not sure why we use the jmespath when we already have the datetime object available in the for loop in the existing code.  I think it is best to compare `LastModified` directly with the user supplied datetime rather than converting everything to string and doing a json search.  For one this is additional transformation that inherently could introduce an error through formatting difference.  I also suspect the json search is significantly less efficient but not sure.  Bottom line, if you already have datetime on both sides of the comparison, why convert them both to string to compare them?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r667653648



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       This makes sense. I will add the condition to use `StartAfter` only when it's not `None`.  Also, I will make the default string `None` as per the Airflow codebase.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-877741460


   @potiuk sure. Was planning to comment on the review and do changes today. Some situation at home hence the delay. I will resolve the issues today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668288116



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       i'm not sure adding this ResponseFilter is the right approach.  i am not sure we need to add this abstraction layer.
   
   Why not just let the user put an arbitrary callable that does the filtering that the user needs?  
   
   For example, your docstring shows this example:
   
   ```python
   object_filter={"LastModified__lt": datetime.now()},
   ```
   
   But with arbitrary callable, this can be implemented about as simply this way:
   
   ```python
   object_filter=lambda x: x['LastModified'] < datetime.now(),
   ```
   
   And importantly with the latter approach there's no need to understand the options available within the class and how it works.
   
   ---
   
   That said, I still think it's probably best to go with your original thought, i.e. `from_datetime` and `to_datetime` (though with the updated naming / logic).  I know I initially suggested adding object filter callable, but after further thought I added that since there aren't other useful filter attributes in the object metadata, it's not that helpful.
   
   And your approach of adding only a certain number of allowable operations doesn't really provide the flexibility of the callable.
   
   Simply adding the two optional datetime params, from and to, is simple, easy to understand, useful.  It's true that with simple from / to params, we can't provide flexibility with interval edge inclusivity, but to me that seems like an acceptable tradeoff for the simplicity and usabilitiy.  If we go with this approach I would probably vote for doing `>=` for `from` and `<` for `to`.
   
   But if _not_ going with simple from / to, I'm in support of simple arbitrary callable instead of the ResponseFilter apparatus.
   
   What do yall think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668288116



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       i'm not sure adding this ResponseFilter is the right approach.  i am not sure we need to add this abstraction layer.
   
   Why not just let the user put an arbitrary callable that does the filtering that the user needs?  For example, then the user could pass `lambda x: x['LastModified'] > some_val` for greater than, or `lambda x: other_val > x['LastModified'] > some_val` for between.
   
   This is of course simpler to implement, but also it seems straightforward from user perspective.  This way we wouldn't be adding a new class that the user needs to learn how to use.
   
   Alternatively, we can go with your original thought, i.e. `from_datetime` and `to_datetime` (though with the updated naming / logic).  To me, simply adding the two optional datetime params, from and to, seems like the best approach because it's simple, easy to understand, useful, and there aren't a lot of other attributes that we would want to filter on (which lessens the utility of an arbitrary callable).  It's true that we can't provide flexibility with interval edge inclusivity, but to me that seems like an acceptable tradeoff for the simplicity and usabilitiy.  I would probably vote for doing `>=` for `from` and `<` for `to`.  
   
   What do yall think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-874673106


   Re-triggered CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668288116



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       i'm not sure adding this ResponseFilter is the right approach.  i am not sure we need to add this abstraction layer.
   
   Why not just let the user put an arbitrary callable that does the filtering that the user needs?  
   
   For example, your docstring shows this example:
   
   ```python
   object_filter={"LastModified__lt": datetime.now()},
   ```
   
   But with arbitrary callable, this can be implemented about as simply this way:
   
   ```python
   object_filter=lambda x: x['LastModified'] < datetime.now(),
   ```
   
   This is roughly as compact and roughly as simple, but the benefit is there's no need to understand the options available within the class and how it works, and if user wants to do arbitrary filtering they can.
   
   ---
   
   That said, I still think it's probably best to go with your original thought, i.e. `from_datetime` and `to_datetime` (though with the updated naming / logic).  
   
   _(I know I initially suggested adding object filter callable, but subsequently I walked back that suggestion; since there aren't other useful filter attributes in the object metadata, it's not as useful)_.
   
   And your approach of adding only a certain number of allowable operations doesn't really provide the flexibility of the callable.
   
   Simply adding the two optional datetime params, from and to, is simple, easy to understand, useful.  It's true that with simple from / to params, we can't provide flexibility with interval edge inclusivity, but to me that seems like an acceptable tradeoff for the simplicity and usabilitiy.  If we go with this approach I would probably vote for doing `>=` for `from` and `<` for `to`.
   
   ---
   
   In conclusion i think going with `from` and `to` seems best, but if not that then simple arbitrary callable instead of the ResponseFilter apparatus seems best.  Interested in what others think.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-878604778


   @potiuk  @dstandish @pateash I have done the required changes. I would be more than happy to accommodate the feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r664912477



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(

Review comment:
       why are you filtering even when start after is `None`?

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -283,6 +287,12 @@ def list_keys(
         :type page_size: int
         :param max_items: maximum items to return
         :type max_items: int
+        :param start_after_key: returns keys after this specified key in the bucket.
+        :type start_after_key: str
+        :param start_after_datetime: returns keys with last modified datetime greater than the specified datetime.
+        :type start_after_datetime: datetime , ISO8601: '%Y-%m-%dT%H:%M:%S%z', e.g. 2021-02-20T05:20:20+0000

Review comment:
       if the type is `datetime` then why do you need to specify string formatting here?  it makes it look like the type is actually string

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       I think `None` is the standard default for optional string.  I think this is the most intuitive approach, compared with empty string.
   
   Then you can you can add the parameter only conditionally to `paginate`.  To me it makes sense to only specify parameters when you actually want to use them.  No me `None` is a good default to mean, do not use this param (unless supplying `StartAfter=None` in paginate has the same effect as omitting it, in which case leaving it in is fine).

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       if it's going to be called `start_after_datetime` then this should be strictly greater than, i think




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668268942



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       Done with the changes




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-877626711


   Hey @sunank200  - how about applying the comments from reviews ? I am preparing to release next wave of providers and it would be great to merge that one in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 edited a comment on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 edited a comment on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-878604778


   @potiuk  @dstandish @pateash I have done the required changes. I would be more than happy to accommodate the feedback.
   
   closes: #16821 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r667641239



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(

Review comment:
       When `StartAfter` is `None`, `paginator.paginate()` doesn't use it as filter. It filters on `StartAfter` only when it's not a `None`. So, filtering works in both the conditions when it's `None` and not `None`.  But like you have suggested above, I will add a condition to using the parameters only when passed by the user as argument.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-881834423


   @potiuk I have rebase to latest main.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r665031254



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       does not handle case where you have exactly _one of_ `to_datetime` and `start_after_datetime` (i.e. not both)
   
   also pendulum and datetime are formatted differently. observe:
   ```python
   >>> import pendulum
   >>> '{}'.format(pendulum.now())
   '2021-07-06T21:22:11.514220-07:00'
   >>> from datetime import datetime
   >>> '{}'.format(datetime.now())
   '2021-07-06 21:22:35.418199'
   ```
   
   if you compare the objects directly, your feature could support both pendulum _(which is used commonly with airflow)_ and datetime; but by using string formatting you really cannot use both. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668288116



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       i'm not sure adding this ResponseFilter is the right approach.  i am not sure we need to add this abstraction layer.
   
   Why not just let the user put an arbitrary callable that does the filtering that the user needs?  
   
   For example, your docstring shows this example:
   
   ```python
   object_filter={"LastModified__lt": datetime.now()},
   ```
   
   But with arbitrary callable, this can be implemented about as simply this way:
   
   ```python
   object_filter=lambda x: x['LastModified'] < datetime.now(),
   ```
   
   This is roughly as compact and roughly as simple, but the benefit is there's no need to understand the options available within the class and how it works, and if user wants to do arbitrary filtering they can.
   
   ---
   
   That said, I still think it's probably best to go with your original thought, i.e. `from_datetime` and `to_datetime` (though with the updated naming / logic).  
   
   _(I know I initially suggested adding object filter callable, but subsequently I walked back that suggestion; since there aren't other useful filter attributes in the object metadata, it's not as useful)_.
   
   And your approach of adding only a certain number of allowable operations doesn't really provide the flexibility of the callable.
   
   Simply adding the two optional datetime params, from and to, is simple, easy to understand, useful.  It's true that with simple from / to params, we can't provide flexibility with interval edge inclusivity, but to me that seems like an acceptable tradeoff for the simplicity and usabilitiy.  If we go with this approach I would probably vote for doing `>=` for `from` and `<` for `to`.
   
   But if _not_ going with simple from / to, I'm in support of simple arbitrary callable instead of the ResponseFilter apparatus.
   
   What do yall think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-877858089


   > @potiuk sure. Was planning to comment on the review and do changes today. Some situation at home hence the delay. I will resolve the issues today.
   
   No worries. Hope all is good !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-910933889


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r664911695



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       I think `None` is the standard default for optional string.  I think this is the most intuitive approach, compared with empty string.
   
   Then you can you can add the parameter only conditionally to `paginate`.  To me it makes sense to only specify parameters when you actually want to use them.  `None` is a good default to mean, "do not use this param" (unless supplying `StartAfter=None` in paginate has the same effect as omitting it, in which case leaving it in is fine).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-882058541


   There are some static checks and docs build that needs fixes (I recommend installing pre-commit to fix the static checks)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668288116



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       i'm not sure i like the ResponsFilter approach.
   
   if you want to use the `object_filter` approach, why not just let the user put an arbitrary callable that does the filtering that the user needs.  e.g. `lambda x: x['LastModified'] > some_val`. or `lambda x: other_val > x['LastModified'] > some_val`
   
   to me this seems much simpler to implement (cus it removes the entire ResponseFilter` class) and quite simple and transparent to use.
   
   i am not sure we need to add this abstraction layer -- that's just another thing to have to learn.
   
   alternatively, i would be fine doing `from_datetime` and `to_datetime` where from is evaluated `>=` and to is evaluated `<`
   
   what do yall think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-944804074


   Have implemented the pre-commit hook changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-879652607


   Can you please rebase to latest main @sunank200 ? We had some failure in main recently and they are all fixed now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-881915952


   Reopened it to re-trigger the build


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668288116



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       i'm not sure adding this ResponseFilter is the right approach.  i am not sure we need to add this abstraction layer.
   
   Why not just let the user put an arbitrary callable that does the filtering that the user needs?  For example, then the user could pass `lambda x: x['LastModified'] > some_val` for greater than, or `lambda x: other_val > x['LastModified'] > some_val` for between.
   
   This is of course simpler to implement, but also it seems straightforward from user perspective.  This way we wouldn't be adding a new class that the user needs to learn how to use.
   
   Alternatively, we can go with your original thought, i.e. `from_datetime` and `to_datetime` (though with the updated naming / logic).  I think this is probably the best approach.  Simply adding the two optional datetime params, from and to, is simple, easy to understand, useful, and importantly there aren't other attributes in the response that we would want to filter on (which lessens the utility of an arbitrary callable).  It's true that with simple from / to params, we can't provide flexibility with interval edge inclusivity, but to me that seems like an acceptable tradeoff for the simplicity and usabilitiy.  I would probably vote for doing `>=` for `from` and `<` for `to`.  
   
   But if not adding simple from / to, I'm in support of simple arbitrary callable instead of the ResponseFilter apparatus.
   
   What do yall think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 edited a comment on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 edited a comment on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-878604778


   @potiuk  @dstandish @pateash I have done the required changes. I would be more than happy to accommodate the feedback.
   
   closes: #16627 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r667641239



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(

Review comment:
       When `StartAfter` is `None`, `paginator.paginate()` doesn't use it as filter. It filters on `StartAfter` only when it's not a `None`. So, filtering works in both the conditions when it's `None` and not `None`.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pateash commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
pateash commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r666683264



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       Completely agree with @dstandish,
   I have rarely seen a default string as literal '' in my experience in Airflow codebase.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] closed pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #16821:
URL: https://github.com/apache/airflow/pull/16821


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r667650552



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -283,6 +287,12 @@ def list_keys(
         :type page_size: int
         :param max_items: maximum items to return
         :type max_items: int
+        :param start_after_key: returns keys after this specified key in the bucket.
+        :type start_after_key: str
+        :param start_after_datetime: returns keys with last modified datetime greater than the specified datetime.
+        :type start_after_datetime: datetime , ISO8601: '%Y-%m-%dT%H:%M:%S%z', e.g. 2021-02-20T05:20:20+0000

Review comment:
       @dstandish I don't think we require to specify the string formatting here. `datetime` object should be passed here as an argument, it will work fine. String type passed as an argument won't work. I was just trying to give example. I will update the examples as:
   ```
   e.g:  to_datetime =datetime.strptime('2024-08-19T09:55:48+0000','%Y-%m-%dT%H:%M:%S%z')
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #16821: Add more filter options to list_keys of S3Hoo

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-874189163


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, itโ€™s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better ๐Ÿš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668264380



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       @potiuk  @dstandish @pateash I have pushed the changes just now.
   
   Created `ResponseFilter` class to parse and filter the s3 boto3 response based on various operations defined by the user.  This class can be extended as required for parsing and filtering the response based on different object filters operation. For now, I have added the following allowed operations.
   ```
    allowed_operations = {
           "lte": operator.le,
           "gte": operator.ge,
           "gt": operator.gt,
           "lt": operator.lt,
           "eq": operator.eq,
       }
   ```
   
   `ResponseFilter.filter(object_filter=object_filter)`  puts the filtering in the user's hands (through the use of a filter and would remove the ambiguity of strictly greater than vs greater than (and same with less than). Please refer to the commits in this [thread](https://github.com/apache/airflow/pull/16821/commits/4ef750ea486f76c98f7e5927b12fc62b4767913c) and [test changes](https://github.com/apache/airflow/pull/16821/commits/a7468c0d1591c83fd1a1c7f3ca8e32a2750e4f5b).
   
   I have added the comment for the implementation details in the code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r669332272



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       Sure @dstandish, I think the `to` and `from` approach keeps the code simple and doesn't add additional complexity to it. `ResponseFilter` and `object_filter` callable in my opinion makes things more generic for any filter operations in the future but it does add additional complexity to it. Would be happy to accommodate the change once others comment and conclude on the approach.
   
   @potiuk what do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r668270293



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       I have removed `from_datetime`, `to_datetime` and using `ResponseFilter` class with the filter method which supports all the operations defined by user: [commit](https://github.com/apache/airflow/pull/16821/commits/4ef750ea486f76c98f7e5927b12fc62b4767913c)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
potiuk closed pull request #16821:
URL: https://github.com/apache/airflow/pull/16821


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on pull request #16821: Add more filter options to list_keys of S3Hoo

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-874189163


   Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, mypy and type annotations). Our [pre-commits]( https://github.com/apache/airflow/blob/main/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks) will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in `docs/` directory). Adding a new operator? Check this short [guide](https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/custom-operator.rst) Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze environment](https://github.com/apache/airflow/blob/main/BREEZE.rst) for testing locally, itโ€™s a heavy docker but it ships with a working Airflow and a lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
   - Please follow [ASF Code of Conduct](https://www.apache.org/foundation/policies/conduct) for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
   - Be sure to read the [Airflow Coding style]( https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it better ๐Ÿš€.
   In case of doubts contact the developers at:
   Mailing List: dev@airflow.apache.org
   Slack: https://s.apache.org/airflow-slack
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r665027960



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       but i recommend just `from_datetime` and `to_datetime`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r667641589



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       sure. `from_datetime` and `to_datetime` make more sense. I will do the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ferruzzi commented on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
ferruzzi commented on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-874409253


   Is this functionality we want in the hooks?  Per the discussion [here](https://github.com/apache/airflow/pull/16571#discussion_r660433624) it sounds like that should be in the operator, maybe?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r669332272



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       Sure @dstandish, I think the `to` and `from` approach keeps the code simple and doesn't add additional complexity to it. `ResponseFilter` in my opinion makes things more generic for any filter operations in the future but it does add additional complexity to it. Would be happy to accommodate the change once others comment and conclude on the approach.
   
   @potiuk what do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
kaxil closed pull request #16821:
URL: https://github.com/apache/airflow/pull/16821


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r671414903



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       @dstandish I thought about it, if we take `ResponseFilter` route, it will add additional complexity to the code and people might need to understand the approach. Hence for simplicity of the codebase, I am going forward with `from` and `to` approach. I would be renaming it to `lastmodified_datetime_gte` (>=) and `lastmodified_datetime_lt` (<) . 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] pateash commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
pateash commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r666683466



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       ```suggestion
           start_after_key: Optional[str] = None,
   ```

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       Completely agree with @dstandish,
   I have never seen a default string as literal '' in my experience in Airflow codebase.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 removed a comment on pull request #16821: Add more filter options to list_keys of S3Hook

Posted by GitBox <gi...@apache.org>.
sunank200 removed a comment on pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#issuecomment-944804074


   Have implemented the pre-commit hook changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org