You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/07/06 22:25:04 UTC

[GitHub] [airflow] dstandish commented on a change in pull request #16821: Add more filter options to list_keys of S3Hook

dstandish commented on a change in pull request #16821:
URL: https://github.com/apache/airflow/pull/16821#discussion_r664912477



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(

Review comment:
       why are you filtering even when start after is `None`?

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -283,6 +287,12 @@ def list_keys(
         :type page_size: int
         :param max_items: maximum items to return
         :type max_items: int
+        :param start_after_key: returns keys after this specified key in the bucket.
+        :type start_after_key: str
+        :param start_after_datetime: returns keys with last modified datetime greater than the specified datetime.
+        :type start_after_datetime: datetime , ISO8601: '%Y-%m-%dT%H:%M:%S%z', e.g. 2021-02-20T05:20:20+0000

Review comment:
       if the type is `datetime` then why do you need to specify string formatting here?  it makes it look like the type is actually string

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -269,6 +270,9 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = '',

Review comment:
       I think `None` is the standard default for optional string.  I think this is the most intuitive approach, compared with empty string.
   
   Then you can you can add the parameter only conditionally to `paginate`.  To me it makes sense to only specify parameters when you actually want to use them.  No me `None` is a good default to mean, do not use this param (unless supplying `StartAfter=None` in paginate has the same effect as omitting it, in which case leaving it in is fine).

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -292,18 +302,24 @@ def list_keys(
             'PageSize': page_size,
             'MaxItems': max_items,
         }
-
         paginator = self.get_conn().get_paginator('list_objects_v2')
         response = paginator.paginate(
-            Bucket=bucket_name, Prefix=prefix, Delimiter=delimiter, PaginationConfig=config
+            Bucket=bucket_name,
+            Prefix=prefix,
+            Delimiter=delimiter,
+            PaginationConfig=config,
+            StartAfter=start_after_key,
+        )
+        # JMESPath to query directly on paginated results
+        filtered_response = response.search(
+            "Contents[?to_string("
+            "LastModified)<='\"{}\"' && "
+            "to_string(LastModified)>='\"{"

Review comment:
       if it's going to be called `start_after_datetime` then this should be strictly greater than, i think




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org