You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/03/14 15:55:08 UTC

[GitHub] [airflow] sunank200 opened a new pull request #22231: Add more filter to s3 hook list_key

sunank200 opened a new pull request #22231:
URL: https://github.com/apache/airflow/pull/22231


   Implemented as discussed in [closed PR](https://github.com/apache/airflow/pull/19018).
   
   Add more filter options to list_keys of S3Hook
   - `start_after_key`: should return only keys greater than this key
   - `from_datetime`: should return only keys with LastModified attr greater than this equal `from_datetime`.
   - `to_datetime`: should return only keys with LastModified attr less than this `to_datetime`.
   - `object_filter`: Function callable that receives the list of the S3 objects, `from_datetime` and `to_datetime` and returns the List of the matched key.
   
   Add test for the added argument to `list_keys`.
   
   closes: #16627
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #16627
   related: #16627
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/main/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #22231: S3 list key filter

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #22231:
URL: https://github.com/apache/airflow/pull/22231#discussion_r825527820



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -255,6 +256,23 @@ def list_prefixes(
 
         return prefixes
 
+    def _list_key_object_filter(
+        self, keys: list, from_datetime: Optional[DateTime] = None, to_datetime: Optional[DateTime] = None
+    ) -> list:
+        if from_datetime is None and to_datetime is None:
+            return [k['Key'] for k in keys]
+        elif to_datetime is None:
+            return [k['Key'] for k in keys if k['LastModified'] >= from_datetime]
+        elif from_datetime is None:
+            return [k['Key'] for k in keys if k['LastModified'] < to_datetime]
+        else:
+            return [
+                k['Key']
+                for k in keys
+                if k['LastModified'] >= from_datetime and k['LastModified'] < to_datetime
+            ]
+        return [k['Key'] for k in keys]

Review comment:
       How about
   
   ```suggestion
       def _list_key_object_filter(
           self, keys: list, from_datetime: Optional[DateTime] = None, to_datetime: Optional[DateTime] = None
       ) -> list:
           def _is_in_period(dt: datetime) -> bool:
               if from_datetime is not None and dt < from_datetime:
                   return False
               if to_datetime is not None and dt > to_datetime:
                   return False
               return True
   
           return [k['Key'] for k in keys if _is_in_period(k['LastModified'])]
   ```

##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -263,6 +281,10 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = None,
+        from_datetime: Optional[DateTime] = None,
+        to_datetime: Optional[DateTime] = None,

Review comment:
       I don’t think this needs to take `pendulum.DateTime`. Normal `datetime.datetime` works equally well (and is compatible with `pendulum.DateTime`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed pull request #22231: Add more filter to s3 hook list_key

Posted by GitBox <gi...@apache.org>.
kaxil closed pull request #22231:
URL: https://github.com/apache/airflow/pull/22231


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #22231: S3 list key filter

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #22231:
URL: https://github.com/apache/airflow/pull/22231#discussion_r825716919



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -263,6 +281,10 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = None,
+        from_datetime: Optional[DateTime] = None,
+        to_datetime: Optional[DateTime] = None,

Review comment:
       Native datetime can be timezone-aware as well; on the other hand, using `pendulum.DateTime` still _does not_ guarantee the instance is timezone-aware.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22231: Add more filter to s3 hook list_key

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22231:
URL: https://github.com/apache/airflow/pull/22231#issuecomment-1067019182


   Seems like this is a wider outage of some inventories :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #22231: Add more filter to s3 hook list_key

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #22231:
URL: https://github.com/apache/airflow/pull/22231#discussion_r825950166



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -263,6 +281,10 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = None,
+        from_datetime: Optional[DateTime] = None,
+        to_datetime: Optional[DateTime] = None,

Review comment:
       Sure. Changed the type to `datetime.datetime`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] sunank200 commented on a change in pull request #22231: S3 list key filter

Posted by GitBox <gi...@apache.org>.
sunank200 commented on a change in pull request #22231:
URL: https://github.com/apache/airflow/pull/22231#discussion_r825679975



##########
File path: airflow/providers/amazon/aws/hooks/s3.py
##########
@@ -263,6 +281,10 @@ def list_keys(
         delimiter: Optional[str] = None,
         page_size: Optional[int] = None,
         max_items: Optional[int] = None,
+        start_after_key: Optional[str] = None,
+        from_datetime: Optional[DateTime] = None,
+        to_datetime: Optional[DateTime] = None,

Review comment:
       `LastModified` in the key returned by boto3 is timezone aware and comparison with DateTime specified by the user would create `TypeError: can't compare offset-naive and offset-aware datetimes`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil merged pull request #22231: Add more filter to s3 hook list_key

Posted by GitBox <gi...@apache.org>.
kaxil merged pull request #22231:
URL: https://github.com/apache/airflow/pull/22231


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #22231: Add more filter to s3 hook list_key

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #22231:
URL: https://github.com/apache/airflow/pull/22231#issuecomment-1068021828


   I think you will need to rebase that one @sunank200 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed pull request #22231: Add more filter to s3 hook list_key

Posted by GitBox <gi...@apache.org>.
potiuk closed pull request #22231:
URL: https://github.com/apache/airflow/pull/22231


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #22231: Add more filter to s3 hook list_key

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #22231:
URL: https://github.com/apache/airflow/pull/22231#issuecomment-1066985748


   Re-triggering the build to rerun doc build


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org