You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/04/16 18:56:39 UTC

[GitHub] [airflow] blcksrx opened a new pull request #15409: Support regex pattern in SFTPHOOK

blcksrx opened a new pull request #15409:
URL: https://github.com/apache/airflow/pull/15409


   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   In many cases, users are interested in some cases of files or types, not the whole thing. for example, users want to retrieve only CSV files in the `/tmp` directory and etc. Unfortunately, `Paramiko` does not support wi
   
   
   related: #15397
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] saveriogzz commented on a change in pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
saveriogzz commented on a change in pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#discussion_r619141253



##########
File path: airflow/providers/sftp/hooks/sftp.py
##########
@@ -166,16 +168,21 @@ def describe_directory(self, path: str) -> Dict[str, Dict[str, str]]:
             }
         return files
 
-    def list_directory(self, path: str) -> List[str]:
+    def list_directory(self, path: str, regex_pattern: Optional[str] = None) -> List[str]:
         """
         Returns a list of files on the remote system.
 
         :param path: full path to the remote directory to list
         :type path: str
+        :param regex_pattern: optional pattern to filter the remote_full_path files
+        :type: regex_pattern: Optional[str]
         """
         conn = self.get_conn()
-        files = conn.listdir(path)
-        return files
+        if regex_pattern:
+            pattern = re.compile(regex_pattern)
+            return [file for file in conn.listdir(path) if pattern.match(file)]

Review comment:
       would the usage of a [generator](https://docs.python.org/3/whatsnew/3.3.html#pep-380) be a more efficient solution when encountering directories with many files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] blcksrx edited a comment on pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
blcksrx edited a comment on pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#issuecomment-822290251


   @mik-laj  The author of this issue prefers to support `fnmacth`, I mean wildcards.  whats you opinion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] saveriogzz commented on a change in pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
saveriogzz commented on a change in pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#discussion_r619141253



##########
File path: airflow/providers/sftp/hooks/sftp.py
##########
@@ -166,16 +168,21 @@ def describe_directory(self, path: str) -> Dict[str, Dict[str, str]]:
             }
         return files
 
-    def list_directory(self, path: str) -> List[str]:
+    def list_directory(self, path: str, regex_pattern: Optional[str] = None) -> List[str]:
         """
         Returns a list of files on the remote system.
 
         :param path: full path to the remote directory to list
         :type path: str
+        :param regex_pattern: optional pattern to filter the remote_full_path files
+        :type: regex_pattern: Optional[str]
         """
         conn = self.get_conn()
-        files = conn.listdir(path)
-        return files
+        if regex_pattern:
+            pattern = re.compile(regex_pattern)
+            return [file for file in conn.listdir(path) if pattern.match(file)]

Review comment:
       would the usage of a [generator](https://docs.python.org/3/whatsnew/3.3.html#pep-380) a more efficient solution when encountering directories with many files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] saveriogzz commented on a change in pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
saveriogzz commented on a change in pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#discussion_r619141253



##########
File path: airflow/providers/sftp/hooks/sftp.py
##########
@@ -166,16 +168,21 @@ def describe_directory(self, path: str) -> Dict[str, Dict[str, str]]:
             }
         return files
 
-    def list_directory(self, path: str) -> List[str]:
+    def list_directory(self, path: str, regex_pattern: Optional[str] = None) -> List[str]:
         """
         Returns a list of files on the remote system.
 
         :param path: full path to the remote directory to list
         :type path: str
+        :param regex_pattern: optional pattern to filter the remote_full_path files
+        :type: regex_pattern: Optional[str]
         """
         conn = self.get_conn()
-        files = conn.listdir(path)
-        return files
+        if regex_pattern:
+            pattern = re.compile(regex_pattern)
+            return [file for file in conn.listdir(path) if pattern.match(file)]

Review comment:
       would the usage of a [generator](https://docs.python.org/3/whatsnew/3.3.html#pep-380) be a more efficient solution when encountering directories with many files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] saveriogzz commented on pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
saveriogzz commented on pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#issuecomment-823898217


   > @saveriogzz Do you want to implement it?
   
   yes, I will work on it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#issuecomment-885968331


   This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] saveriogzz commented on a change in pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
saveriogzz commented on a change in pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#discussion_r619141253



##########
File path: airflow/providers/sftp/hooks/sftp.py
##########
@@ -166,16 +168,21 @@ def describe_directory(self, path: str) -> Dict[str, Dict[str, str]]:
             }
         return files
 
-    def list_directory(self, path: str) -> List[str]:
+    def list_directory(self, path: str, regex_pattern: Optional[str] = None) -> List[str]:
         """
         Returns a list of files on the remote system.
 
         :param path: full path to the remote directory to list
         :type path: str
+        :param regex_pattern: optional pattern to filter the remote_full_path files
+        :type: regex_pattern: Optional[str]
         """
         conn = self.get_conn()
-        files = conn.listdir(path)
-        return files
+        if regex_pattern:
+            pattern = re.compile(regex_pattern)
+            return [file for file in conn.listdir(path) if pattern.match(file)]

Review comment:
       would the usage of a [generator](https://docs.python.org/3/whatsnew/3.3.html#pep-380) be a more efficient solution when encountering directories with many files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] blcksrx commented on pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
blcksrx commented on pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#issuecomment-822290251


   @mik-laj  The author of this issue prefers to support `fnmacth`, I mean wildcards. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#issuecomment-822532828


   It would be great if your methods were similar to the SFTP for GCS operator.
   https://github.com/apache/airflow/blob/master/airflow/providers/google/cloud/transfers/sftp_to_gcs.py#L138


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
mik-laj commented on pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#issuecomment-822474064


   I think, we should use fnmatchr because other operators also uses this function. This allows you to standardize the experience of using different operators so that you don't have to think about the parameter format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on a change in pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
uranusjr commented on a change in pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#discussion_r616309811



##########
File path: airflow/providers/sftp/hooks/sftp.py
##########
@@ -198,7 +205,9 @@ def delete_directory(self, path: str) -> None:
         conn = self.get_conn()
         conn.rmdir(path)
 
-    def retrieve_file(self, remote_full_path: str, local_full_path: str) -> None:
+    def retrieve_file(
+        self, remote_full_path: str, local_full_path: str, regex_pattern: Optional[str] = None
+    ) -> None:

Review comment:
       One thing I really dislike about this implementation is it subtly changes semantics depending on the arguments. In this function, if `regex_pattern` is None, `remote_full_path` should point to a file, but when pattern matching is performed, it should point to a directory instead (and the file name matched with the pattern). This will be very unintuitive in practice and likely cause headaches to both users and maintainers.
   
   Pattern matching is definitely a useful feature here, but I really don’t think this is how it should be done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] blcksrx commented on pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
blcksrx commented on pull request #15409:
URL: https://github.com/apache/airflow/pull/15409#issuecomment-823889387


   @saveriogzz Do you want to implement it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] closed pull request #15409: Support regex pattern in SFTPHOOK

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #15409:
URL: https://github.com/apache/airflow/pull/15409


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org