You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "aru-trackunit (via GitHub)" <gi...@apache.org> on 2023/02/22 13:03:37 UTC

[GitHub] [airflow] aru-trackunit opened a new pull request, #29694: Fixing log message + adding more verbose documetation

aru-trackunit opened a new pull request, #29694:
URL: https://github.com/apache/airflow/pull/29694

   related: [#ISSUE](https://github.com/apache/airflow/issues/29552)
   
   Previously the log message didn't exactly tell the user where the has been uploaded to so I decided to fix that behavior. 
   
   If reviewers decide that solution is too complex to keep it then I am happy to remove the log as well or set a message to "File has been uploaded."
   
   Unfortunately google api doesn't return full path of the file and there is a need go up the tree. In the documentation they also say that drive can have up to 20 nested folders which I included in the solution to break a loop if more than 20 iterations have happened.
   https://support.google.com/a/users/answer/7338880?hl=en


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1121542791


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +161,59 @@ def exists(
             )
         )
 
+    def _get_file_info(self, file_id: str):
+        """
+        Returns Google API file_info object containing id, name, parents in the response
+        https://developers.google.com/drive/api/v3/reference/files/get
+
+        :param file_id: id as string representation of interested file
+        :return: file
+        """
+        file_info = (
+            self.get_conn()
+            .files()
+            .get(
+                fileId=file_id,
+                fields="id,name,parents",
+                supportsAllDrives=True,
+            )
+            .execute(num_retries=2)

Review Comment:
   @eladkal here is the number of retries



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29694: Fixing log message + adding more verbose documetation

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1116764865


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,46 @@ def exists(
             )
         )
 
+    def _resolve_file_path(self, file_id: str) -> str:
+        """
+        Returns the full Google Drive path for given file_id
+
+        :param file_id: The id of a file in Google Drive
+        :return: Google Drive full path for a file
+        """
+        MAX_NESTED_FOLDERS_LEVEL = 20

Review Comment:
   Wouod be great to link to the docs as comment here so that it's not seen to be as magical as it looks.



##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,46 @@ def exists(
             )
         )
 
+    def _resolve_file_path(self, file_id: str) -> str:
+        """
+        Returns the full Google Drive path for given file_id
+
+        :param file_id: The id of a file in Google Drive
+        :return: Google Drive full path for a file
+        """
+        MAX_NESTED_FOLDERS_LEVEL = 20

Review Comment:
   Would be great to link to the docs as comment here so that it's not seen to be as magical as it looks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1119002091


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,57 @@ def exists(
             )
         )
 
+    def _get_file_info(self, file_id: str):
+        """
+        Returns Google API file_info object containing id, name, parents in the response
+        https://developers.google.com/drive/api/v3/reference/files/get
+
+        :param file_id: id as string representation of interested file
+        :return: file
+        """
+        file_info = (
+            self.get_conn().files()
+                .get(
+                fileId=file_id,
+                fields="id,name,parents",
+                supportsAllDrives=True,
+            ).execute()
+        )
+        return file_info

Review Comment:
   I think we need to handle case of exceptions from the API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1118430932


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +284,13 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = (

Review Comment:
   @dimberman I have tested your first scenario. Google API doesn't return more `parents` than user can see. So having path: `/Drive/FolderA/FolderB/file.csv` and user is granted access to `FolderB` then what he sees is `/FolderB/file.csv` which makes sense because this is what he sees on the UI as well.
   
   I do agree that for performance reasons it is nice to switch it off when needed in case someone has a large tree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1125572952


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +292,21 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = remote_location
+
+        if folder_id != "root":
+            try:
+                upload_location = self._resolve_file_path(folder_id)
+            except GoogleApiClientError as e:
+                self.log.warning("A problem has been encountered when trying to resolve file path: ", e)
+
+        if show_full_target_path:
+            self.log.info("File %s uploaded to gdrive://%s.", local_location, upload_location)
+        else:
+            self.log.info("File %s has been uploaded successfully to gdrive", local_location)

Review Comment:
    👌



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on PR #29694:
URL: https://github.com/apache/airflow/pull/29694#issuecomment-1453181311

   @josh-fell I decided to remove this nested limit, since we can rely on Google API instead, also I made a wrong assumption. 
   If google doesn't let user upload a file nested deeper that 20 levels then it will never generate longer path, but it was also protecting that in worst case scenario it makes 20 iterations instead of infinite loop if google introduces a bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1124163133


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +161,59 @@ def exists(
             )
         )
 
+    def _get_file_info(self, file_id: str):
+        """
+        Returns Google API file_info object containing id, name, parents in the response
+        https://developers.google.com/drive/api/v3/reference/files/get
+
+        :param file_id: id as string representation of interested file
+        :return: file
+        """
+        file_info = (
+            self.get_conn()
+            .files()
+            .get(
+                fileId=file_id,
+                fields="id,name,parents",
+                supportsAllDrives=True,
+            )
+            .execute(num_retries=2)
+        )
+        return file_info
+
+    def _resolve_file_path(self, file_id: str) -> str:
+        """
+        Returns the full Google Drive path for given file_id
+
+        :param file_id: The id of a file in Google Drive
+        :return: Google Drive full path for a file
+        """
+        MAX_NESTED_FOLDERS_LEVEL = 20  # Link to docs https://support.google.com/a/users/answer/7338880?hl=en

Review Comment:
   I have tested it and it means that users can't create a folder:
   That's the error body - status code `403`
   ```
   {
    "error": {
     "errors": [
      {
       "domain": "global",
       "reason": "teamDriveHierarchyTooDeep",
       "message": "The shared drive hierarchy depth will exceed the limit."
      }
     ],
     "code": 403,
     "message": "The shared drive hierarchy depth will exceed the limit."
    }
   }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk merged pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk merged PR #29694:
URL: https://github.com/apache/airflow/pull/29694


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on PR #29694:
URL: https://github.com/apache/airflow/pull/29694#issuecomment-1455785178

   Wohooo!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dimberman commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "dimberman (via GitHub)" <gi...@apache.org>.
dimberman commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1118985060


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,46 @@ def exists(
             )
         )
 
+    def _resolve_file_path(self, file_id: str) -> str:
+        """
+        Returns the full Google Drive path for given file_id
+
+        :param file_id: The id of a file in Google Drive
+        :return: Google Drive full path for a file
+        """
+        MAX_NESTED_FOLDERS_LEVEL = 20  # Link to docs https://support.google.com/a/users/answer/7338880?hl=en
+        iteration = 1
+
+        service = self.get_conn()
+        has_reached_root = False
+        _file_id = file_id
+        path: str = ""
+        while not has_reached_root:
+            file_info = (
+                service.files()
+                .get(
+                    fileId=_file_id,
+                    fields="id,name,parents",
+                    supportsAllDrives=True,
+                )
+                .execute()
+            )
+            if "parents" in file_info:
+                parent_directories = file_info["parents"]
+                path = f'{file_info["name"]}' if path == "" else f'{file_info["name"]}/{path}'
+
+                if len(parent_directories) > 1:
+                    self.log.warning("Google returned multiple parents, picking first")
+                _file_id = parent_directories[0]
+            else:
+                has_reached_root = True
+                path = f'{file_info["name"]}/{path}'

Review Comment:
   Looks great!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dimberman commented on pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "dimberman (via GitHub)" <gi...@apache.org>.
dimberman commented on PR #29694:
URL: https://github.com/apache/airflow/pull/29694#issuecomment-1446642090

   LGTM! Thanks for the explanations


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on pull request #29694: Fixing log message + adding more verbose documetation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on PR #29694:
URL: https://github.com/apache/airflow/pull/29694#issuecomment-1442981185

   I am not sure who should review it and no one has been attached, @eladkal -> If I made a mistake tagging you please assign to a review someone that could do that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1119926887


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,57 @@ def exists(
             )
         )
 
+    def _get_file_info(self, file_id: str):
+        """
+        Returns Google API file_info object containing id, name, parents in the response
+        https://developers.google.com/drive/api/v3/reference/files/get
+
+        :param file_id: id as string representation of interested file
+        :return: file
+        """
+        file_info = (
+            self.get_conn().files()
+                .get(
+                fileId=file_id,
+                fields="id,name,parents",
+                supportsAllDrives=True,
+            ).execute()
+        )
+        return file_info

Review Comment:
   You're right. Any exception that is thrown from google we want to handle within this feature, as it doesn't make any sense to fail the job due to unresolving file path. I updated the code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] josh-fell commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "josh-fell (via GitHub)" <gi...@apache.org>.
josh-fell commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1123963291


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +300,21 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = remote_location
+
+        if folder_id != "root":
+            try:
+                upload_location = self._resolve_file_path(folder_id)
+            except (GoogleApiClientError, AirflowException) as e:

Review Comment:
   I agree with @aru-trackunit. I don't think the task should fail simply because the file path can't be resolved for logging purposes. It's unfortunate the Google API doesn't surface this information easily. As long as the task does what it's supposed to do, Airflow shouldn't care what is in its user-facing task logs. Having this logic configurable also helps too.
   
   I do agree with @eladkal though that `AirflowException` is irrelevant. The operator is not failing for some orchestration or execution-related issue (IMO I think we use `AirflowException` with too broad of a brush sometimes). Maybe a custom exception is raised (e.g. `NestedFolderLimitReachedException` or something similar) just in case some real AirflowException is swallowed unknowingly?



##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +161,59 @@ def exists(
             )
         )
 
+    def _get_file_info(self, file_id: str):
+        """
+        Returns Google API file_info object containing id, name, parents in the response
+        https://developers.google.com/drive/api/v3/reference/files/get
+
+        :param file_id: id as string representation of interested file
+        :return: file
+        """
+        file_info = (
+            self.get_conn()
+            .files()
+            .get(
+                fileId=file_id,
+                fields="id,name,parents",
+                supportsAllDrives=True,
+            )
+            .execute(num_retries=2)
+        )
+        return file_info
+
+    def _resolve_file_path(self, file_id: str) -> str:
+        """
+        Returns the full Google Drive path for given file_id
+
+        :param file_id: The id of a file in Google Drive
+        :return: Google Drive full path for a file
+        """
+        MAX_NESTED_FOLDERS_LEVEL = 20  # Link to docs https://support.google.com/a/users/answer/7338880?hl=en

Review Comment:
   This still feels magical. But, out of curiosity, if the documentation states "A folder in a shared drive can have up to 20 levels of nested folders", does that mean users _can't_ physically create more than that number of nested folders or is it really a _shouldn't_ create situation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29694: Fixing log message + adding more verbose documetation

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1116774627


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +284,13 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = (

Review Comment:
   I think having to do the traverse up the directories hierarchy is a convenience and I think maybe it should be guarded by a parameter ("show_full_target_path") in the `upload` method which might be set to `true` by default?
   
   If not for the performance, it will also give the escape hatch in case for some reason the traversal will be problematic. I can imagine a number of cases - for example it could be that the file is in a place that you can upload a file but you can only have permissions to look at some part of the tree. Not sure if it might happen in GDrive, but this can - for example - happen in regular POSIX filesystem.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29694: Fixing log message + adding more verbose documetation

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1117220804


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +284,13 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = (

Review Comment:
   BTW. I have no idea how it works, and  whether it's possible, just raised it as possible scenario where such traversal might fail. There might be other cases (is this possible there is more than one parent maybe? Or some rate limiting when calling an API). So I am not really sure if the scenario is possible, and which scenario might be problematic, my only point here was that we shoulkd  have an escape hatch in such case. This traversal is only needed to log the full destination path, it's not used anywhere else, so IMHO we should be able to disable it in case any of similar things happen. I am not sure if those are likely scenarios/issues currently (they might also change in the future), so without spending a lot of time on analysing those, having a simple to use escape hatch seems to be a good idea :D)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1118471190


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,46 @@ def exists(
             )
         )
 
+    def _resolve_file_path(self, file_id: str) -> str:
+        """
+        Returns the full Google Drive path for given file_id
+
+        :param file_id: The id of a file in Google Drive
+        :return: Google Drive full path for a file
+        """
+        MAX_NESTED_FOLDERS_LEVEL = 20  # Link to docs https://support.google.com/a/users/answer/7338880?hl=en
+        iteration = 1
+
+        service = self.get_conn()
+        has_reached_root = False
+        _file_id = file_id
+        path: str = ""
+        while not has_reached_root:
+            file_info = (
+                service.files()
+                .get(
+                    fileId=_file_id,
+                    fields="id,name,parents",
+                    supportsAllDrives=True,
+                )
+                .execute()
+            )
+            if "parents" in file_info:
+                parent_directories = file_info["parents"]
+                path = f'{file_info["name"]}' if path == "" else f'{file_info["name"]}/{path}'
+
+                if len(parent_directories) > 1:
+                    self.log.warning("Google returned multiple parents, picking first")
+                _file_id = parent_directories[0]
+            else:
+                has_reached_root = True
+                path = f'{file_info["name"]}/{path}'

Review Comment:
   I've tried to make it simpler, let me know if this is what you meant



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1121866085


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +300,21 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = remote_location
+
+        if folder_id != "root":
+            try:
+                upload_location = self._resolve_file_path(folder_id)
+            except (GoogleApiClientError, AirflowException) as e:

Review Comment:
   I disagree, because of the order of tasks. First file is uploaded and then we are trying to resolve its path, so if we fail on path resolution then file got uploaded, we failed the task due to unexpected behavior and we can't revert this change. So in my head there are 2 options:
   1. We keep the solution as is
   2. I move the path resolution code before uploading a file so the transaction doesn't need to be reverted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1124323895


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +292,21 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = remote_location
+
+        if folder_id != "root":
+            try:
+                upload_location = self._resolve_file_path(folder_id)
+            except GoogleApiClientError as e:
+                self.log.warning("A problem has been encountered when trying to resolve file path: ", e)
+
+        if show_full_target_path:
+            self.log.info("File %s uploaded to gdrive://%s.", local_location, upload_location)
+        else:
+            self.log.info("File %s has been uploaded successfully to gdrive", local_location)

Review Comment:
   i think this parameter is redundant?
   It doesn't give any additional value. Lets just print the full path. if someone doesn't need the full path they can just remove `gdrive://` from the path
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] aru-trackunit commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "aru-trackunit (via GitHub)" <gi...@apache.org>.
aru-trackunit commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1121542511


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +300,21 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = remote_location
+
+        if folder_id != "root":
+            try:
+                upload_location = self._resolve_file_path(folder_id)
+            except (GoogleApiClientError, AirflowException) as e:

Review Comment:
   @eladkal Here is the catch for all exceptions - we don't want to fail the job due to path resolution



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1121729152


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +300,21 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = remote_location
+
+        if folder_id != "root":
+            try:
+                upload_location = self._resolve_file_path(folder_id)
+            except (GoogleApiClientError, AirflowException) as e:

Review Comment:
   Actually I think we shouldn't do that. The change you made to `_get_file_info` is enough.
   AirflowException is not relevant here and as for GoogleApiClientError if this is raised after the 2 retries set then I guess we want the task to fail.. not to catch the exception and to just log it.
   
   So lets revert this part... and I think we good to go.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1125569580


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +292,21 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = remote_location
+
+        if folder_id != "root":
+            try:
+                upload_location = self._resolve_file_path(folder_id)
+            except GoogleApiClientError as e:
+                self.log.warning("A problem has been encountered when trying to resolve file path: ", e)
+
+        if show_full_target_path:
+            self.log.info("File %s uploaded to gdrive://%s.", local_location, upload_location)
+        else:
+            self.log.info("File %s has been uploaded successfully to gdrive", local_location)

Review Comment:
   @eladkal -> this is what I asked for as an escape hatch. The problem is that there might be potentially problems that might lead to  exception raised when "_resolve_file_path". is called (I thought for example if you have no permission to read the whole path. GDrive is not a "normal" filesystem and traversing up the directory might in some cases turn problematic. Since we ONLY need that to log the path, this is something that we might want to disable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1119002091


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,57 @@ def exists(
             )
         )
 
+    def _get_file_info(self, file_id: str):
+        """
+        Returns Google API file_info object containing id, name, parents in the response
+        https://developers.google.com/drive/api/v3/reference/files/get
+
+        :param file_id: id as string representation of interested file
+        :return: file
+        """
+        file_info = (
+            self.get_conn().files()
+                .get(
+                fileId=file_id,
+                fields="id,name,parents",
+                supportsAllDrives=True,
+            ).execute()
+        )
+        return file_info

Review Comment:
   I think we need to handle case of exceptions from the API? (maybe retry?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on a diff in pull request #29694: `GoogleDriveHook`: Fixing log message + adding more verbose documentation

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1120772006


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,57 @@ def exists(
             )
         )
 
+    def _get_file_info(self, file_id: str):
+        """
+        Returns Google API file_info object containing id, name, parents in the response
+        https://developers.google.com/drive/api/v3/reference/files/get
+
+        :param file_id: id as string representation of interested file
+        :return: file
+        """
+        file_info = (
+            self.get_conn().files()
+                .get(
+                fileId=file_id,
+                fields="id,name,parents",
+                supportsAllDrives=True,
+            ).execute()
+        )
+        return file_info

Review Comment:
   was it added?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on pull request #29694: Fixing log message + adding more verbose documetation

Posted by "potiuk (via GitHub)" <gi...@apache.org>.
potiuk commented on PR #29694:
URL: https://github.com/apache/airflow/pull/29694#issuecomment-1443393061

   >  I am not sure who should review it and no one has been attached, @eladkal -> If I made a mistake tagging you please assign to a review someone that could do that
   
   
   FYI. Just to dispel some misconception and misunderstanding and set expectations.
   
   We have no "assignments" for review, and asking others to assign others is not how things work here. This is not a corporate environment where people have "jobs". This is community driven organisation where people spend their time on tasks they find they might help, not because somoene "assigns" them to do stuff.  As an  author you should patiently wait until someone will take a look (and possibly ping in general if it does not happen in reasonable time (which is not well defined BTW - you need to use your own judgment, but generally 72 HRS in ASF is considered as a time that people who are in different time zone and potentially have families, day jobs, sometimes weekends and holidays for any reaction time.
   
   You need to exercise patience. And tagging people to assign others is more of "corporate delegation expectation" than "community trust" so it's quite a bit out-of-place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] dimberman commented on a diff in pull request #29694: Fixing log message + adding more verbose documetation

Posted by "dimberman (via GitHub)" <gi...@apache.org>.
dimberman commented on code in PR #29694:
URL: https://github.com/apache/airflow/pull/29694#discussion_r1117188933


##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -159,6 +160,46 @@ def exists(
             )
         )
 
+    def _resolve_file_path(self, file_id: str) -> str:
+        """
+        Returns the full Google Drive path for given file_id
+
+        :param file_id: The id of a file in Google Drive
+        :return: Google Drive full path for a file
+        """
+        MAX_NESTED_FOLDERS_LEVEL = 20  # Link to docs https://support.google.com/a/users/answer/7338880?hl=en
+        iteration = 1
+
+        service = self.get_conn()
+        has_reached_root = False
+        _file_id = file_id
+        path: str = ""
+        while not has_reached_root:
+            file_info = (
+                service.files()
+                .get(
+                    fileId=_file_id,
+                    fields="id,name,parents",
+                    supportsAllDrives=True,
+                )
+                .execute()
+            )
+            if "parents" in file_info:
+                parent_directories = file_info["parents"]
+                path = f'{file_info["name"]}' if path == "" else f'{file_info["name"]}/{path}'
+
+                if len(parent_directories) > 1:
+                    self.log.warning("Google returned multiple parents, picking first")
+                _file_id = parent_directories[0]
+            else:
+                has_reached_root = True
+                path = f'{file_info["name"]}/{path}'

Review Comment:
   This code section is somewhat hard to read. Could you please either a) break this out into helper functions, b) write in some comments explaining exactly what's Happening Here, or c) both?



##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +284,13 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = (

Review Comment:
   Well two questions about this:
   
   1. In a scenario where user only has permission to look at part of the tree, would the API throw an error or would we just not be able to see it at all?
   
   2. Are there people who have such deep traversal trees in their Google Drive that this would be a large amount of time? How common is that?



##########
airflow/providers/google/suite/hooks/drive.py:
##########
@@ -243,8 +284,13 @@ def upload_file(
             .create(body=file_metadata, media_body=media, fields="id", supportsAllDrives=True)
             .execute(num_retries=self.num_retries)
         )
-        self.log.info("File %s uploaded to gdrive://%s.", local_location, remote_location)
-        return file.get("id")
+        file_id = file.get("id")
+
+        upload_location = (

Review Comment:
   I think it would be worth us investigating whether this is actually a scenario that could happen in Google drive as I have a feeling that the Google drive filesystem is probably closer to an s3 than a POSIX



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org