You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/06 18:37:09 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #5900: Improve file scan time opening metric to include start_next_file

alamb opened a new pull request, #5900:
URL: https://github.com/apache/arrow-datafusion/pull/5900

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes https://github.com/apache/arrow-datafusion/issues/5899
   
   # Rationale for this change
   
   See https://github.com/apache/arrow-datafusion/pull/5790#discussion_r1155734134 -- I was messing around in this code anyways so I figured I would also make the proposed change
   
   # What changes are included in this PR?
   
   1. Include the time to call `start_file_opening` in the `time_opening` FileStream metrics
   
   # Are these changes tested?
   No -- I don't have any good idea of how to test it
   
   # Are there any user-facing changes?
   (Slighly) better metrics


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5900: Improve file scan time opening metric to include start_next_file

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5900:
URL: https://github.com/apache/arrow-datafusion/pull/5900


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5900: Improve file scan time opening metric to include start_next_file

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5900:
URL: https://github.com/apache/arrow-datafusion/pull/5900#discussion_r1160617706


##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -264,9 +264,9 @@ impl<F: FileOpener> FileStream<F> {
                     Ok(reader) => {
                         let partition_values = mem::take(partition_values);
 
-                        let next = self.start_next_file().transpose();
-
+                        // include time needed to start opening in `start_next_file`

Review Comment:
   > return a Future and the call should take almost no time.
   > Sorry that I do not get why making this change.
   
   It was suggested by @thinkharderdev  in https://github.com/apache/arrow-datafusion/pull/5790#discussion_r1155734134: 
   
   > I think we want this call before self.start_next_file. In some cases (for example, when reading from the LocalFileSystem store outside of a Tokio context) this can block and we would not capture the time spent on disk IO.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #5900: Improve file scan time opening metric to include start_next_file

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on code in PR #5900:
URL: https://github.com/apache/arrow-datafusion/pull/5900#discussion_r1160409961


##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -264,9 +264,9 @@ impl<F: FileOpener> FileStream<F> {
                     Ok(reader) => {
                         let partition_values = mem::take(partition_values);
 
-                        let next = self.start_next_file().transpose();
-
+                        // include time needed to start opening in `start_next_file`

Review Comment:
   I think the `self.start_next_file().transpose()` return a `Future` and the call should take almost no time.  
   Sorry that I do not get why making this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5900: Improve file scan time opening metric to include start_next_file

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5900:
URL: https://github.com/apache/arrow-datafusion/pull/5900#discussion_r1160136124


##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -264,9 +264,9 @@ impl<F: FileOpener> FileStream<F> {
                     Ok(reader) => {
                         let partition_values = mem::take(partition_values);
 
-                        let next = self.start_next_file().transpose();
-
+                        // include time needed to start opening in `start_next_file`

Review Comment:
   With this change, the FileStream now consistently includes the time for `start_next_file` in `time_opening` -- both here and immediately above:
   
   
   https://github.com/apache/arrow-datafusion/blob/5c0fe0d7ebe0b205ce10c2b07ee2230ff3df6cba/datafusion/core/src/physical_plan/file_format/file_stream.rs#L244-L246



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org