You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/06 18:30:55 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #5898: Minor: Improve doc comments in FileStream

alamb opened a new pull request, #5898:
URL: https://github.com/apache/arrow-datafusion/pull/5898

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/5788.
   Closes https://github.com/apache/arrow-datafusion/pull/5790
   
   # Rationale for this change
   
   It is clear that the metrics are confusing but they are still useful as described by @thinkharderdev  in https://github.com/apache/arrow-datafusion/pull/5790#pullrequestreview-1368680282
   
   # What changes are included in this PR?
   1. Add doc comments to try and explain in more detail what the metrics mean. 
    
   # Are these changes tested?
   N/A
   # Are there any user-facing changes?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #5898: Minor: Improve doc comments in FileStream

Posted by "waynexia (via GitHub)" <gi...@apache.org>.
waynexia commented on code in PR #5898:
URL: https://github.com/apache/arrow-datafusion/pull/5898#discussion_r1163675746


##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -140,14 +141,32 @@ impl StartableTime {
     }
 }
 
+/// Metrics for [`FileStream`]
+///
+/// Note that all of these metrics are in terms of wall clock time
+/// (not cpu time) so they include time spent waiting on I/O as well
+/// as other operators.
 struct FileStreamMetrics {
-    /// Time elapsed for file opening
+    /// Wall clock time elapsed for file opening.
+    ///
+    /// Time between when [`FileReader::open`] is called and when the
+    /// [`FileStream`] receives a stream for reading.
     pub time_opening: StartableTime,
-    /// Time elapsed for file scanning + first record batch of decompression + decoding
+    /// Wall clock time elapsed for file scanning + first record batch of decompression + decoding
+    ///
+    /// Time between when the [`FileStream`] requests data from the
+    /// stream and when the first [`RecordBatch`] is produced.
     pub time_scanning_until_data: StartableTime,
-    /// Total elapsed time for for scanning + record batch decompression / decoding
+    /// total elapsed wall clock time for for scanning + record batch decompression / decoding

Review Comment:
   nit:
   ```suggestion
       /// Total elapsed wall clock time for for scanning + record batch decompression / decoding
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing commented on a diff in pull request #5898: Minor: Improve doc comments in FileStream

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.
yahoNanJing commented on code in PR #5898:
URL: https://github.com/apache/arrow-datafusion/pull/5898#discussion_r1160414885


##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -140,14 +141,31 @@ impl StartableTime {
     }
 }
 
+/// Metrics for [`FileStream`]
+///
+/// Note that all of these metrics are in terms of wall clock time
+/// (not cpu time) so they include time spent waiting on I/O as well
+/// as other operators.
 struct FileStreamMetrics {
-    /// Time elapsed for file opening
+    /// Wall clock time elapsed for file opening.
+    ///
+    /// Time between when [`FileReader::open`] is called and when the
+    /// [`FileStream`] receives a stream for reading.
     pub time_opening: StartableTime,
-    /// Time elapsed for file scanning + first record batch of decompression + decoding
+    /// Wall clock time elapsed for file scanning + first record batch of decompression + decoding
+    ///
+    /// Time between when the [`FileStream`] requests data from the
+    /// stream and when the first [`RecordBatch`] is produced.
     pub time_scanning_until_data: StartableTime,
-    /// Total elapsed time for for scanning + record batch decompression / decoding
+    /// total elapsed wall clock time for for scanning + record batch decompression / decoding
+    ///
+    /// Sum of time between when the [`FileStream`] requests data from
+    /// the stream and when a [`RecordBatch`] is produced for all
+    /// record batches in the stream.
     pub time_scanning_total: StartableTime,

Review Comment:
   This time may include the time of the parent operator's execution. It's better to indicate this info. And maybe it's better to change `time_scanning_total` to other names.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5898: Minor: Improve doc comments in FileStream

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5898:
URL: https://github.com/apache/arrow-datafusion/pull/5898#issuecomment-1504041202

   @thinkharderdev  and @nenorbot since you recently looked at this code I wonder if you have some time to review this PR (it adds comments, no code changes)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5898: Minor: Improve doc comments in FileStream

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5898:
URL: https://github.com/apache/arrow-datafusion/pull/5898


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5898: Minor: Improve doc comments in FileStream

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5898:
URL: https://github.com/apache/arrow-datafusion/pull/5898#discussion_r1161792996


##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -140,14 +141,31 @@ impl StartableTime {
     }
 }
 
+/// Metrics for [`FileStream`]
+///
+/// Note that all of these metrics are in terms of wall clock time
+/// (not cpu time) so they include time spent waiting on I/O as well
+/// as other operators.
 struct FileStreamMetrics {
-    /// Time elapsed for file opening
+    /// Wall clock time elapsed for file opening.
+    ///
+    /// Time between when [`FileReader::open`] is called and when the
+    /// [`FileStream`] receives a stream for reading.
     pub time_opening: StartableTime,
-    /// Time elapsed for file scanning + first record batch of decompression + decoding
+    /// Wall clock time elapsed for file scanning + first record batch of decompression + decoding
+    ///
+    /// Time between when the [`FileStream`] requests data from the
+    /// stream and when the first [`RecordBatch`] is produced.
     pub time_scanning_until_data: StartableTime,
-    /// Total elapsed time for for scanning + record batch decompression / decoding
+    /// total elapsed wall clock time for for scanning + record batch decompression / decoding
+    ///
+    /// Sum of time between when the [`FileStream`] requests data from
+    /// the stream and when a [`RecordBatch`] is produced for all
+    /// record batches in the stream.
     pub time_scanning_total: StartableTime,

Review Comment:
   Updated in 94ecfc49fa



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] thinkharderdev commented on a diff in pull request #5898: Minor: Improve doc comments in FileStream

Posted by "thinkharderdev (via GitHub)" <gi...@apache.org>.
thinkharderdev commented on code in PR #5898:
URL: https://github.com/apache/arrow-datafusion/pull/5898#discussion_r1163904273


##########
datafusion/core/src/physical_plan/file_format/file_stream.rs:
##########
@@ -140,14 +141,32 @@ impl StartableTime {
     }
 }
 
+/// Metrics for [`FileStream`]
+///
+/// Note that all of these metrics are in terms of wall clock time
+/// (not cpu time) so they include time spent waiting on I/O as well
+/// as other operators.
 struct FileStreamMetrics {
-    /// Time elapsed for file opening
+    /// Wall clock time elapsed for file opening.
+    ///
+    /// Time between when [`FileReader::open`] is called and when the
+    /// [`FileStream`] receives a stream for reading.

Review Comment:
   ```suggestion
       /// Wall clock time elapsed for file opening.
       ///
       /// Time between when [`FileReader::open`] is called and when the
       /// [`FileStream`] receives a stream for reading.
       ///
       /// If there are multiple files being scanned, the stream
       /// will open the next file in the background while scanning the 
       /// current file. This metric will only capture time spent opening
       /// while not also scanning. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org