You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/26 10:49:40 UTC

[GitHub] [arrow-datafusion] yjshen opened a new pull request #1682: Add gauge

yjshen opened a new pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792532430



##########
File path: datafusion/src/physical_plan/metrics/aggregated.rs
##########
@@ -35,25 +35,31 @@ pub struct AggregatedMetricsSet {
     final_: Arc<std::sync::Mutex<Vec<ExecutionPlanMetricsSet>>>,
 }
 
+impl Default for AggregatedMetricsSet {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
 impl AggregatedMetricsSet {
     /// Create a new aggregated set
-    pub(crate) fn new() -> Self {
+    pub fn new() -> Self {

Review comment:
       To create a memory-managed version of [`ShuffleWriter`](https://github.com/blaze-init/blaze-rs/blob/master/datafusion-ext/src/shuffle_writer_exec.rs#L457-L508), we need to expose these four APIs for dependent crate usage. 
   I'm not sure this should be in its own PR, or I can have it here since this PR is also metric-related. I can remove this if needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] liukun4515 commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792535023



##########
File path: datafusion/src/physical_plan/metrics/value.rs
##########
@@ -77,6 +77,62 @@ impl Count {
     }
 }
 
+/// A gauge is the simplest metrics type. It just returns a value.
+/// For example, you can easily expose current memory consumption with a gauge.
+///
+/// Note `clone`ing gauge update the same underlying metrics
+#[derive(Debug, Clone)]
+pub struct Gauge {
+    /// value of the metric counter

Review comment:
       this comment does not match the value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792548957



##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -51,6 +51,11 @@ pub trait RecordBatchStream: Stream<Item = ArrowResult<RecordBatch>> {
     /// Implementation of this trait should guarantee that all `RecordBatch`'s returned by this
     /// stream should have the same schema as returned from this method.
     fn schema(&self) -> SchemaRef;
+
+    /// Returns the current memory usage for this stream.
+    fn mem_used(&self) -> usize {
+        0
+    }

Review comment:
       This line adds a `mem_used` method in our essential `RecordBatchStream` trait.
   
   A baby step to tracking Non-Limited-Operators' memory usage since I think `SendableRecordBatchStream` is the fundamental entity that holds memory during execution.  However, I didn't quite find a way to register these streams generated during `async execute` to our memory manager.
   
   I would love to hear your thoughts. 
   
   If considered not appropriate, I will remove it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792544773



##########
File path: datafusion/src/physical_plan/metrics/value.rs
##########
@@ -77,6 +77,62 @@ impl Count {
     }
 }
 
+/// A gauge is the simplest metrics type. It just returns a value.
+/// For example, you can easily expose current memory consumption with a gauge.
+///
+/// Note `clone`ing gauge update the same underlying metrics
+#[derive(Debug, Clone)]
+pub struct Gauge {
+    /// value of the metric counter
+    value: std::sync::Arc<AtomicUsize>,
+}
+
+impl PartialEq for Gauge {
+    fn eq(&self, other: &Self) -> bool {
+        self.value().eq(&other.value())
+    }
+}
+
+impl Display for Gauge {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(f, "{}", self.value())
+    }
+}
+
+impl Default for Gauge {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl Gauge {
+    /// create a new gauge
+    pub fn new() -> Self {
+        Self {
+            value: Arc::new(AtomicUsize::new(0)),
+        }
+    }
+
+    /// Add `n` to the metric's value
+    pub fn add(&self, n: usize) {

Review comment:
       We discussed this previously in https://github.com/apache/arrow-datafusion/issues/1569. I will create a new issue later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792543242



##########
File path: datafusion/src/physical_plan/metrics/value.rs
##########
@@ -77,6 +77,62 @@ impl Count {
     }
 }
 
+/// A gauge is the simplest metrics type. It just returns a value.
+/// For example, you can easily expose current memory consumption with a gauge.
+///
+/// Note `clone`ing gauge update the same underlying metrics
+#[derive(Debug, Clone)]
+pub struct Gauge {
+    /// value of the metric counter
+    value: std::sync::Arc<AtomicUsize>,
+}
+
+impl PartialEq for Gauge {
+    fn eq(&self, other: &Self) -> bool {
+        self.value().eq(&other.value())
+    }
+}
+
+impl Display for Gauge {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(f, "{}", self.value())
+    }
+}
+
+impl Default for Gauge {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl Gauge {
+    /// create a new gauge
+    pub fn new() -> Self {
+        Self {
+            value: Arc::new(AtomicUsize::new(0)),
+        }
+    }
+
+    /// Add `n` to the metric's value
+    pub fn add(&self, n: usize) {

Review comment:
       Currently, I haven't found a use case that needs to decrease the value of the gauge instead of setting the value to a new one. for example, during `spill` I set it to 0. therefore the API is not added for the moment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792548957



##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -51,6 +51,11 @@ pub trait RecordBatchStream: Stream<Item = ArrowResult<RecordBatch>> {
     /// Implementation of this trait should guarantee that all `RecordBatch`'s returned by this
     /// stream should have the same schema as returned from this method.
     fn schema(&self) -> SchemaRef;
+
+    /// Returns the current memory usage for this stream.
+    fn mem_used(&self) -> usize {
+        0
+    }

Review comment:
       This line adds a `mem_used` method in our essential `RecordBatchStream` trait.
   
   A baby step to tracking Non-Limited-Operators' memory usage since I think `SendableRecordBatchStream` is the fundamental entity that holds memory during execution.  However, I didn't quite find a way to register these streams generated during `async execute` to our memory manager.
   
   I would love to hear your thoughts. If considered not appropriate, I will remove it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #1682: Add a new metric type: `Gauge` + `CurrentMemoryUsage` to metrics

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] liukun4515 commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792538389



##########
File path: datafusion/src/physical_plan/metrics/value.rs
##########
@@ -77,6 +77,62 @@ impl Count {
     }
 }
 
+/// A gauge is the simplest metrics type. It just returns a value.
+/// For example, you can easily expose current memory consumption with a gauge.
+///
+/// Note `clone`ing gauge update the same underlying metrics
+#[derive(Debug, Clone)]
+pub struct Gauge {
+    /// value of the metric counter
+    value: std::sync::Arc<AtomicUsize>,
+}
+
+impl PartialEq for Gauge {
+    fn eq(&self, other: &Self) -> bool {
+        self.value().eq(&other.value())
+    }
+}
+
+impl Display for Gauge {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(f, "{}", self.value())
+    }
+}
+
+impl Default for Gauge {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl Gauge {
+    /// create a new gauge
+    pub fn new() -> Self {
+        Self {
+            value: Arc::new(AtomicUsize::new(0)),
+        }
+    }
+
+    /// Add `n` to the metric's value
+    pub fn add(&self, n: usize) {

Review comment:
       better to add `dec` function if we need dec the value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792532430



##########
File path: datafusion/src/physical_plan/metrics/aggregated.rs
##########
@@ -35,25 +35,31 @@ pub struct AggregatedMetricsSet {
     final_: Arc<std::sync::Mutex<Vec<ExecutionPlanMetricsSet>>>,
 }
 
+impl Default for AggregatedMetricsSet {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
 impl AggregatedMetricsSet {
     /// Create a new aggregated set
-    pub(crate) fn new() -> Self {
+    pub fn new() -> Self {

Review comment:
       To create a memory-managed version of [`ShuffleWriter`](https://github.com/blaze-init/blaze-rs/blob/master/datafusion-ext/src/shuffle_writer_exec.rs#L457-L508), we need to expose these four APIs for dependent crate usage. 
   I'm not sure this should be in its own PR, or I can have it here since it is also metric-related. I can remove this if needed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] liukun4515 commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792620579



##########
File path: datafusion/src/physical_plan/metrics/value.rs
##########
@@ -77,6 +77,62 @@ impl Count {
     }
 }
 
+/// A gauge is the simplest metrics type. It just returns a value.
+/// For example, you can easily expose current memory consumption with a gauge.
+///
+/// Note `clone`ing gauge update the same underlying metrics
+#[derive(Debug, Clone)]
+pub struct Gauge {
+    /// value of the metric counter
+    value: std::sync::Arc<AtomicUsize>,
+}
+
+impl PartialEq for Gauge {
+    fn eq(&self, other: &Self) -> bool {
+        self.value().eq(&other.value())
+    }
+}
+
+impl Display for Gauge {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(f, "{}", self.value())
+    }
+}
+
+impl Default for Gauge {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl Gauge {
+    /// create a new gauge
+    pub fn new() -> Self {
+        Self {
+            value: Arc::new(AtomicUsize::new(0)),
+        }
+    }
+
+    /// Add `n` to the metric's value
+    pub fn add(&self, n: usize) {

Review comment:
       If we don't use the `dec/sub`, we no need to add it.
   Thanks @yjshen 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1682: Add a new metric type: `Gauge` + `CurrentMemoryUsage` to metrics

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792944104



##########
File path: datafusion/src/physical_plan/metrics/aggregated.rs
##########
@@ -35,25 +35,31 @@ pub struct AggregatedMetricsSet {
     final_: Arc<std::sync::Mutex<Vec<ExecutionPlanMetricsSet>>>,
 }
 
+impl Default for AggregatedMetricsSet {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
 impl AggregatedMetricsSet {
     /// Create a new aggregated set
-    pub(crate) fn new() -> Self {
+    pub fn new() -> Self {

Review comment:
       I think it is fine to be in this PR

##########
File path: datafusion/src/physical_plan/sorts/sort.rs
##########
@@ -51,10 +51,9 @@ use std::fmt::{Debug, Formatter};
 use std::fs::File;
 use std::io::BufReader;
 use std::path::{Path, PathBuf};
-use std::sync::atomic::{AtomicUsize, Ordering};
 use std::sync::Arc;
 use tempfile::NamedTempFile;
-use tokio::sync::mpsc::{Receiver as TKReceiver, Sender as TKSender};
+use tokio::sync::mpsc::{Receiver, Sender};

Review comment:
       👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1682: Add a new metric type: `Gauge` + `CurrentMemoryUsage` to metrics

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#issuecomment-1022495615


   And thank you for the review @liukun4515 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] yjshen commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
yjshen commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792652252



##########
File path: datafusion/src/physical_plan/mod.rs
##########
@@ -51,6 +51,11 @@ pub trait RecordBatchStream: Stream<Item = ArrowResult<RecordBatch>> {
     /// Implementation of this trait should guarantee that all `RecordBatch`'s returned by this
     /// stream should have the same schema as returned from this method.
     fn schema(&self) -> SchemaRef;
+
+    /// Returns the current memory usage for this stream.
+    fn mem_used(&self) -> usize {
+        0
+    }

Review comment:
       After re-consider this for a while. I cannot think of a solution to register `RecordBatchStream` somewhere else since each stream is used through mutable reference. I don't think there's a way sharing it except for Arc<dyn RecordBatchStream ....>, which we have discussed earlier and is not acceptable.
   
   I'm going to revert this last commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] liukun4515 commented on a change in pull request #1682: Add a new metric type: `Gauge`

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on a change in pull request #1682:
URL: https://github.com/apache/arrow-datafusion/pull/1682#discussion_r792535023



##########
File path: datafusion/src/physical_plan/metrics/value.rs
##########
@@ -77,6 +77,62 @@ impl Count {
     }
 }
 
+/// A gauge is the simplest metrics type. It just returns a value.
+/// For example, you can easily expose current memory consumption with a gauge.
+///
+/// Note `clone`ing gauge update the same underlying metrics
+#[derive(Debug, Clone)]
+pub struct Gauge {
+    /// value of the metric counter

Review comment:
       this comment is not match the value.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org