You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/23 16:31:54 UTC

[GitHub] [arrow-datafusion] crepererum opened a new pull request, #4342: add `{TDigest,ScalarValue,Accumulator}::size`

crepererum opened a new pull request, #4342:
URL: https://github.com/apache/arrow-datafusion/pull/4342

   # Which issue does this PR close?
   None, but helps with #3940.
   
   # Rationale for this change
   I need some helper methods to determine in-mem sizes of certain objects.
   
   # What changes are included in this PR?
   Determine in-mem size of `TDigest`, `ScalarValue`, and `Accumulator`.
   
   # Are these changes tested?
   \-
   
   # Are there any user-facing changes?
   **:warning: Breaking Change: `Accumulator` now requires `fn size(&self) -> usize;`!**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4342: add `{TDigest,ScalarValue,Accumulator}::size`

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4342:
URL: https://github.com/apache/arrow-datafusion/pull/4342#discussion_r1030687602


##########
datafusion/common/src/scalar.rs:
##########
@@ -2290,6 +2290,76 @@ impl ScalarValue {
             ScalarValue::Null => array.data().is_null(index),
         }
     }
+
+    /// Estimate size if bytes including `Self`

Review Comment:
   ```suggestion
       /// Estimate size if bytes including `Self`. For values with internal containers such as `String` 
       /// includes the allocated size (`capacity`) rather than the current length (`len`)
   ```



##########
datafusion/physical-expr/src/aggregate/array_agg_distinct.rs:
##########
@@ -156,6 +156,17 @@ impl Accumulator for DistinctArrayAggAccumulator {
             self.datatype.clone(),
         ))
     }
+
+    fn size(&self) -> usize {
+        // TODO(crepererum): `DataType` is NOT fixed size, add `DataType::size` method to arrow (https://github.com/apache/arrow-rs/issues/3147)
+        std::mem::size_of_val(self)
+            + (std::mem::size_of::<ScalarValue>() * self.values.capacity())
+            + self
+                .values
+                .iter()
+                .map(|sv| sv.size() - std::mem::size_of_val(sv))
+                .sum::<usize>()
+    }

Review Comment:
   This pattern is so common (and I found the difference between size and capacity subtle initially) -- I wonder if it could be refactored -- like
   
   ```rust
   ScalarValue::size_of_vec(&self.values)
   ```
   
   Or something to allow a single location and some docstrings?



##########
datafusion/expr/src/accumulator.rs:
##########
@@ -54,6 +54,9 @@ pub trait Accumulator: Send + Sync + Debug {
 
     /// returns its value based on its current state.
     fn evaluate(&self) -> Result<ScalarValue>;
+
+    /// Size in bytes including `Self`.

Review Comment:
   ```suggestion
       /// Allocated size required for this accumulator, in bytes, including `Self`.
       /// Allocated means that for internal containers such as `Vec`, the `capacity` should be used
       /// not the `len`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4342: add `{TDigest,ScalarValue,Accumulator}::size`

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4342:
URL: https://github.com/apache/arrow-datafusion/pull/4342#issuecomment-1327312643

   Benchmark runs are scheduled for baseline = e1204a5bf72c119123404463befb716adbdcff25 and contender = e0150439e89e1d98ff362bddcafdb980ee499a7a. e0150439e89e1d98ff362bddcafdb980ee499a7a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/8ed0f886a903406aabacc619934a1e10...a61adb79b5d9492e827be1ff110ab071/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/76204f17489d4f04a609d6c412c02397...629cba7714404900960b9d8dbdfe015d/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/1b18a5cf8d1040558b7c969b3d4c96ff...4c7851c25cec40369853b899545481e6/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/427bd059bb6645afadb21cfd3e3f8c18...5c1bc547bc144c5bab985c6485d4a325/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #4342: add `{TDigest,ScalarValue,Accumulator}::size`

Posted by GitBox <gi...@apache.org>.
alamb merged PR #4342:
URL: https://github.com/apache/arrow-datafusion/pull/4342


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on a diff in pull request #4342: add `{TDigest,ScalarValue,Accumulator}::size`

Posted by GitBox <gi...@apache.org>.
crepererum commented on code in PR #4342:
URL: https://github.com/apache/arrow-datafusion/pull/4342#discussion_r1030721591


##########
datafusion/physical-expr/src/aggregate/array_agg_distinct.rs:
##########
@@ -156,6 +156,17 @@ impl Accumulator for DistinctArrayAggAccumulator {
             self.datatype.clone(),
         ))
     }
+
+    fn size(&self) -> usize {
+        // TODO(crepererum): `DataType` is NOT fixed size, add `DataType::size` method to arrow (https://github.com/apache/arrow-rs/issues/3147)
+        std::mem::size_of_val(self)
+            + (std::mem::size_of::<ScalarValue>() * self.values.capacity())
+            + self
+                .values
+                .iter()
+                .map(|sv| sv.size() - std::mem::size_of_val(sv))
+                .sum::<usize>()
+    }

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org