You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/23 17:02:07 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4342: add `{TDigest,ScalarValue,Accumulator}::size`

alamb commented on code in PR #4342:
URL: https://github.com/apache/arrow-datafusion/pull/4342#discussion_r1030687602


##########
datafusion/common/src/scalar.rs:
##########
@@ -2290,6 +2290,76 @@ impl ScalarValue {
             ScalarValue::Null => array.data().is_null(index),
         }
     }
+
+    /// Estimate size if bytes including `Self`

Review Comment:
   ```suggestion
       /// Estimate size if bytes including `Self`. For values with internal containers such as `String` 
       /// includes the allocated size (`capacity`) rather than the current length (`len`)
   ```



##########
datafusion/physical-expr/src/aggregate/array_agg_distinct.rs:
##########
@@ -156,6 +156,17 @@ impl Accumulator for DistinctArrayAggAccumulator {
             self.datatype.clone(),
         ))
     }
+
+    fn size(&self) -> usize {
+        // TODO(crepererum): `DataType` is NOT fixed size, add `DataType::size` method to arrow (https://github.com/apache/arrow-rs/issues/3147)
+        std::mem::size_of_val(self)
+            + (std::mem::size_of::<ScalarValue>() * self.values.capacity())
+            + self
+                .values
+                .iter()
+                .map(|sv| sv.size() - std::mem::size_of_val(sv))
+                .sum::<usize>()
+    }

Review Comment:
   This pattern is so common (and I found the difference between size and capacity subtle initially) -- I wonder if it could be refactored -- like
   
   ```rust
   ScalarValue::size_of_vec(&self.values)
   ```
   
   Or something to allow a single location and some docstrings?



##########
datafusion/expr/src/accumulator.rs:
##########
@@ -54,6 +54,9 @@ pub trait Accumulator: Send + Sync + Debug {
 
     /// returns its value based on its current state.
     fn evaluate(&self) -> Result<ScalarValue>;
+
+    /// Size in bytes including `Self`.

Review Comment:
   ```suggestion
       /// Allocated size required for this accumulator, in bytes, including `Self`.
       /// Allocated means that for internal containers such as `Vec`, the `capacity` should be used
       /// not the `len`
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org