You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/04/30 23:33:32 UTC

[GitHub] [arrow] tustvold opened a new pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

tustvold opened a new pull request #7076:
URL: https://github.com/apache/arrow/pull/7076


   Both ListBuilder and FixedSizeListBuilder accept a values_builder as a constructor argument and then set the capacity of their internal builders based off the length of this values_builder. Unfortunately at construction time this values_builder is normally empty, and consequently programs spend an unnecessary amount of time reallocating memory.
   
   This PR adds new constructor methods that allow specifying the desired capacity upfront.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tustvold commented on a change in pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #7076:
URL: https://github.com/apache/arrow/pull/7076#discussion_r418774678



##########
File path: rust/parquet/src/arrow/converter.rs
##########
@@ -128,7 +128,10 @@ pub struct Utf8ArrayConverter {}
 
 impl Converter<Vec<Option<ByteArray>>, StringArray> for Utf8ArrayConverter {
     fn convert(source: Vec<Option<ByteArray>>) -> Result<StringArray> {
-        let mut builder = StringBuilder::new(source.len());
+        let mut builder = StringBuilder::with_capacity(
+            source.len(),
+            source.len() * std::mem::size_of::<ByteArray>(),

Review comment:
       Having revisited this, not only is it pessimistic - it's plain wrong :). The ByteArray type here is just a wrapper pointer struct and its size in memory has no semantic value. This is fixed in the latest version which computes the correct array capacity.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7076:
URL: https://github.com/apache/arrow/pull/7076#issuecomment-622178164


   https://issues.apache.org/jira/browse/ARROW-8659


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #7076:
URL: https://github.com/apache/arrow/pull/7076#issuecomment-622616144


   Merged. Thanks @tustvold !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tustvold commented on a change in pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #7076:
URL: https://github.com/apache/arrow/pull/7076#discussion_r418650048



##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -527,11 +527,18 @@ pub struct ListBuilder<T: ArrayBuilder> {
 impl<T: ArrayBuilder> ListBuilder<T> {
     /// Creates a new `ListArrayBuilder` from a given values array builder
     pub fn new(values_builder: T) -> Self {
-        let mut offsets_builder = Int32BufferBuilder::new(values_builder.len() + 1);
+        let capacity = values_builder.len();
+        Self::with_capacity(values_builder, capacity)
+    }
+
+    /// Creates a new `ListArrayBuilder` from a given values array builder
+    /// `append` may be called up to `capacity` times without triggering reallocation

Review comment:
       I went with something I hope was in the spirit of what you wanted, as technically it allocates for an extra offset and also pre-allocates the bitmap builder. Let me know if you'd prefer I stuck with your exact wording




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tustvold commented on a change in pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #7076:
URL: https://github.com/apache/arrow/pull/7076#discussion_r418347285



##########
File path: rust/parquet/src/arrow/converter.rs
##########
@@ -128,7 +128,10 @@ pub struct Utf8ArrayConverter {}
 
 impl Converter<Vec<Option<ByteArray>>, StringArray> for Utf8ArrayConverter {
     fn convert(source: Vec<Option<ByteArray>>) -> Result<StringArray> {
-        let mut builder = StringBuilder::new(source.len());
+        let mut builder = StringBuilder::with_capacity(
+            source.len(),
+            source.len() * std::mem::size_of::<ByteArray>(),

Review comment:
       This might be a touch on the pessimistic side, but appears to be the same approximation made in `ComplexObjectArrayReader`. It would be nice to somehow feed the data from `ComplexObjectArrayReader` through to the converter, but this was a non-trivial change so I avoided it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on a change in pull request #7076: ARROW-8659: [Rust] ListBuilder allocate with_capacity

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #7076:
URL: https://github.com/apache/arrow/pull/7076#discussion_r418460308



##########
File path: rust/parquet/src/arrow/converter.rs
##########
@@ -128,7 +128,10 @@ pub struct Utf8ArrayConverter {}
 
 impl Converter<Vec<Option<ByteArray>>, StringArray> for Utf8ArrayConverter {
     fn convert(source: Vec<Option<ByteArray>>) -> Result<StringArray> {
-        let mut builder = StringBuilder::new(source.len());
+        let mut builder = StringBuilder::with_capacity(
+            source.len(),
+            source.len() * std::mem::size_of::<ByteArray>(),

Review comment:
       Yeah this could be overly pessimistic. A more accurate way is to calculate the sum of all lens. It may be worth it.

##########
File path: rust/arrow/src/array/builder.rs
##########
@@ -527,11 +527,18 @@ pub struct ListBuilder<T: ArrayBuilder> {
 impl<T: ArrayBuilder> ListBuilder<T> {
     /// Creates a new `ListArrayBuilder` from a given values array builder
     pub fn new(values_builder: T) -> Self {
-        let mut offsets_builder = Int32BufferBuilder::new(values_builder.len() + 1);
+        let capacity = values_builder.len();
+        Self::with_capacity(values_builder, capacity)
+    }
+
+    /// Creates a new `ListArrayBuilder` from a given values array builder
+    /// `append` may be called up to `capacity` times without triggering reallocation

Review comment:
       nit: maybe rephrase this to: "`capacity` is the number of items to pre-allocate space for in offset buffer of this builder"? similarly below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org