You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 11:23:53 UTC

[GitHub] [arrow-rs] alamb opened a new issue #43: [Parquet] Use IntoIter trait for write_batch/write_mini_batch

alamb opened a new issue #43:
URL: https://github.com/apache/arrow-rs/issues/43


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-5153
   
   Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take a record struct like:
   
   {{struct MyData {}}{{  name: String,}}{{  address: Option<String>}}{{}}}
   
   Over the course of working sets of this data, you'll have the bulk data Vec<MyData>,  the names column in a Vec<&String>, the address column in a Vec<Option<String>>. This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the same size as the bulk data even if we are using references.
   
   What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: IntoIter<Item=T::T>". Then you can do things like
   
   {{  write_batch(bulk.iter().map(|x| x.name), None, None)}}{{  write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| x.is_some())), None)}}
   
   and you can see there's no need for an intermediate Vec, so no short-term allocations to write out the data.
   
   I am writing data with many columns and I think this would really help to speed things up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #43: [Parquet] Use IntoIter trait for write_batch/write_mini_batch

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #43:
URL: https://github.com/apache/arrow-rs/issues/43#issuecomment-826757072


   Comment from Xavier Lange(xrl) @ 2019-04-09T19:22:29.534+0000:
   <pre>[~csun] [~sadikovi] what do you think of this potentially breaking change? I need to confirm the backwards compatibility but I think it might still be a useful change.</pre>
   
   Comment from Ivan Sadikov(sadikovi) @ 2019-04-09T19:33:52.271+0000:
   <pre>[~xrl] Yes, sure. I will be happy to review if you open a PR with changes. We can create a new method "write_batch_iter", which implements new API and make "write_batch" to call the new method, since as you pointed out slice implements IntoIter.</pre>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org