You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2021/04/26 11:24:05 UTC
[jira] [Commented] (ARROW-5153) [Rust] [Parquet] Use IntoIter trait
for write_batch/write_mini_batch
[ https://issues.apache.org/jira/browse/ARROW-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332027#comment-17332027 ]
Andrew Lamb commented on ARROW-5153:
------------------------------------
Migrated to github: https://github.com/apache/arrow-rs/issues/43
> [Rust] [Parquet] Use IntoIter trait for write_batch/write_mini_batch
> --------------------------------------------------------------------
>
> Key: ARROW-5153
> URL: https://issues.apache.org/jira/browse/ARROW-5153
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust
> Reporter: Xavier Lange
> Priority: Major
>
> Writing data to a parquet file requires a lot of copying and intermediate Vec creation. Take a record struct like:
> {{struct MyData {}}{{ name: String,}}{{ address: Option<String>}}{{}}}
> Over the course of working sets of this data, you'll have the bulk data Vec<MyData>, the names column in a Vec<&String>, the address column in a Vec<Option<String>>. This puts extra memory pressure on the system, at the minimum we have to allocate a Vec the same size as the bulk data even if we are using references.
> What I'm proposing is to use an IntoIter style. This will maintain backward compat as a slice automatically implements IntoIter. Where ColumnWriterImpl#write_batch goes from "values: &[T::T]"to values "values: IntoIter<Item=T::T>". Then you can do things like
> {{ write_batch(bulk.iter().map(|x| x.name), None, None)}}{{ write_batch(bulk.iter().map(|x| x.address), Some(bulk.iter().map(|x| x.is_some())), None)}}
> and you can see there's no need for an intermediate Vec, so no short-term allocations to write out the data.
> I am writing data with many columns and I think this would really help to speed things up.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)