You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Neville Dipale (Jira)" <ji...@apache.org> on 2021/03/31 05:00:00 UTC

[jira] [Resolved] (ARROW-12121) [Rust] [Parquet] Arrow writer benchmarks

     [ https://issues.apache.org/jira/browse/ARROW-12121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neville Dipale resolved ARROW-12121.
------------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 9825
[https://github.com/apache/arrow/pull/9825]

> [Rust] [Parquet] Arrow writer benchmarks
> ----------------------------------------
>
>                 Key: ARROW-12121
>                 URL: https://issues.apache.org/jira/browse/ARROW-12121
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Neville Dipale
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The common concern with Parquet's Arrow readers and writers is that they're slow.
> My diagnosis is that we rely on a chain of processes, which introduces overhead.
> For example, writing an Arrow RecordBatch involves the following:
> 1. Iterate through arrays to create def/rep levels
> 2. Extract Parquet primitive values from arrays using these levels
> 3. Write primitive values, validating them in the process (when they already should be validated)
> 4. Split the already materialised values into small batches for Parquet chunks (consider where we have 1e6 values in a batch)
> 5. Write these batches, computing the stats of each batch, and encoding values
> The above is as a side-effect of convenience, as it would likely require a lot more effort to bypass some of the steps.
> I have ideas around going from step 1 to 5 directly, but won't know if it's better if there aren't performance benchmarks. I also struggle to see if I'm making improvements while I clean up the writer code, especially removing the allocations that I created to reduce the complexity of the level calculations.
> With ARROW-12120 (random array & batch generator), it becomes more convenient to benchmark (and test many combinations of) the Arrow writer.
> I would thus like to start adding benchmarks for the Arrow writer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)