You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/06 20:23:27 UTC

[GitHub] [arrow] jorgecarleitao commented on pull request #8118: ARROW-9922: [Rust] Add StructArray::TryFrom and deprecate StructBuilder (+40%)

jorgecarleitao commented on pull request #8118:
URL: https://github.com/apache/arrow/pull/8118#issuecomment-687891730


   > Maybe I am misunderstanding, but I think there may be a flaw with this approach and we're not comparing apples with apples when looking at the benchmarks.
   > 
   > The original code is dynamically building a struct using the builder. The new code starts with a `vec!` where everything is known at compile time. In theory, the builders should be more efficient than building a `Vec` and then converting it.
   
   I though that `criterion::black_box()` would block the compiler from optimizing the code on it, so that the benchmark would not be tainted by compiler optimizations. I use these in both the Builder and `From`.
   
   Regardless, the reason I used this approach was because I looked through the code on where we use Builders, and I found two main inputs:
   
   * a vector:
       * constructed from reading batches of rows (e.g. `StringRecord` in CSV, `&[Value]` in json)
       * constructed in memory from some external source (e.g. `MemoryScan`)
   * an Arrow Array, in most in-memory calculations (e.g. `RecordBatch` and `ArrayRef`, in `compute` and DataFusion)
   
   In all cases, we use the builders to append rows row-by-row:
   * see [here](https://github.com/apache/arrow/blob/master/rust/arrow/src/csv/reader.rs#L432) for CSV
   * see [here](https://github.com/apache/arrow/blob/master/rust/arrow/src/json/reader.rs#L491) for JSON
   * in parquet [we do not use Array builders](https://github.com/apache/arrow/blob/master/rust/parquet/src/arrow/array_reader.rs#L27)
   * see [here](https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs#L207) for an example in compute
   
   Based on this analysis, I though that:
   * this benchmark was a good representation of our use-cases
   * we can use `[Try]From` to build our results instead of a builder. The `from` is essentially `builder.append_data().finish()`, with a significantly simpler API
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org