You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/07/23 23:24:39 UTC

[GitHub] [incubator-iceberg] prodeezy edited a comment on issue #9: Vectorize reads and deserialize to Arrow

prodeezy edited a comment on issue #9: Vectorize reads and deserialize to Arrow
URL: https://github.com/apache/incubator-iceberg/issues/9#issuecomment-514419084

I'v added a WIP branch with a working POC for vectorization for primitive types in Iceberg
https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP

**Implementation Notes:**
- Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to instruct the DataSourceV2ScanExec to use `planBatchPartitions()` instead of the usual `planInputPartitions()`. It returns instances of `ColumnarBatch` on each iteration.
- `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This was copied from [3] . Added by @danielcweeks . Thanks for that!
- `VectorizedParquetValueReaders` contains ParquetValueReaders used for reading/decoding the Parquet rowgroups (aka pagestores as referred to in the code)
- `VectorizedSparkParquetReaders` contains the visitor implementations to map Parquet types to appropriate value readers. I implemented the struct visitor so that the root schema can be mapped properly. This has the added benefit of vectorization support for structs, so yay!
- For the initial version the value readers read an entire row group into a single Arrow Field Vector. this i'd imagine will require tuning for right batch sizing but i'v gone with one batch per rowgroup for now.
- Arrow Field Vectors are wrapped using `ArrowColumnVector` which is Spark's ColumnVector implementation backed by Arrow. This is the first contact point between Spark and Arrow interfaces.
- ArrowColumnVectors are stitched together into a `ColumnarBatch` by `ColumnarBatchReader` . This is my replacement for `InternalRowReader` which maps Structs to Columnar Batches. This allows us to have nested structs where each level of nesting would be a nested columnar batch. Lemme know what you think of this approach.
- I'v added value readers for all supported primitive types listed in `AvroDataTest`. There's a corresponding test for vectorized reader under `TestSparkParquetVectorizedReader`
- I haven't fixed all the Checkstyle errors so you will have to turn checkstyle off in build.gradle. Also skip tests while building.. sorry! :-(

P.S. There's some unused code under ArrowReader.java. Ignore this as it's not used. This was from my previous impl of Vectorization. I'v kept it around to compare performance.

Lemme know what folks think of the approach. I'm getting this working for our scale test benchmark and will report back with numbers. Feel free to run your own benchmarks and share.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org