You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by Peter Vary <pv...@cloudera.com.INVALID> on 2021/05/27 11:00:36 UTC

Hive Vectorization

Hi Team,

Currently we are working on enabling vectorization for reading Iceberg tables through Hive. This will have serious performance benefit in itself and we would like to contribute the code to the Iceberg codebase as well.

Adam Szita created a pull request for it: "Hive: Vectorized ORC reads for Hive #2613".
See: https://github.com/apache/iceberg/pull/2613 <https://github.com/apache/iceberg/pull/2613>

He wrote a quite good summary there.

I could review and merge the code, but we would really value the input from the community about the changes.

We have seen that any conversion between data formats is costly and seriously hurts the performance
We have taken a look at the Flink / Spark vectorized reads used a middle layer between the Readers and the Engines. When we used that approach we found that the performance suffered because of the conversion.
Currently the storage-api contains the Hive classes shaded to org.apache.orc.storage so Hive can not use them directly. Even though the classes are the same we had to manually copy the data which caused performance degradation again.

Because of the problems above:
To prevent object conversion we had to reshade the storage-api back to the original Hive objects.
Use the org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch as HIVE IN_MEMORY_DATA_MODEL

I would like to know what the Iceberg community thinks about the solution, especially the contributors and reviewers of the other Vectorization solutions.

Thanks,
Peter