You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/05/19 12:50:24 UTC

[GitHub] [iceberg] szlta opened a new issue #2612: Vectorized ORC reads for Hive

szlta opened a new issue #2612:
URL: https://github.com/apache/iceberg/issues/2612


   Currently Hive relies on org.apache.iceberg.mr.mapreduce.IcebergInputFormat for reading data from Iceberg. This in turn relies on the internal reader implementations for the various file formats, such as GenericOrcReader(s). Unfortunately for Hive, the whole read pipeline is record based, whereas GenericRecord instances hold the information on which Hive has to use the Iceberg object inspectors to retrieve the encoded data from.
   
   From a design point of view I see 3 layers where during reading data flows upwards:
   - Execution engine: Hive
   - Table format: Iceberg
   - File format: ORC/Parquet/Avro
   The current design however combines the two bottom layers, and although Hive already has support for vectorized execution, and already has implementations for vectorized reading from various file formats; with Iceberg in the picture it can't have the advantages of vectorization.
   
   In this change I propose to swap out the bottom layer with the implementations Hive already has, so in the case of ORC: with VectorizedOrcInputFormat. This produces VectorizedRowBatch instances which Hive expects as its vectorized format. This could be used as the 'Hive' in memory data model, something IcebergInputFormat already distinguishes upon.
   
   One big obstacle is that Iceberg depends on a type of ORC that has the nohive classifier. This means that classes that are actually coming from Hive/storage-api are shaded inside ORC with an ORC classname. In order to revert this while also causing the minimal disruption to other parts of Iceberg I propose to create an intermediary module for iceberg-hive3 to depend on (rather then depending on iceberg-orc and iceberg-data)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szlta closed issue #2612: Vectorized ORC reads for Hive

Posted by GitBox <gi...@apache.org>.

szlta closed issue #2612:
URL: https://github.com/apache/iceberg/issues/2612


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org