You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2016/05/02 17:16:12 UTC

[jira] [Commented] (HIVE-9660) store end offset of compressed data for RG in RowIndex in ORC

    [ https://issues.apache.org/jira/browse/HIVE-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266769#comment-15266769 ] 

Owen O'Malley commented on HIVE-9660:
-------------------------------------

After looking at this patch, I feel like we can do it more cleanly. I'd propose that we:
* add a capability to register callbacks on PositionedOutputStream that get called immediately if there are no uncompressed bytes, or after the next compression block finishes.
* add a similar capability to the run length encoders that wait until the end of the current run and then pass the callback down to the PositionedOutputStream.
* the ORC WriterImpl then creates callbacks that finalize the RowIndexEntry when all of the streams for that column have completed their run length encoding block and compression block.

This makes most of the column types really straightforward. The only one that is a mess is the string column types because of the delayed writing caused by the dictionary.

I should have a first draft of such a patch today for everyone to look at.

Thoughts?

> store end offset of compressed data for RG in RowIndex in ORC
> -------------------------------------------------------------
>
>                 Key: HIVE-9660
>                 URL: https://issues.apache.org/jira/browse/HIVE-9660
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-9660.01.patch, HIVE-9660.02.patch, HIVE-9660.03.patch, HIVE-9660.04.patch, HIVE-9660.05.patch, HIVE-9660.06.patch, HIVE-9660.07.patch, HIVE-9660.07.patch, HIVE-9660.08.patch, HIVE-9660.09.patch, HIVE-9660.10.patch, HIVE-9660.10.patch, HIVE-9660.11.patch, HIVE-9660.patch, HIVE-9660.patch
>
>
> Right now the end offset is estimated, which in some cases results in tons of extra data being read.
> We can add a separate array to RowIndex (positions_v2?) that stores number of compressed buffers for each RG, or end offset, or something, to remove this estimation magic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)