You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2015/09/14 20:52:45 UTC

[jira] [Commented] (HIVE-11807) Set ORC buffer size in relation to set stripe size

    [ https://issues.apache.org/jira/browse/HIVE-11807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744025#comment-14744025 ] 

Owen O'Malley commented on HIVE-11807:
--------------------------------------

Ok, there are a couple changes that I'd propose:
* Use the stripe size rather than the available memory. This is more important because the stripe will be flushed when the buffering reaches the stripe size.
* Count all of the columns not just the top level ones.
* Most of the streams have at most 2 large streams so if we use 20 buffers, that will give us a reasonable balance between internal fragmentation and throughput.


> Set ORC buffer size in relation to set stripe size
> --------------------------------------------------
>
>                 Key: HIVE-11807
>                 URL: https://issues.apache.org/jira/browse/HIVE-11807
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> A customer produced ORC files with very small stripe sizes (10k rows/stripe) by setting a small 64MB stripe size and 256K buffer size for a 54 column table. At that size, each of the streams only get a buffer or two before the stripe size is reached. The current code uses the available memory instead of the stripe size and thus doesn't shrink the buffer size if the JVM has much more memory than the stripe size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)