You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Prasanth Jayachandran (JIRA)" <ji...@apache.org> on 2017/07/31 20:55:02 UTC
[jira] [Commented] (ORC-220) Stripe size too small for wide tables

    [ https://issues.apache.org/jira/browse/ORC-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107938#comment-16107938 ] 

Prasanth Jayachandran commented on ORC-220:
-------------------------------------------

[~shardulm] How are you generating the ORC files? Are you using Hive? The stripe size will get affected if there is not much memory available for ORC writers. Concurrent writers will share the available memory. For example: If you are using dynamic partitioning in Hive then reducers will keep many ORC writers open at the same time reducing the stripe size of individual writers. You could provide more memory, reduce stripe size or enable hive.optimize.sort.dynamic.partition which makes sure only one writer is open at a time in case of dynamic partitioning. By default ORC memory manager uses only 50% (hive.exec.orc.memory.pool) of heap memory leaving some space of aggregation, sort buffers etc.

I don't think using ByteBufer.position() will be correct here as the size has to account for memory usage in heap. It doesn't matter if ORC stream uses the buffer fully or not, memory manager has to account for total allocation.

> Stripe size too small for wide tables
> -------------------------------------
>
>                 Key: ORC-220
>                 URL: https://issues.apache.org/jira/browse/ORC-220
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
>            Reporter: Shardul Mahadik
>
> For a wide table having, eg. 100 columns, I observed that really small stripes were generated.
> As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression buffer size instead of the specified 256KB).
> I came across this PR https://github.com/apache/hive/pull/118 which was sent to the Hive repo. The PR suggests using ByteBuffer.postion() instead of ByteBuffer.capacity() to estimate the stripe size. This is really useful for wide tables where the difference between position and capacity of the buffers can add up significantly. In our case, with this patch, I saw that the number of stripes went down to 115, each stripe being 8.3MB. The patch reduced the value returned by estimateStripeSize() by approx 15MB which delayed the flushing on the stripes.
> Would like to know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)