You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by "Shardul Mahadik (JIRA)" <ji...@apache.org> on 2017/07/31 18:03:00 UTC

[jira] [Created] (ORC-220) Stripe size too small for wide tables

Shardul Mahadik created ORC-220:
-----------------------------------

             Summary: Stripe size too small for wide tables
                 Key: ORC-220
                 URL: https://issues.apache.org/jira/browse/ORC-220
             Project: ORC
          Issue Type: Bug
    Affects Versions: 1.4.0, 1.3.0, 1.2.0, 1.1.0, 1.0.0
            Reporter: Shardul Mahadik


For a wide table having, eg. 100 columns, I observed that really small stripes were generated.
As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression buffer size instead of the specified 256KB).
I came across this PR https://github.com/apache/hive/pull/118 which was sent to the Hive repo. The PR suggests using ByteBuffer.postion() instead of ByteBuffer.capacity() to estimate the stripe size. This is really useful for wide tables where the difference between position and capacity of the buffers can add up significantly. In our case, with this patch, I saw that the number of stripes went down to 115, each stripe being 8.3MB. The patch reduced the value returned by estimateStripeSize() by approx 15MB which delayed the flushing on the stripes.
Would like to know your thoughts on this.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)