You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/09/18 11:42:00 UTC
[jira] [Commented] (PARQUET-1337) Current block alignment logic may lead to several row groups per block

    [ https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618985#comment-16618985 ] 

ASF GitHub Bot commented on PARQUET-1337:
-----------------------------------------

zivanfi opened a new pull request #523: PARQUET-1337: Current block alignment logic may lead to several row groups per block
URL: https://github.com/apache/parquet-mr/pull/523
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Current block alignment logic may lead to several row groups per block
> ----------------------------------------------------------------------
>
>                 Key: PARQUET-1337
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1337
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>              Labels: pull-request-available
>
> When the size of buffered data gets near the desired row group size, Parquet flushes the data to a row group. However, at this point the data for the last page is not yet encoded nor compressed, thereby the row group may end up being significantly smaller than it was intended.
> If the row group ends up being so small that it is farther away from the next disk block boundary than the maximum padding, Parquet will try to create a new group in the same disk block, this time targeting the remaning space. This may also be flushed prematurely, leading to the creation of an even smaller row group, which may lead to an even smaller one... This gets repeated until we get sufficiently close to the block boundary so that padding can be finally applied. The resulting superflous row groups can lead to bad performance.
> An example of the structure of a Parquet file suffering from this problem can be seen below. For easier interpretation, the row groups are visually grouped by disk blocks:
> {noformat}
> row group 1:  RC:18774 TS:22182960 OFFSET:       4
> row group 2:  RC: 2896 TS: 3428160 OFFSET: 6574564
> row group 3:  RC: 1964 TS: 2322560 OFFSET: 7679844
> row group 4:  RC: 1074 TS: 1268880 OFFSET: 8732964
> {noformat}
> {noformat}
> row group 5:  RC:18808 TS:22228560 OFFSET:10000000
> row group 6:  RC: 2872 TS: 3389520 OFFSET:16612640
> row group 7:  RC: 1930 TS: 2284960 OFFSET:17716800
> row group 8:  RC: 1040 TS: 1233520 OFFSET:18768240
> {noformat}
> {noformat}
> row group 9:  RC:18852 TS:22275520 OFFSET:20000000
> row group 10: RC: 2831 TS: 3345680 OFFSET:26656320
> row group 11: RC: 1893 TS: 2244640 OFFSET:27757200
> row group 12: RC: 1008 TS: 1195520 OFFSET:28806560
> {noformat}
> {noformat}
> row group 13: RC:18841 TS:22263360 OFFSET:30000000
> row group 14: RC: 2835 TS: 3350480 OFFSET:36652000
> row group 15: RC: 1900 TS: 2249040 OFFSET:37753600
> row group 16: RC: 1016 TS: 1198640 OFFSET:38803600
> {noformat}
> {noformat}
> row group 17: RC: 1466 TS: 1740320 OFFSET:40000000
> {noformat}
> In this example, both the disk block size and the row group size was set to 10000000. The data would fit in 5 row groups of this size, but instead, each of the disk blocks (except the last) is split into 4 row groups of progressively decreasing size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)