You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2015/06/09 23:45:01 UTC

[jira] [Commented] (PARQUET-306) Improve alignment between row groups and HDFS blocks

    [ https://issues.apache.org/jira/browse/PARQUET-306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14579618#comment-14579618 ] 

Ryan Blue commented on PARQUET-306:
-----------------------------------

The PR implements #1 and #3. HDFS-3689 isn't available yet.

> Improve alignment between row groups and HDFS blocks
> ----------------------------------------------------
>
>                 Key: PARQUET-306
>                 URL: https://issues.apache.org/jira/browse/PARQUET-306
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>
> Row groups should not span HDFS blocks to avoid remote reads. There are 3 things we can use to avoid this:
> 1. Set the next row group's size to the remaining bytes in the current HDFS block
> 2. Use HDFS-3689, variable-length HDFS blocks, when available
> 3. Pad after row groups close to the block boundary to start the next row group at the start of the next block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)