You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/05/05 17:23:04 UTC

[jira] [Commented] (DRILL-5379) Set Hdfs Block Size based on Parquet Block Size

    [ https://issues.apache.org/jira/browse/DRILL-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998610#comment-15998610 ] 

ASF GitHub Bot commented on DRILL-5379:
---------------------------------------

GitHub user ppadma opened a pull request:

    https://github.com/apache/drill/pull/826

    DRILL-5379: Set Hdfs Block Size based on Parquet Block Size

    Provide an option to specify blocksize during file creation.
    This will help create parquet files with single block on HDFS, helping improve
    performance when we read those files.
    See DRILL-5379 for details.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ppadma/drill DRILL-5379

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/826.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #826
    
----
commit ae77a26aa950e401f2ca40488021431ebfde7156
Author: Padma Penumarthy <pp...@yahoo.com>
Date:   2017-04-20T00:25:20Z

    DRILL-5379: Set Hdfs Block Size based on Parquet Block Size
    Provide an option to specify blocksize during file creation.
    This will help create parquet files with single block on HDFS, helping improve
    performance when we read those files.
    See DRILL-5379 for details.

----


> Set Hdfs Block Size based on Parquet Block Size
> -----------------------------------------------
>
>                 Key: DRILL-5379
>                 URL: https://issues.apache.org/jira/browse/DRILL-5379
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Parquet
>    Affects Versions: 1.9.0
>            Reporter: F Méthot
>             Fix For: Future
>
>
> It seems there a way to force Drill to store CTAS generated parquet file as a single block when using HDFS. Java HDFS API allows to do that, files could be created with the Parquet block-size set in a session or system config.
> Since it is ideal  to have single parquet file per hdfs block.
> Here is the HDFS API that allow to do that:
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> Drill uses the hadoop ParquetFileWriter (https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java).
> This is where the file creation occurs so it might be tricky.
> However, ParquetRecordWriter.java (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java) in Drill creates the ParquetFileWriter with an hadoop configuration object.
> something to explore: Could the block size be set as a property within the Configuration object before passing it to ParquetFileWriter constructor?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)