You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by kkhatua <gi...@git.apache.org> on 2017/06/29 01:25:59 UTC

[GitHub] drill issue #826: DRILL-5379: Set Hdfs Block Size based on Parquet Block Siz...

Github user kkhatua commented on the issue:

https://github.com/apache/drill/pull/826

@ppadma , Khurram [~khfaraaz] and I were looking at the details in the PR and it's not very clear what new behavior does the PR allow. If you need to specify the block-size as described in the [comment ](https://issues.apache.org/jira/browse/DRILL-5379?focusedCommentId=15981366&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15981366)by @fmethot , doesn't Drill already do that? I thought Drill implicitly creates files with a single row-group anyway.

My understanding of the JIRA's problem statement was that if the Parquet block-size (i.e. the rowgroup size) is set to a large value that exceeds the HDFS block size, using the flag would allow Drill to ignore the larger value in the options and write with a parquet-blocksize that matches the target HDFS location. So, I could have {{store.parquet.block-size=1073741824}} (i.e. 1GB), but when writing an output worth 512MB, instead of 1 file... Drill would read the HDFS block-size (say 128GB) and apply that as the parquet-block-size and write 4 files.

@fmethot is that what you were looking for? An **automatic scaling down** of the parquet file's size to match (and be contained within) the HDFS block size?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---