You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by François Méthot <fm...@gmail.com> on 2017/03/22 18:54:56 UTC

Single Hdfs block per parquet file

Hi,

Is there a way to force Drill to store CTAS generated parquet file as a
single block when using HDFS? Java HDFS API allows to do that, files could
be created with the Parquet block-size.

We are using Drill on hdfs configured with block size of 128MB. Changing
this size is not an option at this point.

It would be ideal for us to have single parquet file per hdfs block, setting
store.parquet.block-size to 128MB would fix our issue but we end up with a
lot more files to deal with.

Thanks
Francois

Re: Single Hdfs block per parquet file

Posted by François Méthot <fm...@gmail.com>.

Done,
Thanks for the feedback

https://issues.apache.org/jira/browse/DRILL-5379


On Thu, Mar 23, 2017 at 4:29 PM, Kunal Khatua <kk...@mapr.com> wrote:

> This seems like a reasonable feature request. It could also be expanded to
> detect the underlying block size for the location being written to.
>
>
> Could you file a JIRA for this?
>
>
> Thanks
>
> Kunal
>
> ________________________________
> From: François Méthot <fm...@gmail.com>
> Sent: Thursday, March 23, 2017 9:08:51 AM
> To: dev@drill.apache.org
> Subject: Re: Single Hdfs block per parquet file
>
> After further investigation, Drill uses the hadoop ParquetFileWriter (
> https://github.com/Parquet/parquet-mr/blob/master/
> parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
> ).
> This is where the file creation occurs so it might be tricky after all.
>
> However ParquetRecordWriter.java (
> https://github.com/apache/drill/blob/master/exec/java-
> exec/src/main/java/org/apache/drill/exec/store/parquet/
> ParquetRecordWriter.java)
> in Drill creates the ParquetFileWriter with an hadoop configuration object.
>
> However something to explore: Could the block size be set as a property
> within the Configuration object before passing it to ParquetFileWriter
> constructor?
>
> François
>
> On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <pp...@mapr.com>
> wrote:
>
> > Yes, seems like it is possible to create files with different block
> sizes.
> > We could potentially pass the configured store.parquet.block-size to the
> > create call.
> > I will try it out and see. will let you know.
> >
> > Thanks,
> > Padma
> >
> >
> > > On Mar 22, 2017, at 4:16 PM, François Méthot <fm...@gmail.com>
> > wrote:
> > >
> > > Here are 2 links I could find:
> > >
> > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> > fs.Path,%20boolean,%20int,%20short,%20long)
> > >
> > > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> > apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> > fs.Path,%20boolean,%20int,%20short,%20long)
> > >
> > > Francois
> > >
> > > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <
> ppenumarthy@mapr.com>
> > > wrote:
> > >
> > >> I think we create one file for each parquet block.
> > >> If underlying HDFS block size is 128 MB and parquet block size  is  >
> > >> 128MB,
> > >> it will create more blocks on HDFS.
> > >> Can you let me know what is the HDFS API that would allow you to
> > >> do otherwise ?
> > >>
> > >> Thanks,
> > >> Padma
> > >>
> > >>
> > >>> On Mar 22, 2017, at 11:54 AM, François Méthot <fm...@gmail.com>
> > >> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Is there a way to force Drill to store CTAS generated parquet file
> as a
> > >>> single block when using HDFS? Java HDFS API allows to do that, files
> > >> could
> > >>> be created with the Parquet block-size.
> > >>>
> > >>> We are using Drill on hdfs configured with block size of 128MB.
> > Changing
> > >>> this size is not an option at this point.
> > >>>
> > >>> It would be ideal for us to have single parquet file per hdfs block,
> > >> setting
> > >>> store.parquet.block-size to 128MB would fix our issue but we end up
> > with
> > >> a
> > >>> lot more files to deal with.
> > >>>
> > >>> Thanks
> > >>> Francois
> > >>
> > >>
> >
> >
>

Re: Single Hdfs block per parquet file

Posted by Kunal Khatua <kk...@mapr.com>.

This seems like a reasonable feature request. It could also be expanded to detect the underlying block size for the location being written to.

Could you file a JIRA for this?

Thanks

Kunal

________________________________
From: François Méthot <fm...@gmail.com>
Sent: Thursday, March 23, 2017 9:08:51 AM
To: dev@drill.apache.org
Subject: Re: Single Hdfs block per parquet file

After further investigation, Drill uses the hadoop ParquetFileWriter (
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
).
This is where the file creation occurs so it might be tricky after all.

However ParquetRecordWriter.java (
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
in Drill creates the ParquetFileWriter with an hadoop configuration object.

However something to explore: Could the block size be set as a property
within the Configuration object before passing it to ParquetFileWriter
constructor?

François

On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <pp...@mapr.com>
wrote:

> Yes, seems like it is possible to create files with different block sizes.
> We could potentially pass the configured store.parquet.block-size to the
> create call.
> I will try it out and see. will let you know.
>
> Thanks,
> Padma
>
>
> > On Mar 22, 2017, at 4:16 PM, François Méthot <fm...@gmail.com>
> wrote:
> >
> > Here are 2 links I could find:
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > Francois
> >
> > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <pp...@mapr.com>
> > wrote:
> >
> >> I think we create one file for each parquet block.
> >> If underlying HDFS block size is 128 MB and parquet block size  is  >
> >> 128MB,
> >> it will create more blocks on HDFS.
> >> Can you let me know what is the HDFS API that would allow you to
> >> do otherwise ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >>> On Mar 22, 2017, at 11:54 AM, François Méthot <fm...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is there a way to force Drill to store CTAS generated parquet file as a
> >>> single block when using HDFS? Java HDFS API allows to do that, files
> >> could
> >>> be created with the Parquet block-size.
> >>>
> >>> We are using Drill on hdfs configured with block size of 128MB.
> Changing
> >>> this size is not an option at this point.
> >>>
> >>> It would be ideal for us to have single parquet file per hdfs block,
> >> setting
> >>> store.parquet.block-size to 128MB would fix our issue but we end up
> with
> >> a
> >>> lot more files to deal with.
> >>>
> >>> Thanks
> >>> Francois
> >>
> >>
>
>

Re: Single Hdfs block per parquet file

Posted by François Méthot <fm...@gmail.com>.

After further investigation, Drill uses the hadoop ParquetFileWriter (
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
).
This is where the file creation occurs so it might be tricky after all.

However ParquetRecordWriter.java (
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetRecordWriter.java)
in Drill creates the ParquetFileWriter with an hadoop configuration object.

However something to explore: Could the block size be set as a property
within the Configuration object before passing it to ParquetFileWriter
constructor?

François

On Wed, Mar 22, 2017 at 11:55 PM, Padma Penumarthy <pp...@mapr.com>
wrote:

> Yes, seems like it is possible to create files with different block sizes.
> We could potentially pass the configured store.parquet.block-size to the
> create call.
> I will try it out and see. will let you know.
>
> Thanks,
> Padma
>
>
> > On Mar 22, 2017, at 4:16 PM, François Méthot <fm...@gmail.com>
> wrote:
> >
> > Here are 2 links I could find:
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/
> apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.
> fs.Path,%20boolean,%20int,%20short,%20long)
> >
> > Francois
> >
> > On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <pp...@mapr.com>
> > wrote:
> >
> >> I think we create one file for each parquet block.
> >> If underlying HDFS block size is 128 MB and parquet block size  is  >
> >> 128MB,
> >> it will create more blocks on HDFS.
> >> Can you let me know what is the HDFS API that would allow you to
> >> do otherwise ?
> >>
> >> Thanks,
> >> Padma
> >>
> >>
> >>> On Mar 22, 2017, at 11:54 AM, François Méthot <fm...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Is there a way to force Drill to store CTAS generated parquet file as a
> >>> single block when using HDFS? Java HDFS API allows to do that, files
> >> could
> >>> be created with the Parquet block-size.
> >>>
> >>> We are using Drill on hdfs configured with block size of 128MB.
> Changing
> >>> this size is not an option at this point.
> >>>
> >>> It would be ideal for us to have single parquet file per hdfs block,
> >> setting
> >>> store.parquet.block-size to 128MB would fix our issue but we end up
> with
> >> a
> >>> lot more files to deal with.
> >>>
> >>> Thanks
> >>> Francois
> >>
> >>
>
>

Re: Single Hdfs block per parquet file

Posted by Padma Penumarthy <pp...@mapr.com>.

Yes, seems like it is possible to create files with different block sizes.
We could potentially pass the configured store.parquet.block-size to the create call.
I will try it out and see. will let you know.

Thanks,
Padma 


> On Mar 22, 2017, at 4:16 PM, François Méthot <fm...@gmail.com> wrote:
> 
> Here are 2 links I could find:
> 
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> 
> http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)
> 
> Francois
> 
> On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <pp...@mapr.com>
> wrote:
> 
>> I think we create one file for each parquet block.
>> If underlying HDFS block size is 128 MB and parquet block size  is  >
>> 128MB,
>> it will create more blocks on HDFS.
>> Can you let me know what is the HDFS API that would allow you to
>> do otherwise ?
>> 
>> Thanks,
>> Padma
>> 
>> 
>>> On Mar 22, 2017, at 11:54 AM, François Méthot <fm...@gmail.com>
>> wrote:
>>> 
>>> Hi,
>>> 
>>> Is there a way to force Drill to store CTAS generated parquet file as a
>>> single block when using HDFS? Java HDFS API allows to do that, files
>> could
>>> be created with the Parquet block-size.
>>> 
>>> We are using Drill on hdfs configured with block size of 128MB. Changing
>>> this size is not an option at this point.
>>> 
>>> It would be ideal for us to have single parquet file per hdfs block,
>> setting
>>> store.parquet.block-size to 128MB would fix our issue but we end up with
>> a
>>> lot more files to deal with.
>>> 
>>> Thanks
>>> Francois
>> 
>>

Re: Single Hdfs block per parquet file

Posted by François Méthot <fm...@gmail.com>.

Here are 2 links I could find:

http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)

http://archive.cloudera.com/cdh4/cdh/4/hadoop/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)

Francois

On Wed, Mar 22, 2017 at 4:29 PM, Padma Penumarthy <pp...@mapr.com>
wrote:

> I think we create one file for each parquet block.
> If underlying HDFS block size is 128 MB and parquet block size  is  >
> 128MB,
> it will create more blocks on HDFS.
> Can you let me know what is the HDFS API that would allow you to
> do otherwise ?
>
> Thanks,
> Padma
>
>
> > On Mar 22, 2017, at 11:54 AM, François Méthot <fm...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Is there a way to force Drill to store CTAS generated parquet file as a
> > single block when using HDFS? Java HDFS API allows to do that, files
> could
> > be created with the Parquet block-size.
> >
> > We are using Drill on hdfs configured with block size of 128MB. Changing
> > this size is not an option at this point.
> >
> > It would be ideal for us to have single parquet file per hdfs block,
> setting
> > store.parquet.block-size to 128MB would fix our issue but we end up with
> a
> > lot more files to deal with.
> >
> > Thanks
> > Francois
>
>

Re: Single Hdfs block per parquet file

Posted by Padma Penumarthy <pp...@mapr.com>.

I think we create one file for each parquet block.
If underlying HDFS block size is 128 MB and parquet block size  is  > 128MB, 
it will create more blocks on HDFS. 
Can you let me know what is the HDFS API that would allow you to
do otherwise ?

Thanks,
Padma


> On Mar 22, 2017, at 11:54 AM, François Méthot <fm...@gmail.com> wrote:
> 
> Hi,
> 
> Is there a way to force Drill to store CTAS generated parquet file as a
> single block when using HDFS? Java HDFS API allows to do that, files could
> be created with the Parquet block-size.
> 
> We are using Drill on hdfs configured with block size of 128MB. Changing
> this size is not an option at this point.
> 
> It would be ideal for us to have single parquet file per hdfs block, setting
> store.parquet.block-size to 128MB would fix our issue but we end up with a
> lot more files to deal with.
> 
> Thanks
> Francois