You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Jim <ji...@gmail.com> on 2014/09/05 22:40:59 UTC

Parquet file partition size

Hello all,

I've been experimenting with drill to load data into Parquet files. I 
noticed rather large variability in the size of each parquet chunk. Is 
there a way to control this?

The documentation seems a little sparse on configuring some of the finer 
details. My apologies if I missed something obvious.

Thanks
Jim

Re: Parquet file partition size

Posted by Jacques Nadeau <ja...@apache.org>.

I haven't tried it but theoretically if you set slice target to 1, you can
control by width per node times number of nodes.   The 1 sets an
artificially high maximum and then you are just constraining the quantity.

Note,  however,  that this will cause Drill to make poor decisions in the
rest of the query plan so it is generally ill advised.
On Sep 9, 2014 4:43 AM, "Jim" <ji...@gmail.com> wrote:

>
> Thanks Jim and Jacques,
>
> I'll play with those settings but I'm using Drill to generate Parquet
> files for use with Spark (because Spark's generation of Parquet files
> doesn't provide enough control over slicing and I end up with files that
> can't be read on a cluster (e.g. http://apache-spark-user-list.
> 1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-
> with-an-ec2-install-tt13737.html ).
>
> So I'm using Drill as a tool for preprocessing and don't intend to read
> the slices on the same machines that generates them. Is there a way to
> simply set the number of slices?
>
> Thanks again,
> Jim
>
> On 09/08/2014 11:40 AM, Jacques Nadeau wrote:
>
>> Drill's default behavior is to use estimates to determine the number of
>> files that will be written.  The equation is fairly complicated.  However,
>> there are three key variables that will impact file splits.  These are:
>>
>> planner.slice_target: targeted number of records to allow within a single
>> slice before increasing parallelization (defaults to 1mm in 0.4, 100k in
>> 0.5)
>> planner.width.max_per_node: maximum number of slices run per node
>> (defaults
>> to 0.7 * core count)
>> store.parquet.block-size:   largest allowed row group when generating
>> Parquet files.  (defaults to 512mb)
>>
>> If you are having more files than you would like, you can
>> decrease planner.width.max_per_node to a smaller number.
>>
>> It's likely that Jim Scott's experience with a smaller number of files was
>> due to running on a machine with a smaller number of cores or the
>> optimizer
>> estimating a smaller amount of data in the output.  The behavior is data
>> and machine dependent.
>>
>> thanks,
>> Jacques
>>
>>
>> On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <js...@maprtech.com> wrote:
>>
>>  I have created tables with Drill in parquet format and it created 2
>>> files.
>>>
>>>
>>> On Fri, Sep 5, 2014 at 3:46 PM, Jim <ji...@gmail.com> wrote:
>>>
>>>  Actually, it looks like it always breaks it into 6 pieces by default. Is
>>>> there a way to make the partition size fixed rather than the number of
>>>> partitions?
>>>>
>>>>
>>>> On 09/05/2014 04:40 PM, Jim wrote:
>>>>
>>>>  Hello all,
>>>>>
>>>>> I've been experimenting with drill to load data into Parquet files. I
>>>>> noticed rather large variability in the size of each parquet chunk. Is
>>>>> there a way to control this?
>>>>>
>>>>> The documentation seems a little sparse on configuring some of the
>>>>> finer
>>>>> details. My apologies if I missed something obvious.
>>>>>
>>>>> Thanks
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>> --
>>> *Jim Scott*
>>> Director, Enterprise Strategy & Architecture
>>>
>>>   <http://www.mapr.com/>
>>> [image: MapR Technologies] <http://www.mapr.com>
>>>
>>>
>

Re: Parquet file partition size

Posted by Jim <ji...@gmail.com>.

Thanks Jim and Jacques,

I'll play with those settings but I'm using Drill to generate Parquet 
files for use with Spark (because Spark's generation of Parquet files 
doesn't provide enough control over slicing and I end up with files that 
can't be read on a cluster (e.g. 
http://apache-spark-user-list.1001560.n3.nabble.com/Querying-a-parquet-file-in-s3-with-an-ec2-install-tt13737.html 
).

So I'm using Drill as a tool for preprocessing and don't intend to read 
the slices on the same machines that generates them. Is there a way to 
simply set the number of slices?

Thanks again,
Jim

On 09/08/2014 11:40 AM, Jacques Nadeau wrote:
> Drill's default behavior is to use estimates to determine the number of
> files that will be written.  The equation is fairly complicated.  However,
> there are three key variables that will impact file splits.  These are:
>
> planner.slice_target: targeted number of records to allow within a single
> slice before increasing parallelization (defaults to 1mm in 0.4, 100k in
> 0.5)
> planner.width.max_per_node: maximum number of slices run per node (defaults
> to 0.7 * core count)
> store.parquet.block-size:   largest allowed row group when generating
> Parquet files.  (defaults to 512mb)
>
> If you are having more files than you would like, you can
> decrease planner.width.max_per_node to a smaller number.
>
> It's likely that Jim Scott's experience with a smaller number of files was
> due to running on a machine with a smaller number of cores or the optimizer
> estimating a smaller amount of data in the output.  The behavior is data
> and machine dependent.
>
> thanks,
> Jacques
>
>
> On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <js...@maprtech.com> wrote:
>
>> I have created tables with Drill in parquet format and it created 2 files.
>>
>>
>> On Fri, Sep 5, 2014 at 3:46 PM, Jim <ji...@gmail.com> wrote:
>>
>>> Actually, it looks like it always breaks it into 6 pieces by default. Is
>>> there a way to make the partition size fixed rather than the number of
>>> partitions?
>>>
>>>
>>> On 09/05/2014 04:40 PM, Jim wrote:
>>>
>>>> Hello all,
>>>>
>>>> I've been experimenting with drill to load data into Parquet files. I
>>>> noticed rather large variability in the size of each parquet chunk. Is
>>>> there a way to control this?
>>>>
>>>> The documentation seems a little sparse on configuring some of the finer
>>>> details. My apologies if I missed something obvious.
>>>>
>>>> Thanks
>>>> Jim
>>>>
>>>>
>>
>> --
>> *Jim Scott*
>> Director, Enterprise Strategy & Architecture
>>
>>   <http://www.mapr.com/>
>> [image: MapR Technologies] <http://www.mapr.com>
>>

Re: Parquet file partition size

Posted by Ted Dunning <te...@gmail.com>.

Cool.



On Mon, Sep 8, 2014 at 11:01 AM, Jacques Nadeau <ja...@apache.org> wrote:

> For all three of these variables, you can use the ALTER SESSION or ALTER
> SYSTEM statements.  See more here:
>
> https://cwiki.apache.org/confluence/display/DRILL/SQL+Commands+Summary
>
> https://cwiki.apache.org/confluence/display/DRILL/Planning+and+Execution+Options
>
> example usage:
>
> ALTER SESSION `planner.slice_target` = 100000;
>
>
>
> On Mon, Sep 8, 2014 at 10:50 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Where are these variables best modified?
> >
> >
> >
> >
> > On Mon, Sep 8, 2014 at 8:40 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >
> > > Drill's default behavior is to use estimates to determine the number of
> > > files that will be written.  The equation is fairly complicated.
> > However,
> > > there are three key variables that will impact file splits.  These are:
> > >
> > > planner.slice_target: targeted number of records to allow within a
> single
> > > slice before increasing parallelization (defaults to 1mm in 0.4, 100k
> in
> > > 0.5)
> > > planner.width.max_per_node: maximum number of slices run per node
> > (defaults
> > > to 0.7 * core count)
> > > store.parquet.block-size:   largest allowed row group when generating
> > > Parquet files.  (defaults to 512mb)
> > >
> > > If you are having more files than you would like, you can
> > > decrease planner.width.max_per_node to a smaller number.
> > >
> > > It's likely that Jim Scott's experience with a smaller number of files
> > was
> > > due to running on a machine with a smaller number of cores or the
> > optimizer
> > > estimating a smaller amount of data in the output.  The behavior is
> data
> > > and machine dependent.
> > >
> > > thanks,
> > > Jacques
> > >
> > >
> > > On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <js...@maprtech.com> wrote:
> > >
> > > > I have created tables with Drill in parquet format and it created 2
> > > files.
> > > >
> > > >
> > > > On Fri, Sep 5, 2014 at 3:46 PM, Jim <ji...@gmail.com> wrote:
> > > >
> > > > >
> > > > > Actually, it looks like it always breaks it into 6 pieces by
> default.
> > > Is
> > > > > there a way to make the partition size fixed rather than the number
> > of
> > > > > partitions?
> > > > >
> > > > >
> > > > > On 09/05/2014 04:40 PM, Jim wrote:
> > > > >
> > > > >> Hello all,
> > > > >>
> > > > >> I've been experimenting with drill to load data into Parquet
> files.
> > I
> > > > >> noticed rather large variability in the size of each parquet
> chunk.
> > Is
> > > > >> there a way to control this?
> > > > >>
> > > > >> The documentation seems a little sparse on configuring some of the
> > > finer
> > > > >> details. My apologies if I missed something obvious.
> > > > >>
> > > > >> Thanks
> > > > >> Jim
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > *Jim Scott*
> > > > Director, Enterprise Strategy & Architecture
> > > >
> > > >  <http://www.mapr.com/>
> > > > [image: MapR Technologies] <http://www.mapr.com>
> > > >
> > >
> >
>

Re: Parquet file partition size

Posted by Jacques Nadeau <ja...@apache.org>.

For all three of these variables, you can use the ALTER SESSION or ALTER
SYSTEM statements.  See more here:

https://cwiki.apache.org/confluence/display/DRILL/SQL+Commands+Summary
https://cwiki.apache.org/confluence/display/DRILL/Planning+and+Execution+Options

example usage:

ALTER SESSION `planner.slice_target` = 100000;



On Mon, Sep 8, 2014 at 10:50 AM, Ted Dunning <te...@gmail.com> wrote:

> Where are these variables best modified?
>
>
>
>
> On Mon, Sep 8, 2014 at 8:40 AM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Drill's default behavior is to use estimates to determine the number of
> > files that will be written.  The equation is fairly complicated.
> However,
> > there are three key variables that will impact file splits.  These are:
> >
> > planner.slice_target: targeted number of records to allow within a single
> > slice before increasing parallelization (defaults to 1mm in 0.4, 100k in
> > 0.5)
> > planner.width.max_per_node: maximum number of slices run per node
> (defaults
> > to 0.7 * core count)
> > store.parquet.block-size:   largest allowed row group when generating
> > Parquet files.  (defaults to 512mb)
> >
> > If you are having more files than you would like, you can
> > decrease planner.width.max_per_node to a smaller number.
> >
> > It's likely that Jim Scott's experience with a smaller number of files
> was
> > due to running on a machine with a smaller number of cores or the
> optimizer
> > estimating a smaller amount of data in the output.  The behavior is data
> > and machine dependent.
> >
> > thanks,
> > Jacques
> >
> >
> > On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <js...@maprtech.com> wrote:
> >
> > > I have created tables with Drill in parquet format and it created 2
> > files.
> > >
> > >
> > > On Fri, Sep 5, 2014 at 3:46 PM, Jim <ji...@gmail.com> wrote:
> > >
> > > >
> > > > Actually, it looks like it always breaks it into 6 pieces by default.
> > Is
> > > > there a way to make the partition size fixed rather than the number
> of
> > > > partitions?
> > > >
> > > >
> > > > On 09/05/2014 04:40 PM, Jim wrote:
> > > >
> > > >> Hello all,
> > > >>
> > > >> I've been experimenting with drill to load data into Parquet files.
> I
> > > >> noticed rather large variability in the size of each parquet chunk.
> Is
> > > >> there a way to control this?
> > > >>
> > > >> The documentation seems a little sparse on configuring some of the
> > finer
> > > >> details. My apologies if I missed something obvious.
> > > >>
> > > >> Thanks
> > > >> Jim
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > > *Jim Scott*
> > > Director, Enterprise Strategy & Architecture
> > >
> > >  <http://www.mapr.com/>
> > > [image: MapR Technologies] <http://www.mapr.com>
> > >
> >
>

Re: Parquet file partition size

Posted by Ted Dunning <te...@gmail.com>.

Where are these variables best modified?




On Mon, Sep 8, 2014 at 8:40 AM, Jacques Nadeau <ja...@apache.org> wrote:

> Drill's default behavior is to use estimates to determine the number of
> files that will be written.  The equation is fairly complicated.  However,
> there are three key variables that will impact file splits.  These are:
>
> planner.slice_target: targeted number of records to allow within a single
> slice before increasing parallelization (defaults to 1mm in 0.4, 100k in
> 0.5)
> planner.width.max_per_node: maximum number of slices run per node (defaults
> to 0.7 * core count)
> store.parquet.block-size:   largest allowed row group when generating
> Parquet files.  (defaults to 512mb)
>
> If you are having more files than you would like, you can
> decrease planner.width.max_per_node to a smaller number.
>
> It's likely that Jim Scott's experience with a smaller number of files was
> due to running on a machine with a smaller number of cores or the optimizer
> estimating a smaller amount of data in the output.  The behavior is data
> and machine dependent.
>
> thanks,
> Jacques
>
>
> On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <js...@maprtech.com> wrote:
>
> > I have created tables with Drill in parquet format and it created 2
> files.
> >
> >
> > On Fri, Sep 5, 2014 at 3:46 PM, Jim <ji...@gmail.com> wrote:
> >
> > >
> > > Actually, it looks like it always breaks it into 6 pieces by default.
> Is
> > > there a way to make the partition size fixed rather than the number of
> > > partitions?
> > >
> > >
> > > On 09/05/2014 04:40 PM, Jim wrote:
> > >
> > >> Hello all,
> > >>
> > >> I've been experimenting with drill to load data into Parquet files. I
> > >> noticed rather large variability in the size of each parquet chunk. Is
> > >> there a way to control this?
> > >>
> > >> The documentation seems a little sparse on configuring some of the
> finer
> > >> details. My apologies if I missed something obvious.
> > >>
> > >> Thanks
> > >> Jim
> > >>
> > >>
> > >
> >
> >
> > --
> > *Jim Scott*
> > Director, Enterprise Strategy & Architecture
> >
> >  <http://www.mapr.com/>
> > [image: MapR Technologies] <http://www.mapr.com>
> >
>

Re: Parquet file partition size

Posted by Jacques Nadeau <ja...@apache.org>.

Drill's default behavior is to use estimates to determine the number of
files that will be written.  The equation is fairly complicated.  However,
there are three key variables that will impact file splits.  These are:

planner.slice_target: targeted number of records to allow within a single
slice before increasing parallelization (defaults to 1mm in 0.4, 100k in
0.5)
planner.width.max_per_node: maximum number of slices run per node (defaults
to 0.7 * core count)
store.parquet.block-size:   largest allowed row group when generating
Parquet files.  (defaults to 512mb)

If you are having more files than you would like, you can
decrease planner.width.max_per_node to a smaller number.

It's likely that Jim Scott's experience with a smaller number of files was
due to running on a machine with a smaller number of cores or the optimizer
estimating a smaller amount of data in the output.  The behavior is data
and machine dependent.

thanks,
Jacques

On Mon, Sep 8, 2014 at 8:32 AM, Jim Scott <js...@maprtech.com> wrote:

> I have created tables with Drill in parquet format and it created 2 files.
>
>
> On Fri, Sep 5, 2014 at 3:46 PM, Jim <ji...@gmail.com> wrote:
>
> >
> > Actually, it looks like it always breaks it into 6 pieces by default. Is
> > there a way to make the partition size fixed rather than the number of
> > partitions?
> >
> >
> > On 09/05/2014 04:40 PM, Jim wrote:
> >
> >> Hello all,
> >>
> >> I've been experimenting with drill to load data into Parquet files. I
> >> noticed rather large variability in the size of each parquet chunk. Is
> >> there a way to control this?
> >>
> >> The documentation seems a little sparse on configuring some of the finer
> >> details. My apologies if I missed something obvious.
> >>
> >> Thanks
> >> Jim
> >>
> >>
> >
>
>
> --
> *Jim Scott*
> Director, Enterprise Strategy & Architecture
>
>  <http://www.mapr.com/>
> [image: MapR Technologies] <http://www.mapr.com>
>

Re: Parquet file partition size

Posted by Jim Scott <js...@maprtech.com>.

I have created tables with Drill in parquet format and it created 2 files.


On Fri, Sep 5, 2014 at 3:46 PM, Jim <ji...@gmail.com> wrote:

>
> Actually, it looks like it always breaks it into 6 pieces by default. Is
> there a way to make the partition size fixed rather than the number of
> partitions?
>
>
> On 09/05/2014 04:40 PM, Jim wrote:
>
>> Hello all,
>>
>> I've been experimenting with drill to load data into Parquet files. I
>> noticed rather large variability in the size of each parquet chunk. Is
>> there a way to control this?
>>
>> The documentation seems a little sparse on configuring some of the finer
>> details. My apologies if I missed something obvious.
>>
>> Thanks
>> Jim
>>
>>
>


-- 
*Jim Scott*
Director, Enterprise Strategy & Architecture

 <http://www.mapr.com/>
[image: MapR Technologies] <http://www.mapr.com>

Re: Parquet file partition size

Posted by Jim <ji...@gmail.com>.

Actually, it looks like it always breaks it into 6 pieces by default. Is 
there a way to make the partition size fixed rather than the number of 
partitions?

On 09/05/2014 04:40 PM, Jim wrote:
> Hello all,
>
> I've been experimenting with drill to load data into Parquet files. I 
> noticed rather large variability in the size of each parquet chunk. Is 
> there a way to control this?
>
> The documentation seems a little sparse on configuring some of the 
> finer details. My apologies if I missed something obvious.
>
> Thanks
> Jim
>