You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jason Altekruse <al...@gmail.com> on 2015/01/10 02:09:24 UTC

Centralizing batch size bounds for testing and tuning

Hello Drillers,

Currently each of the physical operators in Drill has it's own way of
specifying how many records it will try to produce in a single batch. For
some operators like project, the outgoing batch will be the same size as
the incoming, in the case of a projection with no evaluations. If the size
of the data is changing in a projection, such as converting a numeric type
to varchar, we cannot guess how much memory will be needed in the outgoing
buffer, so we may have to cut off the first batch once we run out of space
and separately handle the overflowing data.

In other operators, where the incoming streams cause the spawning of new
outgoing records, we can not make a guess about the outgoing batch size, we
just need to keep producing row and cutting off batches as we run out of
space. Rather than hit exceptions in all cases, many of the operators have
a loop termination based on some expected number of rows in a batch, this
is generally around 4096. The record readers also define such limits.

I believe standardizing this value and making it configurable may be useful
for both debugging and tuning Drill. We have often found bugs around batch
boundary conditions, which often necessitates generating larger test cases
to reproduce problems and create unit tests once the issues are fixed. I'm
thinking if we could lower this value we may be able to write more concise
tests that easily demonstrate the boundary conditions in smaller input
files and test definitions.

This could also be useful for tuning drill. While we may not want to make
this option available in production, we could use it in the meantime to
drive efforts to identify best values in different scenarios when we
stretch the limits of Drill. After a brief discussion with Steven he said
that in some of his testing he was able to see some performance gains
increasing the value from 4000, to 32k. This isn't a strong argument in
itself for pushing up the default, as it will increase memory requirements
and will likely hurt is in multi-user and environments running many
concurrent queries. In these cases we may need to automatically throttle
back the batch size to reduce overall memory usage of any particular
operation.

Making this would be a code change that would touch a fairly large number
of files, but I think the possible benefits could justify the change, just
wanted to collect thoughts from the community.

- Jason

Re: Centralizing batch size bounds for testing and tuning

Posted by MapR <ad...@maprtech.com>.

For testing purposes, having a static and configurable batch size, in number of rows, will definitely help.

> On Jan 9, 2015, at 5:45 PM, Aman Sinha <as...@maprtech.com> wrote:
> 
> Yes, in the past we have talked about this primarily in the context of unit
> testing...certainly we want to be able to write tests with small input data
> and exercise the batch boundaries.  Performance tuning is added benefit
> once we can do some performance characterizations.
> We might want to think about whether the batch size should be a static
> value or something that is determined once the 'fast schema' is known for
> the leaf operators.  The batch size could be a function of the row width...
> 
> Aman
> 
> On Fri, Jan 9, 2015 at 5:09 PM, Jason Altekruse <al...@gmail.com>
> wrote:
> 
>> Hello Drillers,
>> 
>> Currently each of the physical operators in Drill has it's own way of
>> specifying how many records it will try to produce in a single batch. For
>> some operators like project, the outgoing batch will be the same size as
>> the incoming, in the case of a projection with no evaluations. If the size
>> of the data is changing in a projection, such as converting a numeric type
>> to varchar, we cannot guess how much memory will be needed in the outgoing
>> buffer, so we may have to cut off the first batch once we run out of space
>> and separately handle the overflowing data.
>> 
>> In other operators, where the incoming streams cause the spawning of new
>> outgoing records, we can not make a guess about the outgoing batch size, we
>> just need to keep producing row and cutting off batches as we run out of
>> space. Rather than hit exceptions in all cases, many of the operators have
>> a loop termination based on some expected number of rows in a batch, this
>> is generally around 4096. The record readers also define such limits.
>> 
>> I believe standardizing this value and making it configurable may be useful
>> for both debugging and tuning Drill. We have often found bugs around batch
>> boundary conditions, which often necessitates generating larger test cases
>> to reproduce problems and create unit tests once the issues are fixed. I'm
>> thinking if we could lower this value we may be able to write more concise
>> tests that easily demonstrate the boundary conditions in smaller input
>> files and test definitions.
>> 
>> This could also be useful for tuning drill. While we may not want to make
>> this option available in production, we could use it in the meantime to
>> drive efforts to identify best values in different scenarios when we
>> stretch the limits of Drill. After a brief discussion with Steven he said
>> that in some of his testing he was able to see some performance gains
>> increasing the value from 4000, to 32k. This isn't a strong argument in
>> itself for pushing up the default, as it will increase memory requirements
>> and will likely hurt is in multi-user and environments running many
>> concurrent queries. In these cases we may need to automatically throttle
>> back the batch size to reduce overall memory usage of any particular
>> operation.
>> 
>> Making this would be a code change that would touch a fairly large number
>> of files, but I think the possible benefits could justify the change, just
>> wanted to collect thoughts from the community.
>> 
>> - Jason
>>

Re: Centralizing batch size bounds for testing and tuning

Posted by Aman Sinha <as...@maprtech.com>.

Yes, in the past we have talked about this primarily in the context of unit
testing...certainly we want to be able to write tests with small input data
and exercise the batch boundaries.  Performance tuning is added benefit
once we can do some performance characterizations.
We might want to think about whether the batch size should be a static
value or something that is determined once the 'fast schema' is known for
the leaf operators.  The batch size could be a function of the row width...

Aman

On Fri, Jan 9, 2015 at 5:09 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Hello Drillers,
>
> Currently each of the physical operators in Drill has it's own way of
> specifying how many records it will try to produce in a single batch. For
> some operators like project, the outgoing batch will be the same size as
> the incoming, in the case of a projection with no evaluations. If the size
> of the data is changing in a projection, such as converting a numeric type
> to varchar, we cannot guess how much memory will be needed in the outgoing
> buffer, so we may have to cut off the first batch once we run out of space
> and separately handle the overflowing data.
>
> In other operators, where the incoming streams cause the spawning of new
> outgoing records, we can not make a guess about the outgoing batch size, we
> just need to keep producing row and cutting off batches as we run out of
> space. Rather than hit exceptions in all cases, many of the operators have
> a loop termination based on some expected number of rows in a batch, this
> is generally around 4096. The record readers also define such limits.
>
> I believe standardizing this value and making it configurable may be useful
> for both debugging and tuning Drill. We have often found bugs around batch
> boundary conditions, which often necessitates generating larger test cases
> to reproduce problems and create unit tests once the issues are fixed. I'm
> thinking if we could lower this value we may be able to write more concise
> tests that easily demonstrate the boundary conditions in smaller input
> files and test definitions.
>
> This could also be useful for tuning drill. While we may not want to make
> this option available in production, we could use it in the meantime to
> drive efforts to identify best values in different scenarios when we
> stretch the limits of Drill. After a brief discussion with Steven he said
> that in some of his testing he was able to see some performance gains
> increasing the value from 4000, to 32k. This isn't a strong argument in
> itself for pushing up the default, as it will increase memory requirements
> and will likely hurt is in multi-user and environments running many
> concurrent queries. In these cases we may need to automatically throttle
> back the batch size to reduce overall memory usage of any particular
> operation.
>
> Making this would be a code change that would touch a fairly large number
> of files, but I think the possible benefits could justify the change, just
> wanted to collect thoughts from the community.
>
> - Jason
>