You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jason Altekruse <al...@gmail.com> on 2015/06/16 21:38:05 UTC

[DISCUSS] Making the drill codebase easier to unit test

Hello Drill devs,

I would like to propose a proactive effort to make the Drill codebase
easier to unit test.
Many JIRAs have been created for bugs that should have been prevented by
better unit testing, and we are still fixing these kinds of bugs today as
they crop up. I have a few ideas, and I plan on creating JIRAs for specific
refactoring and test infrastructure improvements. Before I do, I would like
to collect thoughts from everyone on what can get us the most benefit for
our work.

As a short overview of the situation today, most of the tests in Drill take
the form of running a SQL query on a local drillbit and verifying the
results. Plenty of times this has been described as more of integration
testing than unit testing, and it has caused several common testing pains
and gaps.

1. batch boundaries - as we cannot control where batches are cut off during
the query, complete queries often make it hard to test different scenarios
processing an incoming stream of data with given properties.
         - examples of issues: inconsistent behavior between operators,
some
           operators have failed to handle empty batches, or a batch full
of nulls
           until we wrote a test that happened to have the right input file
and plan to
           produce these scenarios
2. Valid planning changes can end up making tests previously designed to
test execution fail in new ways as the data will now flow differently
through the operators
3. SQL queries as test specifications make it hard to test "everything",
all types, all possible data properties/structures, all possible switches
flipped in the planner or configuration for an operator

I would like to start the discussion with a proposal to fix some of these
problems. We need a way to run an operator easily in isolation. Possible
steps to achieve this include, a new operator that will produce data in
explicitly provided batches, that can be configured from a test. This can
serve as a universal input to unit test operators. We would also need some
way to consume and verify the output of the operators. This could share
code with the current query execution, or possibly side step it to avoid
having to mock or instantiate the whole query context.

This proposal itself is testing a relatively large part of the system as a
whole "unit". I would be interested to hear opinions on the utility vs
extra effort of trying to refactor more classes so that they can be created
in tests and have their individual methods tested. This is already being
done for some classes like the value vectors, but it is far from
exhaustive. I don't expect us to start rigidly enforcing this level of
testing granularity everywhere, but there are components of the system that
really need to be resilient and be guaranteed to stay that way as the
project evolves.

Please chime in with your thoughts.

Re: [DISCUSS] Making the drill codebase easier to unit test

Posted by Jason Altekruse <al...@gmail.com>.
I agree that code refactoring in necessary to make some components of the
project more testable. Do you have some ideas in particular about coupling
that is blocking this kind of testing today? I know that there are several
context objects like DrillbitContext, FragmentContext and QueryContext that
are relatively heavy and shred amongst a number of components.

Do you think that there are some cases that might be able to be fixed with
less refactoring and instead some test infrastructure enhancements that can
generate these contexts? This should be done in a generalized manner where
they can be grabbed for particular tests from some kind of static
initialization function, avoiding code duplication in the tests themselves.
I haven't tried to do a lot of this testing myself, but I have been under
the impression that this might solve some of our issues. If we have a few
small methods in the unit tests that work to create these objects rather
than try to mock subsets of them we might be able to get some of these
benefits without major core code refactoring.

On Wed, Jun 17, 2015 at 2:22 PM, Hanifi Gunes <hg...@maprtech.com> wrote:

> Some sub-systems that I know of, particularly around readers, writers, VVs
> and operators are not unit-testing friendly by design: First, they involve
> much more logic than one could define as a unit. Second, it is relatively
> tough if not impossible to control their behavior, mock or inject
> dependencies because they are tightly coupled with other parts of the
> system. I would propose starting off with very fundamental yet minor code
> refactoring that aims to have self-contained, cohesive pieces abstracted
> away so that we could get these unit-tested first. Applying this
> idea iteratively should bring better test coverage. Then we can focus on
> testing operators or other components that rely on these well tested units.
> Either way I would prefer a piece-meal approach rather than trying to
> unit-test an entire sub-system.
>
> -Hanifi
>
> On Wed, Jun 17, 2015 at 1:53 PM, Abdel Hakim Deneche <
> adeneche@maprtech.com>
> wrote:
>
> > I don't know much work this involves (it seems a lot!) but this would be
> > really useful. Like you said, with the current model coming up with good
> > unit tests can be really tricky especially when testing the edge cases,
> and
> > the worst part is that any changes to how queries are planned or for
> > example the size of the batches can make some tests useless.
> >
> > On Tue, Jun 16, 2015 at 12:38 PM, Jason Altekruse <
> > altekrusejason@gmail.com>
> > wrote:
> >
> > > Hello Drill devs,
> > >
> > > I would like to propose a proactive effort to make the Drill codebase
> > > easier to unit test.
> > > Many JIRAs have been created for bugs that should have been prevented
> by
> > > better unit testing, and we are still fixing these kinds of bugs today
> as
> > > they crop up. I have a few ideas, and I plan on creating JIRAs for
> > specific
> > > refactoring and test infrastructure improvements. Before I do, I would
> > like
> > > to collect thoughts from everyone on what can get us the most benefit
> for
> > > our work.
> > >
> > > As a short overview of the situation today, most of the tests in Drill
> > take
> > > the form of running a SQL query on a local drillbit and verifying the
> > > results. Plenty of times this has been described as more of integration
> > > testing than unit testing, and it has caused several common testing
> pains
> > > and gaps.
> > >
> > > 1. batch boundaries - as we cannot control where batches are cut off
> > during
> > > the query, complete queries often make it hard to test different
> > scenarios
> > > processing an incoming stream of data with given properties.
> > >          - examples of issues: inconsistent behavior between operators,
> > > some
> > >            operators have failed to handle empty batches, or a batch
> full
> > > of nulls
> > >            until we wrote a test that happened to have the right input
> > file
> > > and plan to
> > >            produce these scenarios
> > > 2. Valid planning changes can end up making tests previously designed
> to
> > > test execution fail in new ways as the data will now flow differently
> > > through the operators
> > > 3. SQL queries as test specifications make it hard to test
> "everything",
> > > all types, all possible data properties/structures, all possible
> switches
> > > flipped in the planner or configuration for an operator
> > >
> > > I would like to start the discussion with a proposal to fix some of
> these
> > > problems. We need a way to run an operator easily in isolation.
> Possible
> > > steps to achieve this include, a new operator that will produce data in
> > > explicitly provided batches, that can be configured from a test. This
> can
> > > serve as a universal input to unit test operators. We would also need
> > some
> > > way to consume and verify the output of the operators. This could share
> > > code with the current query execution, or possibly side step it to
> avoid
> > > having to mock or instantiate the whole query context.
> > >
> > > This proposal itself is testing a relatively large part of the system
> as
> > a
> > > whole "unit". I would be interested to hear opinions on the utility vs
> > > extra effort of trying to refactor more classes so that they can be
> > created
> > > in tests and have their individual methods tested. This is already
> being
> > > done for some classes like the value vectors, but it is far from
> > > exhaustive. I don't expect us to start rigidly enforcing this level of
> > > testing granularity everywhere, but there are components of the system
> > that
> > > really need to be resilient and be guaranteed to stay that way as the
> > > project evolves.
> > >
> > > Please chime in with your thoughts.
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>

Re: [DISCUSS] Making the drill codebase easier to unit test

Posted by Hanifi Gunes <hg...@maprtech.com>.
Some sub-systems that I know of, particularly around readers, writers, VVs
and operators are not unit-testing friendly by design: First, they involve
much more logic than one could define as a unit. Second, it is relatively
tough if not impossible to control their behavior, mock or inject
dependencies because they are tightly coupled with other parts of the
system. I would propose starting off with very fundamental yet minor code
refactoring that aims to have self-contained, cohesive pieces abstracted
away so that we could get these unit-tested first. Applying this
idea iteratively should bring better test coverage. Then we can focus on
testing operators or other components that rely on these well tested units.
Either way I would prefer a piece-meal approach rather than trying to
unit-test an entire sub-system.

-Hanifi

On Wed, Jun 17, 2015 at 1:53 PM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> I don't know much work this involves (it seems a lot!) but this would be
> really useful. Like you said, with the current model coming up with good
> unit tests can be really tricky especially when testing the edge cases, and
> the worst part is that any changes to how queries are planned or for
> example the size of the batches can make some tests useless.
>
> On Tue, Jun 16, 2015 at 12:38 PM, Jason Altekruse <
> altekrusejason@gmail.com>
> wrote:
>
> > Hello Drill devs,
> >
> > I would like to propose a proactive effort to make the Drill codebase
> > easier to unit test.
> > Many JIRAs have been created for bugs that should have been prevented by
> > better unit testing, and we are still fixing these kinds of bugs today as
> > they crop up. I have a few ideas, and I plan on creating JIRAs for
> specific
> > refactoring and test infrastructure improvements. Before I do, I would
> like
> > to collect thoughts from everyone on what can get us the most benefit for
> > our work.
> >
> > As a short overview of the situation today, most of the tests in Drill
> take
> > the form of running a SQL query on a local drillbit and verifying the
> > results. Plenty of times this has been described as more of integration
> > testing than unit testing, and it has caused several common testing pains
> > and gaps.
> >
> > 1. batch boundaries - as we cannot control where batches are cut off
> during
> > the query, complete queries often make it hard to test different
> scenarios
> > processing an incoming stream of data with given properties.
> >          - examples of issues: inconsistent behavior between operators,
> > some
> >            operators have failed to handle empty batches, or a batch full
> > of nulls
> >            until we wrote a test that happened to have the right input
> file
> > and plan to
> >            produce these scenarios
> > 2. Valid planning changes can end up making tests previously designed to
> > test execution fail in new ways as the data will now flow differently
> > through the operators
> > 3. SQL queries as test specifications make it hard to test "everything",
> > all types, all possible data properties/structures, all possible switches
> > flipped in the planner or configuration for an operator
> >
> > I would like to start the discussion with a proposal to fix some of these
> > problems. We need a way to run an operator easily in isolation. Possible
> > steps to achieve this include, a new operator that will produce data in
> > explicitly provided batches, that can be configured from a test. This can
> > serve as a universal input to unit test operators. We would also need
> some
> > way to consume and verify the output of the operators. This could share
> > code with the current query execution, or possibly side step it to avoid
> > having to mock or instantiate the whole query context.
> >
> > This proposal itself is testing a relatively large part of the system as
> a
> > whole "unit". I would be interested to hear opinions on the utility vs
> > extra effort of trying to refactor more classes so that they can be
> created
> > in tests and have their individual methods tested. This is already being
> > done for some classes like the value vectors, but it is far from
> > exhaustive. I don't expect us to start rigidly enforcing this level of
> > testing granularity everywhere, but there are components of the system
> that
> > really need to be resilient and be guaranteed to stay that way as the
> > project evolves.
> >
> > Please chime in with your thoughts.
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: [DISCUSS] Making the drill codebase easier to unit test

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.
I don't know much work this involves (it seems a lot!) but this would be
really useful. Like you said, with the current model coming up with good
unit tests can be really tricky especially when testing the edge cases, and
the worst part is that any changes to how queries are planned or for
example the size of the batches can make some tests useless.

On Tue, Jun 16, 2015 at 12:38 PM, Jason Altekruse <al...@gmail.com>
wrote:

> Hello Drill devs,
>
> I would like to propose a proactive effort to make the Drill codebase
> easier to unit test.
> Many JIRAs have been created for bugs that should have been prevented by
> better unit testing, and we are still fixing these kinds of bugs today as
> they crop up. I have a few ideas, and I plan on creating JIRAs for specific
> refactoring and test infrastructure improvements. Before I do, I would like
> to collect thoughts from everyone on what can get us the most benefit for
> our work.
>
> As a short overview of the situation today, most of the tests in Drill take
> the form of running a SQL query on a local drillbit and verifying the
> results. Plenty of times this has been described as more of integration
> testing than unit testing, and it has caused several common testing pains
> and gaps.
>
> 1. batch boundaries - as we cannot control where batches are cut off during
> the query, complete queries often make it hard to test different scenarios
> processing an incoming stream of data with given properties.
>          - examples of issues: inconsistent behavior between operators,
> some
>            operators have failed to handle empty batches, or a batch full
> of nulls
>            until we wrote a test that happened to have the right input file
> and plan to
>            produce these scenarios
> 2. Valid planning changes can end up making tests previously designed to
> test execution fail in new ways as the data will now flow differently
> through the operators
> 3. SQL queries as test specifications make it hard to test "everything",
> all types, all possible data properties/structures, all possible switches
> flipped in the planner or configuration for an operator
>
> I would like to start the discussion with a proposal to fix some of these
> problems. We need a way to run an operator easily in isolation. Possible
> steps to achieve this include, a new operator that will produce data in
> explicitly provided batches, that can be configured from a test. This can
> serve as a universal input to unit test operators. We would also need some
> way to consume and verify the output of the operators. This could share
> code with the current query execution, or possibly side step it to avoid
> having to mock or instantiate the whole query context.
>
> This proposal itself is testing a relatively large part of the system as a
> whole "unit". I would be interested to hear opinions on the utility vs
> extra effort of trying to refactor more classes so that they can be created
> in tests and have their individual methods tested. This is already being
> done for some classes like the value vectors, but it is far from
> exhaustive. I don't expect us to start rigidly enforcing this level of
> testing granularity everywhere, but there are components of the system that
> really need to be resilient and be guaranteed to stay that way as the
> project evolves.
>
> Please chime in with your thoughts.
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>