You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mrunit.apache.org by Steve Loughran <st...@apache.org> on 2011/05/26 12:35:24 UTC

adding more test stuff to mrunit

I'm thinking, could MRUnit be the place to put in other hadoop-testing code.

specifically

== Junit on multiple hosts ==


I have some prototype code to exec junit test cases as MR jobs, collect 
the results (including serialized throwables). It runs one test per line 
of text (the name of the package). It could be better to support lines 
of tests and config options, or other ways to explore the config space. 
And I'd really like to be able to deploy the junit tests to all the 
workers in the cluster, the reduction would be to identify which boxes 
are playing up.

== Sampling for testing ==

Good desktop tests need real data, which means sampling from the live 
datasets. Some standard MR jobs to do the sampling (which themselves use 
MR Unit to self-test) could make it easier to sample.

thoughts?

Re: adding more test stuff to mrunit

Posted by Aaron Kimball <ak...@gmail.com>.

Steve,

I'm interested in the "sampling for testing" direction -- that sounds like a
really good idea. The ability to "fuzz test" your MR job over a subset of
the input is a good idea. Figuring out how to handle
InputFormat/serialization issues regarding records extracted in such a way
is likely to be a big challenge there -- though not insurmountable. The
other interesting challenge is "how do you write an output specification for
that sampled input run, that can be easily verified as correct?"

I'm a bit more confused about the topicality of the "distributing JUnit
tests" problem. In my mind, the direction MRUnit should pursue is to make it
easier to test MapReduce jobs, and more specifically, their constituent
components (currently, Mappers and Reducers). The ability to parallelize
test execution of generic JUnit tests across a number of machines is a
pretty different problem; can you explain where something like
Hudson/Jenkins that already has the ability to aggregate parallel testing
workloads isn't the right choice, and what we could do better/differently?

Cheers,
- Aaron

On Fri, May 27, 2011 at 4:36 PM, Patrick Hunt <ph...@apache.org> wrote:

> Even subprojects are considered separate communities (at least that's
> my understanding of it). In general Apache frowns on subs. I believe
> another option is to have the code base as a separate repo from the
> tlp, but still part of the tlp, with separate dev/release cycles but a
> single "community". This is the ideal, not sure how it reconciles with
> the real world.
>
> Patrick
>
> On Thu, May 26, 2011 at 11:04 AM, Eric Sammer <es...@cloudera.com>
> wrote:
> > And I think that leads in to the conversation about where mrunit goes
> when
> > we graduate. The original purpose of a breakout was mostly to allow
> separate
> > release cycles and to be able to support multiple versions of Hadoop (if
> we
> > wanted to do such things) without having circular dependencies across
> > versions. If we graduate to a standalone subproject of Hadoop (which may
> be
> > an option, subject to the Hadoop PMC's approval) we could "reunite" the
> > communities while still remaining independent. Just a thought.
> >
> > On Thu, May 26, 2011 at 9:58 AM, Patrick Hunt <ph...@apache.org> wrote:
> >
> >> I had suggested something like this in one of the original "remove
> >> MRUNIT from hadoop contrib" threads... There was some push back about
> >> community fragmentation (tests should live in hadoop), but I
> >> personally don't see why not, we could course correct as things
> >> mature.
> >>
> >> On Thu, May 26, 2011 at 3:35 AM, Steve Loughran <st...@apache.org>
> wrote:
> >> > I'm thinking, could MRUnit be the place to put in other hadoop-testing
> >> code.
> >> >
> >> > specifically
> >> >
> >> > == Junit on multiple hosts ==
> >> >
> >> >
> >> > I have some prototype code to exec junit test cases as MR jobs,
> collect
> >> the
> >> > results (including serialized throwables). It runs one test per line
> of
> >> text
> >> > (the name of the package). It could be better to support lines of
> tests
> >> and
> >> > config options, or other ways to explore the config space. And I'd
> really
> >> > like to be able to deploy the junit tests to all the workers in the
> >> cluster,
> >> > the reduction would be to identify which boxes are playing up.
> >> >
> >> > == Sampling for testing ==
> >> >
> >> > Good desktop tests need real data, which means sampling from the live
> >> > datasets. Some standard MR jobs to do the sampling (which themselves
> use
> >> MR
> >> > Unit to self-test) could make it easier to sample.
> >> >
> >> > thoughts?
> >> >
> >>
> >
> >
> >
> > --
> > Eric Sammer
> > twitter: esammer
> > data: www.cloudera.com
> >
>

Re: adding more test stuff to mrunit

Posted by Patrick Hunt <ph...@apache.org>.

Even subprojects are considered separate communities (at least that's
my understanding of it). In general Apache frowns on subs. I believe
another option is to have the code base as a separate repo from the
tlp, but still part of the tlp, with separate dev/release cycles but a
single "community". This is the ideal, not sure how it reconciles with
the real world.

Patrick

On Thu, May 26, 2011 at 11:04 AM, Eric Sammer <es...@cloudera.com> wrote:
> And I think that leads in to the conversation about where mrunit goes when
> we graduate. The original purpose of a breakout was mostly to allow separate
> release cycles and to be able to support multiple versions of Hadoop (if we
> wanted to do such things) without having circular dependencies across
> versions. If we graduate to a standalone subproject of Hadoop (which may be
> an option, subject to the Hadoop PMC's approval) we could "reunite" the
> communities while still remaining independent. Just a thought.
>
> On Thu, May 26, 2011 at 9:58 AM, Patrick Hunt <ph...@apache.org> wrote:
>
>> I had suggested something like this in one of the original "remove
>> MRUNIT from hadoop contrib" threads... There was some push back about
>> community fragmentation (tests should live in hadoop), but I
>> personally don't see why not, we could course correct as things
>> mature.
>>
>> On Thu, May 26, 2011 at 3:35 AM, Steve Loughran <st...@apache.org> wrote:
>> > I'm thinking, could MRUnit be the place to put in other hadoop-testing
>> code.
>> >
>> > specifically
>> >
>> > == Junit on multiple hosts ==
>> >
>> >
>> > I have some prototype code to exec junit test cases as MR jobs, collect
>> the
>> > results (including serialized throwables). It runs one test per line of
>> text
>> > (the name of the package). It could be better to support lines of tests
>> and
>> > config options, or other ways to explore the config space. And I'd really
>> > like to be able to deploy the junit tests to all the workers in the
>> cluster,
>> > the reduction would be to identify which boxes are playing up.
>> >
>> > == Sampling for testing ==
>> >
>> > Good desktop tests need real data, which means sampling from the live
>> > datasets. Some standard MR jobs to do the sampling (which themselves use
>> MR
>> > Unit to self-test) could make it easier to sample.
>> >
>> > thoughts?
>> >
>>
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>

Re: adding more test stuff to mrunit

Posted by Eric Sammer <es...@cloudera.com>.

And I think that leads in to the conversation about where mrunit goes when
we graduate. The original purpose of a breakout was mostly to allow separate
release cycles and to be able to support multiple versions of Hadoop (if we
wanted to do such things) without having circular dependencies across
versions. If we graduate to a standalone subproject of Hadoop (which may be
an option, subject to the Hadoop PMC's approval) we could "reunite" the
communities while still remaining independent. Just a thought.

On Thu, May 26, 2011 at 9:58 AM, Patrick Hunt <ph...@apache.org> wrote:

> I had suggested something like this in one of the original "remove
> MRUNIT from hadoop contrib" threads... There was some push back about
> community fragmentation (tests should live in hadoop), but I
> personally don't see why not, we could course correct as things
> mature.
>
> On Thu, May 26, 2011 at 3:35 AM, Steve Loughran <st...@apache.org> wrote:
> > I'm thinking, could MRUnit be the place to put in other hadoop-testing
> code.
> >
> > specifically
> >
> > == Junit on multiple hosts ==
> >
> >
> > I have some prototype code to exec junit test cases as MR jobs, collect
> the
> > results (including serialized throwables). It runs one test per line of
> text
> > (the name of the package). It could be better to support lines of tests
> and
> > config options, or other ways to explore the config space. And I'd really
> > like to be able to deploy the junit tests to all the workers in the
> cluster,
> > the reduction would be to identify which boxes are playing up.
> >
> > == Sampling for testing ==
> >
> > Good desktop tests need real data, which means sampling from the live
> > datasets. Some standard MR jobs to do the sampling (which themselves use
> MR
> > Unit to self-test) could make it easier to sample.
> >
> > thoughts?
> >
>

-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Re: adding more test stuff to mrunit

Posted by Patrick Hunt <ph...@apache.org>.

I had suggested something like this in one of the original "remove
MRUNIT from hadoop contrib" threads... There was some push back about
community fragmentation (tests should live in hadoop), but I
personally don't see why not, we could course correct as things
mature.

On Thu, May 26, 2011 at 3:35 AM, Steve Loughran <st...@apache.org> wrote:
> I'm thinking, could MRUnit be the place to put in other hadoop-testing code.
>
> specifically
>
> == Junit on multiple hosts ==
>
>
> I have some prototype code to exec junit test cases as MR jobs, collect the
> results (including serialized throwables). It runs one test per line of text
> (the name of the package). It could be better to support lines of tests and
> config options, or other ways to explore the config space. And I'd really
> like to be able to deploy the junit tests to all the workers in the cluster,
> the reduction would be to identify which boxes are playing up.
>
> == Sampling for testing ==
>
> Good desktop tests need real data, which means sampling from the live
> datasets. Some standard MR jobs to do the sampling (which themselves use MR
> Unit to self-test) could make it easier to sample.
>
> thoughts?
>