You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Isabel Drost <is...@apache.org> on 2012/07/04 20:25:25 UTC

Swedish refactoring

Hi,


recently I spend a week in Sweden - getting there involved spending 6h on a 
ferry. I spend that time on some (potentially crazy) refactoring idaes I wanted 
to try out with Mahout for quite some time already. This is to share the results 
with you and maybe get some feedback. When looking at the final result, keep in 
mind that this is the result of a spike - so it's more a proof of concept than 
anything clean and polished.

The goals I wanted to accomplish were two-fold:

a) Running the Mahout tests takes ages at least on my machine (I admit that that 
one is pretty dated, however even running them on a faster laptop with ssd and 
three times more memory did not help).

b) When telling the story of Mahout I am faced with comments along the lines of 
"I don't want to install Hadoop to run it" over and over again. It seems non-
obvious that Mahout actually aims to be a stable, scalable Java library that 
comes with the added benefit of having quite a few of it's algorithms 
implemented on Hadoop.

The third goal was pretty selfish, namely to get some experience with the random 
testing framework used over in Lucene-land.

Here's what I did:

- introduce the RandomTestRunner

- fix all issues that bubbled up as a result (I ran into quite a few problems 
due to threads not being shut down either as part of the test (not too critical) 
or even during regular computation - happy to isolate what I ran into here and 
file separate issues (including fixes obviously) for each of these.

- mark all long running tests with the nightly annotation - my goal here is not 
to switch them off forever but rather draw contributors' attention to those 
running particularly long (>20s) and fix them

- convert our current core module into a parent, move any code in that into a 
submodule called stuff

- move anything out of stuff and into module write that concerns serialization 
and is reasonably algorithm independent

- move anything out of stuff and into module hadoop that really needs mapreduce 
to run

- move anything out of stuff and into cli that offers just a command line 
interface to implementations (I might have missed some jobs here that still 
contain logic in addition to the command line stuff, all I did was to go through 
and fix failing tests, for several jobs I factored the parameters into separate 
beans to deal with default values, I suppose some of Frank's work could come in 
handy when doing that right.)

- factor some of the unit testing utils into their own modules (those two could 
be collapsed actually) to avoid depending on running the tests just for 
compiling all the source code.


Here's where you can look at the results:

https://github.com/MaineC/mahout/tree/swedish-refactoring

Unfortunately there's one huge commit on June 26th - I was so naive to believe 
in "let me just do a tiny refactoring and commit the successfully building code 
back" kind of assumption...

If there's any interest in any of the results above, I'd be happy to
go through the refactoring work in a non-feasibility-study mode and
change thing piece by piece. There are still several open questions - like 
enable running tests in parallel (the changes I made to get that in the parent 
pom are right now disabled as I ran into some errors due to tests failing when 
running in parallel to other tests), our tests could benefit a lot from using 
more of the random testing functionality. Naming is more than sub-optimal. In 
the end I'd love to have a "all mahout included" jar in addition to the 
individual modules, etc.


Feedback welcome - also feel free to ignore in case that stuff is just too far 
from the current roadmap,

Isabel


PS: If you read that - Alan, thanks for getting us to drive to the coast instead 
of into the mountains close to the arctic circle - that recommendation was 
awesome!


Re: Swedish refactoring

Posted by Isabel Drost <is...@apache.org>.
On Wed, Jul 4, 2012 at 10:19 PM, Sean Owen <sr...@gmail.com> wrote:
> There's a scene in a movie I love called Kicking and Screaming (not
> the Will Ferrell one), where someone drops a glass on the kitchen
> floor. The housemates are quintessentially lazy, and so its left to
> the next day. But one takes action the next day -- he places a
> carefully lettered sign reading "Watch Out" over the broken glass and
> moves on.
>
> Annotations with this purpose remind me of this scene.

<irony>Isn't that what we are already doing by telling our users to
run the build with tests set to skip?</irony>

This is not to say that taking long-running stuff out of the build by
an annotation is a whole lot better - committers won't be annoyed by
long running tests anymore as  a result and might be even less likely
to fix them. That part of my branch just helped me to iterate more
quickly by having a build that fails quickly in case of trivial
mistakes (due to the faster tests failing) and having a few slow final
iterations to fix the rest.

Isabel

Re: Swedish refactoring

Posted by Ted Dunning <te...@gmail.com>.
Frankly, the Hadoop dependency doesn't really cause a problem here.  I
disagree with Sean about the essential identity of Hadoop and Mahout, but
having the Hadoop dependency doesn't cause me any anxiety when doing
non-Hadoop stuff.

On Wed, Jul 4, 2012 at 1:50 PM, Dawid Weiss <da...@cs.put.poznan.pl>wrote:

> > Sure, does this annotation make that happen automatically? so it
> > doesn't run unless I set a certain flag, that the Hudson build sets?
>
> Yes. Test groups are controlled by system properties (you pick the
> names). The only one predefined is @Nightly which is enabled with
> "-Dtests.nightly=on". See the definition here and you'll see how to
> build up your own:
>
>
> https://github.com/carrotsearch/randomizedtesting/blob/master/randomized-runner/src/main/java/com/carrotsearch/randomizedtesting/annotations/Nightly.java
>
> Lucene's test groups are nested static class inside LuceneTestCase.
>
> D.
>

Re: Swedish refactoring

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Sure, does this annotation make that happen automatically? so it
> doesn't run unless I set a certain flag, that the Hudson build sets?

Yes. Test groups are controlled by system properties (you pick the
names). The only one predefined is @Nightly which is enabled with
"-Dtests.nightly=on". See the definition here and you'll see how to
build up your own:

https://github.com/carrotsearch/randomizedtesting/blob/master/randomized-runner/src/main/java/com/carrotsearch/randomizedtesting/annotations/Nightly.java

Lucene's test groups are nested static class inside LuceneTestCase.

D.

Re: Swedish refactoring

Posted by Sean Owen <sr...@gmail.com>.
Sure, does this annotation make that happen automatically? so it
doesn't run unless I set a certain flag, that the Hudson build sets?
makes sense but I didn't know whether that was the idea, or just to
flag these things. It's the just flagging things that I am not so keen
on. They've been flagged for a long time in JIRA.

On Wed, Jul 4, 2012 at 11:24 PM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> These aren't going to get fixed;
>
> I think the idea is to allow long-running tests on build servers where
> they don't really annoy people so much and have a quick run for
> developers to check before they commit etc.
>
> Dawid

Re: Swedish refactoring

Posted by Frank Scholten <fr...@frankscholten.nl>.
On Fri, Jul 6, 2012 at 3:26 PM, Isabel Drost <is...@apache.org> wrote:
> On Fri, Jul 6, 2012 at 9:28 AM, Frank Scholten <fr...@frankscholten.nl> wrote:
>> Yeah, I think that would be useful to have. An alternative is to use
>> Maven profiles and use conventions like *IntegrationTest or
>> *MapReduceTest as part of the name of the test class.
>
> Jepp - true. One thing though: I never managed to convince my IDE to
> not run tests named *ITest* when right-clicking on the package or
> folder containing the tests.

This can be solved by having subpackages like 'integration' and
'mapreduce' in the test source tree.

> (And yes, I have to admit that I also> like some other features the random testing component provides - e.g.
> tracking leaking threads, executing tests out of order etc. and needed
> an excuse to play with them during my holiday.)
>
>> Would it make a difference in test speed if we used MRUnit for some of
>> the tests? Just curious.
>
> I haven't looked too deeply into what our tests do - the ones I did
> mark @Nightly mostly looked like integration tests that check the
> whole workflow instead of the individual units. Which does not mean
> that using MRUnit for some of our tests might not actually speed
> things up.

I think there are also a few tests that use more iterations / data
then necessary.
>
>
> Isabel
>

Re: Swedish refactoring

Posted by Isabel Drost <is...@apache.org>.
On Fri, Jul 6, 2012 at 9:28 AM, Frank Scholten <fr...@frankscholten.nl> wrote:
> Yeah, I think that would be useful to have. An alternative is to use
> Maven profiles and use conventions like *IntegrationTest or
> *MapReduceTest as part of the name of the test class.

Jepp - true. One thing though: I never managed to convince my IDE to
not run tests named *ITest* when right-clicking on the package or
folder containing the tests. (And yes, I have to admit that I also
like some other features the random testing component provides - e.g.
tracking leaking threads, executing tests out of order etc. and needed
an excuse to play with them during my holiday.)

> Would it make a difference in test speed if we used MRUnit for some of
> the tests? Just curious.

I haven't looked too deeply into what our tests do - the ones I did
mark @Nightly mostly looked like integration tests that check the
whole workflow instead of the individual units. Which does not mean
that using MRUnit for some of our tests might not actually speed
things up.


Isabel

Re: Swedish refactoring

Posted by Frank Scholten <fr...@frankscholten.nl>.
Yeah, I think that would be useful to have. An alternative is to use
Maven profiles and use conventions like *IntegrationTest or
*MapReduceTest as part of the name of the test class.

Users have the option of running only the unit test, MR tests or do a
full build by activating a Maven profile.

Another benefit of these naming conventions is it integrates nicely in
an IDE because with autocompletion and so on you can quickly see what
kind of tests are available.

The nightly on Jenkins will run a full build.

Would it make a difference in test speed if we used MRUnit for some of
the tests? Just curious.

Frank

On Wed, Jul 4, 2012 at 10:24 PM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> These aren't going to get fixed;
>
> I think the idea is to allow long-running tests on build servers where
> they don't really annoy people so much and have a quick run for
> developers to check before they commit etc.
>
> Dawid
>

Re: Swedish refactoring

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> These aren't going to get fixed;

I think the idea is to allow long-running tests on build servers where
they don't really annoy people so much and have a quick run for
developers to check before they commit etc.

Dawid

Re: Swedish refactoring

Posted by Sean Owen <sr...@gmail.com>.
On Wed, Jul 4, 2012 at 9:25 PM, Isabel Drost <is...@apache.org> wrote:

> b) When telling the story of Mahout I am faced with comments along the lines of
> "I don't want to install Hadoop to run it" over and over again. It seems non-
> obvious that Mahout actually aims to be a stable, scalable Java library that
> comes with the added benefit of having quite a few of it's algorithms
> implemented on Hadoop.

Personally, I think that Mahout equals Hadoop-based. I don't think
it's so useful to pretend otherwise, and do not see an issue with this
identity. It does not have consistent implementations based on
anything else. Implementations based on a very different framework
sounds like a different project in itself.


> - mark all long running tests with the nightly annotation - my goal here is not
> to switch them off forever but rather draw contributors' attention to those
> running particularly long (>20s) and fix them

There's a scene in a movie I love called Kicking and Screaming (not
the Will Ferrell one), where someone drops a glass on the kitchen
floor. The housemates are quintessentially lazy, and so its left to
the next day. But one takes action the next day -- he places a
carefully lettered sign reading "Watch Out" over the broken glass and
moves on.

Annotations with this purpose remind me of this scene.

These aren't going to get fixed; these half-hour tests have been in
for a year with notes to fix. They're kinda broken tests; fix them by
removing them. It's less good than fixing, but better than people not
running tests since they take like an hour to finish.


> - convert our current core module into a parent, move any code in that into a
> submodule called stuff
>
> - move anything out of stuff and into module write that concerns serialization
> and is reasonably algorithm independent
>
> - move anything out of stuff and into module hadoop that really needs mapreduce
> to run
>
> - move anything out of stuff and into cli that offers just a command line
> interface to implementations (I might have missed some jobs here that still
> contain logic in addition to the command line stuff, all I did was to go through
> and fix failing tests, for several jobs I factored the parameters into separate
> beans to deal with default values, I suppose some of Frank's work could come in
> handy when doing that right.)
>
> - factor some of the unit testing utils into their own modules (those two could
> be collapsed actually) to avoid depending on running the tests just for
> compiling all the source code.

I think I'm against this -- it's complex, and i think it supposes that
the code is meaningfully separable along these lines. I don't think it
is. Everything is fairly tied in with Hadoop and its serialization.
Weigh that against the complication of making users -- who are going
to want all the Hadoop stuff -- now import 4-5 JARs / artifacts
instead of 2-3, which already feels like 1-2 too many.

There is a lot of messy stuff that needs to be refactored, and maybe
these ideas encompass some of them, but I don't think any I can see
are at the module level. It's dirtier and harder than that.

Re: Swedish refactoring

Posted by Isabel Drost <is...@apache.org>.
On Wed, Jul 4, 2012 at 8:43 PM, Dawid Weiss
<da...@cs.put.poznan.pl> wrote:
>> - introduce the RandomTestRunner
>
> Awesome :)

Yeah - I really liked what I saw in Poznan.


>> - fix all issues that bubbled up as a result (I ran into quite a few problems
>> due to threads not being shut down either as part of the test (not too critical)
>> or even during regular computation - happy to isolate what I ran into here and
>> file separate issues (including fixes obviously) for each of these.
>
> You can always mark the superclass with:
>
> @ThreadLeaks(failTestIfLeaking = false)

Jepp, I know that one. However in my case I really wanted them to fail
as there were a few issues with executor pools being left open etc. -
so having that part did help.


> For the most part thread leaks should be handled gracefully but there
> are situations in which recovery from a thread leak is hard. I'm
> currently fighting the subject in the Lucene land and I expect a few
> refactorings to take place -- hope it's not going to be problematic
> for you. If you have any questions or problems -- fire at me directly
> (don't want to miss something on the list).

Will certainly do!

>> - mark all long running tests with the nightly annotation - my goal here is not
>> to switch them off forever but rather draw contributors' attention to those
>> running particularly long (>20s) and fix them
>
> You can also design your own thread group and disable it by default. Look at:
>
> LuceneTestCase.@Slow
>
> annotation -- I just recently marked a lot of tests as slow. They can
> be bulk enabled or disabled with a system property.

Awesome. Thanks for pointing that out.


Isabel

Re: Swedish refactoring

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> - introduce the RandomTestRunner

Awesome :)

>
> - fix all issues that bubbled up as a result (I ran into quite a few problems
> due to threads not being shut down either as part of the test (not too critical)
> or even during regular computation - happy to isolate what I ran into here and
> file separate issues (including fixes obviously) for each of these.

You can always mark the superclass with:

@ThreadLeaks(failTestIfLeaking = false)

For the most part thread leaks should be handled gracefully but there
are situations in which recovery from a thread leak is hard. I'm
currently fighting the subject in the Lucene land and I expect a few
refactorings to take place -- hope it's not going to be problematic
for you. If you have any questions or problems -- fire at me directly
(don't want to miss something on the list).

> - mark all long running tests with the nightly annotation - my goal here is not
> to switch them off forever but rather draw contributors' attention to those
> running particularly long (>20s) and fix them

You can also design your own thread group and disable it by default. Look at:

LuceneTestCase.@Slow

annotation -- I just recently marked a lot of tests as slow. They can
be bulk enabled or disabled with a system property.

Dawid