You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Drew Farris <dr...@apache.org> on 2010/10/08 21:22:37 UTC

integration tests / example scripts?

It sure would be really nice if we had more integration tests /
example scripts for the various algorithms like build-reuters.sh
script. These capture problems with the system in the way real users
are likely to first encounter it, and provide an easy way for new
users to understand the steps of using mahout externally to the wiki.
If we were really smart, we'd run them automatically from hudson as a
separate sanity check and then use something like gist to publish them
to confluence automatically so our examples would always be up to
date. But I get ahead of myself.

Would something like the script attached to
https://issues.apache.org/jira/browse/MAHOUT-520, which adds a script
to run the bayes 20newsgroups example, be appropriate to commit at
this point?

Drew

Re: integration tests / example scripts?

Posted by Joe Kumar <jo...@gmail.com>.
Drew,

Thanks for your suggestions.

I have modified the script to enable a non-interactive mode by calling the
script with a parameter "-ni". This will just run the Canopy Clustering. I
am not sure if it should run all the clustering algos. any thots ?
By default the script will be in an interactive mode. so when users invoke
the script they'll interact by choosing the clustering algorithm.
Let me know if this is ok, so I'll test the script and upload the same
tonite.

Regarding the check on hadoop, I was thinking of "hadoop fs -ls" but then
wasnt really sure. Anyhow, I am currently going with this but if there is a
better way we can adapt that later i guess.

Also shall I change the build-reuters.sh for the interactive /
non-interactive mode ?

regards
Joe.

On Mon, Oct 11, 2010 at 9:14 AM, Drew Farris <dr...@apache.org> wrote:

> On Sun, Oct 10, 2010 at 11:36 PM, Joe Kumar <jo...@gmail.com> wrote:
> > Drew / all,
> >
> > I have written a script (80% done) for running the clustering job on
> > synthetic control data.
> > Should I upload this in MAHOUT-520 or should i open a new jira issue ?
>
> Great! I've revised MAHOUT-520's description to accomotate this, so
> why don't you ga and attach the script/patch there.
>
> > I m thinking of modifying the build-reuters.sh to make it more
> interactive.
> > Currently it says "uncomment lines for kmeans or lda" but we can ask user
> > the select whether they want to run kmeans or lda and invoke the command
> for
> > those algos accordingly. I have done something similar for synthetic
> control
> > data example.
>
> Interactivity is >ok< but It would be excellent if the script did not
> >require< interactivity, e.g was able to run with automatically
> command-line arguments perhaps. This way they could be run as part of
> the nightly build in hudson.
>
> > When we are running some of the examples, we are checking if HADOOP_HOME
> is
> > set. Sometimes HADOOP_HOME might be set but if hadoop is not running,
> then
> > our examples would fail. so I am trying to see what would be the best way
> to
> > check and make sure hadoop is up through shell script. Once I get this,
> the
> > script for synthetic control data should be complete.
>
> I'm not aware of best practices here, but 'hadoop -dfs ls' can be used
> to check that the namenode is available, 'hadoop -job list' can be
> used to check if the jobtracker is available. Each of these will retry
> up to 10 times to contact the namenode or tasktracker, so there will
> be a bit of a pause before an error if the service in question isn't
> available.
>
> Not sure the best way to obtain datanode/tasktracker health.
>
> Drew
>

Re: integration tests / example scripts?

Posted by Drew Farris <dr...@apache.org>.
On Sun, Oct 10, 2010 at 11:36 PM, Joe Kumar <jo...@gmail.com> wrote:
> Drew / all,
>
> I have written a script (80% done) for running the clustering job on
> synthetic control data.
> Should I upload this in MAHOUT-520 or should i open a new jira issue ?

Great! I've revised MAHOUT-520's description to accomotate this, so
why don't you ga and attach the script/patch there.

> I m thinking of modifying the build-reuters.sh to make it more interactive.
> Currently it says "uncomment lines for kmeans or lda" but we can ask user
> the select whether they want to run kmeans or lda and invoke the command for
> those algos accordingly. I have done something similar for synthetic control
> data example.

Interactivity is >ok< but It would be excellent if the script did not
>require< interactivity, e.g was able to run with automatically
command-line arguments perhaps. This way they could be run as part of
the nightly build in hudson.

> When we are running some of the examples, we are checking if HADOOP_HOME is
> set. Sometimes HADOOP_HOME might be set but if hadoop is not running, then
> our examples would fail. so I am trying to see what would be the best way to
> check and make sure hadoop is up through shell script. Once I get this, the
> script for synthetic control data should be complete.

I'm not aware of best practices here, but 'hadoop -dfs ls' can be used
to check that the namenode is available, 'hadoop -job list' can be
used to check if the jobtracker is available. Each of these will retry
up to 10 times to contact the namenode or tasktracker, so there will
be a bit of a pause before an error if the service in question isn't
available.

Not sure the best way to obtain datanode/tasktracker health.

Drew

Re: integration tests / example scripts?

Posted by Joe Kumar <jo...@gmail.com>.
Drew / all,

I have written a script (80% done) for running the clustering job on
synthetic control data.
Should I upload this in MAHOUT-520 or should i open a new jira issue ?
I m thinking of modifying the build-reuters.sh to make it more interactive.
Currently it says "uncomment lines for kmeans or lda" but we can ask user
the select whether they want to run kmeans or lda and invoke the command for
those algos accordingly. I have done something similar for synthetic control
data example.

When we are running some of the examples, we are checking if HADOOP_HOME is
set. Sometimes HADOOP_HOME might be set but if hadoop is not running, then
our examples would fail. so I am trying to see what would be the best way to
check and make sure hadoop is up through shell script. Once I get this, the
script for synthetic control data should be complete.

I searched in google to see if there are any best practices / approaches reg
this but could really find anything solid.
appreciate your thoughts.

regards
Joe.

On Sat, Oct 9, 2010 at 1:10 PM, Gangadhar Nittala
<np...@gmail.com>wrote:

> I think scripts which help users understand the usage of the various
> algorithms will be helpful. For the 0.5 release, if some of the
> algorithms have necessary scripts associated with them, it will make
> it easy for people interested in contributing to run the tests and
> look at the code. While testing the Bayes classifier that was one of
> the issues I faced.
>
> On Fri, Oct 8, 2010 at 8:40 PM, Ted Dunning <te...@gmail.com> wrote:
> > I will build a few SGD based classifier scripts.
> >
> > On Fri, Oct 8, 2010 at 12:29 PM, Drew Farris <dr...@apache.org> wrote:
> >
> >> Perhaps it would be easy for the individuals doing tests for 0.4 to at
> >> least take a transcript of the commands they're using so that they can
> >> eventually be changed into these sorts of scripts.
> >>
> >> On Fri, Oct 8, 2010 at 3:25 PM, Robin Anil <ro...@gmail.com>
> wrote:
> >> > +1 for integration script
> >> >
> >> > On Sat, Oct 9, 2010 at 12:52 AM, Drew Farris <dr...@apache.org> wrote:
> >> >
> >> >> It sure would be really nice if we had more integration tests /
> >> >> example scripts for the various algorithms like build-reuters.sh
> >> >> script. These capture problems with the system in the way real users
> >> >> are likely to first encounter it, and provide an easy way for new
> >> >> users to understand the steps of using mahout externally to the wiki.
> >> >> If we were really smart, we'd run them automatically from hudson as a
> >> >> separate sanity check and then use something like gist to publish
> them
> >> >> to confluence automatically so our examples would always be up to
> >> >> date. But I get ahead of myself.
> >> >>
> >> >> Would something like the script attached to
> >> >> https://issues.apache.org/jira/browse/MAHOUT-520, which adds a
> script
> >> >> to run the bayes 20newsgroups example, be appropriate to commit at
> >> >> this point?
> >> >>
> >> >> Drew
> >> >>
> >> >
> >>
> >
>

Re: integration tests / example scripts?

Posted by Gangadhar Nittala <np...@gmail.com>.
I think scripts which help users understand the usage of the various
algorithms will be helpful. For the 0.5 release, if some of the
algorithms have necessary scripts associated with them, it will make
it easy for people interested in contributing to run the tests and
look at the code. While testing the Bayes classifier that was one of
the issues I faced.

On Fri, Oct 8, 2010 at 8:40 PM, Ted Dunning <te...@gmail.com> wrote:
> I will build a few SGD based classifier scripts.
>
> On Fri, Oct 8, 2010 at 12:29 PM, Drew Farris <dr...@apache.org> wrote:
>
>> Perhaps it would be easy for the individuals doing tests for 0.4 to at
>> least take a transcript of the commands they're using so that they can
>> eventually be changed into these sorts of scripts.
>>
>> On Fri, Oct 8, 2010 at 3:25 PM, Robin Anil <ro...@gmail.com> wrote:
>> > +1 for integration script
>> >
>> > On Sat, Oct 9, 2010 at 12:52 AM, Drew Farris <dr...@apache.org> wrote:
>> >
>> >> It sure would be really nice if we had more integration tests /
>> >> example scripts for the various algorithms like build-reuters.sh
>> >> script. These capture problems with the system in the way real users
>> >> are likely to first encounter it, and provide an easy way for new
>> >> users to understand the steps of using mahout externally to the wiki.
>> >> If we were really smart, we'd run them automatically from hudson as a
>> >> separate sanity check and then use something like gist to publish them
>> >> to confluence automatically so our examples would always be up to
>> >> date. But I get ahead of myself.
>> >>
>> >> Would something like the script attached to
>> >> https://issues.apache.org/jira/browse/MAHOUT-520, which adds a script
>> >> to run the bayes 20newsgroups example, be appropriate to commit at
>> >> this point?
>> >>
>> >> Drew
>> >>
>> >
>>
>

Re: integration tests / example scripts?

Posted by Ted Dunning <te...@gmail.com>.
I will build a few SGD based classifier scripts.

On Fri, Oct 8, 2010 at 12:29 PM, Drew Farris <dr...@apache.org> wrote:

> Perhaps it would be easy for the individuals doing tests for 0.4 to at
> least take a transcript of the commands they're using so that they can
> eventually be changed into these sorts of scripts.
>
> On Fri, Oct 8, 2010 at 3:25 PM, Robin Anil <ro...@gmail.com> wrote:
> > +1 for integration script
> >
> > On Sat, Oct 9, 2010 at 12:52 AM, Drew Farris <dr...@apache.org> wrote:
> >
> >> It sure would be really nice if we had more integration tests /
> >> example scripts for the various algorithms like build-reuters.sh
> >> script. These capture problems with the system in the way real users
> >> are likely to first encounter it, and provide an easy way for new
> >> users to understand the steps of using mahout externally to the wiki.
> >> If we were really smart, we'd run them automatically from hudson as a
> >> separate sanity check and then use something like gist to publish them
> >> to confluence automatically so our examples would always be up to
> >> date. But I get ahead of myself.
> >>
> >> Would something like the script attached to
> >> https://issues.apache.org/jira/browse/MAHOUT-520, which adds a script
> >> to run the bayes 20newsgroups example, be appropriate to commit at
> >> this point?
> >>
> >> Drew
> >>
> >
>

Re: integration tests / example scripts?

Posted by Drew Farris <dr...@apache.org>.
Perhaps it would be easy for the individuals doing tests for 0.4 to at
least take a transcript of the commands they're using so that they can
eventually be changed into these sorts of scripts.

On Fri, Oct 8, 2010 at 3:25 PM, Robin Anil <ro...@gmail.com> wrote:
> +1 for integration script
>
> On Sat, Oct 9, 2010 at 12:52 AM, Drew Farris <dr...@apache.org> wrote:
>
>> It sure would be really nice if we had more integration tests /
>> example scripts for the various algorithms like build-reuters.sh
>> script. These capture problems with the system in the way real users
>> are likely to first encounter it, and provide an easy way for new
>> users to understand the steps of using mahout externally to the wiki.
>> If we were really smart, we'd run them automatically from hudson as a
>> separate sanity check and then use something like gist to publish them
>> to confluence automatically so our examples would always be up to
>> date. But I get ahead of myself.
>>
>> Would something like the script attached to
>> https://issues.apache.org/jira/browse/MAHOUT-520, which adds a script
>> to run the bayes 20newsgroups example, be appropriate to commit at
>> this point?
>>
>> Drew
>>
>

Re: integration tests / example scripts?

Posted by Robin Anil <ro...@gmail.com>.
+1 for integration script

On Sat, Oct 9, 2010 at 12:52 AM, Drew Farris <dr...@apache.org> wrote:

> It sure would be really nice if we had more integration tests /
> example scripts for the various algorithms like build-reuters.sh
> script. These capture problems with the system in the way real users
> are likely to first encounter it, and provide an easy way for new
> users to understand the steps of using mahout externally to the wiki.
> If we were really smart, we'd run them automatically from hudson as a
> separate sanity check and then use something like gist to publish them
> to confluence automatically so our examples would always be up to
> date. But I get ahead of myself.
>
> Would something like the script attached to
> https://issues.apache.org/jira/browse/MAHOUT-520, which adds a script
> to run the bayes 20newsgroups example, be appropriate to commit at
> this point?
>
> Drew
>