You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/03/17 16:41:27 UTC
Demos/Tutorials
Now that we have some code in place for clustering, I think it would
be cool to put together some examples/demos of real world problems.
Things like clustering text (perhaps we can use the wikipedia download
or the reuters download that Lucene contrib/benchmark uses) or
clustering other pieces of data.
We could setup a demo area of code and use Lucene's analysis code to
create document vectors.
Ideas and/or thoughts or volunteers?
Cheers,
Grant
Re: Demos/Tutorials
Posted by Grant Ingersoll <gs...@apache.org>.
Yeah, I hear you there. I have a project I am working on that will
require me to generate examples, but it is a couple of weeks away.
The gene expression stuff is great. Text based ones would be really
cool too. I haven't done too much clustering work (other than using
Dawid's excellent Carrot2 project), so it is a learning experience for
me, and demos, tutorials would be great.
-Grant
On Mar 18, 2008, at 5:31 AM, Dawid Weiss wrote:
>
> This is absolutely necessary, if not for just showing off with the
> project, then certainly for verification of correctness of
> algorithms inside it.
>
> I will certainly hop in to such a subtask to the extent of my
> current available time resources (not much, sadly).
>
> D.
>
> Grant Ingersoll wrote:
>> Now that we have some code in place for clustering, I think it
>> would be cool to put together some examples/demos of real world
>> problems. Things like clustering text (perhaps we can use the
>> wikipedia download or the reuters download that Lucene contrib/
>> benchmark uses) or clustering other pieces of data.
>> We could setup a demo area of code and use Lucene's analysis code
>> to create document vectors.
>> Ideas and/or thoughts or volunteers?
>> Cheers,
>> Grant
Re: Demos/Tutorials
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
This is absolutely necessary, if not for just showing off with the project, then
certainly for verification of correctness of algorithms inside it.
I will certainly hop in to such a subtask to the extent of my current available
time resources (not much, sadly).
D.
Grant Ingersoll wrote:
> Now that we have some code in place for clustering, I think it would be
> cool to put together some examples/demos of real world problems. Things
> like clustering text (perhaps we can use the wikipedia download or the
> reuters download that Lucene contrib/benchmark uses) or clustering other
> pieces of data.
>
> We could setup a demo area of code and use Lucene's analysis code to
> create document vectors.
>
> Ideas and/or thoughts or volunteers?
>
> Cheers,
> Grant
Re: Demos/Tutorials
Posted by Grant Ingersoll <gs...@apache.org>.
On Mar 20, 2008, at 9:15 AM, Grant Ingersoll wrote:
>
> On Mar 19, 2008, at 9:56 PM, Karl Wettin wrote:
>
>> Grant Ingersoll skrev:
>>> Now that we have some code in place for clustering, I think it
>>> would be cool to put together some examples/demos of real world
>>> problems. Things like clustering text (perhaps we can use the
>>> wikipedia download or the reuters download that Lucene contrib/
>>> benchmark uses) or clustering other pieces of data.
>>> We could setup a demo area of code and use Lucene's analysis code
>>> to create document vectors.
>>> Ideas and/or thoughts or volunteers?
>>
>> Should a demo make sense enough so people who never heard about
>> machine learning before understand what's going on? Or should it
>> mainly show how to use the API? Or is it something that is just
>> built to show off working or large data set?
>>
>
> I think it is more about working with the APIs, at least for now.
> In the longer run, intro to ML would be cool, but there is lots
> available on that. I don't think it should be that large, as I
> don't think we can really show scale.
Clarifying: I mean I don't know that we can really show scale in a
simple demo. The goal would be that someone can take and scale, sure,
but scaling requires infrastructure, etc.
> Just something that shows how to get the source, set it up to run
> against a test set of data and somehow see the results, even if it
> is trivial cmd. line stuff.
Re: Demos/Tutorials
Posted by Isabel Drost <ap...@isabel-drost.de>.
On Wednesday 26 March 2008, Grant Ingersoll wrote:
> On Mar 24, 2008, at 4:48 PM, Isabel Drost wrote:
> > I think Mahout is not really suitable to build demos that explain
> > the inner workings of the algorithms implemented.
>
> I agree, but as we develop, we will probably have programmer's guides,
> etc. which may go into some of the theory in a practical way.
+1 That sounds great.
> > I agree with that. I think once we offer enough functionality to be
> > usable for commercial projects it would be nice to gather a list of links
> > to users.
>
> I added a PoweredBy page on the Wiki.
Already saw it on the mahout_commit list :)
> > I think for that we should rely on datasets that are manageable with
> > a few machines. I would guess people evaluating our library or want to add
> > more functionality do not necessarily have a huge cluster of machines at
> > their disposal.
>
> Definitely, even a single machine would be fine, but will then easily
> scale up (in other words, it does all the Hadoop setup). You don't
> really want a demo that runs for more than a few minutes, I don't think.
+1
--
When you live in a sick society, just about everything you do is wrong.
|\ _,,,---,,_ Web: <http://www.isabel-drost.de>
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) (fL) IM: <xm...@spaceboyz.net>
Re: Demos/Tutorials
Posted by Grant Ingersoll <gs...@apache.org>.
On Mar 24, 2008, at 4:48 PM, Isabel Drost wrote:
> On Thursday 20 March 2008, Grant Ingersoll wrote:
>> In the longer run, intro to ML would be cool, but there is lots
>> available
>> on that.
>
> I think Mahout is not really suitable to build demos that explain
> the inner
> workings of the algorithms implemented.
I agree, but as we develop, we will probably have programmer's guides,
etc. which may go into some of the theory in a practical way.
>
>
>
>> I don't think it should be that large, as I don't think we
>> can really show scale.
>
> I agree with that. I think once we offer enough functionality to be
> usable for
> commercial projects it would be nice to gather a list of links to
> users.
>
I added a PoweredBy page on the Wiki.
> I would also love to see our name mentioned in a few research
> publications or
> at one of the machine learning competitions - the blog track would
> be a
> really great start ;)
+1
>
>
>
>> Just something that shows how to get the source, set it up to run
>> against a
>> test set of data and somehow see the results, even if it is trivial
>> cmd.
>> line stuff.
>
> I think for that we should rely on datasets that are manageable with
> a few
> machines. I would guess people evaluating our library or want to add
> more
> functionality do not necessarily have a huge cluster of machines at
> their
> disposal.
Definitely, even a single machine would be fine, but will then easily
scale up (in other words, it does all the Hadoop setup). You don't
really want a demo that runs for more than a few minutes, I don't think.
Something simple like Hadoop's WordCount example comes to mind.
>
>
> Isabel
>
>
> --
> God must have loved calories, she made so many of them.
> |\ _,,,---,,_ Web: <http://www.isabel-drost.de>
> /,`.-'`' -. ;-;;,_
> |,4- ) )-,_..;\ ( `'-'
> '---''(_/--' `-'\_) (fL) IM: <xm...@spaceboyz.net>
--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Demos/Tutorials
Posted by Isabel Drost <ap...@isabel-drost.de>.
On Thursday 20 March 2008, Grant Ingersoll wrote:
> In the longer run, intro to ML would be cool, but there is lots available
> on that.
I think Mahout is not really suitable to build demos that explain the inner
workings of the algorithms implemented.
> I don't think it should be that large, as I don't think we
> can really show scale.
I agree with that. I think once we offer enough functionality to be usable for
commercial projects it would be nice to gather a list of links to users.
I would also love to see our name mentioned in a few research publications or
at one of the machine learning competitions - the blog track would be a
really great start ;)
> Just something that shows how to get the source, set it up to run against a
> test set of data and somehow see the results, even if it is trivial cmd.
> line stuff.
I think for that we should rely on datasets that are manageable with a few
machines. I would guess people evaluating our library or want to add more
functionality do not necessarily have a huge cluster of machines at their
disposal.
Isabel
--
God must have loved calories, she made so many of them.
|\ _,,,---,,_ Web: <http://www.isabel-drost.de>
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) (fL) IM: <xm...@spaceboyz.net>
Re: Demos/Tutorials
Posted by Grant Ingersoll <gs...@apache.org>.
On Mar 19, 2008, at 9:56 PM, Karl Wettin wrote:
> Grant Ingersoll skrev:
>> Now that we have some code in place for clustering, I think it
>> would be cool to put together some examples/demos of real world
>> problems. Things like clustering text (perhaps we can use the
>> wikipedia download or the reuters download that Lucene contrib/
>> benchmark uses) or clustering other pieces of data.
>> We could setup a demo area of code and use Lucene's analysis code
>> to create document vectors.
>> Ideas and/or thoughts or volunteers?
>
> Should a demo make sense enough so people who never heard about
> machine learning before understand what's going on? Or should it
> mainly show how to use the API? Or is it something that is just
> built to show off working or large data set?
>
I think it is more about working with the APIs, at least for now. In
the longer run, intro to ML would be cool, but there is lots available
on that. I don't think it should be that large, as I don't think we
can really show scale. Just something that shows how to get the
source, set it up to run against a test set of data and somehow see
the results, even if it is trivial cmd. line stuff.
Re: Demos/Tutorials
Posted by Andrzej Bialecki <ab...@getopt.org>.
Karl Wettin wrote:
> Grant Ingersoll skrev:
>> Now that we have some code in place for clustering, I think it would
>> be cool to put together some examples/demos of real world problems.
>> Things like clustering text (perhaps we can use the wikipedia download
>> or the reuters download that Lucene contrib/benchmark uses) or
>> clustering other pieces of data.
>>
>> We could setup a demo area of code and use Lucene's analysis code to
>> create document vectors.
>>
>> Ideas and/or thoughts or volunteers?
>
> Should a demo make sense enough so people who never heard about machine
> learning before understand what's going on? Or should it mainly show how
> to use the API? Or is it something that is just built to show off
> working or large data set?
>
>
> Wikinews is generally speaking less good than the Reuters data, but some
> articles exists in mulitiple languages and they often reference parts of
> texts to Wikipedia articles.
>
> I can't think of any clustering use case with the mentioned data sets
> that makes that sense. Something grouping articles or stories that are
> the same but from different sources makes sense, but we only have this
> one source that often tries to merge things that are the same.
>
> There are these tags describing categories and what not, but testing
> this feels more of a classifier- than a cluster problem.
There are many other corpora, which are free and good enough for a demo.
For example, the "20 newsgroups" for clustering, the EuroParl for
multi-lingual IR (language detection, machine translation etc), WebKB
for web page clustering, the Acquis corpus
(http://wt.jrc.it/lt/Acquis/), etc, etc ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Demos/Tutorials
Posted by Karl Wettin <ka...@gmail.com>.
Grant Ingersoll skrev:
> Now that we have some code in place for clustering, I think it would be
> cool to put together some examples/demos of real world problems. Things
> like clustering text (perhaps we can use the wikipedia download or the
> reuters download that Lucene contrib/benchmark uses) or clustering other
> pieces of data.
>
> We could setup a demo area of code and use Lucene's analysis code to
> create document vectors.
>
> Ideas and/or thoughts or volunteers?
Should a demo make sense enough so people who never heard about machine
learning before understand what's going on? Or should it mainly show how
to use the API? Or is it something that is just built to show off
working or large data set?
Wikinews is generally speaking less good than the Reuters data, but some
articles exists in mulitiple languages and they often reference parts of
texts to Wikipedia articles.
I can't think of any clustering use case with the mentioned data sets
that makes that sense. Something grouping articles or stories that are
the same but from different sources makes sense, but we only have this
one source that often tries to merge things that are the same.
There are these tags describing categories and what not, but testing
this feels more of a classifier- than a cluster problem.
I suppose text mining means Lucene tokenization, so clustering search
results is not too far fetched. But it is still clustering this one
source we have.
Wikibooks:cookbook could be a great source for fun examples (cluster
applicable recepies, feature select shopping list, product ethicity
classifier, market basket analysis, collaborate filtering, etc) but I
fear it would take a bit of work to parse the recepies.
karl
Re: Demos/Tutorials
Posted by Isabel Drost <ap...@isabel-drost.de>.
On Monday 17 March 2008, Allen Day wrote:
> I'll be trying out Mahout to do some microarray gene expression
> clustering pretty soon. I would be happy to do a small write-up.
That sounds really great. Would be a great demo for applications apart from
obvious tasks in the area of clustering texts.
Looking forward to this demo/tutorial,
Isabel
--
Just remember, wherever you go, there you are. -- Buckaroo Bonzai
|\ _,,,---,,_ Web: <http://www.isabel-drost.de>
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) (fL) IM: <xm...@spaceboyz.net>
Re: Demos/Tutorials
Posted by Allen Day <al...@gmail.com>.
Hi,
I'll be trying out Mahout to do some microarray gene expression
clustering pretty soon. I would be happy to do a small write-up.
-Allen
On Mon, Mar 17, 2008 at 7:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Now that we have some code in place for clustering, I think it would
> be cool to put together some examples/demos of real world problems.
> Things like clustering text (perhaps we can use the wikipedia download
> or the reuters download that Lucene contrib/benchmark uses) or
> clustering other pieces of data.
>
> We could setup a demo area of code and use Lucene's analysis code to
> create document vectors.
>
> Ideas and/or thoughts or volunteers?
>
> Cheers,
> Grant
>
--
allenday.skype
+1 (415) 335-4654 (office)
+1 (310) 804-5304 (mobile)
+1 (515) 474-9337 (fax)
Re: Demos/Tutorials
Posted by Isabel Drost <ap...@isabel-drost.de>.
On Monday 17 March 2008, Grant Ingersoll wrote:
> Now that we have some code in place for clustering, I think it would
> be cool to put together some examples/demos of real world problems.
One idea I thought of reading the proposal of Allen: I think it might also be
great, if people using - or trying to use - our framework in a research
context would post their experiences. Links to results would be even better,
but I think it is a bit too early for this.
Isabel
--
Anyone can do any amount of work provided it isn't the work he is supposed to
be doing at the moment. -- Robert Benchley
|\ _,,,---,,_ Web: <http://www.isabel-drost.de>
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) (fL) IM: <xm...@spaceboyz.net>
RE: Demos/Tutorials
Posted by Jeff Eastman <je...@windwardsolutions.com>.
I've been using the canopy clustering to cluster Apache log time slices by
URL frequency. Typical results indicate several big clusters with the
"business as usual" access patterns in them and then several small clusters
with the unusual patterns. It's a little difficult to interpret beyond that
but still intriguing. Since every body has such logs it might be a useful
demo application that people could run over their own data.
Jeff
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Monday, March 17, 2008 8:41 AM
> To: mahout-dev@lucene.apache.org
> Subject: Demos/Tutorials
>
> Now that we have some code in place for clustering, I think it would
> be cool to put together some examples/demos of real world problems.
> Things like clustering text (perhaps we can use the wikipedia download
> or the reuters download that Lucene contrib/benchmark uses) or
> clustering other pieces of data.
>
> We could setup a demo area of code and use Lucene's analysis code to
> create document vectors.
>
> Ideas and/or thoughts or volunteers?
>
> Cheers,
> Grant