You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/03/17 16:41:27 UTC

Demos/Tutorials

Now that we have some code in place for clustering, I think it would  
be cool to put together some examples/demos of real world problems.   
Things like clustering text (perhaps we can use the wikipedia download  
or the reuters download that Lucene contrib/benchmark uses) or  
clustering other pieces of data.

We could setup a demo area of code and use Lucene's analysis code to  
create document vectors.

Ideas and/or thoughts or volunteers?

Cheers,
Grant

Re: Demos/Tutorials

Posted by Grant Ingersoll <gs...@apache.org>.

Yeah, I hear you there.  I have a project I am working on that will  
require me to generate examples, but it is a couple of weeks away.   
The gene expression stuff is great.  Text based ones would be really  
cool too.  I haven't done too much clustering work (other than using  
Dawid's excellent Carrot2 project), so it is a learning experience for  
me, and demos, tutorials would be great.

-Grant

On Mar 18, 2008, at 5:31 AM, Dawid Weiss wrote:

>
> This is absolutely necessary, if not for just showing off with the  
> project, then certainly for verification of correctness of  
> algorithms inside it.
>
> I will certainly hop in to such a subtask to the extent of my  
> current available time resources (not much, sadly).
>
> D.
>
> Grant Ingersoll wrote:
>> Now that we have some code in place for clustering, I think it  
>> would be cool to put together some examples/demos of real world  
>> problems.  Things like clustering text (perhaps we can use the  
>> wikipedia download or the reuters download that Lucene contrib/ 
>> benchmark uses) or clustering other pieces of data.
>> We could setup a demo area of code and use Lucene's analysis code  
>> to create document vectors.
>> Ideas and/or thoughts or volunteers?
>> Cheers,
>> Grant

Re: Demos/Tutorials

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

This is absolutely necessary, if not for just showing off with the project, then 
certainly for verification of correctness of algorithms inside it.

I will certainly hop in to such a subtask to the extent of my current available 
time resources (not much, sadly).

D.

Grant Ingersoll wrote:
> Now that we have some code in place for clustering, I think it would be 
> cool to put together some examples/demos of real world problems.  Things 
> like clustering text (perhaps we can use the wikipedia download or the 
> reuters download that Lucene contrib/benchmark uses) or clustering other 
> pieces of data.
> 
> We could setup a demo area of code and use Lucene's analysis code to 
> create document vectors.
> 
> Ideas and/or thoughts or volunteers?
> 
> Cheers,
> Grant

Re: Demos/Tutorials

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 20, 2008, at 9:15 AM, Grant Ingersoll wrote:

>
> On Mar 19, 2008, at 9:56 PM, Karl Wettin wrote:
>
>> Grant Ingersoll skrev:
>>> Now that we have some code in place for clustering, I think it  
>>> would be cool to put together some examples/demos of real world  
>>> problems.  Things like clustering text (perhaps we can use the  
>>> wikipedia download or the reuters download that Lucene contrib/ 
>>> benchmark uses) or clustering other pieces of data.
>>> We could setup a demo area of code and use Lucene's analysis code  
>>> to create document vectors.
>>> Ideas and/or thoughts or volunteers?
>>
>> Should a demo make sense enough so people who never heard about  
>> machine learning before understand what's going on? Or should it  
>> mainly show how to use the API? Or is it something that is just  
>> built to show off working or large data set?
>>
>
> I think it is more about working with the APIs, at least for now.   
> In the longer run, intro to ML would be cool, but there is lots  
> available on that.  I don't think it should be that large, as I  
> don't think we can really show scale.

Clarifying:  I mean I don't know that we can really show scale in a  
simple demo.  The goal would be that someone can take and scale, sure,  
but scaling requires infrastructure, etc.

> Just something that shows how to get the source, set it up to run  
> against a test set of data and somehow see the results, even if it  
> is trivial cmd. line stuff.

Re: Demos/Tutorials

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Wednesday 26 March 2008, Grant Ingersoll wrote:
> On Mar 24, 2008, at 4:48 PM, Isabel Drost wrote:
> > I think Mahout is not really suitable to build demos that explain
> > the inner workings of the algorithms implemented.
>
> I agree, but as we develop, we will probably have programmer's guides,
> etc. which may go into some of the theory in a practical way.

+1 That sounds great.


> > I agree with that. I think once we offer enough functionality to be
> > usable for commercial projects it would be nice to gather a list of links
> > to users.
>
> I added a PoweredBy page on the Wiki.

Already saw it on the mahout_commit list :)


> > I think for that we should rely on datasets that are manageable with
> > a few machines. I would guess people evaluating our library or want to add
> > more functionality do not necessarily  have a huge cluster of machines at
> > their disposal.
>
> Definitely, even a single machine would be fine, but will then easily
> scale up (in other words, it does all the Hadoop setup).  You don't
> really want a demo that runs for more than a few minutes, I don't think.

+1

-- 
When you live in a sick society, just about everything you do is wrong.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Demos/Tutorials

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 24, 2008, at 4:48 PM, Isabel Drost wrote:

> On Thursday 20 March 2008, Grant Ingersoll wrote:
>> In the longer run, intro to ML would be cool, but there is lots  
>> available
>> on that.
>
> I think Mahout is not really suitable to build demos that explain  
> the inner
> workings of the algorithms implemented.

I agree, but as we develop, we will probably have programmer's guides,  
etc. which may go into some of the theory in a practical way.

>
>
>
>> I don't think it should be that large, as I don't think we
>> can really show scale.
>
> I agree with that. I think once we offer enough functionality to be  
> usable for
> commercial projects it would be nice to gather a list of links to  
> users.
>

I added a PoweredBy page on the Wiki.

> I would also love to see our name mentioned in a few research  
> publications or
> at one of the machine learning competitions - the blog track would  
> be a
> really great start ;)

+1

>
>
>
>> Just something that shows how to get the source, set it up to run  
>> against a
>> test set of data and somehow see the results, even if it is trivial  
>> cmd.
>> line stuff.
>
> I think for that we should rely on datasets that are manageable with  
> a few
> machines. I would guess people evaluating our library or want to add  
> more
> functionality do not necessarily  have a huge cluster of machines at  
> their
> disposal.

Definitely, even a single machine would be fine, but will then easily  
scale up (in other words, it does all the Hadoop setup).  You don't  
really want a demo that runs for more than a few minutes, I don't think.

Something simple like Hadoop's WordCount example comes to mind.

>
>
> Isabel
>
>
> -- 
> God must have loved calories, she made so many of them.
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
> |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Demos/Tutorials

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Thursday 20 March 2008, Grant Ingersoll wrote:
> In the longer run, intro to ML would be cool, but there is lots available
> on that. 

I think Mahout is not really suitable to build demos that explain the inner 
workings of the algorithms implemented.

> I don't think it should be that large, as I don't think we 
> can really show scale.

I agree with that. I think once we offer enough functionality to be usable for 
commercial projects it would be nice to gather a list of links to users.

I would also love to see our name mentioned in a few research publications or 
at one of the machine learning competitions - the blog track would be a 
really great start ;)

> Just something that shows how to get the source, set it up to run against a
> test set of data and somehow see the results, even if it is trivial cmd.
> line stuff. 

I think for that we should rely on datasets that are manageable with a few 
machines. I would guess people evaluating our library or want to add more 
functionality do not necessarily  have a huge cluster of machines at their 
disposal.

Isabel

-- 
God must have loved calories, she made so many of them.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Demos/Tutorials

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 19, 2008, at 9:56 PM, Karl Wettin wrote:

> Grant Ingersoll skrev:
>> Now that we have some code in place for clustering, I think it  
>> would be cool to put together some examples/demos of real world  
>> problems.  Things like clustering text (perhaps we can use the  
>> wikipedia download or the reuters download that Lucene contrib/ 
>> benchmark uses) or clustering other pieces of data.
>> We could setup a demo area of code and use Lucene's analysis code  
>> to create document vectors.
>> Ideas and/or thoughts or volunteers?
>
> Should a demo make sense enough so people who never heard about  
> machine learning before understand what's going on? Or should it  
> mainly show how to use the API? Or is it something that is just  
> built to show off working or large data set?
>

I think it is more about working with the APIs, at least for now.  In  
the longer run, intro to ML would be cool, but there is lots available  
on that.  I don't think it should be that large, as I don't think we  
can really show scale. Just something that shows how to get the  
source, set it up to run against a test set of data and somehow see  
the results, even if it is trivial cmd. line stuff.

Re: Demos/Tutorials

Posted by Andrzej Bialecki <ab...@getopt.org>.

Karl Wettin wrote:
> Grant Ingersoll skrev:
>> Now that we have some code in place for clustering, I think it would 
>> be cool to put together some examples/demos of real world problems.  
>> Things like clustering text (perhaps we can use the wikipedia download 
>> or the reuters download that Lucene contrib/benchmark uses) or 
>> clustering other pieces of data.
>>
>> We could setup a demo area of code and use Lucene's analysis code to 
>> create document vectors.
>>
>> Ideas and/or thoughts or volunteers?
> 
> Should a demo make sense enough so people who never heard about machine 
> learning before understand what's going on? Or should it mainly show how 
> to use the API? Or is it something that is just built to show off 
> working or large data set?
> 
> 
> Wikinews is generally speaking less good than the Reuters data, but some 
> articles exists in mulitiple languages and they often reference parts of 
> texts to Wikipedia articles.
> 
> I can't think of any clustering use case with the mentioned data sets 
> that makes that sense. Something grouping articles or stories that are 
> the same but from different sources makes sense, but we only have this 
> one source that often tries to merge things that are the same.
> 
> There are these tags describing categories and what not, but testing 
> this feels more of a classifier- than a cluster problem.

There are many other corpora, which are free and good enough for a demo. 
For example, the "20 newsgroups" for clustering, the EuroParl for 
multi-lingual IR (language detection, machine translation etc), WebKB 
for web page clustering, the Acquis corpus 
(http://wt.jrc.it/lt/Acquis/), etc, etc ...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Demos/Tutorials

Posted by Karl Wettin <ka...@gmail.com>.

Grant Ingersoll skrev:
> Now that we have some code in place for clustering, I think it would be 
> cool to put together some examples/demos of real world problems.  Things 
> like clustering text (perhaps we can use the wikipedia download or the 
> reuters download that Lucene contrib/benchmark uses) or clustering other 
> pieces of data.
> 
> We could setup a demo area of code and use Lucene's analysis code to 
> create document vectors.
> 
> Ideas and/or thoughts or volunteers?

Should a demo make sense enough so people who never heard about machine 
learning before understand what's going on? Or should it mainly show how 
to use the API? Or is it something that is just built to show off 
working or large data set?


Wikinews is generally speaking less good than the Reuters data, but some 
articles exists in mulitiple languages and they often reference parts of 
texts to Wikipedia articles.

I can't think of any clustering use case with the mentioned data sets 
that makes that sense. Something grouping articles or stories that are 
the same but from different sources makes sense, but we only have this 
one source that often tries to merge things that are the same.

There are these tags describing categories and what not, but testing 
this feels more of a classifier- than a cluster problem.

I suppose text mining means Lucene tokenization, so clustering search 
results is not too far fetched. But it is still clustering this one 
source we have.


Wikibooks:cookbook could be a great source for fun examples (cluster 
applicable recepies, feature select shopping list, product ethicity 
classifier, market basket analysis, collaborate filtering, etc) but I 
fear it would take a bit of work to parse the recepies.


     karl

Re: Demos/Tutorials

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Monday 17 March 2008, Allen Day wrote:
> I'll be trying out Mahout to do some microarray gene expression
> clustering pretty soon.  I would be happy to do a small write-up.

That sounds really great. Would be a great demo for applications apart from 
obvious tasks in the area of clustering texts.

Looking forward to this demo/tutorial,
Isabel

-- 
Just remember, wherever you go, there you are.		-- Buckaroo Bonzai
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Demos/Tutorials

Posted by Allen Day <al...@gmail.com>.

Hi,

I'll be trying out Mahout to do some microarray gene expression
clustering pretty soon.  I would be happy to do a small write-up.

-Allen

On Mon, Mar 17, 2008 at 7:41 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Now that we have some code in place for clustering, I think it would
>  be cool to put together some examples/demos of real world problems.
>  Things like clustering text (perhaps we can use the wikipedia download
>  or the reuters download that Lucene contrib/benchmark uses) or
>  clustering other pieces of data.
>
>  We could setup a demo area of code and use Lucene's analysis code to
>  create document vectors.
>
>  Ideas and/or thoughts or volunteers?
>
>  Cheers,
>  Grant
>



-- 
allenday.skype
+1 (415) 335-4654 (office)
+1 (310) 804-5304 (mobile)
+1 (515) 474-9337 (fax)

Re: Demos/Tutorials

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Monday 17 March 2008, Grant Ingersoll wrote:
> Now that we have some code in place for clustering, I think it would
> be cool to put together some examples/demos of real world problems.

One idea I thought of reading the proposal of Allen: I think it might also be 
great, if people using - or trying to use - our framework in a research 
context would post their experiences. Links to results would be even better, 
but I think it is a bit too early for this.

Isabel

-- 
Anyone can do any amount of work provided it isn't the work he is supposed to 
be doing at the moment.		-- Robert Benchley
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

RE: Demos/Tutorials

Posted by Jeff Eastman <je...@windwardsolutions.com>.

I've been using the canopy clustering to cluster Apache log time slices by
URL frequency. Typical results indicate several big clusters with the
"business as usual" access patterns in them and then several small clusters
with the unusual patterns. It's a little difficult to interpret beyond that
but still intriguing. Since every body has such logs it might be a useful
demo application that people could run over their own data.

Jeff

> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Monday, March 17, 2008 8:41 AM
> To: mahout-dev@lucene.apache.org
> Subject: Demos/Tutorials
> 
> Now that we have some code in place for clustering, I think it would
> be cool to put together some examples/demos of real world problems.
> Things like clustering text (perhaps we can use the wikipedia download
> or the reuters download that Lucene contrib/benchmark uses) or
> clustering other pieces of data.
> 
> We could setup a demo area of code and use Lucene's analysis code to
> create document vectors.
> 
> Ideas and/or thoughts or volunteers?
> 
> Cheers,
> Grant