You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Aleksander Stensby <al...@integrasco.com> on 2009/09/17 09:36:50 UTC

Some basic introductory questions

Hi all,
I've been following the development of Mahout for quite a while now and
figured it was time for me to get my hands dirty:)

I've gone through the examples and Grant's excellent IBM article (great work
on that Grant!).
So, now I'm at the point where I want to figure out where I go next.
Specifically, I'm a bit fuzzed about common practices when it comes to
utilizing Mahout in my own applications...

Case scenario:
I have my own project, add the dependencies to Mahout (through maven), and
make my own little kMeans test class.
I guess my question is a bit stupid, but how would you go about using Mahout
out of the box?

Ideally (or maybe not?), I figured that I could just take care of providing
the Vectors -> push it into mahout and run the kMeans clustering...
But when I started looking at the kMeans clustering example, I notice that
there is actually a lot of implementation in the example itself... Is it
really necessary for me to implement all of those methods in every project
where I want to do kMeans? Can't they be reused? The methods I talk about
are for instance:
  static List<Canopy> populateCanopies(DistanceMeasure measure, List<Vector>
points, double t1, double t2)
  private static void referenceKmeans(List<Vector> points,
List<List<Cluster>> clusters, DistanceMeasure measure, int maxIter)
  private static boolean iterateReference(List<Vector> points, List<Cluster>
clusters, DistanceMeasure measure)

In my narrow minded head I would think that input would be the List<Vector>
and that the output would be List<List<Cluster> of some general kMeans
method that did all the internals for me... Or am I missing something? Or do
I have to use the KMeansDriver.runJob and read input from serialized vectors
files?

Appreciate any guidance here guys :)

Cheers,
 Aleksander




-- 
Aleksander M. Stensby
Lead Software Developer and System Architect
Integrasco A/S
www.integrasco.com
http://twitter.com/Integrasco
http://facebook.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Re: Some basic introductory questions

Posted by Ted Dunning <te...@gmail.com>.

Probably should.  I think I will keep this in mind for the next survey that
I write.

On Thu, Sep 17, 2009 at 11:32 PM, Aleksander Stensby <
aleksander.stensby@integrasco.com> wrote:

> You should probably add a few follow-up questions to questions like:
> Do you currently use or develop with Mahout?
> - if i answer yes, but not in production - but I plan on using it in
> production:)
> Same goes for the second question:)
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Some basic introductory questions

Posted by Aleksander Stensby <al...@integrasco.com>.

Of course, I'm happy to.
You should probably add a few follow-up questions to questions like:
Do you currently use or develop with Mahout?
- if i answer yes, but not in production - but I plan on using it in
production:)
Same goes for the second question:)

As for the last question, "standalone batch programs with defined file-based
inputs and outputs" is obviously "acceptable" to me, but ideally I would
like the second and third option.

Cheers,
 Aleks

On Thu, Sep 17, 2009 at 11:02 PM, Ted Dunning <te...@gmail.com> wrote:

> Aleksander,
>
> As a (temporarily) naive user of the system, you are in a special position
> to answer a few use-case questions.  Because I think that we need to
> collect
> some of these impressions, I have created a simple form with less than a
> dozen questions about intended use and preferred shape of the software.
>
> Could you go to the URL below to answer those questions?
>
>
> http://spreadsheets.google.com/viewform?formkey=dGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E6MA
> ..
>
> On Thu, Sep 17, 2009 at 11:59 AM, Aleksander Stensby <
> aleksander.stensby@integrasco.com> wrote:
>
> > Thanks for all the replies guys!
> > I understand the flow of things and it makes sense, but like Shawn
> pointed
> > out there could still be more abstraction (and once I get my hands dirty
> > I'll try to do my best to contribute here aswell:) )
> >
> > And to Levy: your proposed flow of things makes sense, but what I wanted
> > was
> > to do all that from one entry point. (Ideally, I don't want to do manual
> > stuff here, I want everything to be able to run on a regular basis from a
> > single entrypoint - and then I mean any algorithm etc). And I can
> probably
> > do that just fine by using the Drivers etc.
> >
> > Again, thanks for the replies!
> >
> > Cheers,
> >  Aleks
> >
> > On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <gsingers@apache.org
> > >wrote:
> >
> > >
> > > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
> > >
> > >  Hi Aleksander,
> > >>
> > >> I've also been learning how to run mahout's clustering and LDA on our
> > >> cluster.
> > >>
> > >> For k-means, the following series of steps has worked for me:
> > >>
> > >> * build mahout from trunk
> > >>
> > >> * write a program to convert your data to mahout Vectors.  You can
> base
> > >> this on one of the Drivers in the mahout.utils.vectors package (which
> > >> seem designed to work locally).  For bigger datasets you'll probably
> > >> need to  write a simple map reduce job, more like
> > >> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either
> event
> > >> your Vectors need to end up on the dfs.
> > >>
> > >
> > > Yeah, they are designed for local so far, but we should work to extend
> > > them.  I think as Mahout matures, this problem will become less and
> less.
> > >  Ultimately, I'd like to see utilities that simply ingest whatever is
> up
> > on
> > > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a
> _long_
> > > way off, unless someone wants to help drive that.
> > >
> > > Those kinds of utilities would be great contributions from someone
> > looking
> > > to get started contributing.  As I see it, we could leverage Apache
> Tika
> > > with a M/R job to produce the appropriate kinds of things for our
> various
> > > algorithms.
> > >
> > >
> > >> * run clustering with
> org.apache.mahout.clustering.kmeans.KMeansDriver,
> > >> something like:
> > >>  hadoop jar mahout-core-0.2-SNAPSHOT.job
> > >> org.apache.mahout.clustering.kmeans.KMeansDriver -i
> /dfs/input/data/dir
> > >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
> > >> -x <maxIters>
> > >>
> > >> * possibly fix the problem described here
> > >>
> > http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
> > >> -of-KMeans-td24505889.html (solution is at the bottom of the page)
> > >>
> > >> * get all the output files locally
> > >>
> > >> * convert the output to text format with
> > >> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer
> to
> > >> do this on the cluster, but the code seems to expect local files.  If
> > >> you set the name field in your input Vectors in the conversion step to
> a
> > >> suitable ID, then the final output can be a set of cluster centroids,
> > >> each followed by the list of Vector IDs in the corresponding cluster.
> > >>
> > >> Hope this is useful.
> > >>
> > >> More importantly, if anything here is very wrong then please can a
> > >> mahout person correct me!
> > >>
> > >
> > > Looks good to me.  Suggestions/patches are welcome!
> > >
> > >
> >
> >
> > --
> > Aleksander M. Stensby
> > Lead Software Developer and System Architect
> > Integrasco A/S
> > E-mail: aleksander.stensby@integrasco.com
> > Tel.: +47 41 22 82 72
> > www.integrasco.com
> > http://twitter.com/Integrasco
> > http://facebook.com/Integrasco
> >
> > Please consider the environment before printing all or any of this e-mail
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



-- 
Aleksander M. Stensby
Lead Software Developer and System Architect
Integrasco A/S
www.integrasco.com
http://twitter.com/Integrasco
http://facebook.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Re: Some basic introductory questions

Posted by Ted Dunning <te...@gmail.com>.

The summary is available to anyone who fills out the questionaire.  I have
also shared the spreadsheet with the results:

http://spreadsheets.google.com/ccc?key=0Art2iY7e93hUdGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E&hl=en

For reference, the summary can be viewed here (my theory is that you will be
able to see it):

http://spreadsheets.google.com/gform?key=tgY1sR-PpYxn_a4re_6bwGQ&hl=en#chart

On Fri, Sep 18, 2009 at 10:42 AM, Ted Dunning <te...@gmail.com> wrote:

>
> I will make the summary publicly available.
>
>
> On Fri, Sep 18, 2009 at 12:03 AM, Isabel Drost <is...@apache.org> wrote:
>
>> On Thu, 17 Sep 2009 14:02:42 -0700
>> Ted Dunning <te...@gmail.com> wrote:
>>
>> > Because I think that we need to collect some of these impressions, I
>> > have created a simple form with less than a dozen questions about
>> > intended use and preferred shape of the software.
>>
>> Great idea. I am really interested in the outcome of this little
>> questionnaire.
>>
>> Isabel
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: Some basic introductory questions

Posted by Ted Dunning <te...@gmail.com>.

I will make the summary publicly available.

On Fri, Sep 18, 2009 at 12:03 AM, Isabel Drost <is...@apache.org> wrote:

> On Thu, 17 Sep 2009 14:02:42 -0700
> Ted Dunning <te...@gmail.com> wrote:
>
> > Because I think that we need to collect some of these impressions, I
> > have created a simple form with less than a dozen questions about
> > intended use and preferred shape of the software.
>
> Great idea. I am really interested in the outcome of this little
> questionnaire.
>
> Isabel
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Some basic introductory questions

Posted by Isabel Drost <is...@apache.org>.

On Thu, 17 Sep 2009 14:02:42 -0700
Ted Dunning <te...@gmail.com> wrote:

> Because I think that we need to collect some of these impressions, I
> have created a simple form with less than a dozen questions about
> intended use and preferred shape of the software.

Great idea. I am really interested in the outcome of this little
questionnaire.

Isabel

Re: Some basic introductory questions

Posted by Ted Dunning <te...@gmail.com>.

Aleksander,

As a (temporarily) naive user of the system, you are in a special position
to answer a few use-case questions.  Because I think that we need to collect
some of these impressions, I have created a simple form with less than a
dozen questions about intended use and preferred shape of the software.

Could you go to the URL below to answer those questions?

http://spreadsheets.google.com/viewform?formkey=dGdZMXNSLVBwWXhuX2E0cmVfNmJ3R1E6MA
..

On Thu, Sep 17, 2009 at 11:59 AM, Aleksander Stensby <
aleksander.stensby@integrasco.com> wrote:

> Thanks for all the replies guys!
> I understand the flow of things and it makes sense, but like Shawn pointed
> out there could still be more abstraction (and once I get my hands dirty
> I'll try to do my best to contribute here aswell:) )
>
> And to Levy: your proposed flow of things makes sense, but what I wanted
> was
> to do all that from one entry point. (Ideally, I don't want to do manual
> stuff here, I want everything to be able to run on a regular basis from a
> single entrypoint - and then I mean any algorithm etc). And I can probably
> do that just fine by using the Drivers etc.
>
> Again, thanks for the replies!
>
> Cheers,
>  Aleks
>
> On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> >
> > On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
> >
> >  Hi Aleksander,
> >>
> >> I've also been learning how to run mahout's clustering and LDA on our
> >> cluster.
> >>
> >> For k-means, the following series of steps has worked for me:
> >>
> >> * build mahout from trunk
> >>
> >> * write a program to convert your data to mahout Vectors.  You can base
> >> this on one of the Drivers in the mahout.utils.vectors package (which
> >> seem designed to work locally).  For bigger datasets you'll probably
> >> need to  write a simple map reduce job, more like
> >> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either event
> >> your Vectors need to end up on the dfs.
> >>
> >
> > Yeah, they are designed for local so far, but we should work to extend
> > them.  I think as Mahout matures, this problem will become less and less.
> >  Ultimately, I'd like to see utilities that simply ingest whatever is up
> on
> > HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_
> > way off, unless someone wants to help drive that.
> >
> > Those kinds of utilities would be great contributions from someone
> looking
> > to get started contributing.  As I see it, we could leverage Apache Tika
> > with a M/R job to produce the appropriate kinds of things for our various
> > algorithms.
> >
> >
> >> * run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
> >> something like:
> >>  hadoop jar mahout-core-0.2-SNAPSHOT.job
> >> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir
> >> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
> >> -x <maxIters>
> >>
> >> * possibly fix the problem described here
> >>
> http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
> >> -of-KMeans-td24505889.html (solution is at the bottom of the page)
> >>
> >> * get all the output files locally
> >>
> >> * convert the output to text format with
> >> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer to
> >> do this on the cluster, but the code seems to expect local files.  If
> >> you set the name field in your input Vectors in the conversion step to a
> >> suitable ID, then the final output can be a set of cluster centroids,
> >> each followed by the list of Vector IDs in the corresponding cluster.
> >>
> >> Hope this is useful.
> >>
> >> More importantly, if anything here is very wrong then please can a
> >> mahout person correct me!
> >>
> >
> > Looks good to me.  Suggestions/patches are welcome!
> >
> >
>
>
> --
> Aleksander M. Stensby
> Lead Software Developer and System Architect
> Integrasco A/S
> E-mail: aleksander.stensby@integrasco.com
> Tel.: +47 41 22 82 72
> www.integrasco.com
> http://twitter.com/Integrasco
> http://facebook.com/Integrasco
>
> Please consider the environment before printing all or any of this e-mail
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Some basic introductory questions

Posted by Aleksander Stensby <al...@integrasco.com>.

Thanks for all the replies guys!
I understand the flow of things and it makes sense, but like Shawn pointed
out there could still be more abstraction (and once I get my hands dirty
I'll try to do my best to contribute here aswell:) )

And to Levy: your proposed flow of things makes sense, but what I wanted was
to do all that from one entry point. (Ideally, I don't want to do manual
stuff here, I want everything to be able to run on a regular basis from a
single entrypoint - and then I mean any algorithm etc). And I can probably
do that just fine by using the Drivers etc.

Again, thanks for the replies!

Cheers,
 Aleks

On Thu, Sep 17, 2009 at 3:35 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:
>
>  Hi Aleksander,
>>
>> I've also been learning how to run mahout's clustering and LDA on our
>> cluster.
>>
>> For k-means, the following series of steps has worked for me:
>>
>> * build mahout from trunk
>>
>> * write a program to convert your data to mahout Vectors.  You can base
>> this on one of the Drivers in the mahout.utils.vectors package (which
>> seem designed to work locally).  For bigger datasets you'll probably
>> need to  write a simple map reduce job, more like
>> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either event
>> your Vectors need to end up on the dfs.
>>
>
> Yeah, they are designed for local so far, but we should work to extend
> them.  I think as Mahout matures, this problem will become less and less.
>  Ultimately, I'd like to see utilities that simply ingest whatever is up on
> HDFS (office docs, PDFs, mail, etc.) and just works, but that is a _long_
> way off, unless someone wants to help drive that.
>
> Those kinds of utilities would be great contributions from someone looking
> to get started contributing.  As I see it, we could leverage Apache Tika
> with a M/R job to produce the appropriate kinds of things for our various
> algorithms.
>
>
>> * run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
>> something like:
>>  hadoop jar mahout-core-0.2-SNAPSHOT.job
>> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir
>> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
>> -x <maxIters>
>>
>> * possibly fix the problem described here
>> http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
>> -of-KMeans-td24505889.html (solution is at the bottom of the page)
>>
>> * get all the output files locally
>>
>> * convert the output to text format with
>> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer to
>> do this on the cluster, but the code seems to expect local files.  If
>> you set the name field in your input Vectors in the conversion step to a
>> suitable ID, then the final output can be a set of cluster centroids,
>> each followed by the list of Vector IDs in the corresponding cluster.
>>
>> Hope this is useful.
>>
>> More importantly, if anything here is very wrong then please can a
>> mahout person correct me!
>>
>
> Looks good to me.  Suggestions/patches are welcome!
>
>


-- 
Aleksander M. Stensby
Lead Software Developer and System Architect
Integrasco A/S
E-mail: aleksander.stensby@integrasco.com
Tel.: +47 41 22 82 72
www.integrasco.com
http://twitter.com/Integrasco
http://facebook.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Re: Some basic introductory questions

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 17, 2009, at 6:24 AM, Levy, Mark wrote:

> Hi Aleksander,
>
> I've also been learning how to run mahout's clustering and LDA on our
> cluster.
>
> For k-means, the following series of steps has worked for me:
>
> * build mahout from trunk
>
> * write a program to convert your data to mahout Vectors.  You can  
> base
> this on one of the Drivers in the mahout.utils.vectors package (which
> seem designed to work locally).  For bigger datasets you'll probably
> need to  write a simple map reduce job, more like
> mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either  
> event
> your Vectors need to end up on the dfs.

Yeah, they are designed for local so far, but we should work to extend  
them.  I think as Mahout matures, this problem will become less and  
less.  Ultimately, I'd like to see utilities that simply ingest  
whatever is up on HDFS (office docs, PDFs, mail, etc.) and just works,  
but that is a _long_ way off, unless someone wants to help drive that.

Those kinds of utilities would be great contributions from someone  
looking to get started contributing.  As I see it, we could leverage  
Apache Tika with a M/R job to produce the appropriate kinds of things  
for our various algorithms.

>
> * run clustering with  
> org.apache.mahout.clustering.kmeans.KMeansDriver,
> something like:
>   hadoop jar mahout-core-0.2-SNAPSHOT.job
> org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/ 
> dir
> -c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
> -x <maxIters>
>
> * possibly fix the problem described here
> http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
> -of-KMeans-td24505889.html (solution is at the bottom of the page)
>
> * get all the output files locally
>
> * convert the output to text format with
> org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer  
> to
> do this on the cluster, but the code seems to expect local files.  If
> you set the name field in your input Vectors in the conversion step  
> to a
> suitable ID, then the final output can be a set of cluster centroids,
> each followed by the list of Vector IDs in the corresponding cluster.
>
> Hope this is useful.
>
> More importantly, if anything here is very wrong then please can a
> mahout person correct me!

Looks good to me.  Suggestions/patches are welcome!

RE: Some basic introductory questions

Posted by "Levy, Mark" <ma...@last.fm>.

Hi Aleksander,

I've also been learning how to run mahout's clustering and LDA on our
cluster.

For k-means, the following series of steps has worked for me:

* build mahout from trunk

* write a program to convert your data to mahout Vectors.  You can base
this on one of the Drivers in the mahout.utils.vectors package (which
seem designed to work locally).  For bigger datasets you'll probably
need to  write a simple map reduce job, more like
mahout.clustering.syntheticcontrol.canopy.InputDriver.  In either event
your Vectors need to end up on the dfs.

* run clustering with org.apache.mahout.clustering.kmeans.KMeansDriver,
something like:
   hadoop jar mahout-core-0.2-SNAPSHOT.job
org.apache.mahout.clustering.kmeans.KMeansDriver -i /dfs/input/data/dir
-c /dfs/initial/rand/centroids/dir -o /dfs/output/dir -k <numClusters>
-x <maxIters>

* possibly fix the problem described here
http://www.nabble.com/ClassNotFoundException-with-pseudo-distributed-run
-of-KMeans-td24505889.html (solution is at the bottom of the page)

* get all the output files locally

* convert the output to text format with
org.apache.mahout.utils.clustering.ClusterDumper.  It might be nicer to
do this on the cluster, but the code seems to expect local files.  If
you set the name field in your input Vectors in the conversion step to a
suitable ID, then the final output can be a set of cluster centroids,
each followed by the list of Vector IDs in the corresponding cluster.

Hope this is useful.  

More importantly, if anything here is very wrong then please can a
mahout person correct me!  

Many thanks,

Mark

> -----Original Message-----
> From: Aleksander Stensby [mailto:aleksander.stensby@integrasco.com]
> Sent: 17 September 2009 12:32
> To: mahout-user@lucene.apache.org
> Subject: Re: Some basic introductory questions
> 
> Okay, thanks Isabel!
> That was what I thought, I just wanted to check if I had missed
> something
> important here:)
> 
> Cheers,
>  Aleksander
> 
> On Thu, Sep 17, 2009 at 11:23 AM, Isabel Drost <is...@apache.org>
> wrote:
> 
> > On Thu, 17 Sep 2009 09:36:50 +0200
> > Aleksander Stensby <al...@integrasco.com> wrote:
> >
> > > Or do I have to use the KMeansDriver.runJob and read input from
> > > serialized vectors files?
> >
> > I'd say this is the recommended way currently, though we are open to
> > changes to the API that would make your life easier.
> >
> > At least during experimentation phase, serializing the processed
> > vectors to disk has the advantage of being able to rerun clustering
> > with varied parameters (number of clusters, distance measure or even
> > try out one of the other algorithms).
> >
> > Isabel
> >
> 
> 
> 
> --
> Aleksander M. Stensby
> Lead Software Developer and System Architect
> Integrasco A/S
> www.integrasco.com
> http://twitter.com/Integrasco
> http://facebook.com/Integrasco
> 
> Please consider the environment before printing all or any of this e-
> mail

Re: Some basic introductory questions

Posted by Sean Owen <sr...@gmail.com>.

FWIW I do agree that, in the end, the project should be a little more
user-friendly. Right now you see it offers the raw machinery for
running these processes, without much abstraction on top. I think the
project can and will both unify how it exposes this machinery, and
work to offer higher-level wrappers on top. That is to say I don't
think you missed anything or asked a dumb question!

On Thu, Sep 17, 2009 at 12:31 PM, Aleksander Stensby
<al...@integrasco.com> wrote:
> Okay, thanks Isabel!
> That was what I thought, I just wanted to check if I had missed something
> important here:)

Re: Some basic introductory questions

Posted by Aleksander Stensby <al...@integrasco.com>.

Okay, thanks Isabel!
That was what I thought, I just wanted to check if I had missed something
important here:)

Cheers,
 Aleksander

On Thu, Sep 17, 2009 at 11:23 AM, Isabel Drost <is...@apache.org> wrote:

> On Thu, 17 Sep 2009 09:36:50 +0200
> Aleksander Stensby <al...@integrasco.com> wrote:
>
> > Or do I have to use the KMeansDriver.runJob and read input from
> > serialized vectors files?
>
> I'd say this is the recommended way currently, though we are open to
> changes to the API that would make your life easier.
>
> At least during experimentation phase, serializing the processed
> vectors to disk has the advantage of being able to rerun clustering
> with varied parameters (number of clusters, distance measure or even
> try out one of the other algorithms).
>
> Isabel
>



-- 
Aleksander M. Stensby
Lead Software Developer and System Architect
Integrasco A/S
www.integrasco.com
http://twitter.com/Integrasco
http://facebook.com/Integrasco

Please consider the environment before printing all or any of this e-mail

Re: Some basic introductory questions

Posted by Isabel Drost <is...@apache.org>.

On Thu, 17 Sep 2009 09:36:50 +0200
Aleksander Stensby <al...@integrasco.com> wrote:

> Or do I have to use the KMeansDriver.runJob and read input from
> serialized vectors files?

I'd say this is the recommended way currently, though we are open to
changes to the API that would make your life easier.

At least during experimentation phase, serializing the processed
vectors to disk has the advantage of being able to rerun clustering
with varied parameters (number of clusters, distance measure or even
try out one of the other algorithms).

Isabel

Re: Some basic introductory questions

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 17, 2009, at 12:36 AM, Aleksander Stensby wrote:

> Hi all,
> I've been following the development of Mahout for quite a while now  
> and
> figured it was time for me to get my hands dirty:)
>
> I've gone through the examples and Grant's excellent IBM article  
> (great work
> on that Grant!).

Thanks!

> So, now I'm at the point where I want to figure out where I go next.
> Specifically, I'm a bit fuzzed about common practices when it comes to
> utilizing Mahout in my own applications...
>
> Case scenario:
> I have my own project, add the dependencies to Mahout (through  
> maven), and
> make my own little kMeans test class.
> I guess my question is a bit stupid, but how would you go about  
> using Mahout
> out of the box?
>
> Ideally (or maybe not?), I figured that I could just take care of  
> providing
> the Vectors -> push it into mahout and run the kMeans clustering...
> But when I started looking at the kMeans clustering example, I  
> notice that
> there is actually a lot of implementation in the example itself...  
> Is it
> really necessary for me to implement all of those methods in every  
> project
> where I want to do kMeans? Can't they be reused? The methods I talk  
> about
> are for instance:
>  static List<Canopy> populateCanopies(DistanceMeasure measure,  
> List<Vector>
> points, double t1, double t2)

Yeah, this one is a bit weird here.

>  private static void referenceKmeans(List<Vector> points,
> List<List<Cluster>> clusters, DistanceMeasure measure, int maxIter)

I think that is for testing purposes, but don't have the code up at  
the mo'.

>  private static boolean iterateReference(List<Vector> points,  
> List<Cluster>
> clusters, DistanceMeasure measure)
>
> In my narrow minded head I would think that input would be the  
> List<Vector>
> and that the output would be List<List<Cluster> of some general kMeans
> method that did all the internals for me... Or am I missing  
> something? Or do
> I have to use the KMeansDriver.runJob and read input from serialized  
> vectors
> files?

I think the piece that is missing is these algs. are designed to scale  
and use Hadoop.  Imagine passing around 5+ million dense vectors of  
with large cardinality.