You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/09/22 18:17:29 UTC

Mahout book

As I mentioned to some of you, there's a proposal to begin work on a
book on Mahout. It sounds early, but the publisher assures me it's
about the right time to begin, if we want a book out at roughly the
time '1.0' rolls out in a year or so. I've heard support for the idea,
and think it's a good thing.

I'm going to move forward drafting a proposal and draft outline of
such a thing. It seems so far I am the (only?) one interested in
significant work in writing such a thing, which is cool, so I can
drive this -- but I'd be concerned if it were just me speaking for the
project book. Hence:

- Who else might be interested in being a co-author and putting in
significant work?
- Would anyone care to read the proposal before I send it in?
- Would anyone help me, in the short term, draft an outline of the
content of the classification and clustering sections?

Sean

Re: Mahout book

Posted by Isabel Drost <is...@apache.org>.

On Tuesday 22 September 2009 18:17:29 Sean Owen wrote:
> - Who else might be interested in being a co-author and putting in
> significant work?
> - Would anyone care to read the proposal before I send it in?
> - Would anyone help me, in the short term, draft an outline of the
> content of the classification and clustering sections?

As indicated earlier: Would be happy to help whereever I can (proof reading, 
help draft the TOD etc.). However I cannot promise to have enough time for 
being co-author.

Isabel

-- 
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_  
 |,4-  ) )-,_..;\ (  `'-' 
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Mahout book

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 22, 2009, at 12:59 PM, Ted Dunning wrote:

> I would amend that (again) to clustering, classification and  
> recommendations
> at scale.  With Hadoop where necessary.

+1

>
> On Tue, Sep 22, 2009 at 9:48 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> I sense some consensus that Mahout v1.0 is primarily clustering,
>> classification and recommendations at scale using Hadoop.
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Mahout book

Posted by Ted Dunning <te...@gmail.com>.

I think that there is a real need for a more general "Learning at Scale"
book, but I don't think that any of us here are really qualified to write
it.

On Tue, Sep 22, 2009 at 11:00 AM, Sean Owen <sr...@gmail.com> wrote:

> At least, I can't write that theoretical book, and at the
> moment, if there is a book to be written, and it seems I've got the
> most time to put into it, it would be more about how to use what
> Mahout v1.0 is in practice.
>
> But that opens the question -- should that be written? would it be
> better to consider a different style of book, later?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Mahout book

Posted by Sean Owen <sr...@gmail.com>.

There is certainly no reason to make 'using Hadoop, and nothing else'
a long-term goal. I think there are many reasons to focus on Hadoop in
the short term. And I think this book is about the short term, Mahout
v1.0.

That is I don't disagree -- there's every reason to state the
long-term goal of Mahout correctly, while saying that the book will be
talking about Mahout + Hadoop, because that's what Mahout v1.0 does,
and the book is a 1st edition about v1.0.

I suppose I should emphasize that I think the book ought to be a
cookbook (as Tanton just suggested) rather than a more theoretical
book about how these techniques must be approached differently at
scale. At least, I can't write that theoretical book, and at the
moment, if there is a book to be written, and it seems I've got the
most time to put into it, it would be more about how to use what
Mahout v1.0 is in practice.

But that opens the question -- should that be written? would it be
better to consider a different style of book, later?

On Tue, Sep 22, 2009 at 12:34 PM, Ted Dunning <te...@gmail.com> wrote:
> The difference being that we focus on scalable.  This might involve hadoop
> for some, all or none of the steps.
>
> My definition of scalable is "handles data as big as nearly anybody
> produces".  That may or may not require Hadoop to do.  Many on-line learning
> systems are so fast that a single machine can munch near google scale
> amounts of data in a few hours.  Many other algorithms might require Hadoop
> for an aggregation step, but nothing else.  Other algorithms might depend on
> a cluster of Lucene nodes.
>
> In any case, I think that the focus of Mahout should be scalable learning.
> Period.
>
> The methods used should be drawn from a useful toolkit which prominently
> includes Hadoop.  And Lucene.  And some linear algebra stuff.  And Taste.
>
> This leaves open whether the focus of the book should be scalable learning
> or whether it should be learning with Hadoop.

Re: Mahout book

Posted by Isabel Drost <is...@apache.org>.

On Tue, 22 Sep 2009 14:43:03 -0400
zaki rahaman <za...@gmail.com> wrote:

> Sounds good, I'd love to take a look at an outline. I too would love
> to see a cookbook style manual which focuses more on the details of
> implementation, how to optimize systems, best practices, etc. and
> fills in with some of the theory material where appropriate/needed.

Given the number of problems one might want to solve with Mahout: I
think for each task presented in the book we should also be able to
give guidelines on which constraints influence which exact algorithm
works best for a given problem setting.

Example: Currently we already have quite a few clustering algorithms.
Each has several knobs for parameter tuning. In addition data can be
prepared differently before running the algorithms. If I were a reader
of the book, user of Mahout I imagine I would love to learn some general
guidelines (if these exist) as to which algorithm with which settings
performs best for my problem setting. Or at least learn ways to find
those settings. I know that, at least to some extend, this is still an
open research questions. But I am quite certain we have enough people
in our community to contribute best practices from various projects.

Isabel

Re: Mahout book

Posted by zaki rahaman <za...@gmail.com>.

Sean,

Sounds good, I'd love to take a look at an outline. I too would love to see
a cookbook style manual which focuses more on the details of implementation,
how to optimize systems, best practices, etc. and fills in with some of the
theory material where appropriate/needed. It wouldn't hurt to have a very
modular outline (a clustering 'module', one for recommenders, one for
classification), provide a bit of background on each (not more than say 4-5
pages worth honestly), walk through a basic example then deal with more
advanced examples/cases. And if time allows, we could add more
appendix-style chapters on background material. If not, references will do
just fine.

On Tue, Sep 22, 2009 at 2:02 PM, Sean Owen <sr...@gmail.com> wrote:

> That is indeed how I am positioning it in this draft  book proposal --
> it's for a 'Mahout in Action' book from Manning. They want to
> understand why this wouldn't be just another Collective Intelligence
> in Action (which I do think is quite a good book, at least, I learned
> a good deal about Lucene from it.)
>
> On Tue, Sep 22, 2009 at 12:45 PM, Tanton Gibbs <ta...@gmail.com>
> wrote:
> > I hope I'm one of the targeted audience members for the book.  I've
> > used hadoop, done clustering (not with Mahout), have read about
> > collaborative filtering, and plan on using Mahout in a business
> > intelligence setting in 1-2 years.  However, I've never used Mahout
> > itself.  What I would like to see is more of a cookbook style.  I want
> > to know the whys, not just the hows.  Why should I normalize data in a
> > certain way before clustering it, what happens if I don't, etc....
> > I've read Collective Intelligence in Action and found it pretty much
> > useless - I don't need another survey on the topic.  Instead, I want
> > to know the ins and outs of mahout so that when I go to sell it to
> > someone I can answer any and all of their questions around it.  If you
> > want an introductory chapter on the topics, that is fine, but keep
> > them short and point to other materials.
> >
> > Thanks!
> > Tanton
> >
>

-- 
Zaki Rahaman

Re: Mahout book

Posted by Lukáš Vlček <lu...@gmail.com>.

Hello Sean,
as a Mahout fan I can help with charts, diagrams or schema pictures if
needed. Let's make this book looking real good. Is it true that Manning is
forcing authors to use MS Word? Still it should be possible to use PS, EPS
or maybe PDF for vector graphics, correct?

Anyway, I would love to take a look at the draft.

Regards,
Lukas


On Tue, Sep 22, 2009 at 8:02 PM, Sean Owen <sr...@gmail.com> wrote:

> That is indeed how I am positioning it in this draft  book proposal --
> it's for a 'Mahout in Action' book from Manning. They want to
> understand why this wouldn't be just another Collective Intelligence
> in Action (which I do think is quite a good book, at least, I learned
> a good deal about Lucene from it.)
>
> On Tue, Sep 22, 2009 at 12:45 PM, Tanton Gibbs <ta...@gmail.com>
> wrote:
> > I hope I'm one of the targeted audience members for the book.  I've
> > used hadoop, done clustering (not with Mahout), have read about
> > collaborative filtering, and plan on using Mahout in a business
> > intelligence setting in 1-2 years.  However, I've never used Mahout
> > itself.  What I would like to see is more of a cookbook style.  I want
> > to know the whys, not just the hows.  Why should I normalize data in a
> > certain way before clustering it, what happens if I don't, etc....
> > I've read Collective Intelligence in Action and found it pretty much
> > useless - I don't need another survey on the topic.  Instead, I want
> > to know the ins and outs of mahout so that when I go to sell it to
> > someone I can answer any and all of their questions around it.  If you
> > want an introductory chapter on the topics, that is fine, but keep
> > them short and point to other materials.
> >
> > Thanks!
> > Tanton
> >
>

Re: Mahout book

Posted by Sean Owen <sr...@gmail.com>.

That is indeed how I am positioning it in this draft  book proposal --
it's for a 'Mahout in Action' book from Manning. They want to
understand why this wouldn't be just another Collective Intelligence
in Action (which I do think is quite a good book, at least, I learned
a good deal about Lucene from it.)

On Tue, Sep 22, 2009 at 12:45 PM, Tanton Gibbs <ta...@gmail.com> wrote:
> I hope I'm one of the targeted audience members for the book.  I've
> used hadoop, done clustering (not with Mahout), have read about
> collaborative filtering, and plan on using Mahout in a business
> intelligence setting in 1-2 years.  However, I've never used Mahout
> itself.  What I would like to see is more of a cookbook style.  I want
> to know the whys, not just the hows.  Why should I normalize data in a
> certain way before clustering it, what happens if I don't, etc....
> I've read Collective Intelligence in Action and found it pretty much
> useless - I don't need another survey on the topic.  Instead, I want
> to know the ins and outs of mahout so that when I go to sell it to
> someone I can answer any and all of their questions around it.  If you
> want an introductory chapter on the topics, that is fine, but keep
> them short and point to other materials.
>
> Thanks!
> Tanton
>

Re: Mahout book

Posted by Robin Anil <ro...@gmail.com>.

+1 for cookbook style. Thats what i  meant when i said tuning
CBayes/Bayes(there are around 4-5 knobs which you can modify for fitting you
data perfectly

On Tue, Sep 22, 2009 at 11:15 PM, Tanton Gibbs <ta...@gmail.com>wrote:

> I hope I'm one of the targeted audience members for the book.  I've
> used hadoop, done clustering (not with Mahout), have read about
> collaborative filtering, and plan on using Mahout in a business
> intelligence setting in 1-2 years.  However, I've never used Mahout
> itself.  What I would like to see is more of a cookbook style.  I want
> to know the whys, not just the hows.  Why should I normalize data in a
> certain way before clustering it, what happens if I don't, etc....
> I've read Collective Intelligence in Action and found it pretty much
> useless - I don't need another survey on the topic.  Instead, I want
> to know the ins and outs of mahout so that when I go to sell it to
> someone I can answer any and all of their questions around it.  If you
> want an introductory chapter on the topics, that is fine, but keep
> them short and point to other materials.
>
> Thanks!
> Tanton
>

Re: Mahout book

Posted by Tanton Gibbs <ta...@gmail.com>.

I hope I'm one of the targeted audience members for the book.  I've
used hadoop, done clustering (not with Mahout), have read about
collaborative filtering, and plan on using Mahout in a business
intelligence setting in 1-2 years.  However, I've never used Mahout
itself.  What I would like to see is more of a cookbook style.  I want
to know the whys, not just the hows.  Why should I normalize data in a
certain way before clustering it, what happens if I don't, etc....
I've read Collective Intelligence in Action and found it pretty much
useless - I don't need another survey on the topic.  Instead, I want
to know the ins and outs of mahout so that when I go to sell it to
someone I can answer any and all of their questions around it.  If you
want an introductory chapter on the topics, that is fine, but keep
them short and point to other materials.

Thanks!
Tanton

Re: Mahout book

Posted by Robin Anil <ro...@gmail.com>.

I could help out with internals of CBayes/Bayes, FPGrowth(if it becomes
ready by then) and writeups or how to's  to improve efficiency on different
datasets. how to understand your data and to disable enable various
parameters of CBayes/Bayes to fit non text data. Sparse database v/s dense
database on frequent pattern mining.
Other than that I could help out with any other writeups on classification,
clustering, pattern mining that you might need as introductions to the topic
at hand.


On Tue, Sep 22, 2009 at 11:04 PM, Ted Dunning <te...@gmail.com> wrote:

> The difference being that we focus on scalable.  This might involve hadoop
> for some, all or none of the steps.
>
> My definition of scalable is "handles data as big as nearly anybody
> produces".  That may or may not require Hadoop to do.  Many on-line
> learning
> systems are so fast that a single machine can munch near google scale
> amounts of data in a few hours.  Many other algorithms might require Hadoop
> for an aggregation step, but nothing else.  Other algorithms might depend
> on
> a cluster of Lucene nodes.
>
> In any case, I think that the focus of Mahout should be scalable learning.
> Period.
>
> The methods used should be drawn from a useful toolkit which prominently
> includes Hadoop.  And Lucene.  And some linear algebra stuff.  And Taste.
>
> This leaves open whether the focus of the book should be scalable learning
> or whether it should be learning with Hadoop.
>
> On Tue, Sep 22, 2009 at 10:18 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > The difference being, not emphasizing Hadoop? I understand that. I
> > also recall we'd agreed that we were not realistically considering any
> > other distributed processing framework in the near future, which I
> > took to mean before v1.0?
> >
> > On Tue, Sep 22, 2009 at 11:59 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > I would amend that (again) to clustering, classification and
> > recommendations
> > > at scale.  With Hadoop where necessary.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Mahout book

Posted by Ted Dunning <te...@gmail.com>.

The difference being that we focus on scalable.  This might involve hadoop
for some, all or none of the steps.

My definition of scalable is "handles data as big as nearly anybody
produces".  That may or may not require Hadoop to do.  Many on-line learning
systems are so fast that a single machine can munch near google scale
amounts of data in a few hours.  Many other algorithms might require Hadoop
for an aggregation step, but nothing else.  Other algorithms might depend on
a cluster of Lucene nodes.

In any case, I think that the focus of Mahout should be scalable learning.
Period.

The methods used should be drawn from a useful toolkit which prominently
includes Hadoop.  And Lucene.  And some linear algebra stuff.  And Taste.

This leaves open whether the focus of the book should be scalable learning
or whether it should be learning with Hadoop.

On Tue, Sep 22, 2009 at 10:18 AM, Sean Owen <sr...@gmail.com> wrote:

> The difference being, not emphasizing Hadoop? I understand that. I
> also recall we'd agreed that we were not realistically considering any
> other distributed processing framework in the near future, which I
> took to mean before v1.0?
>
> On Tue, Sep 22, 2009 at 11:59 AM, Ted Dunning <te...@gmail.com>
> wrote:
> > I would amend that (again) to clustering, classification and
> recommendations
> > at scale.  With Hadoop where necessary.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Mahout book

Posted by Sean Owen <sr...@gmail.com>.

The difference being, not emphasizing Hadoop? I understand that. I
also recall we'd agreed that we were not realistically considering any
other distributed processing framework in the near future, which I
took to mean before v1.0?

On Tue, Sep 22, 2009 at 11:59 AM, Ted Dunning <te...@gmail.com> wrote:
> I would amend that (again) to clustering, classification and recommendations
> at scale.  With Hadoop where necessary.

Re: Mahout book

Posted by Ted Dunning <te...@gmail.com>.

I would amend that (again) to clustering, classification and recommendations
at scale.  With Hadoop where necessary.

On Tue, Sep 22, 2009 at 9:48 AM, Sean Owen <sr...@gmail.com> wrote:

> I sense some consensus that Mahout v1.0 is primarily clustering,
> classification and recommendations at scale using Hadoop.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Mahout book

Posted by Sean Owen <sr...@gmail.com>.

Good, glad to hear there is interest in making this book happen. I
agree, the very first step I'm going through with the publisher is
trying to answer these questions.

I sense some consensus that Mahout v1.0 is primarily clustering,
classification and recommendations at scale using Hadoop. That's a
pretty good mission statement.

Should the book also be a tutorial on these techniques, and on Hadoop?
There are already books like Collective Intelligence in Action, and
Hadoop: The Definitive Guide. My sense is the book shouldn't try to
duplicate these, though it's unavoidable to cover these topics
partially. From an initial conversation, seemed like the most
available and useful niche would be to focus specifically on Mahout
(of course) and focus on practice as opposed to theory. So rather than
explain Hadoop -- walk through all aspects of running a big clustering
job with Mahout on Hadoop.

I think the intended audience you mention is right. It's for people
that are either already experienced engineers, or already familiar
with these techniques, but not both. A practical cookbook fills in the
gaps for either group.

Zaki why don't I forward you the outline I am writing for the
publisher when I finish it, shortly?

The parts that currently would need most help are sections on
clustering and classification. Obviously I can cover recommender
engines.

On Tue, Sep 22, 2009 at 11:35 AM, zaki rahaman <za...@gmail.com> wrote:
> I've been a longtime lurker and I'm still getting used to the ins and outs
> of using Mahout (I've made some hacks to source in my own environment and
> have done some testing, but nothing in production yet) but I'd love to help
> out on a book, maybe with some of the background material. Maybe I'm the
> only one who feels this way, but any Mahout book should have some basic
> introductory background material -- some discussion about machine learning
> (classification, clustering), high level overviews of algorithms, and maybe
> some case studies/examples (why use mahout vs. other tools?). And of course,
> the standard Intro chapter on MapReduce, HDFS, and the rest of the Hadoop
> environment (including deploying on EC2/S3). Again, it's probably best to
> sort out what does/doesn't belong, but first I think it would be a useful
> excercise to figure out who the intended audience really is. In my mind I
> would break it down into a few possibilities:
>
> 1. Java developers looking to incorporate ML algorithms into their existing
> projects/software.
> 2. People from more of an academic background well versed in ML, IR, NLP,
> etc. who are looking for an efficient and scalable software tool to use.
> 3. Devs from a non-Java environment (obv no one is going to write a
> beginner's Java guide, but highlighting parts of the API that may be able to
> interface with other tools -- I have a small library of python wrappers I
> use to set up and run some routine tasks)
>
> On Tue, Sep 22, 2009 at 12:17 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> As I mentioned to some of you, there's a proposal to begin work on a
>> book on Mahout. It sounds early, but the publisher assures me it's
>> about the right time to begin, if we want a book out at roughly the
>> time '1.0' rolls out in a year or so. I've heard support for the idea,
>> and think it's a good thing.
>>
>> I'm going to move forward drafting a proposal and draft outline of
>> such a thing. It seems so far I am the (only?) one interested in
>> significant work in writing such a thing, which is cool, so I can
>> drive this -- but I'd be concerned if it were just me speaking for the
>> project book. Hence:
>>
>> - Who else might be interested in being a co-author and putting in
>> significant work?
>> - Would anyone care to read the proposal before I send it in?
>> - Would anyone help me, in the short term, draft an outline of the
>> content of the classification and clustering sections?
>>
>> Sean
>>
>
>
>
> --
> Zaki Rahaman
>

Re: Mahout book

Posted by zaki rahaman <za...@gmail.com>.

I've been a longtime lurker and I'm still getting used to the ins and outs
of using Mahout (I've made some hacks to source in my own environment and
have done some testing, but nothing in production yet) but I'd love to help
out on a book, maybe with some of the background material. Maybe I'm the
only one who feels this way, but any Mahout book should have some basic
introductory background material -- some discussion about machine learning
(classification, clustering), high level overviews of algorithms, and maybe
some case studies/examples (why use mahout vs. other tools?). And of course,
the standard Intro chapter on MapReduce, HDFS, and the rest of the Hadoop
environment (including deploying on EC2/S3). Again, it's probably best to
sort out what does/doesn't belong, but first I think it would be a useful
excercise to figure out who the intended audience really is. In my mind I
would break it down into a few possibilities:

1. Java developers looking to incorporate ML algorithms into their existing
projects/software.
2. People from more of an academic background well versed in ML, IR, NLP,
etc. who are looking for an efficient and scalable software tool to use.
3. Devs from a non-Java environment (obv no one is going to write a
beginner's Java guide, but highlighting parts of the API that may be able to
interface with other tools -- I have a small library of python wrappers I
use to set up and run some routine tasks)

On Tue, Sep 22, 2009 at 12:17 PM, Sean Owen <sr...@gmail.com> wrote:

> As I mentioned to some of you, there's a proposal to begin work on a
> book on Mahout. It sounds early, but the publisher assures me it's
> about the right time to begin, if we want a book out at roughly the
> time '1.0' rolls out in a year or so. I've heard support for the idea,
> and think it's a good thing.
>
> I'm going to move forward drafting a proposal and draft outline of
> such a thing. It seems so far I am the (only?) one interested in
> significant work in writing such a thing, which is cool, so I can
> drive this -- but I'd be concerned if it were just me speaking for the
> project book. Hence:
>
> - Who else might be interested in being a co-author and putting in
> significant work?
> - Would anyone care to read the proposal before I send it in?
> - Would anyone help me, in the short term, draft an outline of the
> content of the classification and clustering sections?
>
> Sean
>

-- 
Zaki Rahaman