You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Joe Kumar <jo...@gmail.com> on 2010/08/13 06:01:32 UTC

Documentation / Help for Beginners

Hi all,

I am a beginner wrt Mahout and am trying to learn its architecture and how
it works. This can help me to implement some ML algos for Mahout.
To understand the big picture and end-end flow of an algorithm, I am not
able to find any good documentation (I have tried searching thru google,
mailing list, mahout site..). So I am thinking of writing some documentation
so that new developers would find it easy to understand the architecture /
end-end flow and start designing / coding new algos.
Can someone please point me in the right direction as to where I can start
and what to refer etc...

I am thinking of starting off with 1 classification (probably Naive Bayes)
and create a template for the documentation like
1. Overview of the Algo
2. I/P data set (how to prepare and sample data set)
3. Maybe a sequence diagram explaining how the code flow happens (or any
other way of representing this info ??)
4. O/P (how to read the o/p model and apply it for a real-world
classification problem)

If you have any quick pointers on the design of Naive bayes / any info you
want added to the document template, plz let me know..
would appreciate any guidance regarding this..
goal : new developers can quickly ramp up and understand how an algo is
implemented so they can re-use etc effectively..

i understand many are already mentoring for GSOC but if someone has time to
mentor me in this effort, I'll be glad to submit a formal application
through http://community.apache.org/mentoringprogramme.html.

thanks,
Joe.

Re: Documentation / Help for Beginners

Posted by Sean Owen <sr...@gmail.com>.
That'd be great. I'd say you are welcome to begin working on this
within the wiki:
https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki

I can help with anything recommender related.
For others, I'd ask the apparent author or if that doesn't work as the
mailing list.

On Thu, Aug 12, 2010 at 11:01 PM, Joe Kumar <jo...@gmail.com> wrote:
> Hi all,
>
> I am a beginner wrt Mahout and am trying to learn its architecture and how
> it works. This can help me to implement some ML algos for Mahout.
> To understand the big picture and end-end flow of an algorithm, I am not
> able to find any good documentation (I have tried searching thru google,
> mailing list, mahout site..). So I am thinking of writing some documentation
> so that new developers would find it easy to understand the architecture /
> end-end flow and start designing / coding new algos.
> Can someone please point me in the right direction as to where I can start
> and what to refer etc...
>
> I am thinking of starting off with 1 classification (probably Naive Bayes)
> and create a template for the documentation like
> 1. Overview of the Algo
> 2. I/P data set (how to prepare and sample data set)
> 3. Maybe a sequence diagram explaining how the code flow happens (or any
> other way of representing this info ??)
> 4. O/P (how to read the o/p model and apply it for a real-world
> classification problem)
>
> If you have any quick pointers on the design of Naive bayes / any info you
> want added to the document template, plz let me know..
> would appreciate any guidance regarding this..
> goal : new developers can quickly ramp up and understand how an algo is
> implemented so they can re-use etc effectively..
>
> i understand many are already mentoring for GSOC but if someone has time to
> mentor me in this effort, I'll be glad to submit a formal application
> through http://community.apache.org/mentoringprogramme.html.
>
> thanks,
> Joe.
>

Re: Documentation / Help for Beginners

Posted by Isabel Drost <is...@apache.org>.
On Fri, 13 Aug 2010 Joe Kumar <jo...@gmail.com> wrote:
> Once the example steps are cleaned out for the current version
> of Mahout, I'll start on each of quickstart/clustering ,
> quickstart/classifying and so on.

Thanks for taking up this work. As the project is moving towards its
next release, help with cleaning up and extending existing
documentation is more then welcome.


Isabel



Re: Documentation / Help for Beginners

Posted by Drew Farris <dr...@gmail.com>.
Joe,

Thanks for getting started on this work.

On Fri, Aug 13, 2010 at 8:38 AM, Joe Kumar <jo...@gmail.com> wrote:

>
> For wikipedia bayes example, I am assuming that we need to download data
> (like how we are doing for Twenty Newsgroup example). can someone plz
> reference me the link or the process of getting this data ?
>

see: http://en.wikipedia.org/wiki/Wikipedia:Database_download

The full link is:
http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

WikipediaXmlSplitter is capable of reading the bz2 format file directly.

Drew

Re: Documentation / Help for Beginners

Posted by Joe Kumar <jo...@gmail.com>.
Thanks Sean. I'll check with you for questions regarding Recommenders.

Thanks for the pointer Isabel. I'll probably start off with
https://cwiki.apache.org/MAHOUT/quickstart.html and make sure the examples
and steps mentioned there works well.
For example, the wikipedia bayes example references a build-deprecated.xml
which I couldnt find anywhere.
Once the example steps are cleaned out for the current version of Mahout,
I'll start on each of quickstart/clustering , quickstart/classifying and so
on.

For wikipedia bayes example, I am assuming that we need to download data
(like how we are doing for Twenty Newsgroup example). can someone plz
reference me the link or the process of getting this data ?

thanks
Joe.

On Fri, Aug 13, 2010 at 5:30 AM, Isabel Drost <is...@apache.org> wrote:

> On Fri, 13 Aug 2010 Joe Kumar <jo...@gmail.com> wrote:
> > I am thinking of starting off with 1 classification (probably Naive
> > Bayes) and create a template for the documentation like
> > 1. Overview of the Algo
> > 2. I/P data set (how to prepare and sample data set)
> > 3. Maybe a sequence diagram explaining how the code flow happens (or
> > any other way of representing this info ??)
> > 4. O/P (how to read the o/p model and apply it for a real-world
> > classification problem)
>
> You might also want to have a look at our Quickstart and Algorihtms
> pages in the wiki and potentially simply extend those:
>
> Quickstart:
> https://cwiki.apache.org/MAHOUT/quickstart.html
>
> Classification Overview:
> https://cwiki.apache.org/MAHOUT/classifyingyourdata.html
>
> Brief explanation of Naive Bayes including links to examples:
> https://cwiki.apache.org/MAHOUT/bayesian.html
>
> Isabel
>

Re: Documentation / Help for Beginners

Posted by Isabel Drost <is...@apache.org>.
On Fri, 13 Aug 2010 Joe Kumar <jo...@gmail.com> wrote:
> I am thinking of starting off with 1 classification (probably Naive
> Bayes) and create a template for the documentation like
> 1. Overview of the Algo
> 2. I/P data set (how to prepare and sample data set)
> 3. Maybe a sequence diagram explaining how the code flow happens (or
> any other way of representing this info ??)
> 4. O/P (how to read the o/p model and apply it for a real-world
> classification problem)

You might also want to have a look at our Quickstart and Algorihtms
pages in the wiki and potentially simply extend those:

Quickstart: 
https://cwiki.apache.org/MAHOUT/quickstart.html

Classification Overview:
https://cwiki.apache.org/MAHOUT/classifyingyourdata.html

Brief explanation of Naive Bayes including links to examples:
https://cwiki.apache.org/MAHOUT/bayesian.html

Isabel