You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Aditya <ad...@gmail.com> on 2017/03/31 23:47:51 UTC

Few Questions regarding Mahout

Hi everyone,

I've been talking with Trevor over email and he shared some documents with
me. They contained content that he (along with a few others) were
developing to make Mahout easily accessible to newbies like myself.

I've gone through the planned blog posts titled "Why Mahout", "Getting
Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
from Source" and I have to say, I've got a lot of questions. Since Trevor
is on vacation and the deadline for final proposal submission is fast
approaching, I thought I'll post my questions on the dev forum.

So here goes the big list of my questions. I hope of those of you who were
/ are involved in the development of these blog posts will be able to help
me. Some of the questions are vague / abstract, I suggest you answer them
as if you're explaining it to a layman.

1. Could you elaborate to me the high-level structure of Mahout?

2. What are the plans in pipeline for Mahout's development in the months to
come?

3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.

4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.

5. What are the building blocks of Mahout that enable the distributed
processing? The blog post mentions the Distributed Row Matrix. Are there
any other distributed data structures available? If not, won't the
algorithms that can be a part of the Mahout framework in the future become
limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
problem?

6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?

7. Also, if you could give me some background as to how the development of
Mahout has been going on. Not the motivation / inspiration that led to
Mahout's conception but something like, what work has gone on between the
previous release and the current release candidate.

8. What was the high level motivation of developing Mahout's own DSL,
Samsara?

Regards,
Aditya

Re: Few Questions regarding Mahout

Posted by Trevor Grant <tr...@gmail.com>.
Good questions Aditya, and awesome response Dustin et al.

I'm back in, and trying to work my way through emails I missed while out.

The Meetup presentation referenced is available in full here.
https://github.com/rawkintrevo/presentations/blob/master/Mahout%20Whats%20Next%20DFW%20Meetup.pdf

Hopefully that will be a somewhat useful "structure" overview.

To all watching, the write ups I have mentioned are a series of blog posts
I intend to push out ASAP, specifically aimed at new users (to Aditya's
point number 6).  At the moment they are incomplete/poorly
edited/unclear/possibly incorrect in spots.  I promise to publish once they
are clean!

tg

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, Apr 3, 2017 at 3:41 PM, dustin vanstee <du...@gmail.com>
wrote:

> Hi Aditya, I am new to the project myself so I can't comment on all your
> questions but here are a few comments I have for you ..
>
> 1. High level structure of Mahout
> Trevor gave a presentation at a meetup that had a nice architecture
> diagram that shows the layers.
>
> Mainly its using the Samsara DSL to write backend agnostic algorithms.
> Then let Mahout do the mapping and optimizations to the backend based on
> what one you are using ...
>
> [image: Inline image 1]
> 3. How does contribution of a new algorithm work in Mahout? When I was
> reading the doc "Getting Started with Mahout" the example implemented the
> Ordinary Least Squares Regression in Samsara, Mahout's DSL.
> I had something different in my mind before reading the blog posts. I had
> thought that I would be contributing the distributed algorithm to Mahout
> from scratch, written in Scala and make it available as a package (which
> users can import and use) to users who use Mahout.
>
> I think the idea is to let the backend engine figure out how to best
> distribute the work.  That said, when writing a binding to a particular
> backend a lot of work is probably put into the best implementation of how
> represent a DRM.
>
> 4. In general, is there a plan to contribute the algorithms in future using
> Samsara only? If so, what will be the limitations and advantages of this
> decision? I mean, the algorithms that will be a part of Mahout in the
> future, is there a plan to write all of them in Samsara.
>
> I think thats where the sweet spot is ... backend agnostic code.
>
>
> 6. What is expected of a newbie in the community? What is the learning
> curve to become an active contributor to Mahout? Are there any specific
> books / blog posts that I can read that will make the process easier?
>
> As a newbie, I think its participating in the building/testing of code
> releases.  Also working on some simple JIRAs.   Based on my experience,
> working on my first JIRA is helping me get more familiar with some small
> aspects of the overall project.  I think you will need to get good with
> intelliJ to help you read/write/test code.   I perused Trevors documents,
> and all the writeups in the mahout website.   Beyond that, just trying
> things in code will help.
>
>
> Sorry, don't have tons of answers myself, but his is what I have found out
> so far.  Hope that helps.
>
>
> On Fri, Mar 31, 2017 at 7:47 PM, Aditya <ad...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I've been talking with Trevor over email and he shared some documents with
>> me. They contained content that he (along with a few others) were
>> developing to make Mahout easily accessible to newbies like myself.
>>
>> I've gone through the planned blog posts titled "Why Mahout", "Getting
>> Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
>> from Source" and I have to say, I've got a lot of questions. Since Trevor
>> is on vacation and the deadline for final proposal submission is fast
>> approaching, I thought I'll post my questions on the dev forum.
>>
>> So here goes the big list of my questions. I hope of those of you who were
>> / are involved in the development of these blog posts will be able to help
>> me. Some of the questions are vague / abstract, I suggest you answer them
>> as if you're explaining it to a layman.
>>
>> 1. Could you elaborate to me the high-level structure of Mahout?
>>
>> 2. What are the plans in pipeline for Mahout's development in the months
>> to
>> come?
>>
>> 3. How does contribution of a new algorithm work in Mahout? When I was
>> reading the doc "Getting Started with Mahout" the example implemented the
>> Ordinary Least Squares Regression in Samsara, Mahout's DSL.
>> I had something different in my mind before reading the blog posts. I had
>> thought that I would be contributing the distributed algorithm to Mahout
>> from scratch, written in Scala and make it available as a package (which
>> users can import and use) to users who use Mahout.
>>
>> 4. In general, is there a plan to contribute the algorithms in future
>> using
>> Samsara only? If so, what will be the limitations and advantages of this
>> decision? I mean, the algorithms that will be a part of Mahout in the
>> future, is there a plan to write all of them in Samsara.
>>
>> 5. What are the building blocks of Mahout that enable the distributed
>> processing? The blog post mentions the Distributed Row Matrix. Are there
>> any other distributed data structures available? If not, won't the
>> algorithms that can be a part of the Mahout framework in the future become
>> limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
>> problem?
>>
>> 6. What is expected of a newbie in the community? What is the learning
>> curve to become an active contributor to Mahout? Are there any specific
>> books / blog posts that I can read that will make the process easier?
>>
>> 7. Also, if you could give me some background as to how the development of
>> Mahout has been going on. Not the motivation / inspiration that led to
>> Mahout's conception but something like, what work has gone on between the
>> previous release and the current release candidate.
>>
>> 8. What was the high level motivation of developing Mahout's own DSL,
>> Samsara?
>>
>> Regards,
>> Aditya
>>
>
>

Re: Few Questions regarding Mahout

Posted by dustin vanstee <du...@gmail.com>.
Hi Aditya, I am new to the project myself so I can't comment on all your
questions but here are a few comments I have for you ..

1. High level structure of Mahout
Trevor gave a presentation at a meetup that had a nice architecture diagram
that shows the layers.

Mainly its using the Samsara DSL to write backend agnostic algorithms.
Then let Mahout do the mapping and optimizations to the backend based on
what one you are using ...

[image: Inline image 1]
3. How does contribution of a new algorithm work in Mahout? When I was
reading the doc "Getting Started with Mahout" the example implemented the
Ordinary Least Squares Regression in Samsara, Mahout's DSL.
I had something different in my mind before reading the blog posts. I had
thought that I would be contributing the distributed algorithm to Mahout
from scratch, written in Scala and make it available as a package (which
users can import and use) to users who use Mahout.

I think the idea is to let the backend engine figure out how to best
distribute the work.  That said, when writing a binding to a particular
backend a lot of work is probably put into the best implementation of how
represent a DRM.

4. In general, is there a plan to contribute the algorithms in future using
Samsara only? If so, what will be the limitations and advantages of this
decision? I mean, the algorithms that will be a part of Mahout in the
future, is there a plan to write all of them in Samsara.

I think thats where the sweet spot is ... backend agnostic code.


6. What is expected of a newbie in the community? What is the learning
curve to become an active contributor to Mahout? Are there any specific
books / blog posts that I can read that will make the process easier?

As a newbie, I think its participating in the building/testing of code
releases.  Also working on some simple JIRAs.   Based on my experience,
working on my first JIRA is helping me get more familiar with some small
aspects of the overall project.  I think you will need to get good with
intelliJ to help you read/write/test code.   I perused Trevors documents,
and all the writeups in the mahout website.   Beyond that, just trying
things in code will help.


Sorry, don't have tons of answers myself, but his is what I have found out
so far.  Hope that helps.


On Fri, Mar 31, 2017 at 7:47 PM, Aditya <ad...@gmail.com> wrote:

> Hi everyone,
>
> I've been talking with Trevor over email and he shared some documents with
> me. They contained content that he (along with a few others) were
> developing to make Mahout easily accessible to newbies like myself.
>
> I've gone through the planned blog posts titled "Why Mahout", "Getting
> Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
> from Source" and I have to say, I've got a lot of questions. Since Trevor
> is on vacation and the deadline for final proposal submission is fast
> approaching, I thought I'll post my questions on the dev forum.
>
> So here goes the big list of my questions. I hope of those of you who were
> / are involved in the development of these blog posts will be able to help
> me. Some of the questions are vague / abstract, I suggest you answer them
> as if you're explaining it to a layman.
>
> 1. Could you elaborate to me the high-level structure of Mahout?
>
> 2. What are the plans in pipeline for Mahout's development in the months to
> come?
>
> 3. How does contribution of a new algorithm work in Mahout? When I was
> reading the doc "Getting Started with Mahout" the example implemented the
> Ordinary Least Squares Regression in Samsara, Mahout's DSL.
> I had something different in my mind before reading the blog posts. I had
> thought that I would be contributing the distributed algorithm to Mahout
> from scratch, written in Scala and make it available as a package (which
> users can import and use) to users who use Mahout.
>
> 4. In general, is there a plan to contribute the algorithms in future using
> Samsara only? If so, what will be the limitations and advantages of this
> decision? I mean, the algorithms that will be a part of Mahout in the
> future, is there a plan to write all of them in Samsara.
>
> 5. What are the building blocks of Mahout that enable the distributed
> processing? The blog post mentions the Distributed Row Matrix. Are there
> any other distributed data structures available? If not, won't the
> algorithms that can be a part of the Mahout framework in the future become
> limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
> problem?
>
> 6. What is expected of a newbie in the community? What is the learning
> curve to become an active contributor to Mahout? Are there any specific
> books / blog posts that I can read that will make the process easier?
>
> 7. Also, if you could give me some background as to how the development of
> Mahout has been going on. Not the motivation / inspiration that led to
> Mahout's conception but something like, what work has gone on between the
> previous release and the current release candidate.
>
> 8. What was the high level motivation of developing Mahout's own DSL,
> Samsara?
>
> Regards,
> Aditya
>

Re: Few Questions regarding Mahout

Posted by Andrew Palumbo <ap...@outlook.com>.
Hello, Aditya.


Welcome to Mahout!


My suggestion, while Trevor is on Vacation, if you want to get a jump on your work and also an understaning of Mahout's distributed operations would be to read:


1. http://mahout.apache.org/users/environment/in-core-reference.html

2. http://mahout.apache.org/users/environment/out-of-core-reference.html



And then pop open a shell or notebook and start tinkering with some algorithms that you're familiar with.. This should give you a some familiarity with what you can do, and give you a bit of a head start on developing once Trevor gets back to you.



Good luck, and nice to have you on board.


Andy





________________________________
From: Aditya <ad...@gmail.com>
Sent: Sunday, April 2, 2017 5:06:26 AM
To: dev@mahout.apache.org
Subject: Re: Few Questions regarding Mahout

Hello again,

I hope most of you had the time to read through the previous mail. It would
mean a lot if you could answer (in partial at least) the above questions.

Thanks,
Aditya



On Sat, Apr 1, 2017 at 5:17 AM, Aditya <ad...@gmail.com> wrote:

> Hi everyone,
>
> I've been talking with Trevor over email and he shared some documents with
> me. They contained content that he (along with a few others) were
> developing to make Mahout easily accessible to newbies like myself.
>
> I've gone through the planned blog posts titled "Why Mahout", "Getting
> Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
> from Source" and I have to say, I've got a lot of questions. Since Trevor
> is on vacation and the deadline for final proposal submission is fast
> approaching, I thought I'll post my questions on the dev forum.
>
> So here goes the big list of my questions. I hope of those of you who were
> / are involved in the development of these blog posts will be able to help
> me. Some of the questions are vague / abstract, I suggest you answer them
> as if you're explaining it to a layman.
>
> 1. Could you elaborate to me the high-level structure of Mahout?
>
> 2. What are the plans in pipeline for Mahout's development in the months
> to come?
>
> 3. How does contribution of a new algorithm work in Mahout? When I was
> reading the doc "Getting Started with Mahout" the example implemented the
> Ordinary Least Squares Regression in Samsara, Mahout's DSL.
> I had something different in my mind before reading the blog posts. I had
> thought that I would be contributing the distributed algorithm to Mahout
> from scratch, written in Scala and make it available as a package (which
> users can import and use) to users who use Mahout.
>
> 4. In general, is there a plan to contribute the algorithms in future
> using Samsara only? If so, what will be the limitations and advantages of
> this decision? I mean, the algorithms that will be a part of Mahout in the
> future, is there a plan to write all of them in Samsara.
>
> 5. What are the building blocks of Mahout that enable the distributed
> processing? The blog post mentions the Distributed Row Matrix. Are there
> any other distributed data structures available? If not, won't the
> algorithms that can be a part of the Mahout framework in the future become
> limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
> problem?
>
> 6. What is expected of a newbie in the community? What is the learning
> curve to become an active contributor to Mahout? Are there any specific
> books / blog posts that I can read that will make the process easier?
>
> 7. Also, if you could give me some background as to how the development of
> Mahout has been going on. Not the motivation / inspiration that led to
> Mahout's conception but something like, what work has gone on between the
> previous release and the current release candidate.
>
> 8. What was the high level motivation of developing Mahout's own DSL,
> Samsara?
>
> Regards,
> Aditya
>
>
>
>

Re: Few Questions regarding Mahout

Posted by Aditya <ad...@gmail.com>.
Hello again,

I hope most of you had the time to read through the previous mail. It would
mean a lot if you could answer (in partial at least) the above questions.

Thanks,
Aditya



On Sat, Apr 1, 2017 at 5:17 AM, Aditya <ad...@gmail.com> wrote:

> Hi everyone,
>
> I've been talking with Trevor over email and he shared some documents with
> me. They contained content that he (along with a few others) were
> developing to make Mahout easily accessible to newbies like myself.
>
> I've gone through the planned blog posts titled "Why Mahout", "Getting
> Started with Mahout", "Algorithms Framework" and "Building Apache Mahout
> from Source" and I have to say, I've got a lot of questions. Since Trevor
> is on vacation and the deadline for final proposal submission is fast
> approaching, I thought I'll post my questions on the dev forum.
>
> So here goes the big list of my questions. I hope of those of you who were
> / are involved in the development of these blog posts will be able to help
> me. Some of the questions are vague / abstract, I suggest you answer them
> as if you're explaining it to a layman.
>
> 1. Could you elaborate to me the high-level structure of Mahout?
>
> 2. What are the plans in pipeline for Mahout's development in the months
> to come?
>
> 3. How does contribution of a new algorithm work in Mahout? When I was
> reading the doc "Getting Started with Mahout" the example implemented the
> Ordinary Least Squares Regression in Samsara, Mahout's DSL.
> I had something different in my mind before reading the blog posts. I had
> thought that I would be contributing the distributed algorithm to Mahout
> from scratch, written in Scala and make it available as a package (which
> users can import and use) to users who use Mahout.
>
> 4. In general, is there a plan to contribute the algorithms in future
> using Samsara only? If so, what will be the limitations and advantages of
> this decision? I mean, the algorithms that will be a part of Mahout in the
> future, is there a plan to write all of them in Samsara.
>
> 5. What are the building blocks of Mahout that enable the distributed
> processing? The blog post mentions the Distributed Row Matrix. Are there
> any other distributed data structures available? If not, won't the
> algorithms that can be a part of the Mahout framework in the future become
> limited? Meaning, algorithms that cannot be reduced to a Linear Algebra
> problem?
>
> 6. What is expected of a newbie in the community? What is the learning
> curve to become an active contributor to Mahout? Are there any specific
> books / blog posts that I can read that will make the process easier?
>
> 7. Also, if you could give me some background as to how the development of
> Mahout has been going on. Not the motivation / inspiration that led to
> Mahout's conception but something like, what work has gone on between the
> previous release and the current release candidate.
>
> 8. What was the high level motivation of developing Mahout's own DSL,
> Samsara?
>
> Regards,
> Aditya
>
>
>
>