You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2014/02/28 01:37:18 UTC

Mahout 1.0 goals

I would like to start a conversation about where we want Mahout to be for
1.0.  Let's suspend for the moment the question of how to achieve the
goals.  Instead, let's converge on what we really would like to have happen
and after that, let's talk about means that will get us there.

Here are some goals that I think would be good in the area of numerics,
classifiers and clustering:

- runs with or without Hadoop

- runs with or without map-reduce

- includes (at least), regularized generalized linear models, k-means,
random forest, distributed random forest, distributed neural networks

- reasonably competitive speed against other implementations including
graphlab, mlib and R.

- interactive model building

- models can be exported as code or data

- simple programming model

- programmable via Java or R

- runs clustered or not


What does everybody think?

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Thu, Feb 27, 2014 at 7:01 PM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

>
> And I'm not sure if this is what Dmitriy meant in his comments (3), but I'd
> love to be able to do Mathematica-style work in an interactive shell and/or
> symbolic system where I could do A*B' and it just worked.  That would crush
> everything on the market, though it could be a lot of work to build a DSL
> that supports it.
>

Actually, I've already had that in one incarnation and now rewriting it for
the second time -- check out WIP in
https://issues.apache.org/jira/browse/MAHOUT-1346 . My previous
implementation ended up not lending itself well to physical plan rewrites,
so i rolled it back and started over.

DSL itself is not so much a problem. Optimizer rules are somewhat a
problem. E.g. in SSVD there's a part that deals with skinny matrix
multiplication. There're also some decisions i am not sure about -- like a
thing about row keying in Mahout DRM which generally doesn't have to be row
index and which doesn't play well with some of rewrites -- but SSVD in
general allows non-integer row keying for A and U matrices, so there are
questions of how to isolate those issues architecturally from pure linear
algebra in matrices etc.

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

Thanks for starting the conversation, Ted.  I'm relatively new to the
project though I've been using Mahout for a couple years in production, and
am happy to see things move forward in whatever way makes sense.

I think Mahout needs to ship a production-ready version if it's going to be
called 1.0, otherwise we ought to call the next release 0.10.

In that vein, I think Sean, Dmitriy, and Ted have some good points in that
Mahout is still a very rough draft.  I think we all have used some portion
of Mahout in production and are surprised when we find out how dodgy things
are in certain spots when we look around further after learning how our
favorite things work.

I'd like to see several of the things you mention, Ted, including
decoupling from Hadoop and map-reduce where possible, working on the speed
competition, exporting to PMML, and clarifying the programming approach.

And I'm not sure if this is what Dmitriy meant in his comments (3), but I'd
love to be able to do Mathematica-style work in an interactive shell and/or
symbolic system where I could do A*B' and it just worked.  That would crush
everything on the market, though it could be a lot of work to build a DSL
that supports it.

I also think Dmitriy's (5) for having up-front data assessment stuff is
really valuable.  I'm building things like that internally at work and I
can confirm that there is high demand for it.

Along with the up-front pipelining, I'd like back in  Mahout is a feature
that I think was in there and got removed:  shipping results in a web
service, without writing your own.

So I'd like a free machine-learning library I can count on to make sense
when I use the Java/Scala API or command-line programs, take raw data and
do the necessary "first whack" at it, prepare vectors for jobs, run jobs,
and then build a jar file I can put into Jetty or Tomcat, and bonus points
do that "real-time" solr-recommender-style recalculation and results
serving.

The end-to-end part is where I think Mahout could sprint to the front pack
and do well.

Best
Andrew

On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>

Re: Mahout 1.0 goals

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Me 2.

Sent from my iPhone

> On Mar 3, 2014, at 8:44 AM, "Frank Scholten" <sc...@gmail.com> wrote:
> 
> Yes, count me in.
> 
>> On Mar 3, 2014, at 7:34, Sebastian Schelter <ss...@apache.org> wrote:
>> 
>> Thats a very good idea. I'm happy to join!
>> 
>>> On 03/03/2014 07:31 AM, Ravi Mummulla wrote:
>>> Ted and others,
>>> Once we have enough thoughts on 1.0 on this thread, can we get together on
>>> Google Hangout and discuss the the plan, prioritize the work, and talk
>>> about rough timeline for landing 1.0? We can then create JIRAs and go from
>>> there. If everyone agrees, any preferences on a rough hangout date?
>>> 
>>> Thanks.
>>> 
>>> 
>>>> On Sun, Mar 2, 2014 at 8:45 AM, Ted Dunning <te...@gmail.com> wrote:
>>>> 
>>>> Ravi,
>>>> 
>>>> Good points.
>>>> 
>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ravi.mummulla@gmail.com
>>>>> wrote:
>>>> 
>>>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>>>> for instance)
>>>> 
>>>> There is a bit of demand for that.
>>>> 
>>>> - Faster time to first application (from discovery to first application
>>>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>>>> and reduce the friction for adoption?)
>>>> 
>>>> There is huge evidence that this is important.
>>>> 
>>>> 
>>>>> - Better documenting use cases with working samples/examples
>>>>> (Documentation
>>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>>>> and
>>>>> there is too much focus on algorithms as opposed to use cases - this is
>>>> an
>>>>> adoption blocker)
>>>> 
>>>> This is also important.
>>>> 
>>>> 
>>>>> - Uniformity of the API set across all algorithms (are we providing the
>>>>> same experience across all APIs?)
>>>> 
>>>> And many people have been tripped up by this.
>>>> 
>>>> 
>>>>> - Measuring/publishing scalability metrics of various algorithms (why
>>>>> would
>>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>>> 
>>>> I don't see this as important as some of your other points, but is still
>>>> useful.
>>

Re: Mahout 1.0 goals

Posted by Frank Scholten <sc...@gmail.com>.

Yes, count me in.

On Mar 3, 2014, at 7:34, Sebastian Schelter <ss...@apache.org> wrote:

> Thats a very good idea. I'm happy to join!
> 
> On 03/03/2014 07:31 AM, Ravi Mummulla wrote:
>> Ted and others,
>> Once we have enough thoughts on 1.0 on this thread, can we get together on
>> Google Hangout and discuss the the plan, prioritize the work, and talk
>> about rough timeline for landing 1.0? We can then create JIRAs and go from
>> there. If everyone agrees, any preferences on a rough hangout date?
>> 
>> Thanks.
>> 
>> 
>> On Sun, Mar 2, 2014 at 8:45 AM, Ted Dunning <te...@gmail.com> wrote:
>> 
>>> Ravi,
>>> 
>>> Good points.
>>> 
>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ravi.mummulla@gmail.com
>>>> wrote:
>>> 
>>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>>> for instance)
>>> 
>>> There is a bit of demand for that.
>>> 
>>> - Faster time to first application (from discovery to first application
>>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>>> and reduce the friction for adoption?)
>>> 
>>> There is huge evidence that this is important.
>>> 
>>> 
>>>>  - Better documenting use cases with working samples/examples
>>>> (Documentation
>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>>> and
>>>> there is too much focus on algorithms as opposed to use cases - this is
>>> an
>>>> adoption blocker)
>>> 
>>> This is also important.
>>> 
>>> 
>>>> - Uniformity of the API set across all algorithms (are we providing the
>>>> same experience across all APIs?)
>>> 
>>> And many people have been tripped up by this.
>>> 
>>> 
>>>>  - Measuring/publishing scalability metrics of various algorithms (why
>>>> would
>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>> 
>>> I don't see this as important as some of your other points, but is still
>>> useful.
>

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

Yes.  Excellent idea. 

Sent from my iPhone

> On Mar 2, 2014, at 22:34, Sebastian Schelter <ss...@apache.org> wrote:
> 
> Thats a very good idea. I'm happy to join!
> 
>> On 03/03/2014 07:31 AM, Ravi Mummulla wrote:
>> Ted and others,
>> Once we have enough thoughts on 1.0 on this thread, can we get together on
>> Google Hangout and discuss the the plan, prioritize the work, and talk
>> about rough timeline for landing 1.0? We can then create JIRAs and go from
>> there. If everyone agrees, any preferences on a rough hangout date?
>> 
>> Thanks.
>> 
>> 
>>> On Sun, Mar 2, 2014 at 8:45 AM, Ted Dunning <te...@gmail.com> wrote:
>>> 
>>> Ravi,
>>> 
>>> Good points.
>>> 
>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ravi.mummulla@gmail.com
>>>> wrote:
>>> 
>>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>>> for instance)
>>> 
>>> There is a bit of demand for that.
>>> 
>>> - Faster time to first application (from discovery to first application
>>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>>> and reduce the friction for adoption?)
>>> 
>>> There is huge evidence that this is important.
>>> 
>>> 
>>>>  - Better documenting use cases with working samples/examples
>>>> (Documentation
>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>>> and
>>>> there is too much focus on algorithms as opposed to use cases - this is
>>> an
>>>> adoption blocker)
>>> 
>>> This is also important.
>>> 
>>> 
>>>> - Uniformity of the API set across all algorithms (are we providing the
>>>> same experience across all APIs?)
>>> 
>>> And many people have been tripped up by this.
>>> 
>>> 
>>>>  - Measuring/publishing scalability metrics of various algorithms (why
>>>> would
>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>> 
>>> I don't see this as important as some of your other points, but is still
>>> useful.
>

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@apache.org>.

Thats a very good idea. I'm happy to join!

On 03/03/2014 07:31 AM, Ravi Mummulla wrote:
> Ted and others,
> Once we have enough thoughts on 1.0 on this thread, can we get together on
> Google Hangout and discuss the the plan, prioritize the work, and talk
> about rough timeline for landing 1.0? We can then create JIRAs and go from
> there. If everyone agrees, any preferences on a rough hangout date?
>
> Thanks.
>
>
> On Sun, Mar 2, 2014 at 8:45 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> Ravi,
>>
>> Good points.
>>
>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ravi.mummulla@gmail.com
>>> wrote:
>>
>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>> for instance)
>>>
>>
>> There is a bit of demand for that.
>>
>> - Faster time to first application (from discovery to first application
>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>> and reduce the friction for adoption?)
>>>
>>
>> There is huge evidence that this is important.
>>
>>
>>>   - Better documenting use cases with working samples/examples
>>> (Documentation
>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>> and
>>> there is too much focus on algorithms as opposed to use cases - this is
>> an
>>> adoption blocker)
>>>
>>
>> This is also important.
>>
>>
>>> - Uniformity of the API set across all algorithms (are we providing the
>>> same experience across all APIs?)
>>>
>>
>> And many people have been tripped up by this.
>>
>>
>>>   - Measuring/publishing scalability metrics of various algorithms (why
>>> would
>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>>
>>
>> I don't see this as important as some of your other points, but is still
>> useful.
>>
>
>
>

Re: Mahout 1.0 goals

Posted by Ravi Mummulla <ra...@gmail.com>.

Ted and others,
Once we have enough thoughts on 1.0 on this thread, can we get together on
Google Hangout and discuss the the plan, prioritize the work, and talk
about rough timeline for landing 1.0? We can then create JIRAs and go from
there. If everyone agrees, any preferences on a rough hangout date?

Thanks.


On Sun, Mar 2, 2014 at 8:45 AM, Ted Dunning <te...@gmail.com> wrote:

> Ravi,
>
> Good points.
>
> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ravi.mummulla@gmail.com
> >wrote:
>
> > - Natively support Windows (guidance, etc. No documentation exists today,
> > for instance)
> >
>
> There is a bit of demand for that.
>
> - Faster time to first application (from discovery to first application
> > currently takes a non-trivial amount of effort; how can we lower the bar
> > and reduce the friction for adoption?)
> >
>
> There is huge evidence that this is important.
>
>
> >  - Better documenting use cases with working samples/examples
> > (Documentation
> > on https://mahout.apache.org/users/basics/algorithms.html is spread out
> > and
> > there is too much focus on algorithms as opposed to use cases - this is
> an
> > adoption blocker)
> >
>
> This is also important.
>
>
> > - Uniformity of the API set across all algorithms (are we providing the
> > same experience across all APIs?)
> >
>
> And many people have been tripped up by this.
>
>
> >  - Measuring/publishing scalability metrics of various algorithms (why
> > would
> > we want users to adopt Mahout vs. other frameworks for ML at scale?)
> >
>
> I don't see this as important as some of your other points, but is still
> useful.
>



-- 
Thanks.

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@apache.org>.

On 03/03/2014 01:26 AM, peng wrote:
> 1. Components are not interchangable: e.g. the data and model
> presentation for single-node CF is vastly different from MR CF. New
> feature sometimes add backward-incompatible presentation. This
> drastically demoralized user seeking to integrate with it and expecting
> improvement.

That is a very important point. Could you create a jira ticket that 
describes which models for CF you would like to be interchangeable?

Thanks,
Sebastian

Re: Mahout 1.0 goals

Posted by peng <pc...@uowmail.edu.au>.

Hi Dr Dunning,

I'm reluctant to admit that my feeling is similar to many of Sean's 
customers. as a user of mahout and lucene-solr, I see a lot of 
similarities in their cases:
lucene | mahout
indexing takes text as sparse vectors and build inverted index | 
training takes data as sparse vectors and build model
inverted index exist in memory/HDFS | model exist in memory/HDFS
use by input text and return match with scores | use by input test data 
and return scores/labels
do model selection by comparing ordinal number of scores with ground 
truth | do model selection by comparing scores/labels with ground truth

Then lucene/solr/elasticsearch evolved to become most successful 
flagship products (as buggy and incomplete as it is, it still gain wide 
usage which mahout never achieved). Yet mahout still looks like being 
assembled by glue and duct tape. The major difficulties I encountered 
are:

1. Components are not interchangable: e.g. the data and model 
presentation for single-node CF is vastly different from MR CF. New 
feature sometimes add backward-incompatible presentation. This 
drastically demoralized user seeking to integrate with it and expecting 
improvement.
2. Components have strong dependency on others: e.g. Cross-validation 
of CF can only use in-memory DataModel, which SlopeOneRecommender 
cannot update properly (its removed but you got my point). Such design 
never draw enough attention apart from an 'won't fix' solution.
3. Many models can only be used internally, cannot be exported or 
reused in other applications. This is true in solr as well but its 
restful api is very universal and many etl tools has been built for it. 
In contrast mahout has a very hard learning curve for non-java 
developers.

its not bad t see mahout as a service on top of a library, if it 
doesn't take too much effort.

Yours Peng

On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> Ravi,
>
> Good points.
>
> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>wrote:
>
>> - Natively support Windows (guidance, etc. No documentation exists today,
>> for instance)
>>
>
> There is a bit of demand for that.
>
> - Faster time to first application (from discovery to first application
>> currently takes a non-trivial amount of effort; how can we lower the bar
>> and reduce the friction for adoption?)
>>
>
> There is huge evidence that this is important.
>
>
>>   - Better documenting use cases with working samples/examples
>> (Documentation
>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>> and
>> there is too much focus on algorithms as opposed to use cases - this is an
>> adoption blocker)
>>
>
> This is also important.
>
>
>> - Uniformity of the API set across all algorithms (are we providing the
>> same experience across all APIs?)
>>
>
> And many people have been tripped up by this.
>
>
>>   - Measuring/publishing scalability metrics of various algorithms (why
>> would
>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>
>
> I don't see this as important as some of your other points, but is still
> useful.
>

Re: Mahout 1.0 goals

Posted by Giorgio Zoppi <gi...@gmail.com>.

Ok Dr.Dunning. So put an appointment to google calendar, and add me.. This
should be a no working day since we are voluntering.


2014-03-03 18:51 GMT+01:00 Ted Dunning <te...@gmail.com>:

> Happy to organize a google hangout.  That has the advantage of allowing
> more attendees and supporting YouTube archiving.
>
> Sent from my iPhone
>
> > On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
> >
> > Hello All,
> > Dr.Dunning could you set a meeting next Sat morning, so we can chat and
> > discuss by skype improvements and what to do and indentify volunteer and
> > tasks.
> > Best Regards,
> > Giorgio
> >
> >
> > 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> >
> >> Me three
> >>
> >>
> >>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> >>>
> >>> Ravi,
> >>>
> >>> Good points.
> >>>
> >>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> ravi.mummulla@gmail.com>
> >>> wrote:
> >>>
> >>> - Natively support Windows (guidance, etc. No documentation exists
> today,
> >>>> for instance)
> >>> There is a bit of demand for that.
> >>>
> >>> - Faster time to first application (from discovery to first application
> >>>
> >>>> currently takes a non-trivial amount of effort; how can we lower the
> bar
> >>>> and reduce the friction for adoption?)
> >>> There is huge evidence that this is important.
> >>>
> >>>
> >>>   - Better documenting use cases with working samples/examples
> >>>> (Documentation
> >>>> on https://mahout.apache.org/users/basics/algorithms.html is spread
> out
> >>>> and
> >>>> there is too much focus on algorithms as opposed to use cases - this
> is
> >>>> an
> >>>> adoption blocker)
> >>> This is also important.
> >>>
> >>>
> >>> - Uniformity of the API set across all algorithms (are we providing the
> >>>> same experience across all APIs?)
> >>> And many people have been tripped up by this.
> >>>
> >>>
> >>>   - Measuring/publishing scalability metrics of various algorithms (why
> >>>> would
> >>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
> >>> I don't see this as important as some of your other points, but is
> still
> >>> useful.
> >
> >
> > --
> > Quiero ser el rayo de sol que cada día te despierta
> > para hacerte respirar y vivir en me.
> > "Favola -Moda".
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

Yeah, great


On Mon, Mar 3, 2014 at 11:01 AM, Suneel Marthi <su...@yahoo.com>wrote:

> IRC channel
>
> Sent from my iPhone
>
> > On Mar 3, 2014, at 1:58 PM, Suneel Marthi <su...@yahoo.com>
> wrote:
> >
> > There is an RISC channel for mahout on freenode that's still active .
> >
> > Sent from my iPhone
> >
> >> On Mar 3, 2014, at 1:46 PM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
> >>
> >> How about reviving/advertising an IRC channel so people could hop on
> >> whenever they're free, see if that gains any momentum.
> >>
> >>
> >>> On Mon, Mar 3, 2014 at 10:38 AM, Ted Dunning <te...@gmail.com>
> wrote:
> >>>
> >>> We can have more than one hangout to cover multiple time zones/work
> >>> requirements.  Each meeting should forward notes to the mailing list.
> >>>
> >>>
> >>> On Mon, Mar 3, 2014 at 10:09 AM, Giorgio Zoppi <
> giorgio.zoppi@gmail.com
> >>>> wrote:
> >>>
> >>>> So Friday afterwork_
> >>>>
> >>>>
> >>>> 2014-03-03 18:56 GMT+01:00 Suneel Marthi <su...@yahoo.com>:
> >>>>
> >>>>> Grant had setup a Google Hangout for Mahout sometime last year before
> >>> 0.8
> >>>>> release.  I had one setup too for 0.9 release. I definitely wouldn't
> >>> want
> >>>>> to have a hangout on Saturday or weekend.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Monday, March 3, 2014 12:52 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>> Happy to organize a google hangout.  That has the advantage of
> allowing
> >>>>> more attendees and supporting YouTube archiving.
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>
> >>>>>> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Hello All,
> >>>>>> Dr.Dunning could you set a meeting next Sat morning, so we can chat
> >>> and
> >>>>>> discuss by skype improvements and what to do and indentify volunteer
> >>>> and
> >>>>>> tasks.
> >>>>>> Best Regards,
> >>>>>> Giorgio
> >>>>>>
> >>>>>>
> >>>>>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> >>>>>>
> >>>>>>> Me three
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> >>>>>>>>
> >>>>>>>> Ravi,
> >>>>>>>>
> >>>>>>>> Good points.
> >>>>>>>>
> >>>>>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> >>>>> ravi.mummulla@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> - Natively support Windows (guidance, etc. No documentation exists
> >>>>> today,
> >>>>>>>>> for instance)
> >>>>>>>> There is a bit of demand for that.
> >>>>>>>>
> >>>>>>>> - Faster time to first application (from discovery to first
> >>>> application
> >>>>>>>>
> >>>>>>>>> currently takes a non-trivial amount of effort; how can we lower
> >>> the
> >>>>> bar
> >>>>>>>>> and reduce the friction for adoption?)
> >>>>>>>> There is huge evidence that this is important.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> - Better documenting use cases with working samples/examples
> >>>>>>>>> (Documentation
> >>>>>>>>> on https://mahout.apache.org/users/basics/algorithms.html is
> >>> spread
> >>>>> out
> >>>>>>>>> and
> >>>>>>>>> there is too much focus on algorithms as opposed to use cases -
> >>> this
> >>>>> is
> >>>>>>>>> an
> >>>>>>>>> adoption blocker)
> >>>>>>>> This is also important.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> - Uniformity of the API set across all algorithms (are we
> providing
> >>>> the
> >>>>>>>>> same experience across all APIs?)
> >>>>>>>> And many people have been tripped up by this.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> - Measuring/publishing scalability metrics of various algorithms
> >>>> (why
> >>>>>>>>> would
> >>>>>>>>> we want users to adopt Mahout vs. other frameworks for ML at
> >>> scale?)
> >>>>>>>> I don't see this as important as some of your other points, but is
> >>>>> still
> >>>>>>>> useful.
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Quiero ser el rayo de sol que cada día te despierta
> >>>>>> para hacerte respirar y vivir en me.
> >>>>>> "Favola -Moda".
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Quiero ser el rayo de sol que cada día te despierta
> >>>> para hacerte respirar y vivir en me.
> >>>> "Favola -Moda".
> >>>
>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

IRC channel

Sent from my iPhone

> On Mar 3, 2014, at 1:58 PM, Suneel Marthi <su...@yahoo.com> wrote:
> 
> There is an RISC channel for mahout on freenode that's still active .
> 
> Sent from my iPhone
> 
>> On Mar 3, 2014, at 1:46 PM, Andrew Musselman <an...@gmail.com> wrote:
>> 
>> How about reviving/advertising an IRC channel so people could hop on
>> whenever they're free, see if that gains any momentum.
>> 
>> 
>>> On Mon, Mar 3, 2014 at 10:38 AM, Ted Dunning <te...@gmail.com> wrote:
>>> 
>>> We can have more than one hangout to cover multiple time zones/work
>>> requirements.  Each meeting should forward notes to the mailing list.
>>> 
>>> 
>>> On Mon, Mar 3, 2014 at 10:09 AM, Giorgio Zoppi <giorgio.zoppi@gmail.com
>>>> wrote:
>>> 
>>>> So Friday afterwork_
>>>> 
>>>> 
>>>> 2014-03-03 18:56 GMT+01:00 Suneel Marthi <su...@yahoo.com>:
>>>> 
>>>>> Grant had setup a Google Hangout for Mahout sometime last year before
>>> 0.8
>>>>> release.  I had one setup too for 0.9 release. I definitely wouldn't
>>> want
>>>>> to have a hangout on Saturday or weekend.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> Happy to organize a google hangout.  That has the advantage of allowing
>>>>> more attendees and supporting YouTube archiving.
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> 
>>>>>> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Hello All,
>>>>>> Dr.Dunning could you set a meeting next Sat morning, so we can chat
>>> and
>>>>>> discuss by skype improvements and what to do and indentify volunteer
>>>> and
>>>>>> tasks.
>>>>>> Best Regards,
>>>>>> Giorgio
>>>>>> 
>>>>>> 
>>>>>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
>>>>>> 
>>>>>>> Me three
>>>>>>> 
>>>>>>> 
>>>>>>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>>>>>>> 
>>>>>>>> Ravi,
>>>>>>>> 
>>>>>>>> Good points.
>>>>>>>> 
>>>>>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
>>>>> ravi.mummulla@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> - Natively support Windows (guidance, etc. No documentation exists
>>>>> today,
>>>>>>>>> for instance)
>>>>>>>> There is a bit of demand for that.
>>>>>>>> 
>>>>>>>> - Faster time to first application (from discovery to first
>>>> application
>>>>>>>> 
>>>>>>>>> currently takes a non-trivial amount of effort; how can we lower
>>> the
>>>>> bar
>>>>>>>>> and reduce the friction for adoption?)
>>>>>>>> There is huge evidence that this is important.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> - Better documenting use cases with working samples/examples
>>>>>>>>> (Documentation
>>>>>>>>> on https://mahout.apache.org/users/basics/algorithms.html is
>>> spread
>>>>> out
>>>>>>>>> and
>>>>>>>>> there is too much focus on algorithms as opposed to use cases -
>>> this
>>>>> is
>>>>>>>>> an
>>>>>>>>> adoption blocker)
>>>>>>>> This is also important.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> - Uniformity of the API set across all algorithms (are we providing
>>>> the
>>>>>>>>> same experience across all APIs?)
>>>>>>>> And many people have been tripped up by this.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> - Measuring/publishing scalability metrics of various algorithms
>>>> (why
>>>>>>>>> would
>>>>>>>>> we want users to adopt Mahout vs. other frameworks for ML at
>>> scale?)
>>>>>>>> I don't see this as important as some of your other points, but is
>>>>> still
>>>>>>>> useful.
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Quiero ser el rayo de sol que cada día te despierta
>>>>>> para hacerte respirar y vivir en me.
>>>>>> "Favola -Moda".
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Quiero ser el rayo de sol que cada día te despierta
>>>> para hacerte respirar y vivir en me.
>>>> "Favola -Moda".
>>>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

There is an RISC channel for mahout on freenode that's still active .

Sent from my iPhone

> On Mar 3, 2014, at 1:46 PM, Andrew Musselman <an...@gmail.com> wrote:
> 
> How about reviving/advertising an IRC channel so people could hop on
> whenever they're free, see if that gains any momentum.
> 
> 
>> On Mon, Mar 3, 2014 at 10:38 AM, Ted Dunning <te...@gmail.com> wrote:
>> 
>> We can have more than one hangout to cover multiple time zones/work
>> requirements.  Each meeting should forward notes to the mailing list.
>> 
>> 
>> On Mon, Mar 3, 2014 at 10:09 AM, Giorgio Zoppi <giorgio.zoppi@gmail.com
>>> wrote:
>> 
>>> So Friday afterwork_
>>> 
>>> 
>>> 2014-03-03 18:56 GMT+01:00 Suneel Marthi <su...@yahoo.com>:
>>> 
>>>> Grant had setup a Google Hangout for Mahout sometime last year before
>> 0.8
>>>> release.  I had one setup too for 0.9 release. I definitely wouldn't
>> want
>>>> to have a hangout on Saturday or weekend.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>> 
>>>> Happy to organize a google hangout.  That has the advantage of allowing
>>>> more attendees and supporting YouTube archiving.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> 
>>>>> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
>>> wrote:
>>>>> 
>>>>> Hello All,
>>>>> Dr.Dunning could you set a meeting next Sat morning, so we can chat
>> and
>>>>> discuss by skype improvements and what to do and indentify volunteer
>>> and
>>>>> tasks.
>>>>> Best Regards,
>>>>> Giorgio
>>>>> 
>>>>> 
>>>>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
>>>>> 
>>>>>> Me three
>>>>>> 
>>>>>> 
>>>>>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>>>>>> 
>>>>>>> Ravi,
>>>>>>> 
>>>>>>> Good points.
>>>>>>> 
>>>>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
>>>> ravi.mummulla@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> - Natively support Windows (guidance, etc. No documentation exists
>>>> today,
>>>>>>>> for instance)
>>>>>>> There is a bit of demand for that.
>>>>>>> 
>>>>>>> - Faster time to first application (from discovery to first
>>> application
>>>>>>> 
>>>>>>>> currently takes a non-trivial amount of effort; how can we lower
>> the
>>>> bar
>>>>>>>> and reduce the friction for adoption?)
>>>>>>> There is huge evidence that this is important.
>>>>>>> 
>>>>>>> 
>>>>>>>  - Better documenting use cases with working samples/examples
>>>>>>>> (Documentation
>>>>>>>> on https://mahout.apache.org/users/basics/algorithms.html is
>> spread
>>>> out
>>>>>>>> and
>>>>>>>> there is too much focus on algorithms as opposed to use cases -
>> this
>>>> is
>>>>>>>> an
>>>>>>>> adoption blocker)
>>>>>>> This is also important.
>>>>>>> 
>>>>>>> 
>>>>>>> - Uniformity of the API set across all algorithms (are we providing
>>> the
>>>>>>>> same experience across all APIs?)
>>>>>>> And many people have been tripped up by this.
>>>>>>> 
>>>>>>> 
>>>>>>>  - Measuring/publishing scalability metrics of various algorithms
>>> (why
>>>>>>>> would
>>>>>>>> we want users to adopt Mahout vs. other frameworks for ML at
>> scale?)
>>>>>>> I don't see this as important as some of your other points, but is
>>>> still
>>>>>>> useful.
>>>>> 
>>>>> 
>>>>> --
>>>>> Quiero ser el rayo de sol que cada día te despierta
>>>>> para hacerte respirar y vivir en me.
>>>>> "Favola -Moda".
>>> 
>>> 
>>> 
>>> --
>>> Quiero ser el rayo de sol que cada día te despierta
>>> para hacerte respirar y vivir en me.
>>> "Favola -Moda".
>>

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

How about reviving/advertising an IRC channel so people could hop on
whenever they're free, see if that gains any momentum.


On Mon, Mar 3, 2014 at 10:38 AM, Ted Dunning <te...@gmail.com> wrote:

> We can have more than one hangout to cover multiple time zones/work
> requirements.  Each meeting should forward notes to the mailing list.
>
>
> On Mon, Mar 3, 2014 at 10:09 AM, Giorgio Zoppi <giorgio.zoppi@gmail.com
> >wrote:
>
> > So Friday afterwork_
> >
> >
> > 2014-03-03 18:56 GMT+01:00 Suneel Marthi <su...@yahoo.com>:
> >
> > > Grant had setup a Google Hangout for Mahout sometime last year before
> 0.8
> > > release.  I had one setup too for 0.9 release. I definitely wouldn't
> want
> > > to have a hangout on Saturday or weekend.
> > >
> > >
> > >
> > >
> > >
> > > On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > Happy to organize a google hangout.  That has the advantage of allowing
> > > more attendees and supporting YouTube archiving.
> > >
> > > Sent from my iPhone
> > >
> > >
> > > > On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
> > wrote:
> > > >
> > > > Hello All,
> > > > Dr.Dunning could you set a meeting next Sat morning, so we can chat
> and
> > > > discuss by skype improvements and what to do and indentify volunteer
> > and
> > > > tasks.
> > > > Best Regards,
> > > > Giorgio
> > > >
> > > >
> > > > 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> > > >
> > > >> Me three
> > > >>
> > > >>
> > > >>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> > > >>>
> > > >>> Ravi,
> > > >>>
> > > >>> Good points.
> > > >>>
> > > >>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> > > ravi.mummulla@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> - Natively support Windows (guidance, etc. No documentation exists
> > > today,
> > > >>>> for instance)
> > > >>> There is a bit of demand for that.
> > > >>>
> > > >>> - Faster time to first application (from discovery to first
> > application
> > > >>>
> > > >>>> currently takes a non-trivial amount of effort; how can we lower
> the
> > > bar
> > > >>>> and reduce the friction for adoption?)
> > > >>> There is huge evidence that this is important.
> > > >>>
> > > >>>
> > > >>>   - Better documenting use cases with working samples/examples
> > > >>>> (Documentation
> > > >>>> on https://mahout.apache.org/users/basics/algorithms.html is
> spread
> > > out
> > > >>>> and
> > > >>>> there is too much focus on algorithms as opposed to use cases -
> this
> > > is
> > > >>>> an
> > > >>>> adoption blocker)
> > > >>> This is also important.
> > > >>>
> > > >>>
> > > >>> - Uniformity of the API set across all algorithms (are we providing
> > the
> > > >>>> same experience across all APIs?)
> > > >>> And many people have been tripped up by this.
> > > >>>
> > > >>>
> > > >>>   - Measuring/publishing scalability metrics of various algorithms
> > (why
> > > >>>> would
> > > >>>> we want users to adopt Mahout vs. other frameworks for ML at
> scale?)
> > > >>> I don't see this as important as some of your other points, but is
> > > still
> > > >>> useful.
> > > >
> > > >
> > > > --
> > > > Quiero ser el rayo de sol que cada día te despierta
> > > > para hacerte respirar y vivir en me.
> > > > "Favola -Moda".
> > >
> >
> >
> >
> > --
> > Quiero ser el rayo de sol que cada día te despierta
> > para hacerte respirar y vivir en me.
> > "Favola -Moda".
> >
>

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

We can have more than one hangout to cover multiple time zones/work
requirements.  Each meeting should forward notes to the mailing list.


On Mon, Mar 3, 2014 at 10:09 AM, Giorgio Zoppi <gi...@gmail.com>wrote:

> So Friday afterwork_
>
>
> 2014-03-03 18:56 GMT+01:00 Suneel Marthi <su...@yahoo.com>:
>
> > Grant had setup a Google Hangout for Mahout sometime last year before 0.8
> > release.  I had one setup too for 0.9 release. I definitely wouldn't want
> > to have a hangout on Saturday or weekend.
> >
> >
> >
> >
> >
> > On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > Happy to organize a google hangout.  That has the advantage of allowing
> > more attendees and supporting YouTube archiving.
> >
> > Sent from my iPhone
> >
> >
> > > On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
> wrote:
> > >
> > > Hello All,
> > > Dr.Dunning could you set a meeting next Sat morning, so we can chat and
> > > discuss by skype improvements and what to do and indentify volunteer
> and
> > > tasks.
> > > Best Regards,
> > > Giorgio
> > >
> > >
> > > 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> > >
> > >> Me three
> > >>
> > >>
> > >>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> > >>>
> > >>> Ravi,
> > >>>
> > >>> Good points.
> > >>>
> > >>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> > ravi.mummulla@gmail.com>
> > >>> wrote:
> > >>>
> > >>> - Natively support Windows (guidance, etc. No documentation exists
> > today,
> > >>>> for instance)
> > >>> There is a bit of demand for that.
> > >>>
> > >>> - Faster time to first application (from discovery to first
> application
> > >>>
> > >>>> currently takes a non-trivial amount of effort; how can we lower the
> > bar
> > >>>> and reduce the friction for adoption?)
> > >>> There is huge evidence that this is important.
> > >>>
> > >>>
> > >>>   - Better documenting use cases with working samples/examples
> > >>>> (Documentation
> > >>>> on https://mahout.apache.org/users/basics/algorithms.html is spread
> > out
> > >>>> and
> > >>>> there is too much focus on algorithms as opposed to use cases - this
> > is
> > >>>> an
> > >>>> adoption blocker)
> > >>> This is also important.
> > >>>
> > >>>
> > >>> - Uniformity of the API set across all algorithms (are we providing
> the
> > >>>> same experience across all APIs?)
> > >>> And many people have been tripped up by this.
> > >>>
> > >>>
> > >>>   - Measuring/publishing scalability metrics of various algorithms
> (why
> > >>>> would
> > >>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
> > >>> I don't see this as important as some of your other points, but is
> > still
> > >>> useful.
> > >
> > >
> > > --
> > > Quiero ser el rayo de sol que cada día te despierta
> > > para hacerte respirar y vivir en me.
> > > "Favola -Moda".
> >
>
>
>
> --
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".
>

Re: Mahout 1.0 goals

Posted by Giorgio Zoppi <gi...@gmail.com>.

So Friday afterwork_


2014-03-03 18:56 GMT+01:00 Suneel Marthi <su...@yahoo.com>:

> Grant had setup a Google Hangout for Mahout sometime last year before 0.8
> release.  I had one setup too for 0.9 release. I definitely wouldn't want
> to have a hangout on Saturday or weekend.
>
>
>
>
>
> On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> Happy to organize a google hangout.  That has the advantage of allowing
> more attendees and supporting YouTube archiving.
>
> Sent from my iPhone
>
>
> > On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
> >
> > Hello All,
> > Dr.Dunning could you set a meeting next Sat morning, so we can chat and
> > discuss by skype improvements and what to do and indentify volunteer and
> > tasks.
> > Best Regards,
> > Giorgio
> >
> >
> > 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> >
> >> Me three
> >>
> >>
> >>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> >>>
> >>> Ravi,
> >>>
> >>> Good points.
> >>>
> >>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> ravi.mummulla@gmail.com>
> >>> wrote:
> >>>
> >>> - Natively support Windows (guidance, etc. No documentation exists
> today,
> >>>> for instance)
> >>> There is a bit of demand for that.
> >>>
> >>> - Faster time to first application (from discovery to first application
> >>>
> >>>> currently takes a non-trivial amount of effort; how can we lower the
> bar
> >>>> and reduce the friction for adoption?)
> >>> There is huge evidence that this is important.
> >>>
> >>>
> >>>   - Better documenting use cases with working samples/examples
> >>>> (Documentation
> >>>> on https://mahout.apache.org/users/basics/algorithms.html is spread
> out
> >>>> and
> >>>> there is too much focus on algorithms as opposed to use cases - this
> is
> >>>> an
> >>>> adoption blocker)
> >>> This is also important.
> >>>
> >>>
> >>> - Uniformity of the API set across all algorithms (are we providing the
> >>>> same experience across all APIs?)
> >>> And many people have been tripped up by this.
> >>>
> >>>
> >>>   - Measuring/publishing scalability metrics of various algorithms (why
> >>>> would
> >>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
> >>> I don't see this as important as some of your other points, but is
> still
> >>> useful.
> >
> >
> > --
> > Quiero ser el rayo de sol que cada día te despierta
> > para hacerte respirar y vivir en me.
> > "Favola -Moda".
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@googlemail.com>.

+1

On 03/08/2014 12:04 AM, Ted Dunning wrote:
> There was not yet a meeting.
>
> I owe the list a summary of what people said and some suggested
> roadmapping.  I will get to that on the weekend and we should be good for a
> hangout meeting sometime next week.
>
>
>
> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
>
>> Hey Guys,Been trying to follow with the 1.0 goals , was there already a
>> meeting on what the initial plans are for development and notes from that,
>> I am particualrly interested in deep learning and service-izing mahout ,
>> let me know.
>> Thanks
>>
>>> From: ted.dunning@gmail.com
>>> Date: Tue, 4 Mar 2014 19:32:40 -0800
>>> Subject: Re: Mahout 1.0 goals
>>> To: dev@mahout.apache.org; ssc@apache.org
>>>
>>> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
>> wrote:
>>>
>>>> - AFAIK its also a problem to ship it license-wise as the required
>>>> libraries would not be Apache licensed
>>>>
>>>> See this discussion from the Spark community for details:
>>>>
>>>> https://github.com/apache/incubator-spark/pull/575
>>>>
>>>
>>> This is a real issue and getting a lot of time over on legal as well.
>>>
>>> A non-optional LGPL dependency doesn't fly at this time.
>>
>>
>

RE: Mahout 1.0 goals

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Agree on many levels , so one thing i was thinking about aligned with this is how does mahout fit into custom recommendation engines that are already in existence, to step back a bit the web app will have:
1) UI--logic around service up any data pipeline2) BusinessLogic in an MVC like framework (Spring etc)3) Recommendations engine4) Data Store5) Search

I see mahout as populating 4 with a real time view of recommendations (using cascading or custom plugins underneath) and using spark (or storm) to serve these up and for number 3 mahout can live underneath a more business driven recommendations engine that ties into 2

For 5 if the interfaces are defined correctly we can potentially plug in any lucene like implementation under the hood.   So the question in my mind becomes where and how can mahout provide the most value and what are the APIs that need to be written for it to fit into one or more of the layers above.

I was reading this story on elasticsearch's site and it sparked some of the thoughts above:
http://www.elasticsearch.org/case-study/stumbleupon/

I'd love to volunteer to build something that does all of the above and showcases mahout's abilities if there's enough interest.

Regards


> Subject: Re: Mahout 1.0 goals
> From: pat@occamsmachete.com
> Date: Tue, 11 Mar 2014 09:55:04 -0700
> To: dev@mahout.apache.org
> 
> Doing an example site for the solr-recommender Ted and I were faced with same choices you mention below. He and I chose quite different architectures, either of which is perfectly good. 
> 
> I spent some time thinking about what the common integration points are for web apps. Solr supports a large community of web app integrators and works with about any data format and database out there. So in this special case virtually any wep app framework would have one or more methods for integrating with Solr.
> 
> Why not Mahout?
> 
> There at least two ends to web app integration, the input pipeline and serving the results. Not to mention  background potentially periodic model creation. The web app framework usually defines the way data is served (html, json, REST, the list of formats and protocols goes on) so let me put that aside for now. To me this points to getting data into mahout and out again. Ideally it should come in through an extremely flexible mechanism, which may also serve to get the data out.
> 
> Input and output is primarily about translating formats, Ids, and communicating with storage services (local fs, HDFS, S3, DB, …). I chose Cascading to process input in a mostly scalable way. Cascading does not yet have Schemas to support all the DBs so I build one for my DB (MongoDB) but it does support most file systems. There has been some talk in that community about adding Schemas for DBs, which is also possible to do yourself. It may be possible to create several of the more common pipelines all the way from reading data from a logfile, Cassandra, S3, etc through model creation to output to the web app’s primary store. This leaves it somewhat independent of the web app framework. If defined correctly if could have pluggable sink and source types and flexible format definitions.
> 
> Maybe there are better data pipeline frameworks than Cascading and making this work in 80% of use cases will be a fair amount of work but as long as Mahout has enough users it remains an important missing piece. 
> 
> I suspect that any reasonable attempt at this input to Mahout to datastore pipeline would be considered for inclusion or reference in Mahout-Examples.
> 
> 
> On Mar 8, 2014, at 2:31 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
> 
> Ok so the idea here is to tie and make some strategic partnerships with some other open source products and provide Mahout as one component of a web application, so the use cases for mahout will be partly driven by the use cases for the web application itself, so in a nutshell a web application requires: 1) search 2) recommendations 3) a primary data store.  The recommendations may be driven by the higher level use cases but the key piece here will be pushing mahout into delivering real time recommendations that someone can then perform searches over. One example might be to search for music recommendations like what spotify already does and perform term filters, term queries or other lucene based searches to deliver results.  Another might be to identify how recommendations fit into the rest endpoints or in the case of serviceizing mahout they can be rest endpoints.    I've been thinking about this for a while since lately I've seen a lot of discussions around mahout being hard to use or pick up and learn.   If there's enough interest I can go into more detail when we meet to discuss 1.0
> 
> > Date: Sat, 8 Mar 2014 11:44:53 +0100
> > From: ssc@apache.org
> > To: dev@mahout.apache.org
> > Subject: Re: Mahout 1.0 goals
> > 
> > Hm, can you elaborate more what you mean? IMHO Mahout is a library only, 
> > so we should not build a complete MVC application inside this project, I 
> > think this is something that people should build on top, like 
> > prediction.io .
> > 
> > --sebastian
> > 
> > 
> > On 03/08/2014 12:16 AM, Saikat Kanjilal wrote:
> >> I was also wondering if there'd be any interest in building a plugin to interface with elasticsearch and spring, so what I am thinking is an MVC type service that performs lucene like searches on recommendation algorithm data stored inside a low latency data store, I know/saw that  there was a discussion on a solr recommender on mahout and would be glad to help lead/build an elasticsearch version.
> >> 
> >>> From: ted.dunning@gmail.com
> >>> Date: Fri, 7 Mar 2014 15:04:42 -0800
> >>> Subject: Re: Mahout 1.0 goals
> >>> To: dev@mahout.apache.org
> >>> 
> >>> There was not yet a meeting.
> >>> 
> >>> I owe the list a summary of what people said and some suggested
> >>> roadmapping.  I will get to that on the weekend and we should be good for a
> >>> hangout meeting sometime next week.
> >>> 
> >>> 
> >>> 
> >>> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
> >>> 
> >>>> Hey Guys,Been trying to follow with the 1.0 goals , was there already a
> >>>> meeting on what the initial plans are for development and notes from that,
> >>>> I am particualrly interested in deep learning and service-izing mahout ,
> >>>> let me know.
> >>>> Thanks
> >>>> 
> >>>>> From: ted.dunning@gmail.com
> >>>>> Date: Tue, 4 Mar 2014 19:32:40 -0800
> >>>>> Subject: Re: Mahout 1.0 goals
> >>>>> To: dev@mahout.apache.org; ssc@apache.org
> >>>>> 
> >>>>> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
> >>>> wrote:
> >>>>> 
> >>>>>> - AFAIK its also a problem to ship it license-wise as the required
> >>>>>> libraries would not be Apache licensed
> >>>>>> 
> >>>>>> See this discussion from the Spark community for details:
> >>>>>> 
> >>>>>> https://github.com/apache/incubator-spark/pull/575
> >>>>>> 
> >>>>> 
> >>>>> This is a real issue and getting a lot of time over on legal as well.
> >>>>> 
> >>>>> A non-optional LGPL dependency doesn't fly at this time.
> >>>> 
> >>>> 
> >>  		 	   		
> >> 
> > 
> 		 	   		  
>

Re: Mahout 1.0 goals

Posted by Frank Scholten <fr...@frankscholten.nl>.

+1 for pluggable sink and source types.



On Tue, Mar 11, 2014 at 5:55 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Doing an example site for the solr-recommender Ted and I were faced with
> same choices you mention below. He and I chose quite different
> architectures, either of which is perfectly good.
>
> I spent some time thinking about what the common integration points are
> for web apps. Solr supports a large community of web app integrators and
> works with about any data format and database out there. So in this special
> case virtually any wep app framework would have one or more methods for
> integrating with Solr.
>
> Why not Mahout?
>
> There at least two ends to web app integration, the input pipeline and
> serving the results. Not to mention  background potentially periodic model
> creation. The web app framework usually defines the way data is served
> (html, json, REST, the list of formats and protocols goes on) so let me put
> that aside for now. To me this points to getting data into mahout and out
> again. Ideally it should come in through an extremely flexible mechanism,
> which may also serve to get the data out.
>
> Input and output is primarily about translating formats, Ids, and
> communicating with storage services (local fs, HDFS, S3, DB, ...). I chose
> Cascading to process input in a mostly scalable way. Cascading does not yet
> have Schemas to support all the DBs so I build one for my DB (MongoDB) but
> it does support most file systems. There has been some talk in that
> community about adding Schemas for DBs, which is also possible to do
> yourself. It may be possible to create several of the more common pipelines
> all the way from reading data from a logfile, Cassandra, S3, etc through
> model creation to output to the web app's primary store. This leaves it
> somewhat independent of the web app framework. If defined correctly if
> could have pluggable sink and source types and flexible format definitions.
>
> Maybe there are better data pipeline frameworks than Cascading and making
> this work in 80% of use cases will be a fair amount of work but as long as
> Mahout has enough users it remains an important missing piece.
>
> I suspect that any reasonable attempt at this input to Mahout to datastore
> pipeline would be considered for inclusion or reference in Mahout-Examples.
>
>
> On Mar 8, 2014, at 2:31 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:
>
> Ok so the idea here is to tie and make some strategic partnerships with
> some other open source products and provide Mahout as one component of a
> web application, so the use cases for mahout will be partly driven by the
> use cases for the web application itself, so in a nutshell a web
> application requires: 1) search 2) recommendations 3) a primary data store.
>  The recommendations may be driven by the higher level use cases but the
> key piece here will be pushing mahout into delivering real time
> recommendations that someone can then perform searches over. One example
> might be to search for music recommendations like what spotify already does
> and perform term filters, term queries or other lucene based searches to
> deliver results.  Another might be to identify how recommendations fit into
> the rest endpoints or in the case of serviceizing mahout they can be rest
> endpoints.    I've been thinking about this for a while since lately I've
> seen a lot of discussions around mahout being hard to use or pick up and
> learn.   If there's enough interest I can go into more detail when we meet
> to discuss 1.0
>
> > Date: Sat, 8 Mar 2014 11:44:53 +0100
> > From: ssc@apache.org
> > To: dev@mahout.apache.org
> > Subject: Re: Mahout 1.0 goals
> >
> > Hm, can you elaborate more what you mean? IMHO Mahout is a library only,
> > so we should not build a complete MVC application inside this project, I
> > think this is something that people should build on top, like
> > prediction.io .
> >
> > --sebastian
> >
> >
> > On 03/08/2014 12:16 AM, Saikat Kanjilal wrote:
> >> I was also wondering if there'd be any interest in building a plugin to
> interface with elasticsearch and spring, so what I am thinking is an MVC
> type service that performs lucene like searches on recommendation algorithm
> data stored inside a low latency data store, I know/saw that  there was a
> discussion on a solr recommender on mahout and would be glad to help
> lead/build an elasticsearch version.
> >>
> >>> From: ted.dunning@gmail.com
> >>> Date: Fri, 7 Mar 2014 15:04:42 -0800
> >>> Subject: Re: Mahout 1.0 goals
> >>> To: dev@mahout.apache.org
> >>>
> >>> There was not yet a meeting.
> >>>
> >>> I owe the list a summary of what people said and some suggested
> >>> roadmapping.  I will get to that on the weekend and we should be good
> for a
> >>> hangout meeting sometime next week.
> >>>
> >>>
> >>>
> >>> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sxk1969@hotmail.com
> >wrote:
> >>>
> >>>> Hey Guys,Been trying to follow with the 1.0 goals , was there already
> a
> >>>> meeting on what the initial plans are for development and notes from
> that,
> >>>> I am particualrly interested in deep learning and service-izing
> mahout ,
> >>>> let me know.
> >>>> Thanks
> >>>>
> >>>>> From: ted.dunning@gmail.com
> >>>>> Date: Tue, 4 Mar 2014 19:32:40 -0800
> >>>>> Subject: Re: Mahout 1.0 goals
> >>>>> To: dev@mahout.apache.org; ssc@apache.org
> >>>>>
> >>>>> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
> >>>> wrote:
> >>>>>
> >>>>>> - AFAIK its also a problem to ship it license-wise as the required
> >>>>>> libraries would not be Apache licensed
> >>>>>>
> >>>>>> See this discussion from the Spark community for details:
> >>>>>>
> >>>>>> https://github.com/apache/incubator-spark/pull/575
> >>>>>>
> >>>>>
> >>>>> This is a real issue and getting a lot of time over on legal as well.
> >>>>>
> >>>>> A non-optional LGPL dependency doesn't fly at this time.
> >>>>
> >>>>
> >>
> >>
> >
>
>
>

Re: Mahout 1.0 goals

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Doing an example site for the solr-recommender Ted and I were faced with same choices you mention below. He and I chose quite different architectures, either of which is perfectly good. 

I spent some time thinking about what the common integration points are for web apps. Solr supports a large community of web app integrators and works with about any data format and database out there. So in this special case virtually any wep app framework would have one or more methods for integrating with Solr.

Why not Mahout?

There at least two ends to web app integration, the input pipeline and serving the results. Not to mention  background potentially periodic model creation. The web app framework usually defines the way data is served (html, json, REST, the list of formats and protocols goes on) so let me put that aside for now. To me this points to getting data into mahout and out again. Ideally it should come in through an extremely flexible mechanism, which may also serve to get the data out.

Input and output is primarily about translating formats, Ids, and communicating with storage services (local fs, HDFS, S3, DB, …). I chose Cascading to process input in a mostly scalable way. Cascading does not yet have Schemas to support all the DBs so I build one for my DB (MongoDB) but it does support most file systems. There has been some talk in that community about adding Schemas for DBs, which is also possible to do yourself. It may be possible to create several of the more common pipelines all the way from reading data from a logfile, Cassandra, S3, etc through model creation to output to the web app’s primary store. This leaves it somewhat independent of the web app framework. If defined correctly if could have pluggable sink and source types and flexible format definitions.

Maybe there are better data pipeline frameworks than Cascading and making this work in 80% of use cases will be a fair amount of work but as long as Mahout has enough users it remains an important missing piece. 

I suspect that any reasonable attempt at this input to Mahout to datastore pipeline would be considered for inclusion or reference in Mahout-Examples.

On Mar 8, 2014, at 2:31 PM, Saikat Kanjilal <sx...@hotmail.com> wrote:

Ok so the idea here is to tie and make some strategic partnerships with some other open source products and provide Mahout as one component of a web application, so the use cases for mahout will be partly driven by the use cases for the web application itself, so in a nutshell a web application requires: 1) search 2) recommendations 3) a primary data store.  The recommendations may be driven by the higher level use cases but the key piece here will be pushing mahout into delivering real time recommendations that someone can then perform searches over. One example might be to search for music recommendations like what spotify already does and perform term filters, term queries or other lucene based searches to deliver results.  Another might be to identify how recommendations fit into the rest endpoints or in the case of serviceizing mahout they can be rest endpoints.    I've been thinking about this for a while since lately I've seen a lot of discussions around mahout being hard to use or pick up and learn.   If there's enough interest I can go into more detail when we meet to discuss 1.0

> Date: Sat, 8 Mar 2014 11:44:53 +0100
> From: ssc@apache.org
> To: dev@mahout.apache.org
> Subject: Re: Mahout 1.0 goals
> 
> Hm, can you elaborate more what you mean? IMHO Mahout is a library only, 
> so we should not build a complete MVC application inside this project, I 
> think this is something that people should build on top, like 
> prediction.io .
> 
> --sebastian
> 
> 
> On 03/08/2014 12:16 AM, Saikat Kanjilal wrote:
>> I was also wondering if there'd be any interest in building a plugin to interface with elasticsearch and spring, so what I am thinking is an MVC type service that performs lucene like searches on recommendation algorithm data stored inside a low latency data store, I know/saw that  there was a discussion on a solr recommender on mahout and would be glad to help lead/build an elasticsearch version.
>> 
>>> From: ted.dunning@gmail.com
>>> Date: Fri, 7 Mar 2014 15:04:42 -0800
>>> Subject: Re: Mahout 1.0 goals
>>> To: dev@mahout.apache.org
>>> 
>>> There was not yet a meeting.
>>> 
>>> I owe the list a summary of what people said and some suggested
>>> roadmapping.  I will get to that on the weekend and we should be good for a
>>> hangout meeting sometime next week.
>>> 
>>> 
>>> 
>>> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
>>> 
>>>> Hey Guys,Been trying to follow with the 1.0 goals , was there already a
>>>> meeting on what the initial plans are for development and notes from that,
>>>> I am particualrly interested in deep learning and service-izing mahout ,
>>>> let me know.
>>>> Thanks
>>>> 
>>>>> From: ted.dunning@gmail.com
>>>>> Date: Tue, 4 Mar 2014 19:32:40 -0800
>>>>> Subject: Re: Mahout 1.0 goals
>>>>> To: dev@mahout.apache.org; ssc@apache.org
>>>>> 
>>>>> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
>>>> wrote:
>>>>> 
>>>>>> - AFAIK its also a problem to ship it license-wise as the required
>>>>>> libraries would not be Apache licensed
>>>>>> 
>>>>>> See this discussion from the Spark community for details:
>>>>>> 
>>>>>> https://github.com/apache/incubator-spark/pull/575
>>>>>> 
>>>>> 
>>>>> This is a real issue and getting a lot of time over on legal as well.
>>>>> 
>>>>> A non-optional LGPL dependency doesn't fly at this time.
>>>> 
>>>> 
>>  		 	   		
>> 
>

RE: Mahout 1.0 goals

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Ok so the idea here is to tie and make some strategic partnerships with some other open source products and provide Mahout as one component of a web application, so the use cases for mahout will be partly driven by the use cases for the web application itself, so in a nutshell a web application requires: 1) search 2) recommendations 3) a primary data store.  The recommendations may be driven by the higher level use cases but the key piece here will be pushing mahout into delivering real time recommendations that someone can then perform searches over. One example might be to search for music recommendations like what spotify already does and perform term filters, term queries or other lucene based searches to deliver results.  Another might be to identify how recommendations fit into the rest endpoints or in the case of serviceizing mahout they can be rest endpoints.    I've been thinking about this for a while since lately I've seen a lot of discussions around mahout being hard to use or pick up and learn.   If there's enough interest I can go into more detail when we meet to discuss 1.0

> Date: Sat, 8 Mar 2014 11:44:53 +0100
> From: ssc@apache.org
> To: dev@mahout.apache.org
> Subject: Re: Mahout 1.0 goals
> 
> Hm, can you elaborate more what you mean? IMHO Mahout is a library only, 
> so we should not build a complete MVC application inside this project, I 
> think this is something that people should build on top, like 
> prediction.io .
> 
> --sebastian
> 
> 
> On 03/08/2014 12:16 AM, Saikat Kanjilal wrote:
> > I was also wondering if there'd be any interest in building a plugin to interface with elasticsearch and spring, so what I am thinking is an MVC type service that performs lucene like searches on recommendation algorithm data stored inside a low latency data store, I know/saw that  there was a discussion on a solr recommender on mahout and would be glad to help lead/build an elasticsearch version.
> >
> >> From: ted.dunning@gmail.com
> >> Date: Fri, 7 Mar 2014 15:04:42 -0800
> >> Subject: Re: Mahout 1.0 goals
> >> To: dev@mahout.apache.org
> >>
> >> There was not yet a meeting.
> >>
> >> I owe the list a summary of what people said and some suggested
> >> roadmapping.  I will get to that on the weekend and we should be good for a
> >> hangout meeting sometime next week.
> >>
> >>
> >>
> >> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
> >>
> >>> Hey Guys,Been trying to follow with the 1.0 goals , was there already a
> >>> meeting on what the initial plans are for development and notes from that,
> >>> I am particualrly interested in deep learning and service-izing mahout ,
> >>> let me know.
> >>> Thanks
> >>>
> >>>> From: ted.dunning@gmail.com
> >>>> Date: Tue, 4 Mar 2014 19:32:40 -0800
> >>>> Subject: Re: Mahout 1.0 goals
> >>>> To: dev@mahout.apache.org; ssc@apache.org
> >>>>
> >>>> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
> >>> wrote:
> >>>>
> >>>>> - AFAIK its also a problem to ship it license-wise as the required
> >>>>> libraries would not be Apache licensed
> >>>>>
> >>>>> See this discussion from the Spark community for details:
> >>>>>
> >>>>> https://github.com/apache/incubator-spark/pull/575
> >>>>>
> >>>>
> >>>> This is a real issue and getting a lot of time over on legal as well.
> >>>>
> >>>> A non-optional LGPL dependency doesn't fly at this time.
> >>>
> >>>
> >   		 	   		
> >
>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

On Saturday, March 8, 2014 5:41 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Ah, now back to freely babbling on the dev list.

Mahout wishlist:
1) scaling: I don’t get the need for R integration or running without hadoop or spark. You can run hadoop in local mode on your native file system even using a debugger--then run the exact same code on a cluster. If you don’t care about scaling there are plenty of great libs for R already, why worry about Mahout? One project I worked on started with the in-memory recommender but within months had hopelessly outgrown it. If there isn’t at least a path to scaling we would never have started with Mahout. Non-scalable code is fine and solves many applications but I hope it’s not the primary design point.
2) speed: read below, Hadoop now (speed means buying more computers) More Spark later (buy less computers)
3) ease of data input/output. The conversion of external ids into Mahout sequential integers is deceptively difficult and has to be re-created with every project. I’m trying to submit an example, which includes an input/output pipeline that is mostly scalable. It takes delimited logfiles with external ids, creates Mahout input, then takes the output of Mahout and converts back to external Ids. It is not worthy of core inclusion but is at least a prototype or example of how to do this.

My $0.02 worth about the future of Mahout:
1) the future will be in moving lots of the current code to Spark and that may not be the end of it. If yet another faster platform emerges Mahout will have to go there too. If Mahout doesn’t move (pretty quickly) someone will fill the gap and Mahout will be left behind.
2) the future of Mahout is tied to big data, at least I hope so.

Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge algorithms or is Mahout a scalable, performant ML library that is targeted for production environments?

>> Agree with the later and given that the future is moving existing implementations to Spark, all the more reason to make Mahout less of an experimental sandbox.

I hope most people think it is the later.

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

Me too.

To answer the question:
>> Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge
>> algorithms or is Mahout a scalable, performant ML library that is targeted
>> for production environments?


I think it is important to clean up a lot of wiring and user experience issues and make it production-ready, and have the sandbox too.

To make it more formal and try to prevent "sandbox creep" may mean putting new and experimental things into an internal incubator bucket wherever possible.

> On Mar 8, 2014, at 7:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> 
> very close to my position.
> 
> 
>> On Sat, Mar 8, 2014 at 2:40 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> Ah, now back to freely babbling on the dev list.
>> 
>> Mahout wishlist:
>> 1) scaling:  I don't get the need for R integration or running without
>> hadoop or spark. You can run hadoop in local mode on your native file
>> system even using a debugger--then run the exact same code on a cluster. If
>> you don't care about scaling there are plenty of great libs for R already,
>> why worry about Mahout? One project I worked on started with the in-memory
>> recommender but within months had hopelessly outgrown it. If there isn't at
>> least a path to scaling we would never have started with Mahout.
>> Non-scalable code is fine and solves many applications but I hope it's not
>> the primary design point.
>> 2) speed: read below, Hadoop now (speed means buying more computers) More
>> Spark later (buy less computers)
>> 3) ease of data input/output. The conversion of external ids into Mahout
>> sequential integers is deceptively difficult and has to be re-created with
>> every project. I'm trying to submit an example, which includes an
>> input/output pipeline that is mostly scalable. It takes delimited logfiles
>> with external ids, creates Mahout input, then takes the output of Mahout
>> and converts back to external Ids. It is not worthy of core inclusion but
>> is at least a prototype or example of how to do this.
>> 
>> My $0.02 worth about the future of Mahout:
>> 1) the future will be in moving lots of the current code to Spark and that
>> may not be the end of it. If yet another faster platform emerges Mahout
>> will have to go there too. If Mahout doesn't move (pretty quickly) someone
>> will fill the gap and Mahout will be left behind.
>> 2) the future of Mahout is tied to big data, at least I hope so.
>> 
>> Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge
>> algorithms or is Mahout a scalable, performant ML library that is targeted
>> for production environments?
>> 
>> I hope most people think it is the later.
>> 
>>

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

very close to my position.


On Sat, Mar 8, 2014 at 2:40 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Ah, now back to freely babbling on the dev list.
>
> Mahout wishlist:
> 1) scaling:  I don't get the need for R integration or running without
> hadoop or spark. You can run hadoop in local mode on your native file
> system even using a debugger--then run the exact same code on a cluster. If
> you don't care about scaling there are plenty of great libs for R already,
> why worry about Mahout? One project I worked on started with the in-memory
> recommender but within months had hopelessly outgrown it. If there isn't at
> least a path to scaling we would never have started with Mahout.
>  Non-scalable code is fine and solves many applications but I hope it's not
> the primary design point.
> 2) speed: read below, Hadoop now (speed means buying more computers) More
> Spark later (buy less computers)
> 3) ease of data input/output. The conversion of external ids into Mahout
> sequential integers is deceptively difficult and has to be re-created with
> every project. I'm trying to submit an example, which includes an
> input/output pipeline that is mostly scalable. It takes delimited logfiles
> with external ids, creates Mahout input, then takes the output of Mahout
> and converts back to external Ids. It is not worthy of core inclusion but
> is at least a prototype or example of how to do this.
>
> My $0.02 worth about the future of Mahout:
> 1) the future will be in moving lots of the current code to Spark and that
> may not be the end of it. If yet another faster platform emerges Mahout
> will have to go there too. If Mahout doesn't move (pretty quickly) someone
> will fill the gap and Mahout will be left behind.
> 2) the future of Mahout is tied to big data, at least I hope so.
>
> Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge
> algorithms or is Mahout a scalable, performant ML library that is targeted
> for production environments?
>
> I hope most people think it is the later.
>
>

Re: Mahout 1.0 goals

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Ah, now back to freely babbling on the dev list.

Ask yourself this: Is Mahout a sandbox for experimentation on cutting edge algorithms or is Mahout a scalable, performant ML library that is targeted for production environments?

I hope most people think it is the later.

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@apache.org>.

Hm, can you elaborate more what you mean? IMHO Mahout is a library only, 
so we should not build a complete MVC application inside this project, I 
think this is something that people should build on top, like 
prediction.io .

--sebastian


On 03/08/2014 12:16 AM, Saikat Kanjilal wrote:
> I was also wondering if there'd be any interest in building a plugin to interface with elasticsearch and spring, so what I am thinking is an MVC type service that performs lucene like searches on recommendation algorithm data stored inside a low latency data store, I know/saw that  there was a discussion on a solr recommender on mahout and would be glad to help lead/build an elasticsearch version.
>
>> From: ted.dunning@gmail.com
>> Date: Fri, 7 Mar 2014 15:04:42 -0800
>> Subject: Re: Mahout 1.0 goals
>> To: dev@mahout.apache.org
>>
>> There was not yet a meeting.
>>
>> I owe the list a summary of what people said and some suggested
>> roadmapping.  I will get to that on the weekend and we should be good for a
>> hangout meeting sometime next week.
>>
>>
>>
>> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
>>
>>> Hey Guys,Been trying to follow with the 1.0 goals , was there already a
>>> meeting on what the initial plans are for development and notes from that,
>>> I am particualrly interested in deep learning and service-izing mahout ,
>>> let me know.
>>> Thanks
>>>
>>>> From: ted.dunning@gmail.com
>>>> Date: Tue, 4 Mar 2014 19:32:40 -0800
>>>> Subject: Re: Mahout 1.0 goals
>>>> To: dev@mahout.apache.org; ssc@apache.org
>>>>
>>>> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
>>> wrote:
>>>>
>>>>> - AFAIK its also a problem to ship it license-wise as the required
>>>>> libraries would not be Apache licensed
>>>>>
>>>>> See this discussion from the Spark community for details:
>>>>>
>>>>> https://github.com/apache/incubator-spark/pull/575
>>>>>
>>>>
>>>> This is a real issue and getting a lot of time over on legal as well.
>>>>
>>>> A non-optional LGPL dependency doesn't fly at this time.
>>>
>>>
>   		 	   		
>

RE: Mahout 1.0 goals

Posted by Saikat Kanjilal <sx...@hotmail.com>.

I was also wondering if there'd be any interest in building a plugin to interface with elasticsearch and spring, so what I am thinking is an MVC type service that performs lucene like searches on recommendation algorithm data stored inside a low latency data store, I know/saw that  there was a discussion on a solr recommender on mahout and would be glad to help lead/build an elasticsearch version.

> From: ted.dunning@gmail.com
> Date: Fri, 7 Mar 2014 15:04:42 -0800
> Subject: Re: Mahout 1.0 goals
> To: dev@mahout.apache.org
> 
> There was not yet a meeting.
> 
> I owe the list a summary of what people said and some suggested
> roadmapping.  I will get to that on the weekend and we should be good for a
> hangout meeting sometime next week.
> 
> 
> 
> On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:
> 
> > Hey Guys,Been trying to follow with the 1.0 goals , was there already a
> > meeting on what the initial plans are for development and notes from that,
> > I am particualrly interested in deep learning and service-izing mahout ,
> > let me know.
> > Thanks
> >
> > > From: ted.dunning@gmail.com
> > > Date: Tue, 4 Mar 2014 19:32:40 -0800
> > > Subject: Re: Mahout 1.0 goals
> > > To: dev@mahout.apache.org; ssc@apache.org
> > >
> > > On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> > >
> > > > - AFAIK its also a problem to ship it license-wise as the required
> > > > libraries would not be Apache licensed
> > > >
> > > > See this discussion from the Spark community for details:
> > > >
> > > > https://github.com/apache/incubator-spark/pull/575
> > > >
> > >
> > > This is a real issue and getting a lot of time over on legal as well.
> > >
> > > A non-optional LGPL dependency doesn't fly at this time.
> >
> >

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

There was not yet a meeting.

I owe the list a summary of what people said and some suggested
roadmapping.  I will get to that on the weekend and we should be good for a
hangout meeting sometime next week.



On Fri, Mar 7, 2014 at 10:35 AM, Saikat Kanjilal <sx...@hotmail.com>wrote:

> Hey Guys,Been trying to follow with the 1.0 goals , was there already a
> meeting on what the initial plans are for development and notes from that,
> I am particualrly interested in deep learning and service-izing mahout ,
> let me know.
> Thanks
>
> > From: ted.dunning@gmail.com
> > Date: Tue, 4 Mar 2014 19:32:40 -0800
> > Subject: Re: Mahout 1.0 goals
> > To: dev@mahout.apache.org; ssc@apache.org
> >
> > On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >
> > > - AFAIK its also a problem to ship it license-wise as the required
> > > libraries would not be Apache licensed
> > >
> > > See this discussion from the Spark community for details:
> > >
> > > https://github.com/apache/incubator-spark/pull/575
> > >
> >
> > This is a real issue and getting a lot of time over on legal as well.
> >
> > A non-optional LGPL dependency doesn't fly at this time.
>
>

RE: Mahout 1.0 goals

Posted by Saikat Kanjilal <sx...@hotmail.com>.

Hey Guys,Been trying to follow with the 1.0 goals , was there already a meeting on what the initial plans are for development and notes from that, I am particualrly interested in deep learning and service-izing mahout , let me know.
Thanks

> From: ted.dunning@gmail.com
> Date: Tue, 4 Mar 2014 19:32:40 -0800
> Subject: Re: Mahout 1.0 goals
> To: dev@mahout.apache.org; ssc@apache.org
> 
> On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> > - AFAIK its also a problem to ship it license-wise as the required
> > libraries would not be Apache licensed
> >
> > See this discussion from the Spark community for details:
> >
> > https://github.com/apache/incubator-spark/pull/575
> >
> 
> This is a real issue and getting a lot of time over on legal as well.
> 
> A non-optional LGPL dependency doesn't fly at this time.

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Mar 4, 2014 at 2:24 PM, Sebastian Schelter <ss...@apache.org> wrote:

> - AFAIK its also a problem to ship it license-wise as the required
> libraries would not be Apache licensed
>
> See this discussion from the Spark community for details:
>
> https://github.com/apache/incubator-spark/pull/575
>

This is a real issue and getting a lot of time over on legal as well.

A non-optional LGPL dependency doesn't fly at this time.

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@apache.org>.

JBlas gave roughly 5x -7x performance for solving the dense linear 
systems in ALS when I integrated it into a prototype of Mahout's ALS for 
a research paper.

There are some caveats with it unfortunately:

- it requires certain fortran libs to be installed on the machines of 
the cluster

- its jar is really huge, so it would blow up the size of "uber-jars" 
built from mahout

- AFAIK its also a problem to ship it license-wise as the required 
libraries would not be Apache licensed

See this discussion from the Spark community for details:

https://github.com/apache/incubator-spark/pull/575


Best,
Sebastian

On 03/04/2014 11:17 PM, Suneel Marthi wrote:
> There's JBlas which is used by Spark, Deeplearning.org and other Ml projects.  IIRC, there was some prototyping done in the past using JBlas for Mahout - Sebastian or Sean can better speak to that?  It definitely has better performance than Mahout-Math.
>
> Managing the native Fortran dependencies could be challenging with JBlas, not to mention that JBlas may not support sparse matrices (someone correct me here).
>
>
>
>
>
>
> On Tuesday, March 4, 2014 4:57 PM, Giorgio Zoppi <gi...@gmail.com> wrote:
>
> I would like to find some way of speed up matrix library, ie JNI+C++.
>
>
> 2014-03-04 22:53 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:
>
>> Yes, I like to work on standardizing the code around input formats.
>>
>>
>> On Mon, Mar 3, 2014 at 7:37 PM, Suneel Marthi <suneel_marthi@yahoo.com
>>> wrote:
>>
>>> To get things moving for 1.0:
>>>
>>>
>>> a) Address the 4 issues that Sean had raised - we have already started
>>> looking at Backlog and
>   closing them, started looking at converting old
>>> MapReduce to newer MapReduce API.
>>>
>>>      If someone could start looking at standardizing the input/output
>>> formats across classifiers, clustering and recommenders that would be
>>> great.  Guess Frank S. has already started work in that direction.
>>>
>>> b)  Need a better and cleaner serialized form of Vectors to handle names
>>> and other kind'a stuff, this is gonna impact everything that's presently
>>> implemented.
>>>
>>> c)  Agree with ssc, to start looking at Spark-Mahout integration.
>>>
>>>
>>> d) Need volunteers to QA/address issues with the present
>>> classifiers/clustering algorithms. I personally can vouch for how
>>> disastrous it is to deploy any of Mahout's classifiers/clustering
>>> implementations in an Operations environment. A good example of that is
>>> Sean's recent patch for RDF.
>>>
>>> Naive Bayes code as it is now seems half-baked and is incomplete. Not
>>> every code path has been tested on Streaming KMeans.
>>>
>>> This should go some way in addressing the technical debt that's been
>> piled
>>> over the years.
>>>
>>>
>>>
>>>
>>>
>>> On Monday, March 3, 2014 1:05 PM, Sebastian Schelter <ss...@apache.org>
>>> wrote:
>>>
>>> I would like to discuss whether we should start to have some
>>> Spark-related code in Mahout.
>>>
>>> --sebastian
>>>
>>>
>>> On 03/03/2014 06:56 PM, Suneel Marthi wrote:
>>>> Grant had setup a Google Hangout for Mahout sometime last year before
>>> 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't
>>> want to have a hangout on Saturday or weekend.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Monday, March 3, 2014 12:52
>   PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>>
>>>> Happy to organize a google hangout.  That has the advantage of allowing
>>> more attendees and supporting YouTube archiving.
>>>>
>>>> Sent from my iPhone
>>>>
>>>>
>>>>> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
>> wrote:
>>>>>
>>>>> Hello All,
>>>>> Dr.Dunning could you set a meeting next Sat morning, so we can chat
>> and
>>>>> discuss by skype improvements and what to do and indentify volunteer
>> and
>>>>> tasks.
>>>>> Best Regards,
>>>>> Giorgio
>>>>>
>>>>>
>>>>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
>>>>>
>>>>>> Me three
>>>>>>
>>>>>>
>>>>>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>>>>>>
>>>>>>> Ravi,
>>>>>>>
>>>
>   >>>> Good points.
>>>>>>>
>>>>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
>>> ravi.mummulla@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> - Natively support Windows (guidance, etc. No documentation exists
>>> today,
>>>>>>>> for instance)
>>>>>>> There is a bit of demand for that.
>>>>>>>
>>>>>>> - Faster time to first application (from discovery to first
>>> application
>>>>>>>
>>>
>   >>>>> currently takes a non-trivial amount of effort; how can we lower
>> the
>>> bar
>>>>>>>> and reduce the friction for adoption?)
>>>>>>> There is huge evidence that this is important.
>>>>>>>
>>>>>>>
>>>>>>>       - Better documenting use cases with working samples/examples
>>>>>>>> (Documentation
>>>>>>>> on https://mahout.apache.org/users/basics/algorithms.html is
>> spread
>>> out
>>>>>>>> and
>>>>>>>> there is too much
>   focus on algorithms as opposed to use cases -
>> this
>>> is
>>>>>>>> an
>>>>>>>> adoption
>>>    blocker)
>>>>>>> This is also important.
>>>>>>>
>>>>>>>
>>>>>>> - Uniformity of the API set across all algorithms (are we providing
>>> the
>>>>>>>> same experience across all APIs?)
>>>>>>> And many people have been tripped up by this.
>>>>>>>
>>>>>>>
>>>>>>>       - Measuring/publishing scalability metrics of various algorithms
>>> (why
>>>>>>>> would
>>>>>>>> we want users to adopt Mahout vs. other frameworks for ML at
>> scale?)
>>>>>>> I don't see this as important as some of your other points, but is
>>> still
>>>>>>> useful.
>>>>>
>>>>>
>>>>> --
>>>>> Quiero ser el rayo de sol que cada día te despierta
>>>>> para hacerte respirar y vivir en me.
>>>>> "Favola -Moda".
>
>>>
>>
>
>
>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

There's JBlas which is used by Spark, Deeplearning.org and other Ml projects.  IIRC, there was some prototyping done in the past using JBlas for Mahout - Sebastian or Sean can better speak to that?  It definitely has better performance than Mahout-Math.

Managing the native Fortran dependencies could be challenging with JBlas, not to mention that JBlas may not support sparse matrices (someone correct me here).






On Tuesday, March 4, 2014 4:57 PM, Giorgio Zoppi <gi...@gmail.com> wrote:
 
I would like to find some way of speed up matrix library, ie JNI+C++.


2014-03-04 22:53 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:

> Yes, I like to work on standardizing the code around input formats.
>
>
> On Mon, Mar 3, 2014 at 7:37 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > To get things moving for 1.0:
> >
> >
> > a) Address the 4 issues that Sean had raised - we have already started
> > looking at Backlog and
 closing them, started looking at converting old
> > MapReduce to newer MapReduce API.
> >
> >    If someone could start looking at standardizing the input/output
> > formats across classifiers, clustering and recommenders that would be
> > great.  Guess Frank S. has already started work in that direction.
> >
> > b)  Need a better and cleaner serialized form of Vectors to handle names
> > and other kind'a stuff, this is gonna impact everything that's presently
> > implemented.
> >
> > c)  Agree with ssc, to start looking at Spark-Mahout integration.
> >
> >
> > d) Need volunteers to QA/address issues with the present
> > classifiers/clustering algorithms. I personally can vouch for how
> > disastrous it is to deploy any of Mahout's classifiers/clustering
> > implementations in an Operations environment. A good example of that is
> > Sean's recent patch for RDF.
> >
> > Naive Bayes code as it is now seems half-baked and is incomplete. Not
> > every code path has been tested on Streaming KMeans.
> >
> > This should go some way in addressing the technical debt that's been
> piled
> > over the years.
> >
> >
> >
> >
> >
> > On Monday, March 3, 2014 1:05 PM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> >
> > I would like to discuss whether we should start to have some
> > Spark-related code in Mahout.
> >
> > --sebastian
> >
> >
> > On 03/03/2014 06:56 PM, Suneel Marthi wrote:
> > > Grant had setup a Google Hangout for Mahout sometime last year before
> > 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't
> > want to have a hangout on Saturday or weekend.
> > >
> > >
> > >
> > >
> > >
> > > On Monday, March 3, 2014 12:52
 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > Happy to organize a google hangout.  That has the advantage of allowing
> > more attendees and supporting YouTube archiving.
> > >
> > > Sent from my iPhone
> > >
> > >
> > >> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
> wrote:
> > >>
> > >> Hello All,
> > >> Dr.Dunning could you set a meeting next Sat morning, so we can chat
> and
> > >> discuss by skype improvements and what to do and indentify volunteer
> and
> > >> tasks.
> > >> Best Regards,
> > >> Giorgio
> > >>
> > >>
> > >> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> > >>
> > >>> Me three
> > >>>
> > >>>
> > >>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> > >>>>
> > >>>> Ravi,
> > >>>>
> >
 >>>> Good points.
> > >>>>
> > >>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> > ravi.mummulla@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>> - Natively support Windows (guidance, etc. No documentation exists
> > today,
> > >>>>> for instance)
> > >>>> There is a bit of demand for that.
> > >>>>
> > >>>> - Faster time to first application (from discovery to first
> > application
> > >>>>
> >
 >>>>> currently takes a non-trivial amount of effort; how can we lower
> the
> > bar
> > >>>>> and reduce the friction for adoption?)
> > >>>> There is huge evidence that this is important.
> > >>>>
> > >>>>
> > >>>>     - Better documenting use cases with working samples/examples
> > >>>>> (Documentation
> > >>>>> on https://mahout.apache.org/users/basics/algorithms.html is
> spread
> > out
> > >>>>> and
> > >>>>> there is too much
 focus on algorithms as opposed to use cases -
> this
> > is
> > >>>>> an
> > >>>>> adoption
> >  blocker)
> > >>>> This is also important.
> > >>>>
> > >>>>
> > >>>> - Uniformity of the API set across all algorithms (are we providing
> > the
> > >>>>> same experience across all APIs?)
> > >>>> And many people have been tripped up by this.
> > >>>>
> > >>>>
> > >>>>     - Measuring/publishing scalability metrics of various algorithms
> > (why
> > >>>>> would
> > >>>>> we want users to adopt Mahout vs. other frameworks for ML at
> scale?)
> > >>>> I don't see this as important as some of your other points, but is
> > still
> > >>>> useful.
> > >>
> > >>
> > >> --
> > >> Quiero ser el rayo de sol que cada día te despierta
> > >> para hacerte respirar y vivir en me.
> > >> "Favola -Moda".

> >
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar
 y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Posted by Giorgio Zoppi <gi...@gmail.com>.

I would like to find some way of speed up matrix library, ie JNI+C++.


2014-03-04 22:53 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:

> Yes, I like to work on standardizing the code around input formats.
>
>
> On Mon, Mar 3, 2014 at 7:37 PM, Suneel Marthi <suneel_marthi@yahoo.com
> >wrote:
>
> > To get things moving for 1.0:
> >
> >
> > a) Address the 4 issues that Sean had raised - we have already started
> > looking at Backlog and closing them, started looking at converting old
> > MapReduce to newer MapReduce API.
> >
> >    If someone could start looking at standardizing the input/output
> > formats across classifiers, clustering and recommenders that would be
> > great.  Guess Frank S. has already started work in that direction.
> >
> > b)  Need a better and cleaner serialized form of Vectors to handle names
> > and other kind'a stuff, this is gonna impact everything that's presently
> > implemented.
> >
> > c)  Agree with ssc, to start looking at Spark-Mahout integration.
> >
> >
> > d) Need volunteers to QA/address issues with the present
> > classifiers/clustering algorithms. I personally can vouch for how
> > disastrous it is to deploy any of Mahout's classifiers/clustering
> > implementations in an Operations environment. A good example of that is
> > Sean's recent patch for RDF.
> >
> > Naive Bayes code as it is now seems half-baked and is incomplete. Not
> > every code path has been tested on Streaming KMeans.
> >
> > This should go some way in addressing the technical debt that's been
> piled
> > over the years.
> >
> >
> >
> >
> >
> > On Monday, March 3, 2014 1:05 PM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> >
> > I would like to discuss whether we should start to have some
> > Spark-related code in Mahout.
> >
> > --sebastian
> >
> >
> > On 03/03/2014 06:56 PM, Suneel Marthi wrote:
> > > Grant had setup a Google Hangout for Mahout sometime last year before
> > 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't
> > want to have a hangout on Saturday or weekend.
> > >
> > >
> > >
> > >
> > >
> > > On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > >
> > > Happy to organize a google hangout.  That has the advantage of allowing
> > more attendees and supporting YouTube archiving.
> > >
> > > Sent from my iPhone
> > >
> > >
> > >> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com>
> wrote:
> > >>
> > >> Hello All,
> > >> Dr.Dunning could you set a meeting next Sat morning, so we can chat
> and
> > >> discuss by skype improvements and what to do and indentify volunteer
> and
> > >> tasks.
> > >> Best Regards,
> > >> Giorgio
> > >>
> > >>
> > >> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> > >>
> > >>> Me three
> > >>>
> > >>>
> > >>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> > >>>>
> > >>>> Ravi,
> > >>>>
> > >>>> Good points.
> > >>>>
> > >>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> > ravi.mummulla@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>> - Natively support Windows (guidance, etc. No documentation exists
> > today,
> > >>>>> for instance)
> > >>>> There is a bit of demand for that.
> > >>>>
> > >>>> - Faster time to first application (from discovery to first
> > application
> > >>>>
> > >>>>> currently takes a non-trivial amount of effort; how can we lower
> the
> > bar
> > >>>>> and reduce the friction for adoption?)
> > >>>> There is huge evidence that this is important.
> > >>>>
> > >>>>
> > >>>>     - Better documenting use cases with working samples/examples
> > >>>>> (Documentation
> > >>>>> on https://mahout.apache.org/users/basics/algorithms.html is
> spread
> > out
> > >>>>> and
> > >>>>> there is too much focus on algorithms as opposed to use cases -
> this
> > is
> > >>>>> an
> > >>>>> adoption
> >  blocker)
> > >>>> This is also important.
> > >>>>
> > >>>>
> > >>>> - Uniformity of the API set across all algorithms (are we providing
> > the
> > >>>>> same experience across all APIs?)
> > >>>> And many people have been tripped up by this.
> > >>>>
> > >>>>
> > >>>>     - Measuring/publishing scalability metrics of various algorithms
> > (why
> > >>>>> would
> > >>>>> we want users to adopt Mahout vs. other frameworks for ML at
> scale?)
> > >>>> I don't see this as important as some of your other points, but is
> > still
> > >>>> useful.
> > >>
> > >>
> > >> --
> > >> Quiero ser el rayo de sol que cada día te despierta
> > >> para hacerte respirar y vivir en me.
> > >> "Favola -Moda".
> >
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Posted by Frank Scholten <fr...@frankscholten.nl>.

Yes, I like to work on standardizing the code around input formats.


On Mon, Mar 3, 2014 at 7:37 PM, Suneel Marthi <su...@yahoo.com>wrote:

> To get things moving for 1.0:
>
>
> a) Address the 4 issues that Sean had raised - we have already started
> looking at Backlog and closing them, started looking at converting old
> MapReduce to newer MapReduce API.
>
>    If someone could start looking at standardizing the input/output
> formats across classifiers, clustering and recommenders that would be
> great.  Guess Frank S. has already started work in that direction.
>
> b)  Need a better and cleaner serialized form of Vectors to handle names
> and other kind'a stuff, this is gonna impact everything that's presently
> implemented.
>
> c)  Agree with ssc, to start looking at Spark-Mahout integration.
>
>
> d) Need volunteers to QA/address issues with the present
> classifiers/clustering algorithms. I personally can vouch for how
> disastrous it is to deploy any of Mahout's classifiers/clustering
> implementations in an Operations environment. A good example of that is
> Sean's recent patch for RDF.
>
> Naive Bayes code as it is now seems half-baked and is incomplete. Not
> every code path has been tested on Streaming KMeans.
>
> This should go some way in addressing the technical debt that's been piled
> over the years.
>
>
>
>
>
> On Monday, March 3, 2014 1:05 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
>
> I would like to discuss whether we should start to have some
> Spark-related code in Mahout.
>
> --sebastian
>
>
> On 03/03/2014 06:56 PM, Suneel Marthi wrote:
> > Grant had setup a Google Hangout for Mahout sometime last year before
> 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't
> want to have a hangout on Saturday or weekend.
> >
> >
> >
> >
> >
> > On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> > Happy to organize a google hangout.  That has the advantage of allowing
> more attendees and supporting YouTube archiving.
> >
> > Sent from my iPhone
> >
> >
> >> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
> >>
> >> Hello All,
> >> Dr.Dunning could you set a meeting next Sat morning, so we can chat and
> >> discuss by skype improvements and what to do and indentify volunteer and
> >> tasks.
> >> Best Regards,
> >> Giorgio
> >>
> >>
> >> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> >>
> >>> Me three
> >>>
> >>>
> >>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> >>>>
> >>>> Ravi,
> >>>>
> >>>> Good points.
> >>>>
> >>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
> ravi.mummulla@gmail.com>
> >>>> wrote:
> >>>>
> >>>> - Natively support Windows (guidance, etc. No documentation exists
> today,
> >>>>> for instance)
> >>>> There is a bit of demand for that.
> >>>>
> >>>> - Faster time to first application (from discovery to first
> application
> >>>>
> >>>>> currently takes a non-trivial amount of effort; how can we lower the
> bar
> >>>>> and reduce the friction for adoption?)
> >>>> There is huge evidence that this is important.
> >>>>
> >>>>
> >>>>     - Better documenting use cases with working samples/examples
> >>>>> (Documentation
> >>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread
> out
> >>>>> and
> >>>>> there is too much focus on algorithms as opposed to use cases - this
> is
> >>>>> an
> >>>>> adoption
>  blocker)
> >>>> This is also important.
> >>>>
> >>>>
> >>>> - Uniformity of the API set across all algorithms (are we providing
> the
> >>>>> same experience across all APIs?)
> >>>> And many people have been tripped up by this.
> >>>>
> >>>>
> >>>>     - Measuring/publishing scalability metrics of various algorithms
> (why
> >>>>> would
> >>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
> >>>> I don't see this as important as some of your other points, but is
> still
> >>>> useful.
> >>
> >>
> >> --
> >> Quiero ser el rayo de sol que cada día te despierta
> >> para hacerte respirar y vivir en me.
> >> "Favola -Moda".
>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

To get things moving for 1.0:

a) Address the 4 issues that Sean had raised - we have already started looking at Backlog and closing them, started looking at converting old MapReduce to newer MapReduce API.

   If someone could start looking at standardizing the input/output formats across classifiers, clustering and recommenders that would be great.  Guess Frank S. has already started work in that direction.

b)  Need a better and cleaner serialized form of Vectors to handle names and other kind'a stuff, this is gonna impact everything that's presently implemented.

c)  Agree with ssc, to start looking at Spark-Mahout integration. 

d) Need volunteers to QA/address issues with the present classifiers/clustering algorithms. I personally can vouch for how disastrous it is to deploy any of Mahout's classifiers/clustering implementations in an Operations environment. A good example of that is Sean's recent patch for RDF.

Naive Bayes code as it is now seems half-baked and is incomplete. Not every code path has been tested on Streaming KMeans.

This should go some way in addressing the technical debt that's been piled over the years.  

On Monday, March 3, 2014 1:05 PM, Sebastian Schelter <ss...@apache.org> wrote:

I would like to discuss whether we should start to have some 
Spark-related code in Mahout.

--sebastian

On 03/03/2014 06:56 PM, Suneel Marthi wrote:
> Grant had setup a Google Hangout for Mahout sometime last year before 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't want to have a hangout on Saturday or weekend.
>
>
>
>
>
> On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Happy to organize a google hangout.  That has the advantage of allowing more attendees and supporting YouTube archiving.
>
> Sent from my iPhone
>
>
>> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
>>
>> Hello All,
>> Dr.Dunning could you set a meeting next Sat morning, so we can chat and
>> discuss by skype improvements and what to do and indentify volunteer and
>> tasks.
>> Best Regards,
>> Giorgio
>>
>>
>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
>>
>>> Me three
>>>
>>>
>>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>>>
>>>> Ravi,
>>>>
>>>> Good points.
>>>>
>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>
>>>> wrote:
>>>>
>>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>>>> for instance)
>>>> There is a bit of demand for that.
>>>>
>>>> - Faster time to first application (from discovery to first application
>>>>
>>>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>>>> and reduce the friction for adoption?)
>>>> There is huge evidence that this is important.
>>>>
>>>>
>>>>     - Better documenting use cases with working samples/examples
>>>>> (Documentation
>>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>>>> and
>>>>> there is too much focus on algorithms as opposed to use cases - this is
>>>>> an
>>>>> adoption
 blocker)
>>>> This is also important.
>>>>
>>>>
>>>> - Uniformity of the API set across all algorithms (are we providing the
>>>>> same experience across all APIs?)
>>>> And many people have been tripped up by this.
>>>>
>>>>
>>>>     - Measuring/publishing scalability metrics of various algorithms (why
>>>>> would
>>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>>> I don't see this as important as some of your other points, but is still
>>>> useful.
>>
>>
>> --
>> Quiero ser el rayo de sol que cada día te despierta
>> para hacerte respirar y vivir en me.
>> "Favola -Moda".

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I probably will also put out a distributed QR (just for completeness) as
currently solved for MR SSVD. but we know that actual SSVD can avoid this
-- and it will in the new version -- just like in the in-core version.

there are gaps still in the optimizer (i.e. optimizer has some holes for
some algorithms and when it choses them at action time, the
UnsupportedOperationException is generated, even though the expression
formally compiles). E.g. there's a gap for big, graph or no-graph A'A
algorithms. I also did not investigate GraphX backed implementations yet,
just was trying to make the minimum viable product. But it is enough now to
script out distributed SSVD and weighted ALS.


On Tue, Mar 4, 2014 at 9:59 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Yes. I am pretty close to do fairly big commits in linalg department
> there. (distributed dsl expression optimizer and scripted-out SSVD).
>
> We also possibly may want to think about scala script engine to run 3rd
> party mahout-math scripts or interactive sessions.
>
> -d
>
>
> On Mon, Mar 3, 2014 at 10:02 AM, Sebastian Schelter <ss...@apache.org>wrote:
>
>> I would like to discuss whether we should start to have some
>> Spark-related code in Mahout.
>>
>> --sebastian
>>
>>
>> On 03/03/2014 06:56 PM, Suneel Marthi wrote:
>>
>>> Grant had setup a Google Hangout for Mahout sometime last year before
>>> 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't
>>> want to have a hangout on Saturday or weekend.
>>>
>>>
>>>
>>>
>>>
>>> On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>
>>> Happy to organize a google hangout.  That has the advantage of allowing
>>> more attendees and supporting YouTube archiving.
>>>
>>> Sent from my iPhone
>>>
>>>
>>>  On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
>>>>
>>>> Hello All,
>>>> Dr.Dunning could you set a meeting next Sat morning, so we can chat and
>>>> discuss by skype improvements and what to do and indentify volunteer and
>>>> tasks.
>>>> Best Regards,
>>>> Giorgio
>>>>
>>>>
>>>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
>>>>
>>>>  Me three
>>>>>
>>>>>
>>>>>  On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>>>>>
>>>>>> Ravi,
>>>>>>
>>>>>> Good points.
>>>>>>
>>>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
>>>>>> ravi.mummulla@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> - Natively support Windows (guidance, etc. No documentation exists
>>>>>> today,
>>>>>>
>>>>>>> for instance)
>>>>>>>
>>>>>> There is a bit of demand for that.
>>>>>>
>>>>>> - Faster time to first application (from discovery to first
>>>>>> application
>>>>>>
>>>>>>  currently takes a non-trivial amount of effort; how can we lower the
>>>>>>> bar
>>>>>>> and reduce the friction for adoption?)
>>>>>>>
>>>>>> There is huge evidence that this is important.
>>>>>>
>>>>>>
>>>>>>     - Better documenting use cases with working samples/examples
>>>>>>
>>>>>>> (Documentation
>>>>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread
>>>>>>> out
>>>>>>> and
>>>>>>> there is too much focus on algorithms as opposed to use cases - this
>>>>>>> is
>>>>>>> an
>>>>>>> adoption blocker)
>>>>>>>
>>>>>> This is also important.
>>>>>>
>>>>>>
>>>>>> - Uniformity of the API set across all algorithms (are we providing
>>>>>> the
>>>>>>
>>>>>>> same experience across all APIs?)
>>>>>>>
>>>>>> And many people have been tripped up by this.
>>>>>>
>>>>>>
>>>>>>     - Measuring/publishing scalability metrics of various algorithms
>>>>>> (why
>>>>>>
>>>>>>> would
>>>>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>>>>>>
>>>>>> I don't see this as important as some of your other points, but is
>>>>>> still
>>>>>> useful.
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Quiero ser el rayo de sol que cada día te despierta
>>>> para hacerte respirar y vivir en me.
>>>> "Favola -Moda".
>>>>
>>>
>>
>

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Yes. I am pretty close to do fairly big commits in linalg department there.
(distributed dsl expression optimizer and scripted-out SSVD).

We also possibly may want to think about scala script engine to run 3rd
party mahout-math scripts or interactive sessions.

-d


On Mon, Mar 3, 2014 at 10:02 AM, Sebastian Schelter <ss...@apache.org> wrote:

> I would like to discuss whether we should start to have some Spark-related
> code in Mahout.
>
> --sebastian
>
>
> On 03/03/2014 06:56 PM, Suneel Marthi wrote:
>
>> Grant had setup a Google Hangout for Mahout sometime last year before 0.8
>> release.  I had one setup too for 0.9 release. I definitely wouldn't want
>> to have a hangout on Saturday or weekend.
>>
>>
>>
>>
>>
>> On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> Happy to organize a google hangout.  That has the advantage of allowing
>> more attendees and supporting YouTube archiving.
>>
>> Sent from my iPhone
>>
>>
>>  On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
>>>
>>> Hello All,
>>> Dr.Dunning could you set a meeting next Sat morning, so we can chat and
>>> discuss by skype improvements and what to do and indentify volunteer and
>>> tasks.
>>> Best Regards,
>>> Giorgio
>>>
>>>
>>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
>>>
>>>  Me three
>>>>
>>>>
>>>>  On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>>>>
>>>>> Ravi,
>>>>>
>>>>> Good points.
>>>>>
>>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <
>>>>> ravi.mummulla@gmail.com>
>>>>> wrote:
>>>>>
>>>>> - Natively support Windows (guidance, etc. No documentation exists
>>>>> today,
>>>>>
>>>>>> for instance)
>>>>>>
>>>>> There is a bit of demand for that.
>>>>>
>>>>> - Faster time to first application (from discovery to first application
>>>>>
>>>>>  currently takes a non-trivial amount of effort; how can we lower the
>>>>>> bar
>>>>>> and reduce the friction for adoption?)
>>>>>>
>>>>> There is huge evidence that this is important.
>>>>>
>>>>>
>>>>>     - Better documenting use cases with working samples/examples
>>>>>
>>>>>> (Documentation
>>>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread
>>>>>> out
>>>>>> and
>>>>>> there is too much focus on algorithms as opposed to use cases - this
>>>>>> is
>>>>>> an
>>>>>> adoption blocker)
>>>>>>
>>>>> This is also important.
>>>>>
>>>>>
>>>>> - Uniformity of the API set across all algorithms (are we providing the
>>>>>
>>>>>> same experience across all APIs?)
>>>>>>
>>>>> And many people have been tripped up by this.
>>>>>
>>>>>
>>>>>     - Measuring/publishing scalability metrics of various algorithms
>>>>> (why
>>>>>
>>>>>> would
>>>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>>>>>
>>>>> I don't see this as important as some of your other points, but is
>>>>> still
>>>>> useful.
>>>>>
>>>>
>>>
>>> --
>>> Quiero ser el rayo de sol que cada día te despierta
>>> para hacerte respirar y vivir en me.
>>> "Favola -Moda".
>>>
>>
>

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@apache.org>.

I would like to discuss whether we should start to have some 
Spark-related code in Mahout.

--sebastian

On 03/03/2014 06:56 PM, Suneel Marthi wrote:
> Grant had setup a Google Hangout for Mahout sometime last year before 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't want to have a hangout on Saturday or weekend.
>
>
>
>
>
> On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com> wrote:
>
> Happy to organize a google hangout.  That has the advantage of allowing more attendees and supporting YouTube archiving.
>
> Sent from my iPhone
>
>
>> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
>>
>> Hello All,
>> Dr.Dunning could you set a meeting next Sat morning, so we can chat and
>> discuss by skype improvements and what to do and indentify volunteer and
>> tasks.
>> Best Regards,
>> Giorgio
>>
>>
>> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
>>
>>> Me three
>>>
>>>
>>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>>>
>>>> Ravi,
>>>>
>>>> Good points.
>>>>
>>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>
>>>> wrote:
>>>>
>>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>>>> for instance)
>>>> There is a bit of demand for that.
>>>>
>>>> - Faster time to first application (from discovery to first application
>>>>
>>>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>>>> and reduce the friction for adoption?)
>>>> There is huge evidence that this is important.
>>>>
>>>>
>>>>     - Better documenting use cases with working samples/examples
>>>>> (Documentation
>>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>>>> and
>>>>> there is too much focus on algorithms as opposed to use cases - this is
>>>>> an
>>>>> adoption blocker)
>>>> This is also important.
>>>>
>>>>
>>>> - Uniformity of the API set across all algorithms (are we providing the
>>>>> same experience across all APIs?)
>>>> And many people have been tripped up by this.
>>>>
>>>>
>>>>     - Measuring/publishing scalability metrics of various algorithms (why
>>>>> would
>>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>>> I don't see this as important as some of your other points, but is still
>>>> useful.
>>
>>
>> --
>> Quiero ser el rayo de sol que cada día te despierta
>> para hacerte respirar y vivir en me.
>> "Favola -Moda".

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

Grant had setup a Google Hangout for Mahout sometime last year before 0.8 release.  I had one setup too for 0.9 release. I definitely wouldn't want to have a hangout on Saturday or weekend. 





On Monday, March 3, 2014 12:52 PM, Ted Dunning <te...@gmail.com> wrote:
 
Happy to organize a google hangout.  That has the advantage of allowing more attendees and supporting YouTube archiving. 

Sent from my iPhone


> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
> 
> Hello All,
> Dr.Dunning could you set a meeting next Sat morning, so we can chat and
> discuss by skype improvements and what to do and indentify volunteer and
> tasks.
> Best Regards,
> Giorgio
> 
> 
> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> 
>> Me three
>> 
>> 
>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>> 
>>> Ravi,
>>> 
>>> Good points.
>>> 
>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>
>>> wrote:
>>> 
>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>>> for instance)
>>> There is a bit of demand for that.
>>> 
>>> - Faster time to first application (from discovery to first application
>>> 
>>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>>> and reduce the friction for adoption?)
>>> There is huge evidence that this is important.
>>> 
>>> 
>>>   - Better documenting use cases with working samples/examples
>>>> (Documentation
>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>>> and
>>>> there is too much focus on algorithms as opposed to use cases - this is
>>>> an
>>>> adoption blocker)
>>> This is also important.
>>> 
>>> 
>>> - Uniformity of the API set across all algorithms (are we providing the
>>>> same experience across all APIs?)
>>> And many people have been tripped up by this.
>>> 
>>> 
>>>   - Measuring/publishing scalability metrics of various algorithms (why
>>>> would
>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>> I don't see this as important as some of your other points, but is still
>>> useful.
> 
> 
> -- 
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

Happy to organize a google hangout.  That has the advantage of allowing more attendees and supporting YouTube archiving. 

Sent from my iPhone

> On Mar 3, 2014, at 9:34, Giorgio Zoppi <gi...@gmail.com> wrote:
> 
> Hello All,
> Dr.Dunning could you set a meeting next Sat morning, so we can chat and
> discuss by skype improvements and what to do and indentify volunteer and
> tasks.
> Best Regards,
> Giorgio
> 
> 
> 2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:
> 
>> Me three
>> 
>> 
>>> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>>> 
>>> Ravi,
>>> 
>>> Good points.
>>> 
>>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>
>>> wrote:
>>> 
>>> - Natively support Windows (guidance, etc. No documentation exists today,
>>>> for instance)
>>> There is a bit of demand for that.
>>> 
>>> - Faster time to first application (from discovery to first application
>>> 
>>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>>> and reduce the friction for adoption?)
>>> There is huge evidence that this is important.
>>> 
>>> 
>>>   - Better documenting use cases with working samples/examples
>>>> (Documentation
>>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>>> and
>>>> there is too much focus on algorithms as opposed to use cases - this is
>>>> an
>>>> adoption blocker)
>>> This is also important.
>>> 
>>> 
>>> - Uniformity of the API set across all algorithms (are we providing the
>>>> same experience across all APIs?)
>>> And many people have been tripped up by this.
>>> 
>>> 
>>>   - Measuring/publishing scalability metrics of various algorithms (why
>>>> would
>>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>> I don't see this as important as some of your other points, but is still
>>> useful.
> 
> 
> -- 
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".

Re: Mahout 1.0 goals

Posted by Giorgio Zoppi <gi...@gmail.com>.

Hello All,
Dr.Dunning could you set a meeting next Sat morning, so we can chat and
discuss by skype improvements and what to do and indentify volunteer and
tasks.
Best Regards,
Giorgio


2014-03-03 18:30 GMT+01:00 peng <pc...@uowmail.edu.au>:

> Me three
>
>
> On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
>
>> Ravi,
>>
>> Good points.
>>
>> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>
>> wrote:
>>
>>  - Natively support Windows (guidance, etc. No documentation exists today,
>>> for instance)
>>>
>>>
>> There is a bit of demand for that.
>>
>> - Faster time to first application (from discovery to first application
>>
>>> currently takes a non-trivial amount of effort; how can we lower the bar
>>> and reduce the friction for adoption?)
>>>
>>>
>> There is huge evidence that this is important.
>>
>>
>>    - Better documenting use cases with working samples/examples
>>> (Documentation
>>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>>> and
>>> there is too much focus on algorithms as opposed to use cases - this is
>>> an
>>> adoption blocker)
>>>
>>>
>> This is also important.
>>
>>
>>  - Uniformity of the API set across all algorithms (are we providing the
>>> same experience across all APIs?)
>>>
>>>
>> And many people have been tripped up by this.
>>
>>
>>    - Measuring/publishing scalability metrics of various algorithms (why
>>> would
>>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>>
>>>
>> I don't see this as important as some of your other points, but is still
>> useful.
>>
>>


-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Posted by peng <pc...@uowmail.edu.au>.

Me three

On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:
> Ravi,
>
> Good points.
>
> On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>wrote:
>
>> - Natively support Windows (guidance, etc. No documentation exists today,
>> for instance)
>>
>
> There is a bit of demand for that.
>
> - Faster time to first application (from discovery to first application
>> currently takes a non-trivial amount of effort; how can we lower the bar
>> and reduce the friction for adoption?)
>>
>
> There is huge evidence that this is important.
>
>
>>   - Better documenting use cases with working samples/examples
>> (Documentation
>> on https://mahout.apache.org/users/basics/algorithms.html is spread out
>> and
>> there is too much focus on algorithms as opposed to use cases - this is an
>> adoption blocker)
>>
>
> This is also important.
>
>
>> - Uniformity of the API set across all algorithms (are we providing the
>> same experience across all APIs?)
>>
>
> And many people have been tripped up by this.
>
>
>>   - Measuring/publishing scalability metrics of various algorithms (why
>> would
>> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>>
>
> I don't see this as important as some of your other points, but is still
> useful.
>

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

Ravi,

Good points.

On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla <ra...@gmail.com>wrote:

> - Natively support Windows (guidance, etc. No documentation exists today,
> for instance)
>

There is a bit of demand for that.

- Faster time to first application (from discovery to first application
> currently takes a non-trivial amount of effort; how can we lower the bar
> and reduce the friction for adoption?)
>

There is huge evidence that this is important.


>  - Better documenting use cases with working samples/examples
> (Documentation
> on https://mahout.apache.org/users/basics/algorithms.html is spread out
> and
> there is too much focus on algorithms as opposed to use cases - this is an
> adoption blocker)
>

This is also important.


> - Uniformity of the API set across all algorithms (are we providing the
> same experience across all APIs?)
>

And many people have been tripped up by this.


>  - Measuring/publishing scalability metrics of various algorithms (why
> would
> we want users to adopt Mahout vs. other frameworks for ML at scale?)
>

I don't see this as important as some of your other points, but is still
useful.

Re: Mahout 1.0 goals

Posted by Ravi Mummulla <ra...@gmail.com>.

May I suggest that we bucketize the 1.0 scope into various themes:
- Enhanced first experience / user experience
- Improving the existing framework
- Extending the existing framework (enabling new scenarios/use cases, etc.)

That said, has anyone been thinking about the following?
- Natively support Windows (guidance, etc. No documentation exists today,
for instance)
- Faster time to first application (from discovery to first application
currently takes a non-trivial amount of effort; how can we lower the bar
and reduce the friction for adoption?)
- Better documenting use cases with working samples/examples (Documentation
on https://mahout.apache.org/users/basics/algorithms.html is spread out and
there is too much focus on algorithms as opposed to use cases - this is an
adoption blocker)
- Uniformity of the API set across all algorithms (are we providing the
same experience across all APIs?)
- Measuring/publishing scalability metrics of various algorithms (why would
we want users to adopt Mahout vs. other frameworks for ML at scale?)

Thanks.


On Sat, Mar 1, 2014 at 8:55 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> On Sat, Mar 1, 2014 at 5:05 AM, Sebastian Schelter <ss...@apache.org> wrote:
>
> >
> >
> > I must say that I think that the architecture of Oryx is really what I
> > would envision for Mahout. Provide a computation layer for training
> models
> > and a serving layer with a REST API or Solr for deploying them.
>
>
> I am dubious about desgination of Mahout as a service (of any kind). It
> should be easy to embed and customize, either online or offline. But
> service... I am more along the lines of scikit-learn here. The use case
> patterns (at least in my case) are hard to fit into a rigid black box.
> Looking back (say homonym filtering) i couldn't have done it with a block
> box. I'd leave it to infrastructure engineers to put it into ad-hoc
> service.
>



-- 
Thanks.

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Sat, Mar 1, 2014 at 5:05 AM, Sebastian Schelter <ss...@apache.org> wrote:

>
>
> I must say that I think that the architecture of Oryx is really what I
> would envision for Mahout. Provide a computation layer for training models
> and a serving layer with a REST API or Solr for deploying them.

I am dubious about desgination of Mahout as a service (of any kind). It
should be easy to embed and customize, either online or offline. But
service... I am more along the lines of scikit-learn here. The use case
patterns (at least in my case) are hard to fit into a rigid black box.
Looking back (say homonym filtering) i couldn't have done it with a block
box. I'd leave it to infrastructure engineers to put it into ad-hoc service.

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

Great step, thanks Frank

> On Mar 1, 2014, at 10:29 AM, Frank Scholten <fr...@frankscholten.nl> wrote:
> 
> I got inspired by the discussion so I took a first step in reducing Hadoop
> dependencies in the naive bayes code.
> 
> See my Github branch:
> https://github.com/frankscholten/mahout/tree/naivebayes-modelrepository
> 
> I introduced a repository class for reading and writing the NaiveBayesModel
> to and from HDFS.
> 
> Turns out we store the model in 2 ways: in a HDFS folder structure and in
> an HDFS file. The code I added makes this explicit.
> 
> In this branch NaiveBayesModel only depends on Vector, Matrix and
> Preconditions but no longer on Hadoop.
> 
> If we apply this approach for the other models in Mahout we get could get
> rid of a lot of of Hadoop dependencies.
> 
> Frank
> 
> 
> On Sat, Mar 1, 2014 at 5:32 PM, Frank Scholten <fr...@frankscholten.nl>wrote:
> 
>> On Sat, Mar 1, 2014 at 2:05 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> 
>>> Hi,
>>> 
>>> I think this is an important discussion to have and its good that we have
>>> it. I wish I could say different, but I encountered a lot of the
>>> impressions that Sean mentioned. To be honest, I don't see Mahout being
>>> ready to move to 1.0 in its current state.
>>> 
>>> I still see our main problem in failing to provide viable documentation
>>> and guidance to users. We cleaned up the wiki, but this is only a first
>>> step. I feel that it is extremely hard for people to use a majority of our
>>> algorithms, except if they do understand the mathematical details and are
>>> willing to dig through the source code. I think Mahout contains a lot of
>>> "hidden gems" that make it unique (e.g. Cooccurrence Analysis with
>>> RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority of users
>>> these gems are out of reach.
>>> 
>>> Another important aspect is that machine learning on MapReduce will
>>> vanish very soon and there's no vision to move Mahout to more suitable
>>> platforms yet.
>> 
>> 
>> Before we can even work on supporting other platforms we have to handle
>> the Hadoop dependencies in the codebase. Perhaps we can start to slowly but
>> surely reduce the dependencies on Hadoop or at least contain them by adding
>> more abstraction. Only MR code should be using the Hadoop API IMO.
>> 
>> For example, many classes depend on Hadoop for serializing and
>> deserializing models. Perhaps we can make it so a model can be written to
>> or read from some model interface, which can have implementations for HDFS,
>> the local filesystem or perhaps even a remote API. Take NeuralNetwork for
>> instance. It has dependencies on Hadoop but only for reading and writing
>> the model to and from HDFS.
>> 
>> 
>>> I think our lack of documentation causes a lack of users which stalls the
>>> development and, together with the emergence of other platforms like Spark,
>>> makes it hard for us to attract new people.
>> 
>> 
>> Here is a radical idea: how about creating reference documentation, i.e. a
>> single PDF or HTML? This can be generated using Maven docbook. If the docs
>> are part of the code and generated, users can contribute patches to the
>> documentation because it sits along the source code. We might even be able
>> to generate algorithm characteristics (sequential, MR) from the source code
>> using a script, perhaps through annotations. We move the current Wiki docs
>> inside the project and create Wiki pages only for logistical project
>> information about Mahout and Apache.
>> 
>> Let me know what you think. I can make tickets for these two issues of
>> there is enough interest.
>> 
>> 
>>> 
>>> I must say that I think that the architecture of Oryx is really what I
>>> would envision for Mahout. Provide a computation layer for training models
>>> and a serving layer with a REST API or Solr for deploying them. And then
>>> abstract the training in the computation layer to enable training
>>> in-memory, with Hadoop, Spark, Stratosphere, you name it. I was very
>>> emotional when he had the discussion after Oryx was announced as a separate
>>> project because I felt that this is what Mahout should have become.
>> 
>> If Mahout has a well designed Java API, a REST layer can be added easily
>> via other frameworks.
>> 
>> Frank
>> 
>> 
>>> Just my 2 cents,
>>> Sebastian
>>> 
>>> 
>>>> On 02/28/2014 10:56 AM, Sean Owen wrote:
>>>> 
>>>> OK, your defeatism is my realism. Why has Negative Nancy intruded on
>>>> this conversation?
>>>> 
>>>> I have a view into many large Hadoop users. The feedback from the
>>>> minority that have tried Mahout is that it is inconsistent/unfinished
>>>> ("a confederation of unrelated grad-school projects" as one put it),
>>>> buggy, and hard to use except as a few copied snippets of code. Ouch!
>>>> 
>>>> Only a handful that I'm aware of actually use it. Internally, there is
>>>> a perception that there is no community attention to most of the code
>>>> (see JIRA backlog). As a result -- software problems, community
>>>> issues, little demand -- it is almost certainly not going to be in our
>>>> next major packaging release, and was almost not in the current
>>>> forthcoming one.
>>>> 
>>>> Your Reality May Vary. This seems like yellow-flag territory for an
>>>> Apache project though, if this is representative of a wider reality.
>>>> So a conversation about whole other projects' worth of new
>>>> functionality feels quite disconnected -- red-flag territory.
>>>> 
>>>> To be constructive, here are four items that seem more important for
>>>> something like "1.0.0" and are even a lot less work:
>>>> 
>>>> - Use Hadoop .mapreduce API consistently
>>>> - Standardize input output formats of all jobs
>>>> - Remove use of deprecated code
>>>> - Clear even a third of the open JIRA backlog
>>>> 
>>>> (I still think it's fine to make different projects for quite
>>>> different ideas. Hadoop has another ML project, and is about to have
>>>> another other ML project. These good ideas might well better belong
>>>> there. Here, I think there is a big need for shoring up if it's even
>>>> going to survive to 1.0.)
>>>> 
>>>> On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:
>>>> 
>>>> I think each of several
>>>>> other of these points are probably on their own several times the
>>>>> amount of
>>>>> work that has been put into this project over the past year so I'm
>>>>> wondering if this close to realistic as a to do list for 1.0 of this
>>>>> project.
>>>> That is means.  I think that everything on this list is possible in
>>>> relatively short order, but let's talk goals for a bit.
>>>> 
>>>> What is missing here?  What really doesn't matter?
>>

Re: Mahout 1.0 goals

Posted by Frank Scholten <fr...@frankscholten.nl>.

I got inspired by the discussion so I took a first step in reducing Hadoop
dependencies in the naive bayes code.

See my Github branch:
https://github.com/frankscholten/mahout/tree/naivebayes-modelrepository

I introduced a repository class for reading and writing the NaiveBayesModel
to and from HDFS.

Turns out we store the model in 2 ways: in a HDFS folder structure and in
an HDFS file. The code I added makes this explicit.

In this branch NaiveBayesModel only depends on Vector, Matrix and
Preconditions but no longer on Hadoop.

If we apply this approach for the other models in Mahout we get could get
rid of a lot of of Hadoop dependencies.

Frank


On Sat, Mar 1, 2014 at 5:32 PM, Frank Scholten <fr...@frankscholten.nl>wrote:

> On Sat, Mar 1, 2014 at 2:05 PM, Sebastian Schelter <ss...@apache.org> wrote:
>
>> Hi,
>>
>> I think this is an important discussion to have and its good that we have
>> it. I wish I could say different, but I encountered a lot of the
>> impressions that Sean mentioned. To be honest, I don't see Mahout being
>> ready to move to 1.0 in its current state.
>>
>> I still see our main problem in failing to provide viable documentation
>> and guidance to users. We cleaned up the wiki, but this is only a first
>> step. I feel that it is extremely hard for people to use a majority of our
>> algorithms, except if they do understand the mathematical details and are
>> willing to dig through the source code. I think Mahout contains a lot of
>> "hidden gems" that make it unique (e.g. Cooccurrence Analysis with
>> RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority of users
>> these gems are out of reach.
>>
>> Another important aspect is that machine learning on MapReduce will
>> vanish very soon and there's no vision to move Mahout to more suitable
>> platforms yet.
>>
>
>
> Before we can even work on supporting other platforms we have to handle
> the Hadoop dependencies in the codebase. Perhaps we can start to slowly but
> surely reduce the dependencies on Hadoop or at least contain them by adding
> more abstraction. Only MR code should be using the Hadoop API IMO.
>
> For example, many classes depend on Hadoop for serializing and
> deserializing models. Perhaps we can make it so a model can be written to
> or read from some model interface, which can have implementations for HDFS,
> the local filesystem or perhaps even a remote API. Take NeuralNetwork for
> instance. It has dependencies on Hadoop but only for reading and writing
> the model to and from HDFS.
>
>
>> I think our lack of documentation causes a lack of users which stalls the
>> development and, together with the emergence of other platforms like Spark,
>> makes it hard for us to attract new people.
>>
>
>
> Here is a radical idea: how about creating reference documentation, i.e. a
> single PDF or HTML? This can be generated using Maven docbook. If the docs
> are part of the code and generated, users can contribute patches to the
> documentation because it sits along the source code. We might even be able
> to generate algorithm characteristics (sequential, MR) from the source code
> using a script, perhaps through annotations. We move the current Wiki docs
> inside the project and create Wiki pages only for logistical project
> information about Mahout and Apache.
>
> Let me know what you think. I can make tickets for these two issues of
> there is enough interest.
>
>
>>
>> I must say that I think that the architecture of Oryx is really what I
>> would envision for Mahout. Provide a computation layer for training models
>> and a serving layer with a REST API or Solr for deploying them. And then
>> abstract the training in the computation layer to enable training
>> in-memory, with Hadoop, Spark, Stratosphere, you name it. I was very
>> emotional when he had the discussion after Oryx was announced as a separate
>> project because I felt that this is what Mahout should have become.
>>
>
> If Mahout has a well designed Java API, a REST layer can be added easily
> via other frameworks.
>
> Frank
>
>
>> Just my 2 cents,
>> Sebastian
>>
>>
>> On 02/28/2014 10:56 AM, Sean Owen wrote:
>>
>>> OK, your defeatism is my realism. Why has Negative Nancy intruded on
>>> this conversation?
>>>
>>> I have a view into many large Hadoop users. The feedback from the
>>> minority that have tried Mahout is that it is inconsistent/unfinished
>>> ("a confederation of unrelated grad-school projects" as one put it),
>>> buggy, and hard to use except as a few copied snippets of code. Ouch!
>>>
>>> Only a handful that I'm aware of actually use it. Internally, there is
>>> a perception that there is no community attention to most of the code
>>> (see JIRA backlog). As a result -- software problems, community
>>> issues, little demand -- it is almost certainly not going to be in our
>>> next major packaging release, and was almost not in the current
>>> forthcoming one.
>>>
>>> Your Reality May Vary. This seems like yellow-flag territory for an
>>> Apache project though, if this is representative of a wider reality.
>>> So a conversation about whole other projects' worth of new
>>> functionality feels quite disconnected -- red-flag territory.
>>>
>>> To be constructive, here are four items that seem more important for
>>> something like "1.0.0" and are even a lot less work:
>>>
>>> - Use Hadoop .mapreduce API consistently
>>> - Standardize input output formats of all jobs
>>> - Remove use of deprecated code
>>> - Clear even a third of the open JIRA backlog
>>>
>>> (I still think it's fine to make different projects for quite
>>> different ideas. Hadoop has another ML project, and is about to have
>>> another other ML project. These good ideas might well better belong
>>> there. Here, I think there is a big need for shoring up if it's even
>>> going to survive to 1.0.)
>>>
>>> On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>  I think each of several
>>>> other of these points are probably on their own several times the
>>>> amount of
>>>> work that has been put into this project over the past year so I'm
>>>> wondering if this close to realistic as a to do list for 1.0 of this
>>>> project.
>>>>
>>>>
>>> That is means.  I think that everything on this list is possible in
>>> relatively short order, but let's talk goals for a bit.
>>>
>>> What is missing here?  What really doesn't matter?
>>>
>>>
>>
>

Re: Mahout 1.0 goals

Posted by Frank Scholten <fr...@frankscholten.nl>.

On Sat, Mar 1, 2014 at 2:05 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi,
>
> I think this is an important discussion to have and its good that we have
> it. I wish I could say different, but I encountered a lot of the
> impressions that Sean mentioned. To be honest, I don't see Mahout being
> ready to move to 1.0 in its current state.
>
> I still see our main problem in failing to provide viable documentation
> and guidance to users. We cleaned up the wiki, but this is only a first
> step. I feel that it is extremely hard for people to use a majority of our
> algorithms, except if they do understand the mathematical details and are
> willing to dig through the source code. I think Mahout contains a lot of
> "hidden gems" that make it unique (e.g. Cooccurrence Analysis with
> RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority of users
> these gems are out of reach.
>
> Another important aspect is that machine learning on MapReduce will vanish
> very soon and there's no vision to move Mahout to more suitable platforms
> yet.
>


Before we can even work on supporting other platforms we have to handle the
Hadoop dependencies in the codebase. Perhaps we can start to slowly but
surely reduce the dependencies on Hadoop or at least contain them by adding
more abstraction. Only MR code should be using the Hadoop API IMO.

For example, many classes depend on Hadoop for serializing and
deserializing models. Perhaps we can make it so a model can be written to
or read from some model interface, which can have implementations for HDFS,
the local filesystem or perhaps even a remote API. Take NeuralNetwork for
instance. It has dependencies on Hadoop but only for reading and writing
the model to and from HDFS.


> I think our lack of documentation causes a lack of users which stalls the
> development and, together with the emergence of other platforms like Spark,
> makes it hard for us to attract new people.
>


Here is a radical idea: how about creating reference documentation, i.e. a
single PDF or HTML? This can be generated using Maven docbook. If the docs
are part of the code and generated, users can contribute patches to the
documentation because it sits along the source code. We might even be able
to generate algorithm characteristics (sequential, MR) from the source code
using a script, perhaps through annotations. We move the current Wiki docs
inside the project and create Wiki pages only for logistical project
information about Mahout and Apache.

Let me know what you think. I can make tickets for these two issues of
there is enough interest.


>
> I must say that I think that the architecture of Oryx is really what I
> would envision for Mahout. Provide a computation layer for training models
> and a serving layer with a REST API or Solr for deploying them. And then
> abstract the training in the computation layer to enable training
> in-memory, with Hadoop, Spark, Stratosphere, you name it. I was very
> emotional when he had the discussion after Oryx was announced as a separate
> project because I felt that this is what Mahout should have become.
>

If Mahout has a well designed Java API, a REST layer can be added easily
via other frameworks.

Frank


> Just my 2 cents,
> Sebastian
>
>
> On 02/28/2014 10:56 AM, Sean Owen wrote:
>
>> OK, your defeatism is my realism. Why has Negative Nancy intruded on
>> this conversation?
>>
>> I have a view into many large Hadoop users. The feedback from the
>> minority that have tried Mahout is that it is inconsistent/unfinished
>> ("a confederation of unrelated grad-school projects" as one put it),
>> buggy, and hard to use except as a few copied snippets of code. Ouch!
>>
>> Only a handful that I'm aware of actually use it. Internally, there is
>> a perception that there is no community attention to most of the code
>> (see JIRA backlog). As a result -- software problems, community
>> issues, little demand -- it is almost certainly not going to be in our
>> next major packaging release, and was almost not in the current
>> forthcoming one.
>>
>> Your Reality May Vary. This seems like yellow-flag territory for an
>> Apache project though, if this is representative of a wider reality.
>> So a conversation about whole other projects' worth of new
>> functionality feels quite disconnected -- red-flag territory.
>>
>> To be constructive, here are four items that seem more important for
>> something like "1.0.0" and are even a lot less work:
>>
>> - Use Hadoop .mapreduce API consistently
>> - Standardize input output formats of all jobs
>> - Remove use of deprecated code
>> - Clear even a third of the open JIRA backlog
>>
>> (I still think it's fine to make different projects for quite
>> different ideas. Hadoop has another ML project, and is about to have
>> another other ML project. These good ideas might well better belong
>> there. Here, I think there is a big need for shoring up if it's even
>> going to survive to 1.0.)
>>
>> On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:
>>
>>  I think each of several
>>> other of these points are probably on their own several times the amount
>>> of
>>> work that has been put into this project over the past year so I'm
>>> wondering if this close to realistic as a to do list for 1.0 of this
>>> project.
>>>
>>>
>> That is means.  I think that everything on this list is possible in
>> relatively short order, but let's talk goals for a bit.
>>
>> What is missing here?  What really doesn't matter?
>>
>>
>

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@apache.org>.

Hi,

I think this is an important discussion to have and its good that we 
have it. I wish I could say different, but I encountered a lot of the 
impressions that Sean mentioned. To be honest, I don't see Mahout being 
ready to move to 1.0 in its current state.

I still see our main problem in failing to provide viable documentation 
and guidance to users. We cleaned up the wiki, but this is only a first 
step. I feel that it is extremely hard for people to use a majority of 
our algorithms, except if they do understand the mathematical details 
and are willing to dig through the source code. I think Mahout contains 
a lot of "hidden gems" that make it unique (e.g. Cooccurrence Analysis 
with RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority of 
users these gems are out of reach.

Another important aspect is that machine learning on MapReduce will 
vanish very soon and there's no vision to move Mahout to more suitable 
platforms yet.

I think our lack of documentation causes a lack of users which stalls 
the development and, together with the emergence of other platforms like 
Spark, makes it hard for us to attract new people.

I must say that I think that the architecture of Oryx is really what I 
would envision for Mahout. Provide a computation layer for training 
models and a serving layer with a REST API or Solr for deploying them. 
And then abstract the training in the computation layer to enable 
training in-memory, with Hadoop, Spark, Stratosphere, you name it. I was 
very emotional when he had the discussion after Oryx was announced as a 
separate project because I felt that this is what Mahout should have become.

Just my 2 cents,
Sebastian

On 02/28/2014 10:56 AM, Sean Owen wrote:
> OK, your defeatism is my realism. Why has Negative Nancy intruded on
> this conversation?
>
> I have a view into many large Hadoop users. The feedback from the
> minority that have tried Mahout is that it is inconsistent/unfinished
> ("a confederation of unrelated grad-school projects" as one put it),
> buggy, and hard to use except as a few copied snippets of code. Ouch!
>
> Only a handful that I'm aware of actually use it. Internally, there is
> a perception that there is no community attention to most of the code
> (see JIRA backlog). As a result -- software problems, community
> issues, little demand -- it is almost certainly not going to be in our
> next major packaging release, and was almost not in the current
> forthcoming one.
>
> Your Reality May Vary. This seems like yellow-flag territory for an
> Apache project though, if this is representative of a wider reality.
> So a conversation about whole other projects' worth of new
> functionality feels quite disconnected -- red-flag territory.
>
> To be constructive, here are four items that seem more important for
> something like "1.0.0" and are even a lot less work:
>
> - Use Hadoop .mapreduce API consistently
> - Standardize input output formats of all jobs
> - Remove use of deprecated code
> - Clear even a third of the open JIRA backlog
>
> (I still think it's fine to make different projects for quite
> different ideas. Hadoop has another ML project, and is about to have
> another other ML project. These good ideas might well better belong
> there. Here, I think there is a big need for shoring up if it's even
> going to survive to 1.0.)
>
> On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> I think each of several
>> other of these points are probably on their own several times the amount of
>> work that has been put into this project over the past year so I'm
>> wondering if this close to realistic as a to do list for 1.0 of this
>> project.
>>
>
> That is means.  I think that everything on this list is possible in
> relatively short order, but let's talk goals for a bit.
>
> What is missing here?  What really doesn't matter?
>

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Feb 28, 2014 at 12:30 PM, Andrew Musselman <
andrew.musselman@gmail.com> wrote:

> > Like i said, i believe the future is in moving ahead, build on strengths
> > and finding unique proposition. I agree with the above in a sense  that
> > out-of-core stuff that runs over MR could use some unification. I know
> you
> > have done a lot in that department and I assume since you are writing to
> > dev list, you are looking to help with that going forward. Cause if
>  not...
> > the dev lists are not exactly created to be an open forum for just giving
> > lectures.
> >
>
> Can we agree that before we put an integer version on Mahout that it needs
> some tender-loving care, and that we can still have high hopes?


+10 from me.

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

Sounds good to me.

On Fri, Feb 28, 2014 at 1:21 PM, Suneel Marthi <su...@yahoo.com>wrote:

> First steps towards the "loving care" (in my view) :-
>
> a) Address the issues that Sean's brought
>  up. I wasn't aware of (i) in that list else I would have ensured that
> they were addressed in 0.9.
>
> b) Most of the backlog JIRAs (about 28 of them today) go all the way back
> to the initial stages of Mahout's evolution (pre 0.5).  Some of them may
> just have to be closed and resolved as "Will not do" or "Times Immemorial".
>
> c) Fix algorithms that presently have half-baked code in them like Naive
> Bayes classifier (why is the thetaSummer commented out - either we don't
> need it or does it need fixing?),  Streaming KMeans - lacks adequate test
> coverage and still fails along the different paths and the same goes for
> other clustering algorithms too.
>
>
>
>
>
>
>
> On Friday, February 28, 2014 3:30 PM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
>
> >
> > >
> > > To be constructive, here are four items that seem more important for
> > > something like "1.0.0" and are even a lot less work:
> > >
> > > - Use Hadoop .mapreduce API consistently
> > > - Standardize input output formats of all jobs
> > > - Remove use of deprecated
>  code
> > > - Clear even a third of the open JIRA backlog
> > >
> >
> > Like i said, i believe the future is in moving ahead, build on strengths
> > and finding unique proposition. I agree with the above in a sense  that
> > out-of-core stuff that runs over MR could use some unification. I know
> you
> > have done a lot in that department and I assume since you are writing to
> > dev list, you are looking to help with that going
>  forward. Cause if  not...
> > the dev lists are not exactly created to be an open forum for just giving
> > lectures.
> >
>
> Can we agree that before we put an integer version on Mahout that it needs
> some tender-loving care, and that we can still have high hopes?
>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

First steps towards the "loving care" (in my view) :-

a) Address the issues that Sean's brought
 up. I wasn't aware of (i) in that list else I would have ensured that they were addressed in 0.9.

b) Most of the backlog JIRAs (about 28 of them today) go all the way back to the initial stages of Mahout's evolution (pre 0.5).  Some of them may just have to be closed and resolved as "Will not do" or "Times Immemorial".

c) Fix algorithms that presently have half-baked code in them like Naive Bayes classifier (why is the thetaSummer commented out - either we don't need it or does it need fixing?),  Streaming KMeans - lacks adequate test coverage and still fails along the different paths and the same goes for other clustering algorithms too.

On Friday, February 28, 2014 3:30 PM, Andrew Musselman <an...@gmail.com> wrote:

>
> >
> > To be constructive, here are four items that seem more important for
> > something like "1.0.0" and are even a lot less work:
> >
> > - Use Hadoop .mapreduce API consistently
> > - Standardize input output formats of all jobs
> > - Remove use of deprecated
 code
> > - Clear even a third of the open JIRA backlog
> >
>
> Like i said, i believe the future is in moving ahead, build on strengths
> and finding unique proposition. I agree with the above in a sense  that
> out-of-core stuff that runs over MR could use some unification. I know you
> have done a lot in that department and I assume since you are writing to
> dev list, you are looking to help with that going
 forward. Cause if  not...
> the dev lists are not exactly created to be an open forum for just giving
> lectures.
>

Can we agree that before we put an integer version on Mahout that it needs
some tender-loving care, and that we can still have high hopes?

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

>
> >
> > To be constructive, here are four items that seem more important for
> > something like "1.0.0" and are even a lot less work:
> >
> > - Use Hadoop .mapreduce API consistently
> > - Standardize input output formats of all jobs
> > - Remove use of deprecated code
> > - Clear even a third of the open JIRA backlog
> >
>
> Like i said, i believe the future is in moving ahead, build on strengths
> and finding unique proposition. I agree with the above in a sense  that
> out-of-core stuff that runs over MR could use some unification. I know you
> have done a lot in that department and I assume since you are writing to
> dev list, you are looking to help with that going forward. Cause if  not...
> the dev lists are not exactly created to be an open forum for just giving
> lectures.
>

Can we agree that before we put an integer version on Mahout that it needs
some tender-loving care, and that we can still have high hopes?

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Fri, Feb 28, 2014 at 1:56 AM, Sean Owen <sr...@gmail.com> wrote:

> OK, your defeatism is my realism. Why has Negative Nancy intruded on
> this conversation?
>
>

> Your Reality May Vary. This seems like yellow-flag territory for an
> Apache project though, if this is representative of a wider reality.
> So a conversation about whole other projects' worth of new
> functionality feels quite disconnected -- red-flag territory.
>

Indeed it may.

(1) As far back as i could recollect tracking Mahout PMC, it has always
generally proclaimed that Mahout is about ML at scale. It was specifically
emphasized it was not about running ML on Hadoop. This ML coupling to MR
and Hadoop in particular seems to exist just in your head, but nobody
else's I talked to. Mahout has never been that dogmatic.

(2) Technology changes, and weaknesses of Mahout for most part stem from
heavily relying on aging approaches, and no amount of cleanup is going to
address that. Sad truth is that java, MR in general and Hadoop in
particular are increasingly poor fit for modern day ML. As a function of
it, I believe future holds that MR-based processing will gradually decay,
as well as direct java use for ML math. Like i said, the only thing that
Mahout is still being used for is the unqiueness of the its good (i.e. you
can't do it any other way today for free), not necessarily because of its
underpinnings. But hey, I can say the same thing about say R any day. And I
keep using both R in Mahout for that very reason.

(3) I view any project community is an evolutionary process. The 1.0
milestones IMO are pretty ephemeral if we measure them from maturity point
of view. I might very successfully argue (and ops in my last two companies
wholeheartedly agree with me) that e.g. CDH3 was much closer to "1.0" than
CDH4. Bottom line, it is neverending story. The fluff dies off, the golden
nuggets survive and evolve, even if thru other projects. Dust to dust and
so forth. No drama here whatsoever.

So is community. It ebbs and goes away. The hype is a strong motivating
force there. Look how long EJB delusion lasted. And any reasonable computer
scientist would take say Ceph over CDH any time of day thru independent
benchmarks for performance and usability, yet it is not happening en masse.
Hype is a strong force.

Another thing is... Mahout is not a good source for PhD dissertations. So
no university will ever help with it. We have to go by with what we have.
On a good day somebody would bring in an uniquely viable solution, and in
the end of the day that's the only thing that keeps things moving.
Differentiation in problem coverage. So no drama here either.

>
> To be constructive, here are four items that seem more important for
> something like "1.0.0" and are even a lot less work:
>
> - Use Hadoop .mapreduce API consistently
> - Standardize input output formats of all jobs
> - Remove use of deprecated code
> - Clear even a third of the open JIRA backlog
>

Like i said, i believe the future is in moving ahead, build on strengths
and finding unique proposition. I agree with the above in a sense  that
out-of-core stuff that runs over MR could use some unification. I know you
have done a lot in that department and I assume since you are writing to
dev list, you are looking to help with that going forward. Cause if  not...
the dev lists are not exactly created to be an open forum for just giving
lectures.

Re: Mahout 1.0 goals

Posted by Sean Owen <sr...@gmail.com>.

OK, your defeatism is my realism. Why has Negative Nancy intruded on
this conversation?

I have a view into many large Hadoop users. The feedback from the
minority that have tried Mahout is that it is inconsistent/unfinished
("a confederation of unrelated grad-school projects" as one put it),
buggy, and hard to use except as a few copied snippets of code. Ouch!

Only a handful that I'm aware of actually use it. Internally, there is
a perception that there is no community attention to most of the code
(see JIRA backlog). As a result -- software problems, community
issues, little demand -- it is almost certainly not going to be in our
next major packaging release, and was almost not in the current
forthcoming one.

Your Reality May Vary. This seems like yellow-flag territory for an
Apache project though, if this is representative of a wider reality.
So a conversation about whole other projects' worth of new
functionality feels quite disconnected -- red-flag territory.

To be constructive, here are four items that seem more important for
something like "1.0.0" and are even a lot less work:

- Use Hadoop .mapreduce API consistently
- Standardize input output formats of all jobs
- Remove use of deprecated code
- Clear even a third of the open JIRA backlog

(I still think it's fine to make different projects for quite
different ideas. Hadoop has another ML project, and is about to have
another other ML project. These good ideas might well better belong
there. Here, I think there is a big need for shoring up if it's even
going to survive to 1.0.)

On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:

> I think each of several
> other of these points are probably on their own several times the amount of
> work that has been put into this project over the past year so I'm
> wondering if this close to realistic as a to do list for 1.0 of this
> project.
>

That is means.  I think that everything on this list is possible in
relatively short order, but let's talk goals for a bit.

What is missing here?  What really doesn't matter?

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:

> I think each of several
> other of these points are probably on their own several times the amount of
> work that has been put into this project over the past year so I'm
> wondering if this close to realistic as a to do list for 1.0 of this
> project.
>

That is means.  I think that everything on this list is possible in
relatively short order, but let's talk goals for a bit.

What is missing here?  What really doesn't matter?

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sr...@gmail.com> wrote:

> And whether the goal here should look more like polish
> up and maintain.
>

That sounds like defeatism to me.  I think that new things are quite
possible here.

Re: Mahout 1.0 goals

Posted by Sean Owen <sr...@gmail.com>.

Yes. Wasn't questioning the part about algorithms. I think each of several
other of these points are probably on their own several times the amount of
work that has been put into this project over the past year so I'm
wondering if this close to realistic as a to do list for 1.0 of this
project.

As a design brief for a new project , for sure. Spark or similar is kind of
half of these things already and could use work on adding things like model
import export. This is hijacking your point but wanted to agree with the
ideas and wonder out loudly whether a lot of this effort belongs elsewhere
in the Apache tent. And whether the goal here should look more like polish
up and maintain.
On Feb 28, 2014 1:16 AM, "Ted Dunning" <te...@gmail.com> wrote:

> Well, Mahout has had (kinda sorta awful) classifiers and clustering from
> day one.  It isn't like the only goal is recommendations.
>
> The non-MR, non-Hadoop comments are really more user centric requirements
> than implementations.  It is important that users be able to start without
> a cluster and move relatively transparently into a fully scaled solution.
>
> Moreover, the Hadoop-tied map-reduce implementations that we have had up to
> now have been disastrously complex.  We really need something better.
>
>
>
>
> On Thu, Feb 27, 2014 at 5:11 PM, Sean Owen <sr...@gmail.com> wrote:
>
> > This sounds good, but sounds like a whole different project or projects.
> > For example where does R appear from, what non-MR implementations, etc,
> > what is the no Hadoop implementation?
> > On Feb 28, 2014 12:38 AM, "Ted Dunning" <te...@gmail.com> wrote:
> >
> > > I would like to start a conversation about where we want Mahout to be
> for
> > > 1.0.  Let's suspend for the moment the question of how to achieve the
> > > goals.  Instead, let's converge on what we really would like to have
> > happen
> > > and after that, let's talk about means that will get us there.
> > >
> > > Here are some goals that I think would be good in the area of numerics,
> > > classifiers and clustering:
> > >
> > > - runs with or without Hadoop
> > >
> > > - runs with or without map-reduce
> > >
> > > - includes (at least), regularized generalized linear models, k-means,
> > > random forest, distributed random forest, distributed neural networks
> > >
> > > - reasonably competitive speed against other implementations including
> > > graphlab, mlib and R.
> > >
> > > - interactive model building
> > >
> > > - models can be exported as code or data
> > >
> > > - simple programming model
> > >
> > > - programmable via Java or R
> > >
> > > - runs clustered or not
> > >
> > >
> > > What does everybody think?
> > >
> >
>

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

Well, Mahout has had (kinda sorta awful) classifiers and clustering from
day one.  It isn't like the only goal is recommendations.

The non-MR, non-Hadoop comments are really more user centric requirements
than implementations.  It is important that users be able to start without
a cluster and move relatively transparently into a fully scaled solution.

Moreover, the Hadoop-tied map-reduce implementations that we have had up to
now have been disastrously complex.  We really need something better.

On Thu, Feb 27, 2014 at 5:11 PM, Sean Owen <sr...@gmail.com> wrote:

> This sounds good, but sounds like a whole different project or projects.
> For example where does R appear from, what non-MR implementations, etc,
> what is the no Hadoop implementation?
> On Feb 28, 2014 12:38 AM, "Ted Dunning" <te...@gmail.com> wrote:
>
> > I would like to start a conversation about where we want Mahout to be for
> > 1.0.  Let's suspend for the moment the question of how to achieve the
> > goals.  Instead, let's converge on what we really would like to have
> happen
> > and after that, let's talk about means that will get us there.
> >
> > Here are some goals that I think would be good in the area of numerics,
> > classifiers and clustering:
> >
> > - runs with or without Hadoop
> >
> > - runs with or without map-reduce
> >
> > - includes (at least), regularized generalized linear models, k-means,
> > random forest, distributed random forest, distributed neural networks
> >
> > - reasonably competitive speed against other implementations including
> > graphlab, mlib and R.
> >
> > - interactive model building
> >
> > - models can be exported as code or data
> >
> > - simple programming model
> >
> > - programmable via Java or R
> >
> > - runs clustered or not
> >
> >
> > What does everybody think?
> >
>

Re: Mahout 1.0 goals

Posted by Sean Owen <sr...@gmail.com>.

This sounds good, but sounds like a whole different project or projects.
For example where does R appear from, what non-MR implementations, etc,
what is the no Hadoop implementation?
On Feb 28, 2014 12:38 AM, "Ted Dunning" <te...@gmail.com> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

Any word on this?

> On Mar 4, 2014, at 3:45 PM, Suneel Marthi <su...@yahoo.com> wrote:
> 
> I believe there was an announcement that went out last month about Apache SF embracing github. - http://jaxenter.com/apache-ups-github-integration-potential-49460.html
> 
> Guess this is more of an INFRA task than anything we need to do (like the recent setting up of svnpubsub for future releases).
> 
> I can create an INFRA jira and wait for INFRA to take respond.
> 
> 
> 
> 
> On Tuesday, March 4, 2014 6:03 PM, Andrew Musselman <an...@gmail.com> wrote:
> 
> One of my big wishlist items is to move Mahout to Github for workflow and
> community features.
> 
> I remember there being discussion a while back but is there any way to move
> our Subversion repo to an Apache Git repo?
> 
> 
>> On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>> I would like to start a conversation about where we want Mahout to be for
>> 1.0.  Let's suspend for the moment the question of how to achieve the
>> goals.  Instead, let's converge on what we really would like to have happen
>> and after that, let's talk about means that will get us there.
>> 
>> Here are some goals that I think would be good in the area of numerics,
>> classifiers and clustering:
>> 
>> - runs with or without Hadoop
>> 
>> - runs with or without map-reduce
>> 
>> - includes (at least), regularized generalized linear models, k-means,
>> random forest, distributed random forest, distributed neural networks
>> 
>> - reasonably competitive speed against other implementations including
>> graphlab, mlib and R.
>> 
>> - interactive model building
>> 
>> - models can be exported as code or data
>> 
>> - simple programming model
>> 
>> - programmable via Java or R
>> 
>> - runs clustered or not
>> 
>> 
>> What does everybody think?

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

Whatever works


On Tue, Mar 4, 2014 at 3:45 PM, Suneel Marthi <su...@yahoo.com>wrote:

> I believe there was an announcement that went out last month about Apache
> SF embracing github. -
> http://jaxenter.com/apache-ups-github-integration-potential-49460.html
>
> Guess this is more of an INFRA task than anything we need to do (like the
> recent setting up of svnpubsub for future releases).
>
> I can create an INFRA jira and wait for INFRA to take respond.
>
>
>
>
> On Tuesday, March 4, 2014 6:03 PM, Andrew Musselman <
> andrew.musselman@gmail.com> wrote:
>
> One of my big wishlist items is to move Mahout to Github for workflow and
> community features.
>
> I remember there being discussion a while back but is there any way to move
> our Subversion repo to an Apache Git repo?
>
>
> On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I would like to start a conversation about where we want Mahout to be for
> > 1.0.  Let's suspend for the moment the question of how to achieve the
> > goals.  Instead, let's converge on what we really would like to have
> happen
> > and after that, let's talk about means that will get us there.
> >
> > Here are some goals that I think would be good in the area of numerics,
> > classifiers and clustering:
> >
> > - runs with or without Hadoop
> >
> > - runs with or without map-reduce
> >
> > - includes (at least), regularized generalized linear models, k-means,
> > random forest, distributed random forest, distributed neural networks
> >
> > - reasonably competitive speed against other implementations including
> > graphlab, mlib and R.
> >
> > - interactive model building
> >
> > - models can be exported as code or data
> >
> > - simple programming model
> >
> > - programmable via Java or R
> >
> > - runs clustered or not
> >
> >
> > What does everybody think?
> >
>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

I believe there was an announcement that went out last month about Apache SF embracing github. - http://jaxenter.com/apache-ups-github-integration-potential-49460.html

Guess this is more of an INFRA task than anything we need to do (like the recent setting up of svnpubsub for future releases).

I can create an INFRA jira and wait for INFRA to take respond.

On Tuesday, March 4, 2014 6:03 PM, Andrew Musselman <an...@gmail.com> wrote:

One of my big wishlist items is to move Mahout to Github for workflow and
community features.

I remember there being discussion a while back but is there any way to move
our Subversion repo to an Apache Git repo?

On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

One of my big wishlist items is to move Mahout to Github for workflow and
community features.

I remember there being discussion a while back but is there any way to move
our Subversion repo to an Apache Git repo?


On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>

Re: Mahout 1.0 goals

Posted by Frank Scholten <fr...@frankscholten.nl>.

On Fri, Feb 28, 2014 at 1:37 AM, Ted Dunning <te...@gmail.com> wrote:

> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:

> - simple programming model
>

+1

>
> - programmable via Java or R
>

+1

>
> - runs clustered or not
>

I think both.

>
>
> What does everybody think?
>

Good thread. Some of the comments are a bit above my head when it comes to
specific topics yet here are my 2 cents.

I come from the perspective of a Java developer who likes to add text
clustering, classification and recommendation algorithms to an existing
application and data, whether it's smallish data from a SQL database or
larger amounts of data that requires distributed computing.

So ideally I would like to see

1 A Java beans API for every algorithm.
2 Have a unified way to vectorize data, no matter where it comes from (SQL
database or NoSQL store, filesystem, Lucene index, etc)
3 Have the option to use Hadoop or some other distributed computing
framework to scale out.

I have some ideas on these topics but maybe that's better for another
thread.

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@googlemail.com>.

Hi Giorgio,

a good first step would be to explore the current api. Could you create a
writeup for our wiki how to use a clustering/classification algorithm of
your choice. A small example shoulf be sufficient. This could be used as a
basis for discussing API changes.
Am 03.03.2014 16:29 schrieb "Giorgio Zoppi" <gi...@gmail.com>:

> I would like to help in the api creation. How do I start for being
> productive with mahout?
> Best Regards,
> Giorgio
>
>
> 2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:
>
> > I would like to start a conversation about where we want Mahout to be for
> > 1.0.  Let's suspend for the moment the question of how to achieve the
> > goals.  Instead, let's converge on what we really would like to have
> happen
> > and after that, let's talk about means that will get us there.
> >
> > Here are some goals that I think would be good in the area of numerics,
> > classifiers and clustering:
> >
> > - runs with or without Hadoop
> >
> > - runs with or without map-reduce
> >
> > - includes (at least), regularized generalized linear models, k-means,
> > random forest, distributed random forest, distributed neural networks
> >
> > - reasonably competitive speed against other implementations including
> > graphlab, mlib and R.
> >
> > - interactive model building
> >
> > - models can be exported as code or data
> >
> > - simple programming model
> >
> > - programmable via Java or R
> >
> > - runs clustered or not
> >
> >
> > What does everybody think?
> >
>
>
>
> --
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".
>

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@apache.org>.

That is a nice start, could you evolve it to a short article that 
describes how to run this code for a real example?

Best,
Sebastian

On 03/04/2014 05:05 PM, Yexi Jiang wrote:
> Sebastian,
>
> In one of my recent projects, I used the Naive Bayes for classification, so
> I gave a write-up on this algorithm. You can find the document at
>
> https://docs.google.com/document/d/1h7N0GmIKe-KG64uulPMPzkp00nowM2-HDQ48c4PIhbc/edit?usp=sharing
> .
>
> Feedbacks are welcome.
>
>
> 2014-03-04 3:57 GMT-05:00 Sebastian Schelter <ss...@googlemail.com>:
>
>> Yexi, could you do a small write-up, analogously to what I proposed for
>> Giorgio. Make sure to pick a different algorithm though.
>>
>> --sebastian
>> Am 03.03.2014 16:54 schrieb "Yexi Jiang" <ye...@gmail.com>:
>>
>>> I'm also happy to help.
>>>
>>>
>>> 2014-03-03 10:29 GMT-05:00 Giorgio Zoppi <gi...@gmail.com>:
>>>
>>>> I would like to help in the api creation. How do I start for being
>>>> productive with mahout?
>>>> Best Regards,
>>>> Giorgio
>>>>
>>>>
>>>> 2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:
>>>>
>>>>> I would like to start a conversation about where we want Mahout to be
>>> for
>>>>> 1.0.  Let's suspend for the moment the question of how to achieve the
>>>>> goals.  Instead, let's converge on what we really would like to have
>>>> happen
>>>>> and after that, let's talk about means that will get us there.
>>>>>
>>>>> Here are some goals that I think would be good in the area of
>> numerics,
>>>>> classifiers and clustering:
>>>>>
>>>>> - runs with or without Hadoop
>>>>>
>>>>> - runs with or without map-reduce
>>>>>
>>>>> - includes (at least), regularized generalized linear models,
>> k-means,
>>>>> random forest, distributed random forest, distributed neural networks
>>>>>
>>>>> - reasonably competitive speed against other implementations
>> including
>>>>> graphlab, mlib and R.
>>>>>
>>>>> - interactive model building
>>>>>
>>>>> - models can be exported as code or data
>>>>>
>>>>> - simple programming model
>>>>>
>>>>> - programmable via Java or R
>>>>>
>>>>> - runs clustered or not
>>>>>
>>>>>
>>>>> What does everybody think?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Quiero ser el rayo de sol que cada día te despierta
>>>> para hacerte respirar y vivir en me.
>>>> "Favola -Moda".
>>>>
>>>
>>>
>>>
>>> --
>>> ------
>>> Yexi Jiang,
>>> ECS 251,  yjian004@cs.fiu.edu
>>> School of Computer and Information Science,
>>> Florida International University
>>> Homepage: http://users.cis.fiu.edu/~yjian004/
>>>
>>
>
>
>

Re: Mahout 1.0 goals

Posted by Yexi Jiang <ye...@gmail.com>.

Sebastian,

In one of my recent projects, I used the Naive Bayes for classification, so
I gave a write-up on this algorithm. You can find the document at

https://docs.google.com/document/d/1h7N0GmIKe-KG64uulPMPzkp00nowM2-HDQ48c4PIhbc/edit?usp=sharing
.

Feedbacks are welcome.


2014-03-04 3:57 GMT-05:00 Sebastian Schelter <ss...@googlemail.com>:

> Yexi, could you do a small write-up, analogously to what I proposed for
> Giorgio. Make sure to pick a different algorithm though.
>
> --sebastian
> Am 03.03.2014 16:54 schrieb "Yexi Jiang" <ye...@gmail.com>:
>
> > I'm also happy to help.
> >
> >
> > 2014-03-03 10:29 GMT-05:00 Giorgio Zoppi <gi...@gmail.com>:
> >
> > > I would like to help in the api creation. How do I start for being
> > > productive with mahout?
> > > Best Regards,
> > > Giorgio
> > >
> > >
> > > 2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:
> > >
> > > > I would like to start a conversation about where we want Mahout to be
> > for
> > > > 1.0.  Let's suspend for the moment the question of how to achieve the
> > > > goals.  Instead, let's converge on what we really would like to have
> > > happen
> > > > and after that, let's talk about means that will get us there.
> > > >
> > > > Here are some goals that I think would be good in the area of
> numerics,
> > > > classifiers and clustering:
> > > >
> > > > - runs with or without Hadoop
> > > >
> > > > - runs with or without map-reduce
> > > >
> > > > - includes (at least), regularized generalized linear models,
> k-means,
> > > > random forest, distributed random forest, distributed neural networks
> > > >
> > > > - reasonably competitive speed against other implementations
> including
> > > > graphlab, mlib and R.
> > > >
> > > > - interactive model building
> > > >
> > > > - models can be exported as code or data
> > > >
> > > > - simple programming model
> > > >
> > > > - programmable via Java or R
> > > >
> > > > - runs clustered or not
> > > >
> > > >
> > > > What does everybody think?
> > > >
> > >
> > >
> > >
> > > --
> > > Quiero ser el rayo de sol que cada día te despierta
> > > para hacerte respirar y vivir en me.
> > > "Favola -Moda".
> > >
> >
> >
> >
> > --
> > ------
> > Yexi Jiang,
> > ECS 251,  yjian004@cs.fiu.edu
> > School of Computer and Information Science,
> > Florida International University
> > Homepage: http://users.cis.fiu.edu/~yjian004/
> >
>



-- 
------
Yexi Jiang,
ECS 251,  yjian004@cs.fiu.edu
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/

Re: Mahout 1.0 goals

Posted by Sebastian Schelter <ss...@googlemail.com>.

Yexi, could you do a small write-up, analogously to what I proposed for
Giorgio. Make sure to pick a different algorithm though.

--sebastian
Am 03.03.2014 16:54 schrieb "Yexi Jiang" <ye...@gmail.com>:

> I'm also happy to help.
>
>
> 2014-03-03 10:29 GMT-05:00 Giorgio Zoppi <gi...@gmail.com>:
>
> > I would like to help in the api creation. How do I start for being
> > productive with mahout?
> > Best Regards,
> > Giorgio
> >
> >
> > 2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:
> >
> > > I would like to start a conversation about where we want Mahout to be
> for
> > > 1.0.  Let's suspend for the moment the question of how to achieve the
> > > goals.  Instead, let's converge on what we really would like to have
> > happen
> > > and after that, let's talk about means that will get us there.
> > >
> > > Here are some goals that I think would be good in the area of numerics,
> > > classifiers and clustering:
> > >
> > > - runs with or without Hadoop
> > >
> > > - runs with or without map-reduce
> > >
> > > - includes (at least), regularized generalized linear models, k-means,
> > > random forest, distributed random forest, distributed neural networks
> > >
> > > - reasonably competitive speed against other implementations including
> > > graphlab, mlib and R.
> > >
> > > - interactive model building
> > >
> > > - models can be exported as code or data
> > >
> > > - simple programming model
> > >
> > > - programmable via Java or R
> > >
> > > - runs clustered or not
> > >
> > >
> > > What does everybody think?
> > >
> >
> >
> >
> > --
> > Quiero ser el rayo de sol que cada día te despierta
> > para hacerte respirar y vivir en me.
> > "Favola -Moda".
> >
>
>
>
> --
> ------
> Yexi Jiang,
> ECS 251,  yjian004@cs.fiu.edu
> School of Computer and Information Science,
> Florida International University
> Homepage: http://users.cis.fiu.edu/~yjian004/
>

Re: Mahout 1.0 goals

Posted by Yexi Jiang <ye...@gmail.com>.

I'm also happy to help.


2014-03-03 10:29 GMT-05:00 Giorgio Zoppi <gi...@gmail.com>:

> I would like to help in the api creation. How do I start for being
> productive with mahout?
> Best Regards,
> Giorgio
>
>
> 2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:
>
> > I would like to start a conversation about where we want Mahout to be for
> > 1.0.  Let's suspend for the moment the question of how to achieve the
> > goals.  Instead, let's converge on what we really would like to have
> happen
> > and after that, let's talk about means that will get us there.
> >
> > Here are some goals that I think would be good in the area of numerics,
> > classifiers and clustering:
> >
> > - runs with or without Hadoop
> >
> > - runs with or without map-reduce
> >
> > - includes (at least), regularized generalized linear models, k-means,
> > random forest, distributed random forest, distributed neural networks
> >
> > - reasonably competitive speed against other implementations including
> > graphlab, mlib and R.
> >
> > - interactive model building
> >
> > - models can be exported as code or data
> >
> > - simple programming model
> >
> > - programmable via Java or R
> >
> > - runs clustered or not
> >
> >
> > What does everybody think?
> >
>
>
>
> --
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".
>



-- 
------
Yexi Jiang,
ECS 251,  yjian004@cs.fiu.edu
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/

Re: Mahout 1.0 goals

Posted by Giorgio Zoppi <gi...@gmail.com>.

I would like to help in the api creation. How do I start for being
productive with mahout?
Best Regards,
Giorgio


2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

Not really.  

Sent from my iPhone

> On Mar 12, 2014, at 11:26, Giorgio Zoppi <gi...@gmail.com> wrote:
> 
> Hello Ted
> is there a C api for Mahout runtime?
> Giorgio.
> 
> 
> 2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:
> 
>> I would like to start a conversation about where we want Mahout to be for
>> 1.0.  Let's suspend for the moment the question of how to achieve the
>> goals.  Instead, let's converge on what we really would like to have happen
>> and after that, let's talk about means that will get us there.
>> 
>> Here are some goals that I think would be good in the area of numerics,
>> classifiers and clustering:
>> 
>> - runs with or without Hadoop
>> 
>> - runs with or without map-reduce
>> 
>> - includes (at least), regularized generalized linear models, k-means,
>> random forest, distributed random forest, distributed neural networks
>> 
>> - reasonably competitive speed against other implementations including
>> graphlab, mlib and R.
>> 
>> - interactive model building
>> 
>> - models can be exported as code or data
>> 
>> - simple programming model
>> 
>> - programmable via Java or R
>> 
>> - runs clustered or not
>> 
>> 
>> What does everybody think?
>> 
> 
> 
> 
> -- 
> Quiero ser el rayo de sol que cada día te despierta
> para hacerte respirar y vivir en me.
> "Favola -Moda".

Re: Mahout 1.0 goals

Posted by Giorgio Zoppi <gi...@gmail.com>.

Hello Ted
is there a C api for Mahout runtime?
Giorgio.


2014-02-28 1:37 GMT+01:00 Ted Dunning <te...@gmail.com>:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>



-- 
Quiero ser el rayo de sol que cada día te despierta
para hacerte respirar y vivir en me.
"Favola -Moda".

Re: Mahout 1.0 goals

Posted by Andrew Musselman <an...@gmail.com>.

I agree with b) and c); haven't used seq2sparse enough to grok a).


On Thu, Feb 27, 2014 at 6:30 PM, Suneel Marthi <su...@yahoo.com>wrote:

>
> With the announcement of http://deeplearning4j.org yesterday which is
> various Neural Networks implementations on Hadoop 2/JBlas that had been
> talked about in one of the other discussion threads on this mailing list.
> Do we wanna duplicate a similar effort in Mahout?
>
> In addition to what Dmitriy's already outlined below, I may add that one
> of the bottlenecks (in my experience) in mahout's processing pipeline is
> 'seq2sparse'.
>
>  a) Optimize seq2sparse to handle incremental dictionary tokens
>      - Support for
>  Deterministic Finite Automaton to speed up text processing.
>      - not using StringTuples so much in the tokenization (may result in
> some speedup)
>      - explore using Lucene 4.7 in-memory term dictionaries this may
> improve the performance substantially.
>
>      Even better why not use Lucene indices themselves as document
> repositories as opposed to what's being done now.
>
> b) Stabilize the existing Clustering algorithms - except for Simple KMeans
> the others have issues once we deviate from the 'Happy Sunday Path'
> implementation and lack adequate test coverage.
> c) RESTful interfaces for invoking classifiers/clustering.
>
>
>
>
>
>
> On Thursday, February 27, 2014 9:10 PM, Dmitriy Lyubimov <
> dlieu.7@gmail.com> wrote:
>
> If we approach this form purely "marketing" standpoint, i would look at it
> from two points: why is Mahout used, and why it is not used.
>
> Mahout is not used because it is a collection of methods that are fairly
> non-uniform in their api, especially embedded api, and generaly has zero
> encouragement to be developed on top on and incorporated in yet larger
> customizable models. I.e. it lacks semantic explicitness of quick
> prototyping, and stitching things together is next to impossible.
>
>
> Yet Mahout is used in spite of the above because it has some pretty unique
> solvers in the area of linear algebra and text topical analysis. But I
> would dare to say not e.g. because of GLM regressions.
>
> I personally also use Mahout e.g. in favor of something like breeze because
> it has sparse linalg support, both in-core and out-of-core, from the very
> beginning and it fits naturally unlike in any other package i ever looked
> at, R including btw.
>
> But i find myself heavily disassembling Mahouts into parts and bolts rather
> than exactly how e.g. MIA prescribes it.
>
> Bottom line here, preliminarily primary issues are ease of use,
> embedment/scripting, ease of customization, uniformity of apis.
>
> (1) Take semantic explicitness and scripting issue. Well i guess that's
> where the R part comes from, not because we just want to run R. I would
> clear it right away -- i don't support any sort of R integration. And not
> really because of lack of trying -- I have created a few R front ends for a
> bunch of distributed applications, and also created projects that run R in
> the backend (I wrote CrunchR more than year ago which is the same thing for
> Crunch as what SparkR is for Spark; and yet-another MR framework running R
> in backend; and also tried to run things with HadoopR). And have developed
> a pretty strong opinion that R just doesn't mix with distributed
> frameworks, mostly because of the performance penalities (and if you loose
> $5 per day in performance on a single machine it may be ok, but in 100
> machines one loses $500 a day -- and mid size companies in my experience
> are not succeptible to 'let's solve it at any HW cost" doctrine, much as
> it is generally believed the other way around.
>
> Anyway, on R toptic i don't see it as a solution for any sort of
> semantically explicit driver and customizer technology. There's neither
> demand nor willingness of corporate bosses to go that route. I grew pretty
> opinionated on that issue.
>
> But you don't need R to address semantical explicitness, customization and
> ease of integration/scripting. Pragmatically, i see scala and carefully
> crafted scala dsl as the underlying mechanism for achieving this. Also,
> internally i use scala scripting a lot and it is really easy to build shell
> interpreter for it (just like spark builds a customized shell), so one
> doesn't even need to compile these things necessarily.
>
> Bottom line, ideally distributed solver implementation should look more
> like matlab than java. And I would measure that goal along the lines of
> Evan Sparks' talks (i.e. in lines of code and explicitness needed to script
> out a well known method).
>
> See, you forced my hand to discuss solutions ("how")  :)
>
>
> (2) on the issue of minimally supported algorithms. Again, i would not see
> mlib as a prototype there.Given enough semantical explicitness, virtually
> any data scientist would script out ALS in their sleep. And every second
> one would script out weighted ALS (so called "implicit feedback). I view
> those algorithms not as a goal but rather as a guinea pig for validating
> semantical value of ML environment and apis. I would port stronger solvers
> into the new semantic ML environment over Spark rather than trying to cover
> the very "basics".
>
> Pragmatically i would say it would be interesting and pragmatical (for me)
> to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering
> we have (albeit may be not exactly following the methodology).
>
> I would be also interested in giving foundation for customized hierarchical
> solutions along the lines of RLFM with various customizations including in
> particular temporal weighing of inference and customized inference of
> informative priors there. Computational Bayesian methods along the lines of
> MCEM and MCMC are said to provide a very accurate solutions here.The latter
> class of models IMO are much more interesting for practitioners of
> recommendations than pure rigid uncustomizable ALS class of models, weighed
> or not. At least Deepak Agarwal sounds very convincing in his talks.
>
>
> (3) on the issue of performance, i guess by using Spark bindings dsl you
> can't do any worse than mllib. Perhaps we could include also support for
> Dense JBlas matrices under hood of Matrix API if of interested. Also i am
> hearing using GPU libraries lately is becoming also very popular for
> performance reasons, up to 300x lin alg speed ups are reported. There are
> some fancy thoughts about cost-based optimization of algeraic expressions
> for distributed pipelines, but for the first start I will do just very
> simple physical plan substitutions (something like if i directly see A'A as
> a part of expression, or if A'B' product has small geometry then of course
> i'd rather do (BA)' etc.
>
> But it has potential to do more while retaining absolute degree of manually
> forced execution (thru forced checkpoints). It's just i would stop what i
> pragmatically need to script out distributed SSVD at this point.
>
> (4) but in general i would say the scope of your issues sounds like
> something that would close a gap between 0.5 and 1.0 rather than 0.9 and
> 1.0.
> -d
>
>
>
> On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I would like to start a conversation about where we want Mahout to be for
> > 1.0.  Let's suspend for the moment the question of how to achieve the
> > goals.  Instead, let's converge on what we really would like to have
> happen
> > and after that, let's talk about means that will get us there.
> >
> > Here are some goals that I think would be good in the area of numerics,
> > classifiers and clustering:
> >
> > - runs with or without Hadoop
> >
> > - runs with or without map-reduce
> >
> > - includes (at least), regularized generalized linear models, k-means,
> > random forest, distributed random forest, distributed neural networks
> >
> > - reasonably competitive speed against other implementations including
> > graphlab, mlib and R.
> >
> > - interactive model building
> >
> > - models can be exported as code or data
> >
> > - simple programming model
> >
> > - programmable via Java or R
> >
> > - runs clustered or not
> >
> >
> > What does everybody think?
> >
>

Re: Mahout 1.0 goals

Posted by Suneel Marthi <su...@yahoo.com>.

With the announcement of http://deeplearning4j.org yesterday which is various Neural Networks implementations on Hadoop 2/JBlas that had been talked about in one of the other discussion threads on this mailing list. Do we wanna duplicate a similar effort in Mahout?

In addition to what Dmitriy's already outlined below, I may add that one of the bottlenecks (in my experience) in mahout's processing pipeline is 'seq2sparse'.

 a) Optimize seq2sparse to handle incremental dictionary tokens
     - Support for
 Deterministic Finite Automaton to speed up text processing.
     - not using StringTuples so much in the tokenization (may result in some speedup)
     - explore using Lucene 4.7 in-memory term dictionaries this may improve the performance substantially.

     Even better why not use Lucene indices themselves as document repositories as opposed to what's being done now.

b) Stabilize the existing Clustering algorithms - except for Simple KMeans the others have issues once we deviate from the 'Happy Sunday Path' implementation and lack adequate test coverage.
c) RESTful interfaces for invoking classifiers/clustering.

On Thursday, February 27, 2014 9:10 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

If we approach this form purely "marketing" standpoint, i would look at it
from two points: why is Mahout used, and why it is not used.

Mahout is not used because it is a collection of methods that are fairly
non-uniform in their api, especially embedded api, and generaly has zero
encouragement to be developed on top on and incorporated in yet larger
customizable models. I.e. it lacks semantic explicitness of quick
prototyping, and stitching things together is next to impossible.

Yet Mahout is used in spite of the above because it has some pretty unique
solvers in the area of linear algebra and text topical analysis. But I
would dare to say not e.g. because of GLM regressions.

I personally also use Mahout e.g. in favor of something like breeze because
it has sparse linalg support, both in-core and out-of-core, from the very
beginning and it fits naturally unlike in any other package i ever looked
at, R including btw.

But i find myself heavily disassembling Mahouts into parts and bolts rather
than exactly how e.g. MIA prescribes it.

Bottom line here, preliminarily primary issues are ease of use,
embedment/scripting, ease of customization, uniformity of apis.

(1) Take semantic explicitness and scripting issue. Well i guess that's
where the R part comes from, not because we just want to run R. I would
clear it right away -- i don't support any sort of R integration. And not
really because of lack of trying -- I have created a few R front ends for a
bunch of distributed applications, and also created projects that run R in
the backend (I wrote CrunchR more than year ago which is the same thing for
Crunch as what SparkR is for Spark; and yet-another MR framework running R
in backend; and also tried to run things with HadoopR). And have developed
a pretty strong opinion that R just doesn't mix with distributed
frameworks, mostly because of the performance penalities (and if you loose
$5 per day in performance on a single machine it may be ok, but in 100
machines one loses $500 a day -- and mid size companies in my experience
are not succeptible to 'let's solve it at any HW cost" doctrine, much as
it is generally believed the other way around.

Anyway, on R toptic i don't see it as a solution for any sort of
semantically explicit driver and customizer technology. There's neither
demand nor willingness of corporate bosses to go that route. I grew pretty
opinionated on that issue.

But you don't need R to address semantical explicitness, customization and
ease of integration/scripting. Pragmatically, i see scala and carefully
crafted scala dsl as the underlying mechanism for achieving this. Also,
internally i use scala scripting a lot and it is really easy to build shell
interpreter for it (just like spark builds a customized shell), so one
doesn't even need to compile these things necessarily.

Bottom line, ideally distributed solver implementation should look more
like matlab than java. And I would measure that goal along the lines of
Evan Sparks' talks (i.e. in lines of code and explicitness needed to script
out a well known method).

See, you forced my hand to discuss solutions ("how")  :)

(2) on the issue of minimally supported algorithms. Again, i would not see
mlib as a prototype there.Given enough semantical explicitness, virtually
any data scientist would script out ALS in their sleep. And every second
one would script out weighted ALS (so called "implicit feedback). I view
those algorithms not as a goal but rather as a guinea pig for validating
semantical value of ML environment and apis. I would port stronger solvers
into the new semantic ML environment over Spark rather than trying to cover
the very "basics".

Pragmatically i would say it would be interesting and pragmatical (for me)
to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering
we have (albeit may be not exactly following the methodology).

I would be also interested in giving foundation for customized hierarchical
solutions along the lines of RLFM with various customizations including in
particular temporal weighing of inference and customized inference of
informative priors there. Computational Bayesian methods along the lines of
MCEM and MCMC are said to provide a very accurate solutions here.The latter
class of models IMO are much more interesting for practitioners of
recommendations than pure rigid uncustomizable ALS class of models, weighed
or not. At least Deepak Agarwal sounds very convincing in his talks.

(3) on the issue of performance, i guess by using Spark bindings dsl you
can't do any worse than mllib. Perhaps we could include also support for
Dense JBlas matrices under hood of Matrix API if of interested. Also i am
hearing using GPU libraries lately is becoming also very popular for
performance reasons, up to 300x lin alg speed ups are reported. There are
some fancy thoughts about cost-based optimization of algeraic expressions
for distributed pipelines, but for the first start I will do just very
simple physical plan substitutions (something like if i directly see A'A as
a part of expression, or if A'B' product has small geometry then of course
i'd rather do (BA)' etc.

But it has potential to do more while retaining absolute degree of manually
forced execution (thru forced checkpoints). It's just i would stop what i
pragmatically need to script out distributed SSVD at this point.

(4) but in general i would say the scope of your issues sounds like
something that would close a gap between 0.5 and 1.0 rather than 0.9 and
1.0.
-d

On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>

Re: Mahout 1.0 goals

Posted by Ted Dunning <te...@gmail.com>.

Yes.  THis is a big and important addition.


On Thu, Feb 27, 2014 at 6:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> (5) Another thing i would suggest is to look at feature prep
> standartization -- outlier detection, scaling, hash-tricking etc. etc.
> Again, with abilities to customize, or it would be useless.
>
>
> On Thu, Feb 27, 2014 at 6:08 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> >
> >
> > If we approach this form purely "marketing" standpoint, i would look at
> it
> > from two points: why is Mahout used, and why it is not used.
> >
> > Mahout is not used because it is a collection of methods that are fairly
> > non-uniform in their api, especially embedded api, and generaly has zero
> > encouragement to be developed on top on and incorporated in yet larger
> > customizable models. I.e. it lacks semantic explicitness of quick
> > prototyping, and stitching things together is next to impossible.
> >
> >
> > Yet Mahout is used in spite of the above because it has some pretty
> unique
> > solvers in the area of linear algebra and text topical analysis. But I
> > would dare to say not e.g. because of GLM regressions.
> >
> > I personally also use Mahout e.g. in favor of something like breeze
> > because it has sparse linalg support, both in-core and out-of-core, from
> > the very beginning and it fits naturally unlike in any other package i
> ever
> > looked at, R including btw.
> >
> > But i find myself heavily disassembling Mahouts into parts and bolts
> > rather than exactly how e.g. MIA prescribes it.
> >
> > Bottom line here, preliminarily primary issues are ease of use,
> > embedment/scripting, ease of customization, uniformity of apis.
> >
> > (1) Take semantic explicitness and scripting issue. Well i guess that's
> > where the R part comes from, not because we just want to run R. I would
> > clear it right away -- i don't support any sort of R integration. And not
> > really because of lack of trying -- I have created a few R front ends
> for a
> > bunch of distributed applications, and also created projects that run R
> in
> > the backend (I wrote CrunchR more than year ago which is the same thing
> for
> > Crunch as what SparkR is for Spark; and yet-another MR framework running
> R
> > in backend; and also tried to run things with HadoopR). And have
> developed
> > a pretty strong opinion that R just doesn't mix with distributed
> > frameworks, mostly because of the performance penalities (and if you
> loose
> > $5 per day in performance on a single machine it may be ok, but in 100
> > machines one loses $500 a day -- and mid size companies in my experience
> >  are not succeptible to 'let's solve it at any HW cost" doctrine, much as
> > it is generally believed the other way around.
> >
> > Anyway, on R toptic i don't see it as a solution for any sort of
> > semantically explicit driver and customizer technology. There's neither
> > demand nor willingness of corporate bosses to go that route. I grew
> pretty
> > opinionated on that issue.
> >
> > But you don't need R to address semantical explicitness, customization
> and
> > ease of integration/scripting. Pragmatically, i see scala and carefully
> > crafted scala dsl as the underlying mechanism for achieving this. Also,
> > internally i use scala scripting a lot and it is really easy to build
> shell
> > interpreter for it (just like spark builds a customized shell), so one
> > doesn't even need to compile these things necessarily.
> >
> > Bottom line, ideally distributed solver implementation should look more
> > like matlab than java. And I would measure that goal along the lines of
> > Evan Sparks' talks (i.e. in lines of code and explicitness needed to
> script
> > out a well known method).
> >
> > See, you forced my hand to discuss solutions ("how")  :)
> >
> >
> > (2) on the issue of minimally supported algorithms. Again, i would not
> see
> > mlib as a prototype there.Given enough semantical explicitness, virtually
> > any data scientist would script out ALS in their sleep. And every second
> > one would script out weighted ALS (so called "implicit feedback). I view
> > those algorithms not as a goal but rather as a guinea pig for validating
> > semantical value of ML environment and apis. I would port stronger
> solvers
> > into the new semantic ML environment over Spark rather than trying to
> cover
> > the very "basics".
> >
> > Pragmatically i would say it would be interesting and pragmatical (for
> me)
> > to have LDA/LSA/sparse PCA solvers ported. I would also port all
> clustering
> > we have (albeit may be not exactly following the methodology).
> >
> > I would be also interested in giving foundation for customized
> > hierarchical solutions along the lines of RLFM with various
> customizations
> > including in particular temporal weighing of inference and customized
> > inference of informative priors there. Computational Bayesian methods
> along
> > the lines of MCEM and MCMC are said to provide a very accurate solutions
> > here.The latter class of models IMO are much more interesting for
> > practitioners of recommendations than pure rigid uncustomizable ALS class
> > of models, weighed or not. At least Deepak Agarwal sounds very convincing
> > in his talks.
> >
> >
> > (3) on the issue of performance, i guess by using Spark bindings dsl you
> > can't do any worse than mllib. Perhaps we could include also support for
> > Dense JBlas matrices under hood of Matrix API if of interested. Also i am
> > hearing using GPU libraries lately is becoming also very popular for
> > performance reasons, up to 300x lin alg speed ups are reported. There are
> > some fancy thoughts about cost-based optimization of algeraic expressions
> > for distributed pipelines, but for the first start I will do just very
> > simple physical plan substitutions (something like if i directly see A'A
> as
> > a part of expression, or if A'B' product has small geometry then of
> course
> > i'd rather do (BA)' etc.
> >
> > But it has potential to do more while retaining absolute degree of
> > manually forced execution (thru forced checkpoints). It's just i would
> stop
> > what i pragmatically need to script out distributed SSVD at this point.
> >
> > (4) but in general i would say the scope of your issues sounds like
> > something that would close a gap between 0.5 and 1.0 rather than 0.9 and
> > 1.0.
> > -d
> >
> >
> >
> > On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> I would like to start a conversation about where we want Mahout to be
> for
> >> 1.0.  Let's suspend for the moment the question of how to achieve the
> >> goals.  Instead, let's converge on what we really would like to have
> >> happen
> >> and after that, let's talk about means that will get us there.
> >>
> >> Here are some goals that I think would be good in the area of numerics,
> >> classifiers and clustering:
> >>
> >> - runs with or without Hadoop
> >>
> >> - runs with or without map-reduce
> >>
> >> - includes (at least), regularized generalized linear models, k-means,
> >> random forest, distributed random forest, distributed neural networks
> >>
> >> - reasonably competitive speed against other implementations including
> >> graphlab, mlib and R.
> >>
> >> - interactive model building
> >>
> >> - models can be exported as code or data
> >>
> >> - simple programming model
> >>
> >> - programmable via Java or R
> >>
> >> - runs clustered or not
> >>
> >>
> >> What does everybody think?
> >>
> >
> >
>

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

(5) Another thing i would suggest is to look at feature prep
standartization -- outlier detection, scaling, hash-tricking etc. etc.
Again, with abilities to customize, or it would be useless.


On Thu, Feb 27, 2014 at 6:08 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>
>
> If we approach this form purely "marketing" standpoint, i would look at it
> from two points: why is Mahout used, and why it is not used.
>
> Mahout is not used because it is a collection of methods that are fairly
> non-uniform in their api, especially embedded api, and generaly has zero
> encouragement to be developed on top on and incorporated in yet larger
> customizable models. I.e. it lacks semantic explicitness of quick
> prototyping, and stitching things together is next to impossible.
>
>
> Yet Mahout is used in spite of the above because it has some pretty unique
> solvers in the area of linear algebra and text topical analysis. But I
> would dare to say not e.g. because of GLM regressions.
>
> I personally also use Mahout e.g. in favor of something like breeze
> because it has sparse linalg support, both in-core and out-of-core, from
> the very beginning and it fits naturally unlike in any other package i ever
> looked at, R including btw.
>
> But i find myself heavily disassembling Mahouts into parts and bolts
> rather than exactly how e.g. MIA prescribes it.
>
> Bottom line here, preliminarily primary issues are ease of use,
> embedment/scripting, ease of customization, uniformity of apis.
>
> (1) Take semantic explicitness and scripting issue. Well i guess that's
> where the R part comes from, not because we just want to run R. I would
> clear it right away -- i don't support any sort of R integration. And not
> really because of lack of trying -- I have created a few R front ends for a
> bunch of distributed applications, and also created projects that run R in
> the backend (I wrote CrunchR more than year ago which is the same thing for
> Crunch as what SparkR is for Spark; and yet-another MR framework running R
> in backend; and also tried to run things with HadoopR). And have developed
> a pretty strong opinion that R just doesn't mix with distributed
> frameworks, mostly because of the performance penalities (and if you loose
> $5 per day in performance on a single machine it may be ok, but in 100
> machines one loses $500 a day -- and mid size companies in my experience
>  are not succeptible to 'let's solve it at any HW cost" doctrine, much as
> it is generally believed the other way around.
>
> Anyway, on R toptic i don't see it as a solution for any sort of
> semantically explicit driver and customizer technology. There's neither
> demand nor willingness of corporate bosses to go that route. I grew pretty
> opinionated on that issue.
>
> But you don't need R to address semantical explicitness, customization and
> ease of integration/scripting. Pragmatically, i see scala and carefully
> crafted scala dsl as the underlying mechanism for achieving this. Also,
> internally i use scala scripting a lot and it is really easy to build shell
> interpreter for it (just like spark builds a customized shell), so one
> doesn't even need to compile these things necessarily.
>
> Bottom line, ideally distributed solver implementation should look more
> like matlab than java. And I would measure that goal along the lines of
> Evan Sparks' talks (i.e. in lines of code and explicitness needed to script
> out a well known method).
>
> See, you forced my hand to discuss solutions ("how")  :)
>
>
> (2) on the issue of minimally supported algorithms. Again, i would not see
> mlib as a prototype there.Given enough semantical explicitness, virtually
> any data scientist would script out ALS in their sleep. And every second
> one would script out weighted ALS (so called "implicit feedback). I view
> those algorithms not as a goal but rather as a guinea pig for validating
> semantical value of ML environment and apis. I would port stronger solvers
> into the new semantic ML environment over Spark rather than trying to cover
> the very "basics".
>
> Pragmatically i would say it would be interesting and pragmatical (for me)
> to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering
> we have (albeit may be not exactly following the methodology).
>
> I would be also interested in giving foundation for customized
> hierarchical solutions along the lines of RLFM with various customizations
> including in particular temporal weighing of inference and customized
> inference of informative priors there. Computational Bayesian methods along
> the lines of MCEM and MCMC are said to provide a very accurate solutions
> here.The latter class of models IMO are much more interesting for
> practitioners of recommendations than pure rigid uncustomizable ALS class
> of models, weighed or not. At least Deepak Agarwal sounds very convincing
> in his talks.
>
>
> (3) on the issue of performance, i guess by using Spark bindings dsl you
> can't do any worse than mllib. Perhaps we could include also support for
> Dense JBlas matrices under hood of Matrix API if of interested. Also i am
> hearing using GPU libraries lately is becoming also very popular for
> performance reasons, up to 300x lin alg speed ups are reported. There are
> some fancy thoughts about cost-based optimization of algeraic expressions
> for distributed pipelines, but for the first start I will do just very
> simple physical plan substitutions (something like if i directly see A'A as
> a part of expression, or if A'B' product has small geometry then of course
> i'd rather do (BA)' etc.
>
> But it has potential to do more while retaining absolute degree of
> manually forced execution (thru forced checkpoints). It's just i would stop
> what i pragmatically need to script out distributed SSVD at this point.
>
> (4) but in general i would say the scope of your issues sounds like
> something that would close a gap between 0.5 and 1.0 rather than 0.9 and
> 1.0.
> -d
>
>
>
> On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> I would like to start a conversation about where we want Mahout to be for
>> 1.0.  Let's suspend for the moment the question of how to achieve the
>> goals.  Instead, let's converge on what we really would like to have
>> happen
>> and after that, let's talk about means that will get us there.
>>
>> Here are some goals that I think would be good in the area of numerics,
>> classifiers and clustering:
>>
>> - runs with or without Hadoop
>>
>> - runs with or without map-reduce
>>
>> - includes (at least), regularized generalized linear models, k-means,
>> random forest, distributed random forest, distributed neural networks
>>
>> - reasonably competitive speed against other implementations including
>> graphlab, mlib and R.
>>
>> - interactive model building
>>
>> - models can be exported as code or data
>>
>> - simple programming model
>>
>> - programmable via Java or R
>>
>> - runs clustered or not
>>
>>
>> What does everybody think?
>>
>
>

Re: Mahout 1.0 goals

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

If we approach this form purely "marketing" standpoint, i would look at it
from two points: why is Mahout used, and why it is not used.

Mahout is not used because it is a collection of methods that are fairly
non-uniform in their api, especially embedded api, and generaly has zero
encouragement to be developed on top on and incorporated in yet larger
customizable models. I.e. it lacks semantic explicitness of quick
prototyping, and stitching things together is next to impossible.

Yet Mahout is used in spite of the above because it has some pretty unique
solvers in the area of linear algebra and text topical analysis. But I
would dare to say not e.g. because of GLM regressions.

I personally also use Mahout e.g. in favor of something like breeze because
it has sparse linalg support, both in-core and out-of-core, from the very
beginning and it fits naturally unlike in any other package i ever looked
at, R including btw.

But i find myself heavily disassembling Mahouts into parts and bolts rather
than exactly how e.g. MIA prescribes it.

Bottom line here, preliminarily primary issues are ease of use,
embedment/scripting, ease of customization, uniformity of apis.

(1) Take semantic explicitness and scripting issue. Well i guess that's
where the R part comes from, not because we just want to run R. I would
clear it right away -- i don't support any sort of R integration. And not
really because of lack of trying -- I have created a few R front ends for a
bunch of distributed applications, and also created projects that run R in
the backend (I wrote CrunchR more than year ago which is the same thing for
Crunch as what SparkR is for Spark; and yet-another MR framework running R
in backend; and also tried to run things with HadoopR). And have developed
a pretty strong opinion that R just doesn't mix with distributed
frameworks, mostly because of the performance penalities (and if you loose
$5 per day in performance on a single machine it may be ok, but in 100
machines one loses $500 a day -- and mid size companies in my experience
 are not succeptible to 'let's solve it at any HW cost" doctrine, much as
it is generally believed the other way around.

Anyway, on R toptic i don't see it as a solution for any sort of
semantically explicit driver and customizer technology. There's neither
demand nor willingness of corporate bosses to go that route. I grew pretty
opinionated on that issue.

But you don't need R to address semantical explicitness, customization and
ease of integration/scripting. Pragmatically, i see scala and carefully
crafted scala dsl as the underlying mechanism for achieving this. Also,
internally i use scala scripting a lot and it is really easy to build shell
interpreter for it (just like spark builds a customized shell), so one
doesn't even need to compile these things necessarily.

Bottom line, ideally distributed solver implementation should look more
like matlab than java. And I would measure that goal along the lines of
Evan Sparks' talks (i.e. in lines of code and explicitness needed to script
out a well known method).

See, you forced my hand to discuss solutions ("how")  :)

(2) on the issue of minimally supported algorithms. Again, i would not see
mlib as a prototype there.Given enough semantical explicitness, virtually
any data scientist would script out ALS in their sleep. And every second
one would script out weighted ALS (so called "implicit feedback). I view
those algorithms not as a goal but rather as a guinea pig for validating
semantical value of ML environment and apis. I would port stronger solvers
into the new semantic ML environment over Spark rather than trying to cover
the very "basics".

Pragmatically i would say it would be interesting and pragmatical (for me)
to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering
we have (albeit may be not exactly following the methodology).

I would be also interested in giving foundation for customized hierarchical
solutions along the lines of RLFM with various customizations including in
particular temporal weighing of inference and customized inference of
informative priors there. Computational Bayesian methods along the lines of
MCEM and MCMC are said to provide a very accurate solutions here.The latter
class of models IMO are much more interesting for practitioners of
recommendations than pure rigid uncustomizable ALS class of models, weighed
or not. At least Deepak Agarwal sounds very convincing in his talks.

(3) on the issue of performance, i guess by using Spark bindings dsl you
can't do any worse than mllib. Perhaps we could include also support for
Dense JBlas matrices under hood of Matrix API if of interested. Also i am
hearing using GPU libraries lately is becoming also very popular for
performance reasons, up to 300x lin alg speed ups are reported. There are
some fancy thoughts about cost-based optimization of algeraic expressions
for distributed pipelines, but for the first start I will do just very
simple physical plan substitutions (something like if i directly see A'A as
a part of expression, or if A'B' product has small geometry then of course
i'd rather do (BA)' etc.

But it has potential to do more while retaining absolute degree of manually
forced execution (thru forced checkpoints). It's just i would stop what i
pragmatically need to script out distributed SSVD at this point.

(4) but in general i would say the scope of your issues sounds like
something that would close a gap between 0.5 and 1.0 rather than 0.9 and
1.0.
-d

On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <te...@gmail.com> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
>
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
>
> - runs with or without Hadoop
>
> - runs with or without map-reduce
>
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
>
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
>
> - interactive model building
>
> - models can be exported as code or data
>
> - simple programming model
>
> - programmable via Java or R
>
> - runs clustered or not
>
>
> What does everybody think?
>