You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2011/12/22 21:03:42 UTC

Maturity level annotations

Hi,

what happened to these annotations to mark maturity level? Did we ever
commit those?

thank you.

Re: Maturity level annotations

Posted by Grant Ingersoll <gs...@apache.org>.

We just use @lucene.experimental (or something like that)

On Dec 22, 2011, at 3:54 PM, Dmitriy Lyubimov wrote:

> Well it looks like lucene people were talking about custom javadoc
> tags, not annotations.
> 
> i did a brief scan and it looks like it would require a specific
> doclet developed to handle annotations. Documentation is not terribly
> clear what of standard doclets to subclass.
> 
> just a custom javadoc tag would be easy though bit it would be visible
> to javadoc tool only (and not even IDEs, so typo-prone). Maybe that's
> what we need, but the trend in the rest of hadoop world seems to be to
> use annotation-driven markers in this case.
> 
> On Thu, Dec 22, 2011 at 12:35 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> Yes. Could be due to my lacking maven skills :)
>> 
>> 
>> On 22.12.2011 21:33, Dmitriy Lyubimov wrote:
>>> you mean you couldn't make them come up in javadocs?
>>> 
>>> On Thu, Dec 22, 2011 at 12:25 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>>> There is still a ticket open for those ->
>>>> https://issues.apache.org/jira/browse/MAHOUT-831. I tried to integrate
>>>> the javadoc "annotations" like proposed by the lucene guys, but for some
>>>> reason I didn't get them working. Would be great if someone could help here.
>>>> 
>>>> --sebastian
>>>> 
>>>> On 22.12.2011 21:03, Dmitriy Lyubimov wrote:
>>>>> Hi,
>>>>> 
>>>>> what happened to these annotations to mark maturity level? Did we ever
>>>>> commit those?
>>>>> 
>>>>> thank you.
>>>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Maturity level annotations

Posted by Isabel Drost <is...@apache.org>.

On 28.12.2011 Lance Norskog wrote:
> Or you can take a small set of good data and generate variations to
> get a big set with the same disribution curves.

... and motivate users to evaluate upcoming releases against their setup to spot 
regressions that slipped through performance tests.


Isabel

Re: Maturity level annotations

Posted by Lance Norskog <go...@gmail.com>.

Or you can take a small set of good data and generate variations to
get a big set with the same disribution curves.

On Wed, Dec 28, 2011 at 10:47 AM, Ted Dunning <te...@gmail.com> wrote:
> I have nearly given up on getting publicly available large data sets and
> have started to specify synthetic datasets for development projects.   The
> key is to build reasonably realistic generation algorithms and for that
> there are always some serious difficulties.
>
> For simple scaling tests, however, synthetic data is often just the ticket.
>  You still need some sophistication about the data, but it doesn't take
> much.  For k-means clustering of text documents, for instance, you can use
> re-sample from real text to generate new text with desired properties or
> you can define an LDA-like generator to generate data with known clustering
> properties.  Similarly, to test scaling of classification algorithms, it is
> easy to generate text-like data with known properties.
>
> The primary virtues of synthetic data are that a synthetic data set is easy
> to carry around and it can be any size at all.
>
> As an example of a potential pitfall, I wrote tests for the sequential
> version of the SSVD codes by building low rank matrices and testing the
> reconstruction error.  This is a fine test for correctness and some scaling
> attributes, but it ignores the truncation error that Radim was fulminating
> about recently.  It would be good to additionally explore large matrices
> that are more realistic because they are generated as count data from a
> model that has a realistic spectrum.
>
> On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gs...@apache.org>wrote:
>
>> To me, the big thing we continue to be missing is the ability for those of
>> us working on the project to reliably test the algorithms at scale.  For
>> instance, I've seen hints of several places where our clustering algorithms
>> don't appear to scale very well (which are all M/R -- K-Means does scale)
>> and it isn't clear to me whether it is our implementation, Hadoop, or
>> simply that the data set isn't big enough or the combination of all three.
>>  To see this in action, try out the ASF email archive up on Amazon with 10,
>> 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet,
>> etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all
>> many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big
>> enough (~100GB), but in some regards that's silly since the whole point is
>> supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it
>> seems like it is an area we need to focus on more.
>>
>> Of course, the hard part with all of this is debugging where the
>> bottlenecks are.  In the end, we need to figure out how to reliably get
>> compute time available for testing along with a real data sets that we can
>> use to validate scalability.
>>
>>
>> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>>
>> > On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <tc...@cloudera.com> wrote:
>> >
>> >> ...
>> >>
>> >> They discover Mahout, which does specifically bill itself as scalable
>> >> (from http://mahout.apache.org, in some of the largest letters: "What
>> >> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
>> >> is to build scalable machine learning libraries.").  They sniff check
>> >> it by massaging some moderately-sized data set into the same format as
>> >> an example from the wiki and they fail to get a result - often because
>> >> their problem has some very different properties (more classes, much
>> >> larger feature space, etc.) and the implementation has some limitation
>> >> that they trip over.
>> >>
>> >
>> > I have worked with users of Mahout who had 10^9 possible features and
>> > others who are classifying
>> > into 60,000 categories.
>> >
>> > Neither of these implementations uses Naive Bayes.  Both work very well.
>> >
>> > They will usually try one of the simplest methods available under the
>> >> assumption "well, if this doesn't scale well, the more complex methods
>> >> are surely no better".
>> >
>> >
>> > Silly assumption.
>> >
>> >
>> >> This may not be entirely fair, but since the
>> >> docs they're encountering on the main website and wiki don't warn them
>> >> that certain implementations don't necessarily scale in different
>> >> ways, it's certainly not unreasonable.
>> >
>> >
>> > Well, it is actually silly.
>> >
>> > Clearly the docs can be better.  Clearly the code quality can be better
>> > especially in terms of nuking capabilities that have not found an
>> audience.
>> > But clearly also just trying one technique without asking anybody what
>> the
>> > limitations are isn't going to work as an evaluation technique.  This is
>> > exactly analogous to somebody finding that a matrix in R doesn't do what
>> a
>> > data frame is supposed to do.  It doesn't and you aren't going to find
>> out
>> > why or how from the documentation very quickly.
>> >
>> > In both cases of investigating Mahout or investigating R you will find
>> out
>> > plenty if you ask somebody who knows what they are talking about.
>> >
>> > They're at best going to
>> >> conclude the scalability will be hit-and-miss when a simple method
>> >> doesn't work.  Perhaps they'll check in again in 6-12 months.
>> >>
>> >
>> > Maybe so.  Maybe not.  I have little sympathy with people who make
>> > scatter-shot decisions like this.
>> >
>> >
>> >> ...
>> >> I see your analogy to R or sciPy - and I don't disagree.  But those
>> >> projects do not put scaling front and center; if Mahout is going to
>> >> keep scalability as a "headline feature" (which I would like to see!),
>> >> I think prominently acknowledging how different methods fail to scale
>> >> would really help its credibility.  For what it's worth, of the people
>> >> I know who've tried Mahout 100% of them were using R and/or sciPy
>> >> already, but were curious about Mahout specifically for better
>> >> scalability.
>> >>
>> >
>> > Did they ask on the mailing list?
>> >
>> >
>> >> I'm not sure where this information is best placed - it would be great
>> >> to see it on the Wiki along with the examples, at least.
>> >
>> >
>> > Sounds OK.  Maybe we should put it in the book.
>> >
>> > (oh... wait, we already did that)
>> >
>> >
>> >> It would be
>> >> awesome to see warnings at runtime ("Warning: You just trained a model
>> >> that you cannot load without at least 20GB of RAM"), but I'm not sure
>> >> how realistic that is.
>> >
>> >
>> > I think it is fine that loading the model fails with a fine error message
>> > but putting yellow warning tape all over the user's keyboard isn't going
>> to
>> > help anything.
>> >
>> >
>> >> I would like it to be easier to determine, at some very high level, why
>> >> something didn't work when an experiment fails.  Ideally, without
>> having to
>> >> dive into the code at all.
>> >>
>> >
>> > How about you ask an expert?
>> >
>> > That really is easier.  It helps the community to hear about what other
>> > people need and it helps the new user to hear what other people have
>> done.
>>
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>>
>>
>>



-- 
Lance Norskog
goksron@gmail.com

Re: Maturity level annotations

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 28, 2011, at 1:47 PM, Ted Dunning wrote:

> I have nearly given up on getting publicly available large data sets and
> have started to specify synthetic datasets for development projects.   The
> key is to build reasonably realistic generation algorithms and for that
> there are always some serious difficulties.

Yeah, I agree. 

 Still, 7M+ real emails seems like it should be interesting in size for us while not being overwhelming.  Of course, that only solves 1/2 of the problem.  We still need access to a cluster so we can regularly run experiments.

> 
> For simple scaling tests, however, synthetic data is often just the ticket.
> You still need some sophistication about the data, but it doesn't take
> much.  For k-means clustering of text documents, for instance, you can use
> re-sample from real text to generate new text with desired properties or
> you can define an LDA-like generator to generate data with known clustering
> properties.  Similarly, to test scaling of classification algorithms, it is
> easy to generate text-like data with known properties.

I still like our idea of a "good fake data" project.  Or at least a Util in Mahout.

> 
> The primary virtues of synthetic data are that a synthetic data set is easy
> to carry around and it can be any size at all.
> 
> As an example of a potential pitfall, I wrote tests for the sequential
> version of the SSVD codes by building low rank matrices and testing the
> reconstruction error.  This is a fine test for correctness and some scaling
> attributes, but it ignores the truncation error that Radim was fulminating
> about recently.  It would be good to additionally explore large matrices
> that are more realistic because they are generated as count data from a
> model that has a realistic spectrum.
> 
> On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> To me, the big thing we continue to be missing is the ability for those of
>> us working on the project to reliably test the algorithms at scale.  For
>> instance, I've seen hints of several places where our clustering algorithms
>> don't appear to scale very well (which are all M/R -- K-Means does scale)
>> and it isn't clear to me whether it is our implementation, Hadoop, or
>> simply that the data set isn't big enough or the combination of all three.
>> To see this in action, try out the ASF email archive up on Amazon with 10,
>> 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet,
>> etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all
>> many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big
>> enough (~100GB), but in some regards that's silly since the whole point is
>> supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it
>> seems like it is an area we need to focus on more.
>> 
>> Of course, the hard part with all of this is debugging where the
>> bottlenecks are.  In the end, we need to figure out how to reliably get
>> compute time available for testing along with a real data sets that we can
>> use to validate scalability.
>> 
>> 
>> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>> 
>>> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <tc...@cloudera.com> wrote:
>>> 
>>>> ...
>>>> 
>>>> They discover Mahout, which does specifically bill itself as scalable
>>>> (from http://mahout.apache.org, in some of the largest letters: "What
>>>> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
>>>> is to build scalable machine learning libraries.").  They sniff check
>>>> it by massaging some moderately-sized data set into the same format as
>>>> an example from the wiki and they fail to get a result - often because
>>>> their problem has some very different properties (more classes, much
>>>> larger feature space, etc.) and the implementation has some limitation
>>>> that they trip over.
>>>> 
>>> 
>>> I have worked with users of Mahout who had 10^9 possible features and
>>> others who are classifying
>>> into 60,000 categories.
>>> 
>>> Neither of these implementations uses Naive Bayes.  Both work very well.
>>> 
>>> They will usually try one of the simplest methods available under the
>>>> assumption "well, if this doesn't scale well, the more complex methods
>>>> are surely no better".
>>> 
>>> 
>>> Silly assumption.
>>> 
>>> 
>>>> This may not be entirely fair, but since the
>>>> docs they're encountering on the main website and wiki don't warn them
>>>> that certain implementations don't necessarily scale in different
>>>> ways, it's certainly not unreasonable.
>>> 
>>> 
>>> Well, it is actually silly.
>>> 
>>> Clearly the docs can be better.  Clearly the code quality can be better
>>> especially in terms of nuking capabilities that have not found an
>> audience.
>>> But clearly also just trying one technique without asking anybody what
>> the
>>> limitations are isn't going to work as an evaluation technique.  This is
>>> exactly analogous to somebody finding that a matrix in R doesn't do what
>> a
>>> data frame is supposed to do.  It doesn't and you aren't going to find
>> out
>>> why or how from the documentation very quickly.
>>> 
>>> In both cases of investigating Mahout or investigating R you will find
>> out
>>> plenty if you ask somebody who knows what they are talking about.
>>> 
>>> They're at best going to
>>>> conclude the scalability will be hit-and-miss when a simple method
>>>> doesn't work.  Perhaps they'll check in again in 6-12 months.
>>>> 
>>> 
>>> Maybe so.  Maybe not.  I have little sympathy with people who make
>>> scatter-shot decisions like this.
>>> 
>>> 
>>>> ...
>>>> I see your analogy to R or sciPy - and I don't disagree.  But those
>>>> projects do not put scaling front and center; if Mahout is going to
>>>> keep scalability as a "headline feature" (which I would like to see!),
>>>> I think prominently acknowledging how different methods fail to scale
>>>> would really help its credibility.  For what it's worth, of the people
>>>> I know who've tried Mahout 100% of them were using R and/or sciPy
>>>> already, but were curious about Mahout specifically for better
>>>> scalability.
>>>> 
>>> 
>>> Did they ask on the mailing list?
>>> 
>>> 
>>>> I'm not sure where this information is best placed - it would be great
>>>> to see it on the Wiki along with the examples, at least.
>>> 
>>> 
>>> Sounds OK.  Maybe we should put it in the book.
>>> 
>>> (oh... wait, we already did that)
>>> 
>>> 
>>>> It would be
>>>> awesome to see warnings at runtime ("Warning: You just trained a model
>>>> that you cannot load without at least 20GB of RAM"), but I'm not sure
>>>> how realistic that is.
>>> 
>>> 
>>> I think it is fine that loading the model fails with a fine error message
>>> but putting yellow warning tape all over the user's keyboard isn't going
>> to
>>> help anything.
>>> 
>>> 
>>>> I would like it to be easier to determine, at some very high level, why
>>>> something didn't work when an experiment fails.  Ideally, without
>> having to
>>>> dive into the code at all.
>>>> 
>>> 
>>> How about you ask an expert?
>>> 
>>> That really is easier.  It helps the community to hear about what other
>>> people need and it helps the new user to hear what other people have
>> done.
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Maturity level annotations

Posted by Ted Dunning <te...@gmail.com>.

I have nearly given up on getting publicly available large data sets and
have started to specify synthetic datasets for development projects.   The
key is to build reasonably realistic generation algorithms and for that
there are always some serious difficulties.

For simple scaling tests, however, synthetic data is often just the ticket.
 You still need some sophistication about the data, but it doesn't take
much.  For k-means clustering of text documents, for instance, you can use
re-sample from real text to generate new text with desired properties or
you can define an LDA-like generator to generate data with known clustering
properties.  Similarly, to test scaling of classification algorithms, it is
easy to generate text-like data with known properties.

The primary virtues of synthetic data are that a synthetic data set is easy
to carry around and it can be any size at all.

As an example of a potential pitfall, I wrote tests for the sequential
version of the SSVD codes by building low rank matrices and testing the
reconstruction error.  This is a fine test for correctness and some scaling
attributes, but it ignores the truncation error that Radim was fulminating
about recently.  It would be good to additionally explore large matrices
that are more realistic because they are generated as count data from a
model that has a realistic spectrum.

On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gs...@apache.org>wrote:

> To me, the big thing we continue to be missing is the ability for those of
> us working on the project to reliably test the algorithms at scale.  For
> instance, I've seen hints of several places where our clustering algorithms
> don't appear to scale very well (which are all M/R -- K-Means does scale)
> and it isn't clear to me whether it is our implementation, Hadoop, or
> simply that the data set isn't big enough or the combination of all three.
>  To see this in action, try out the ASF email archive up on Amazon with 10,
> 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet,
> etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all
> many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big
> enough (~100GB), but in some regards that's silly since the whole point is
> supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it
> seems like it is an area we need to focus on more.
>
> Of course, the hard part with all of this is debugging where the
> bottlenecks are.  In the end, we need to figure out how to reliably get
> compute time available for testing along with a real data sets that we can
> use to validate scalability.
>
>
> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>
> > On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <tc...@cloudera.com> wrote:
> >
> >> ...
> >>
> >> They discover Mahout, which does specifically bill itself as scalable
> >> (from http://mahout.apache.org, in some of the largest letters: "What
> >> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
> >> is to build scalable machine learning libraries.").  They sniff check
> >> it by massaging some moderately-sized data set into the same format as
> >> an example from the wiki and they fail to get a result - often because
> >> their problem has some very different properties (more classes, much
> >> larger feature space, etc.) and the implementation has some limitation
> >> that they trip over.
> >>
> >
> > I have worked with users of Mahout who had 10^9 possible features and
> > others who are classifying
> > into 60,000 categories.
> >
> > Neither of these implementations uses Naive Bayes.  Both work very well.
> >
> > They will usually try one of the simplest methods available under the
> >> assumption "well, if this doesn't scale well, the more complex methods
> >> are surely no better".
> >
> >
> > Silly assumption.
> >
> >
> >> This may not be entirely fair, but since the
> >> docs they're encountering on the main website and wiki don't warn them
> >> that certain implementations don't necessarily scale in different
> >> ways, it's certainly not unreasonable.
> >
> >
> > Well, it is actually silly.
> >
> > Clearly the docs can be better.  Clearly the code quality can be better
> > especially in terms of nuking capabilities that have not found an
> audience.
> > But clearly also just trying one technique without asking anybody what
> the
> > limitations are isn't going to work as an evaluation technique.  This is
> > exactly analogous to somebody finding that a matrix in R doesn't do what
> a
> > data frame is supposed to do.  It doesn't and you aren't going to find
> out
> > why or how from the documentation very quickly.
> >
> > In both cases of investigating Mahout or investigating R you will find
> out
> > plenty if you ask somebody who knows what they are talking about.
> >
> > They're at best going to
> >> conclude the scalability will be hit-and-miss when a simple method
> >> doesn't work.  Perhaps they'll check in again in 6-12 months.
> >>
> >
> > Maybe so.  Maybe not.  I have little sympathy with people who make
> > scatter-shot decisions like this.
> >
> >
> >> ...
> >> I see your analogy to R or sciPy - and I don't disagree.  But those
> >> projects do not put scaling front and center; if Mahout is going to
> >> keep scalability as a "headline feature" (which I would like to see!),
> >> I think prominently acknowledging how different methods fail to scale
> >> would really help its credibility.  For what it's worth, of the people
> >> I know who've tried Mahout 100% of them were using R and/or sciPy
> >> already, but were curious about Mahout specifically for better
> >> scalability.
> >>
> >
> > Did they ask on the mailing list?
> >
> >
> >> I'm not sure where this information is best placed - it would be great
> >> to see it on the Wiki along with the examples, at least.
> >
> >
> > Sounds OK.  Maybe we should put it in the book.
> >
> > (oh... wait, we already did that)
> >
> >
> >> It would be
> >> awesome to see warnings at runtime ("Warning: You just trained a model
> >> that you cannot load without at least 20GB of RAM"), but I'm not sure
> >> how realistic that is.
> >
> >
> > I think it is fine that loading the model fails with a fine error message
> > but putting yellow warning tape all over the user's keyboard isn't going
> to
> > help anything.
> >
> >
> >> I would like it to be easier to determine, at some very high level, why
> >> something didn't work when an experiment fails.  Ideally, without
> having to
> >> dive into the code at all.
> >>
> >
> > How about you ask an expert?
> >
> > That really is easier.  It helps the community to hear about what other
> > people need and it helps the new user to hear what other people have
> done.
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>

Re: Maturity level annotations

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 28, 2011, at 7:28 PM, Jeff Eastman wrote:

> This is something that I'm enthusiastic about investigating right now. I'm heartened that K-Means seems to scale well in your tests and I think I've just improved Dirichlet a lot.

I suspect we found out why before, at least for Dirichlet, due to the choice of some parameters.

> I'd like to test it again with your data. FuzzyK is problematic as its clusters always end up with dense vectors for center and radius. I think it will always be a hog. 100GB is not a huge data set and it should sing on a 10-node cluster. Even without MapR <grin>.
> 
> I think improving our predictability at scale is a great goal for 1.0. Getting started would be a great goal for 0.7.

+1


> Jeff
> 
> On 12/28/11 11:35 AM, Grant Ingersoll wrote:
>> To me, the big thing we continue to be missing is the ability for those of us working on the project to reliably test the algorithms at scale.  For instance, I've seen hints of several places where our clustering algorithms don't appear to scale very well (which are all M/R -- K-Means does scale) and it isn't clear to me whether it is our implementation, Hadoop, or simply that the data set isn't big enough or the combination of all three.  To see this in action, try out the ASF email archive up on Amazon with 10, 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet, etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big enough (~100GB), but in some regards that's silly since the whole point is supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it seems like it is an area we need to focus on more.
>> 
>> Of course, the hard part with all of this is debugging where the bottlenecks are.  In the end, we need to figure out how to reliably get compute time available for testing along with a real data sets that we can use to validate scalability.
>> 
>> 
>> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>> 
>>> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce<tc...@cloudera.com>  wrote:
>>> 
>>>> ...
>>>> 
>>>> They discover Mahout, which does specifically bill itself as scalable
>>>> (from http://mahout.apache.org, in some of the largest letters: "What
>>>> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
>>>> is to build scalable machine learning libraries.").  They sniff check
>>>> it by massaging some moderately-sized data set into the same format as
>>>> an example from the wiki and they fail to get a result - often because
>>>> their problem has some very different properties (more classes, much
>>>> larger feature space, etc.) and the implementation has some limitation
>>>> that they trip over.
>>>> 
>>> I have worked with users of Mahout who had 10^9 possible features and
>>> others who are classifying
>>> into 60,000 categories.
>>> 
>>> Neither of these implementations uses Naive Bayes.  Both work very well.
>>> 
>>> They will usually try one of the simplest methods available under the
>>>> assumption "well, if this doesn't scale well, the more complex methods
>>>> are surely no better".
>>> 
>>> Silly assumption.
>>> 
>>> 
>>>> This may not be entirely fair, but since the
>>>> docs they're encountering on the main website and wiki don't warn them
>>>> that certain implementations don't necessarily scale in different
>>>> ways, it's certainly not unreasonable.
>>> 
>>> Well, it is actually silly.
>>> 
>>> Clearly the docs can be better.  Clearly the code quality can be better
>>> especially in terms of nuking capabilities that have not found an audience.
>>> But clearly also just trying one technique without asking anybody what the
>>> limitations are isn't going to work as an evaluation technique.  This is
>>> exactly analogous to somebody finding that a matrix in R doesn't do what a
>>> data frame is supposed to do.  It doesn't and you aren't going to find out
>>> why or how from the documentation very quickly.
>>> 
>>> In both cases of investigating Mahout or investigating R you will find out
>>> plenty if you ask somebody who knows what they are talking about.
>>> 
>>> They're at best going to
>>>> conclude the scalability will be hit-and-miss when a simple method
>>>> doesn't work.  Perhaps they'll check in again in 6-12 months.
>>>> 
>>> Maybe so.  Maybe not.  I have little sympathy with people who make
>>> scatter-shot decisions like this.
>>> 
>>> 
>>>> ...
>>>> I see your analogy to R or sciPy - and I don't disagree.  But those
>>>> projects do not put scaling front and center; if Mahout is going to
>>>> keep scalability as a "headline feature" (which I would like to see!),
>>>> I think prominently acknowledging how different methods fail to scale
>>>> would really help its credibility.  For what it's worth, of the people
>>>> I know who've tried Mahout 100% of them were using R and/or sciPy
>>>> already, but were curious about Mahout specifically for better
>>>> scalability.
>>>> 
>>> Did they ask on the mailing list?
>>> 
>>> 
>>>> I'm not sure where this information is best placed - it would be great
>>>> to see it on the Wiki along with the examples, at least.
>>> 
>>> Sounds OK.  Maybe we should put it in the book.
>>> 
>>> (oh... wait, we already did that)
>>> 
>>> 
>>>> It would be
>>>> awesome to see warnings at runtime ("Warning: You just trained a model
>>>> that you cannot load without at least 20GB of RAM"), but I'm not sure
>>>> how realistic that is.
>>> 
>>> I think it is fine that loading the model fails with a fine error message
>>> but putting yellow warning tape all over the user's keyboard isn't going to
>>> help anything.
>>> 
>>> 
>>>> I would like it to be easier to determine, at some very high level, why
>>>> something didn't work when an experiment fails.  Ideally, without having to
>>>> dive into the code at all.
>>>> 
>>> How about you ask an expert?
>>> 
>>> That really is easier.  It helps the community to hear about what other
>>> people need and it helps the new user to hear what other people have done.
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Maturity level annotations

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

This is something that I'm enthusiastic about investigating right now. 
I'm heartened that K-Means seems to scale well in your tests and I think 
I've just improved Dirichlet a lot. I'd like to test it again with your 
data. FuzzyK is problematic as its clusters always end up with dense 
vectors for center and radius. I think it will always be a hog. 100GB is 
not a huge data set and it should sing on a 10-node cluster. Even 
without MapR <grin>.

I think improving our predictability at scale is a great goal for 1.0. 
Getting started would be a great goal for 0.7.
Jeff

On 12/28/11 11:35 AM, Grant Ingersoll wrote:
> To me, the big thing we continue to be missing is the ability for those of us working on the project to reliably test the algorithms at scale.  For instance, I've seen hints of several places where our clustering algorithms don't appear to scale very well (which are all M/R -- K-Means does scale) and it isn't clear to me whether it is our implementation, Hadoop, or simply that the data set isn't big enough or the combination of all three.  To see this in action, try out the ASF email archive up on Amazon with 10, 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet, etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big enough (~100GB), but in some regards that's silly since the whole point is supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it seems like it is an area we need to focus on more.
>
> Of course, the hard part with all of this is debugging where the bottlenecks are.  In the end, we need to figure out how to reliably get compute time available for testing along with a real data sets that we can use to validate scalability.
>
>
> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>
>> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce<tc...@cloudera.com>  wrote:
>>
>>> ...
>>>
>>> They discover Mahout, which does specifically bill itself as scalable
>>> (from http://mahout.apache.org, in some of the largest letters: "What
>>> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
>>> is to build scalable machine learning libraries.").  They sniff check
>>> it by massaging some moderately-sized data set into the same format as
>>> an example from the wiki and they fail to get a result - often because
>>> their problem has some very different properties (more classes, much
>>> larger feature space, etc.) and the implementation has some limitation
>>> that they trip over.
>>>
>> I have worked with users of Mahout who had 10^9 possible features and
>> others who are classifying
>> into 60,000 categories.
>>
>> Neither of these implementations uses Naive Bayes.  Both work very well.
>>
>> They will usually try one of the simplest methods available under the
>>> assumption "well, if this doesn't scale well, the more complex methods
>>> are surely no better".
>>
>> Silly assumption.
>>
>>
>>> This may not be entirely fair, but since the
>>> docs they're encountering on the main website and wiki don't warn them
>>> that certain implementations don't necessarily scale in different
>>> ways, it's certainly not unreasonable.
>>
>> Well, it is actually silly.
>>
>> Clearly the docs can be better.  Clearly the code quality can be better
>> especially in terms of nuking capabilities that have not found an audience.
>> But clearly also just trying one technique without asking anybody what the
>> limitations are isn't going to work as an evaluation technique.  This is
>> exactly analogous to somebody finding that a matrix in R doesn't do what a
>> data frame is supposed to do.  It doesn't and you aren't going to find out
>> why or how from the documentation very quickly.
>>
>> In both cases of investigating Mahout or investigating R you will find out
>> plenty if you ask somebody who knows what they are talking about.
>>
>> They're at best going to
>>> conclude the scalability will be hit-and-miss when a simple method
>>> doesn't work.  Perhaps they'll check in again in 6-12 months.
>>>
>> Maybe so.  Maybe not.  I have little sympathy with people who make
>> scatter-shot decisions like this.
>>
>>
>>> ...
>>> I see your analogy to R or sciPy - and I don't disagree.  But those
>>> projects do not put scaling front and center; if Mahout is going to
>>> keep scalability as a "headline feature" (which I would like to see!),
>>> I think prominently acknowledging how different methods fail to scale
>>> would really help its credibility.  For what it's worth, of the people
>>> I know who've tried Mahout 100% of them were using R and/or sciPy
>>> already, but were curious about Mahout specifically for better
>>> scalability.
>>>
>> Did they ask on the mailing list?
>>
>>
>>> I'm not sure where this information is best placed - it would be great
>>> to see it on the Wiki along with the examples, at least.
>>
>> Sounds OK.  Maybe we should put it in the book.
>>
>> (oh... wait, we already did that)
>>
>>
>>> It would be
>>> awesome to see warnings at runtime ("Warning: You just trained a model
>>> that you cannot load without at least 20GB of RAM"), but I'm not sure
>>> how realistic that is.
>>
>> I think it is fine that loading the model fails with a fine error message
>> but putting yellow warning tape all over the user's keyboard isn't going to
>> help anything.
>>
>>
>>> I would like it to be easier to determine, at some very high level, why
>>> something didn't work when an experiment fails.  Ideally, without having to
>>> dive into the code at all.
>>>
>> How about you ask an expert?
>>
>> That really is easier.  It helps the community to hear about what other
>> people need and it helps the new user to hear what other people have done.
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>
>
>

Re: Maturity level annotations

Posted by Grant Ingersoll <gs...@apache.org>.

To me, the big thing we continue to be missing is the ability for those of us working on the project to reliably test the algorithms at scale.  For instance, I've seen hints of several places where our clustering algorithms don't appear to scale very well (which are all M/R -- K-Means does scale) and it isn't clear to me whether it is our implementation, Hadoop, or simply that the data set isn't big enough or the combination of all three.  To see this in action, try out the ASF email archive up on Amazon with 10, 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet, etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big enough (~100GB), but in some regards that's silly since the whole point is supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it seems like it is an area we need to focus on more.  

Of course, the hard part with all of this is debugging where the bottlenecks are.  In the end, we need to figure out how to reliably get compute time available for testing along with a real data sets that we can use to validate scalability.


On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:

> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <tc...@cloudera.com> wrote:
> 
>> ...
>> 
>> They discover Mahout, which does specifically bill itself as scalable
>> (from http://mahout.apache.org, in some of the largest letters: "What
>> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
>> is to build scalable machine learning libraries.").  They sniff check
>> it by massaging some moderately-sized data set into the same format as
>> an example from the wiki and they fail to get a result - often because
>> their problem has some very different properties (more classes, much
>> larger feature space, etc.) and the implementation has some limitation
>> that they trip over.
>> 
> 
> I have worked with users of Mahout who had 10^9 possible features and
> others who are classifying
> into 60,000 categories.
> 
> Neither of these implementations uses Naive Bayes.  Both work very well.
> 
> They will usually try one of the simplest methods available under the
>> assumption "well, if this doesn't scale well, the more complex methods
>> are surely no better".
> 
> 
> Silly assumption.
> 
> 
>> This may not be entirely fair, but since the
>> docs they're encountering on the main website and wiki don't warn them
>> that certain implementations don't necessarily scale in different
>> ways, it's certainly not unreasonable.
> 
> 
> Well, it is actually silly.
> 
> Clearly the docs can be better.  Clearly the code quality can be better
> especially in terms of nuking capabilities that have not found an audience.
> But clearly also just trying one technique without asking anybody what the
> limitations are isn't going to work as an evaluation technique.  This is
> exactly analogous to somebody finding that a matrix in R doesn't do what a
> data frame is supposed to do.  It doesn't and you aren't going to find out
> why or how from the documentation very quickly.
> 
> In both cases of investigating Mahout or investigating R you will find out
> plenty if you ask somebody who knows what they are talking about.
> 
> They're at best going to
>> conclude the scalability will be hit-and-miss when a simple method
>> doesn't work.  Perhaps they'll check in again in 6-12 months.
>> 
> 
> Maybe so.  Maybe not.  I have little sympathy with people who make
> scatter-shot decisions like this.
> 
> 
>> ...
>> I see your analogy to R or sciPy - and I don't disagree.  But those
>> projects do not put scaling front and center; if Mahout is going to
>> keep scalability as a "headline feature" (which I would like to see!),
>> I think prominently acknowledging how different methods fail to scale
>> would really help its credibility.  For what it's worth, of the people
>> I know who've tried Mahout 100% of them were using R and/or sciPy
>> already, but were curious about Mahout specifically for better
>> scalability.
>> 
> 
> Did they ask on the mailing list?
> 
> 
>> I'm not sure where this information is best placed - it would be great
>> to see it on the Wiki along with the examples, at least.
> 
> 
> Sounds OK.  Maybe we should put it in the book.
> 
> (oh... wait, we already did that)
> 
> 
>> It would be
>> awesome to see warnings at runtime ("Warning: You just trained a model
>> that you cannot load without at least 20GB of RAM"), but I'm not sure
>> how realistic that is.
> 
> 
> I think it is fine that loading the model fails with a fine error message
> but putting yellow warning tape all over the user's keyboard isn't going to
> help anything.
> 
> 
>> I would like it to be easier to determine, at some very high level, why
>> something didn't work when an experiment fails.  Ideally, without having to
>> dive into the code at all.
>> 
> 
> How about you ask an expert?
> 
> That really is easier.  It helps the community to hear about what other
> people need and it helps the new user to hear what other people have done.

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Maturity level annotations

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <tc...@cloudera.com> wrote:

> ...
>
> They discover Mahout, which does specifically bill itself as scalable
> (from http://mahout.apache.org, in some of the largest letters: "What
> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
> is to build scalable machine learning libraries.").  They sniff check
> it by massaging some moderately-sized data set into the same format as
> an example from the wiki and they fail to get a result - often because
> their problem has some very different properties (more classes, much
> larger feature space, etc.) and the implementation has some limitation
> that they trip over.
>

I have worked with users of Mahout who had 10^9 possible features and
others who are classifying
into 60,000 categories.

Neither of these implementations uses Naive Bayes.  Both work very well.

They will usually try one of the simplest methods available under the
> assumption "well, if this doesn't scale well, the more complex methods
> are surely no better".

Silly assumption.

> This may not be entirely fair, but since the
> docs they're encountering on the main website and wiki don't warn them
> that certain implementations don't necessarily scale in different
> ways, it's certainly not unreasonable.

Well, it is actually silly.

Clearly the docs can be better.  Clearly the code quality can be better
especially in terms of nuking capabilities that have not found an audience.
 But clearly also just trying one technique without asking anybody what the
limitations are isn't going to work as an evaluation technique.  This is
exactly analogous to somebody finding that a matrix in R doesn't do what a
data frame is supposed to do.  It doesn't and you aren't going to find out
why or how from the documentation very quickly.

In both cases of investigating Mahout or investigating R you will find out
plenty if you ask somebody who knows what they are talking about.

They're at best going to
> conclude the scalability will be hit-and-miss when a simple method
> doesn't work.  Perhaps they'll check in again in 6-12 months.
>

Maybe so.  Maybe not.  I have little sympathy with people who make
scatter-shot decisions like this.

> ...
> I see your analogy to R or sciPy - and I don't disagree.  But those
> projects do not put scaling front and center; if Mahout is going to
> keep scalability as a "headline feature" (which I would like to see!),
> I think prominently acknowledging how different methods fail to scale
> would really help its credibility.  For what it's worth, of the people
> I know who've tried Mahout 100% of them were using R and/or sciPy
> already, but were curious about Mahout specifically for better
> scalability.
>

Did they ask on the mailing list?

> I'm not sure where this information is best placed - it would be great
> to see it on the Wiki along with the examples, at least.

Sounds OK.  Maybe we should put it in the book.

(oh... wait, we already did that)

> It would be
> awesome to see warnings at runtime ("Warning: You just trained a model
> that you cannot load without at least 20GB of RAM"), but I'm not sure
> how realistic that is.

I think it is fine that loading the model fails with a fine error message
but putting yellow warning tape all over the user's keyboard isn't going to
help anything.

> I would like it to be easier to determine, at some very high level, why
> something didn't work when an experiment fails.  Ideally, without having to
> dive into the code at all.
>

How about you ask an expert?

That really is easier.  It helps the community to hear about what other
people need and it helps the new user to hear what other people have done.

Re: Maturity level annotations

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Tom,

Thanks for the your input. I have nothing to argue with but I think
project can use help of the people who are kicking the tires in a way
that they may make those problems (in particular, scale problems)
available to the list.

> They discover Mahout, which does specifically bill itself as scalable
> (from http://mahout.apache.org, in some of the largest letters: "What
> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
> is to build scalable machine learning libraries.").  They sniff check
> it by massaging some moderately-sized data set into the same format as
> an example from the wiki and they fail to get a result - often because
> their problem has some very different properties (more classes, much
> larger feature space, etc.) and the implementation has some limitation
> that they trip over.

I would risk to go out on a limb and say no single person knows
exactly the limitations of _all_ currently existing contributions
(it's a community after all, not a  vendorized product), and in few
cases I suspect no proper scale experiment was even set up (i mean, as
in thousands of nodes clusters, it's kind of hard to fund that on an
ongoing basis), only approximation is known. But the contribution is
not necessarily  rejected just because of that. We'll have to work to
gather it on wiki. I think "Mahout in Action" book among other things
represents such an attempt to focus on what is proven and stable and
has known limits.

Part of the difficulties of approximating performance is because in
few cases the run time is super-linear to the input size and it is
hard to see when exactly Hadoop I/O or GC is going to start acting up.

BTW If you have a concrete experimental data showing method
limitations mentioned on wiki, please don't hesitate to share, it will
be taken with great appreciation. There are people who are eager to
make improvements, when such room for improvement becomes apparent
based on benchmarks.

But conducting and submitting benchmarks is the key IMO. I don't think
there's another way to work the kinks out other than address them
based on problem reports.

> I'm not sure where this information is best placed - it would be great
> to see it on the Wiki along with the examples, at least.  It would be

I think wiki is the place.

> awesome to see warnings at runtime ("Warning: You just trained a model
> that you cannot load without at least 20GB of RAM"), but I'm not sure
> how realistic that is.  I would like it to be easier to determine, at
> some very high level, why something didn't work when an experiment
> fails.  Ideally, without having to dive into the code at all.

People who work with MR are routinely accustomed to look at job
counters to see an estimate of sizes (that's what i do). I see the
value in creating some custom counters in certain cases and reporting
them. that's reasonable, I guess. Similar to what Pig does. But at
this point i don't see a direct link to annotations though for this
kind of functionality. I think it is what "Imrovement" jira request is
for on case per case basis.


Thank you.

-Dmitriy

Re: Maturity level annotations

Posted by Tom Pierce <tc...@cloudera.com>.

The users I'm talking about are often quite advanced in many ways -
familiar with R, SAS, etc., capable of coding up their own
implementations based on papers, etc.  They don't know Mahout, they
aren't eager to study a new API out of curiosity, but they would like
to find a suite of super-scalable (in terms of parallelized effort and
data size) ML tools.

They discover Mahout, which does specifically bill itself as scalable
(from http://mahout.apache.org, in some of the largest letters: "What
is Apache Mahout?  The Apache Mahout™ machine learning library's goal
is to build scalable machine learning libraries.").  They sniff check
it by massaging some moderately-sized data set into the same format as
an example from the wiki and they fail to get a result - often because
their problem has some very different properties (more classes, much
larger feature space, etc.) and the implementation has some limitation
that they trip over.

They will usually try one of the simplest methods available under the
assumption "well, if this doesn't scale well, the more complex methods
are surely no better".  This may not be entirely fair, but since the
docs they're encountering on the main website and wiki don't warn them
that certain implementations don't necessarily scale in different
ways, it's certainly not unreasonable.  They're at best going to
conclude the scalability will be hit-and-miss when a simple method
doesn't work.  Perhaps they'll check in again in 6-12 months.

In truth, most of these users would probably never use Mahout's NB
trainer "for real" - most would write their own that required no
interim data transformation from their existing feature space, since
that is often easier than productionalizing the conversion.  However,
they will use it as the tryout method - and they think they're really
giving the project the best possible chance to "shine" - because they
haven't even begun to consider quality/stability of models yet.

I see your analogy to R or sciPy - and I don't disagree.  But those
projects do not put scaling front and center; if Mahout is going to
keep scalability as a "headline feature" (which I would like to see!),
I think prominently acknowledging how different methods fail to scale
would really help its credibility.  For what it's worth, of the people
I know who've tried Mahout 100% of them were using R and/or sciPy
already, but were curious about Mahout specifically for better
scalability.

I'm not sure where this information is best placed - it would be great
to see it on the Wiki along with the examples, at least.  It would be
awesome to see warnings at runtime ("Warning: You just trained a model
that you cannot load without at least 20GB of RAM"), but I'm not sure
how realistic that is.  I would like it to be easier to determine, at
some very high level, why something didn't work when an experiment
fails.  Ideally, without having to dive into the code at all.

-tom

On Tue, Dec 27, 2011 at 5:14 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> On Tue, Dec 27, 2011 at 2:13 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Yes, i think this one is in terms of documentation.
>
> I meant, this patch one is going in in terms of its effects for API
> and their docs.
>
>>
>> Wiki technically doesn't require annotation to be useful in describing
>> method use though.
>>
>> No plans for command line as of the moment as far as i know. What you
>> would suggest people should see there in addition to what they cannot
>> see on wiki?
>>
>>>
>>> When you're just trying out a package - especially one where a prime
>>> benefit you're hoping for is scalability - and you hit an unadvertised
>>> limit in scaling, there's a strong tendency to write off the entire
>>> project as "not quite ready". Especially when you don't have a lot of
>>> time dig into code to understand problems.
>>>
>>
>> I am not sure about this. Mahout is very much like R or sciPy, i.e. a
>> data representation framework that glues a collection of methods
>> ranging widely in their performance (and, in this case, yes, maturity,
>> that's why it is not a 1.0 project yet). I see what you are saying but
>> in the same time I also cannot figure why would anybody be tempted to
>> write off an R as a whole just because some of its numerous packages
>> provides an implementation that scales less or less accurate than
>> other implementations in R.
>>
>> Also as far as i understand advices against Naive Bayes are generally
>> not due to quality of its implementation in Mahout but are rather
>> based on characteristics of this method as opposed to SGD and the
>> stated problem. NB is easy to implement and that's why it's popular,
>> but not because it is a swiss army knife. Therefore, they generally
>> would be true Mahout or not.
>>
>> -D

Re: Maturity level annotations

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Tue, Dec 27, 2011 at 2:13 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Yes, i think this one is in terms of documentation.

I meant, this patch one is going in in terms of its effects for API
and their docs.

>
> Wiki technically doesn't require annotation to be useful in describing
> method use though.
>
> No plans for command line as of the moment as far as i know. What you
> would suggest people should see there in addition to what they cannot
> see on wiki?
>
>>
>> When you're just trying out a package - especially one where a prime
>> benefit you're hoping for is scalability - and you hit an unadvertised
>> limit in scaling, there's a strong tendency to write off the entire
>> project as "not quite ready". Especially when you don't have a lot of
>> time dig into code to understand problems.
>>
>
> I am not sure about this. Mahout is very much like R or sciPy, i.e. a
> data representation framework that glues a collection of methods
> ranging widely in their performance (and, in this case, yes, maturity,
> that's why it is not a 1.0 project yet). I see what you are saying but
> in the same time I also cannot figure why would anybody be tempted to
> write off an R as a whole just because some of its numerous packages
> provides an implementation that scales less or less accurate than
> other implementations in R.
>
> Also as far as i understand advices against Naive Bayes are generally
> not due to quality of its implementation in Mahout but are rather
> based on characteristics of this method as opposed to SGD and the
> stated problem. NB is easy to implement and that's why it's popular,
> but not because it is a swiss army knife. Therefore, they generally
> would be true Mahout or not.
>
> -D

Re: Maturity level annotations

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Yes, i think this one is in terms of documentation.

Wiki technically doesn't require annotation to be useful in describing
method use though.

No plans for command line as of the moment as far as i know. What you
would suggest people should see there in addition to what they cannot
see on wiki?

>
> When you're just trying out a package - especially one where a prime
> benefit you're hoping for is scalability - and you hit an unadvertised
> limit in scaling, there's a strong tendency to write off the entire
> project as "not quite ready". Especially when you don't have a lot of
> time dig into code to understand problems.
>

I am not sure about this. Mahout is very much like R or sciPy, i.e. a
data representation framework that glues a collection of methods
ranging widely in their performance (and, in this case, yes, maturity,
that's why it is not a 1.0 project yet). I see what you are saying but
in the same time I also cannot figure why would anybody be tempted to
write off an R as a whole just because some of its numerous packages
provides an implementation that scales less or less accurate than
other implementations in R.

Also as far as i understand advices against Naive Bayes are generally
not due to quality of its implementation in Mahout but are rather
based on characteristics of this method as opposed to SGD and the
stated problem. NB is easy to implement and that's why it's popular,
but not because it is a swiss army knife. Therefore, they generally
would be true Mahout or not.

-D

Re: Maturity level annotations

Posted by Tom Pierce <tc...@cloudera.com>.

Is there a plan to bubble these annotations out further?  Say to the
wiki or as command-line feedback?

I think it would be really helpful (and promote uptake of Mahout) to
have metadata and prominent documentation that describes the general
scaling/stability properties of the different methods.  I know of a
few places where Mahout's been rejected after a quick sniff check
because, say, Naive Bayes couldn't be used beyond a few dozen classes.

I have seen this crop up on the list, too, and the response tends to
be something along the lines of "You probably don't really want to use
NB anyway, and it might be better to try a SGD-based classifier".
That's probably good advice, but a lot of times people are
specifically running a simple, well understood method to sniff-check
Mahout.

When you're just trying out a package - especially one where a prime
benefit you're hoping for is scalability - and you hit an unadvertised
limit in scaling, there's a strong tendency to write off the entire
project as "not quite ready". Especially when you don't have a lot of
time dig into code to understand problems.

-tom

On Thu, Dec 22, 2011 at 5:48 PM, Ted Dunning <te...@gmail.com> wrote:
> Hmm... this looks promising:
>
> http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/annotation/Documented.html
>
> See the documentation section here:
> http://docs.oracle.com/javase/tutorial/java/javaOO/annotations.html
>
> On Thu, Dec 22, 2011 at 2:43 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> I think annotations are significantly better.  The integration with
>> javadoc isn't impossible and the integration from javadoc markup to
>> annotation is impossible.
>>
>> Interestingly, the javadoc tool documentation tends to recommend an
>> annotation *and* a javadoc tag.  That does make the integration simple.
>>
>>
>> On Thu, Dec 22, 2011 at 12:54 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>>
>>> just a custom javadoc tag would be easy though bit it would be visible
>>> to javadoc tool only (and not even IDEs, so typo-prone). Maybe that's
>>> what we need, but the trend in the rest of hadoop world seems to be to
>>> use annotation-driven markers in this case.
>>>
>>
>>

Re: Maturity level annotations

Posted by Ted Dunning <te...@gmail.com>.

Hmm... this looks promising:

http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/annotation/Documented.html

See the documentation section here:
http://docs.oracle.com/javase/tutorial/java/javaOO/annotations.html

On Thu, Dec 22, 2011 at 2:43 PM, Ted Dunning <te...@gmail.com> wrote:

> I think annotations are significantly better.  The integration with
> javadoc isn't impossible and the integration from javadoc markup to
> annotation is impossible.
>
> Interestingly, the javadoc tool documentation tends to recommend an
> annotation *and* a javadoc tag.  That does make the integration simple.
>
>
> On Thu, Dec 22, 2011 at 12:54 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> just a custom javadoc tag would be easy though bit it would be visible
>> to javadoc tool only (and not even IDEs, so typo-prone). Maybe that's
>> what we need, but the trend in the rest of hadoop world seems to be to
>> use annotation-driven markers in this case.
>>
>
>

Re: Maturity level annotations

Posted by Ted Dunning <te...@gmail.com>.

I think annotations are significantly better.  The integration with javadoc
isn't impossible and the integration from javadoc markup to annotation is
impossible.

Interestingly, the javadoc tool documentation tends to recommend an
annotation *and* a javadoc tag.  That does make the integration simple.

On Thu, Dec 22, 2011 at 12:54 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> just a custom javadoc tag would be easy though bit it would be visible
> to javadoc tool only (and not even IDEs, so typo-prone). Maybe that's
> what we need, but the trend in the rest of hadoop world seems to be to
> use annotation-driven markers in this case.
>

Re: Maturity level annotations

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Well it looks like lucene people were talking about custom javadoc
tags, not annotations.

i did a brief scan and it looks like it would require a specific
doclet developed to handle annotations. Documentation is not terribly
clear what of standard doclets to subclass.

just a custom javadoc tag would be easy though bit it would be visible
to javadoc tool only (and not even IDEs, so typo-prone). Maybe that's
what we need, but the trend in the rest of hadoop world seems to be to
use annotation-driven markers in this case.

On Thu, Dec 22, 2011 at 12:35 PM, Sebastian Schelter <ss...@apache.org> wrote:
> Yes. Could be due to my lacking maven skills :)
>
>
> On 22.12.2011 21:33, Dmitriy Lyubimov wrote:
>> you mean you couldn't make them come up in javadocs?
>>
>> On Thu, Dec 22, 2011 at 12:25 PM, Sebastian Schelter <ss...@apache.org> wrote:
>>> There is still a ticket open for those ->
>>> https://issues.apache.org/jira/browse/MAHOUT-831. I tried to integrate
>>> the javadoc "annotations" like proposed by the lucene guys, but for some
>>> reason I didn't get them working. Would be great if someone could help here.
>>>
>>> --sebastian
>>>
>>> On 22.12.2011 21:03, Dmitriy Lyubimov wrote:
>>>> Hi,
>>>>
>>>> what happened to these annotations to mark maturity level? Did we ever
>>>> commit those?
>>>>
>>>> thank you.
>>>
>

Re: Maturity level annotations

Posted by Sebastian Schelter <ss...@apache.org>.

Yes. Could be due to my lacking maven skills :)


On 22.12.2011 21:33, Dmitriy Lyubimov wrote:
> you mean you couldn't make them come up in javadocs?
> 
> On Thu, Dec 22, 2011 at 12:25 PM, Sebastian Schelter <ss...@apache.org> wrote:
>> There is still a ticket open for those ->
>> https://issues.apache.org/jira/browse/MAHOUT-831. I tried to integrate
>> the javadoc "annotations" like proposed by the lucene guys, but for some
>> reason I didn't get them working. Would be great if someone could help here.
>>
>> --sebastian
>>
>> On 22.12.2011 21:03, Dmitriy Lyubimov wrote:
>>> Hi,
>>>
>>> what happened to these annotations to mark maturity level? Did we ever
>>> commit those?
>>>
>>> thank you.
>>

Re: Maturity level annotations

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

you mean you couldn't make them come up in javadocs?

On Thu, Dec 22, 2011 at 12:25 PM, Sebastian Schelter <ss...@apache.org> wrote:
> There is still a ticket open for those ->
> https://issues.apache.org/jira/browse/MAHOUT-831. I tried to integrate
> the javadoc "annotations" like proposed by the lucene guys, but for some
> reason I didn't get them working. Would be great if someone could help here.
>
> --sebastian
>
> On 22.12.2011 21:03, Dmitriy Lyubimov wrote:
>> Hi,
>>
>> what happened to these annotations to mark maturity level? Did we ever
>> commit those?
>>
>> thank you.
>

Re: Maturity level annotations

Posted by Sebastian Schelter <ss...@apache.org>.

There is still a ticket open for those ->
https://issues.apache.org/jira/browse/MAHOUT-831. I tried to integrate
the javadoc "annotations" like proposed by the lucene guys, but for some
reason I didn't get them working. Would be great if someone could help here.

--sebastian

On 22.12.2011 21:03, Dmitriy Lyubimov wrote:
> Hi,
> 
> what happened to these annotations to mark maturity level? Did we ever
> commit those?
> 
> thank you.