You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/03/05 18:59:27 UTC

Re: Next release

Seems like we need the top list to be responded to also.

Agree about similarity but a completely different method is needed for cosine and the other actual distance measures. The way the old Hadoop code did it is more appropriate. I’ll put it on my list.


> On Mar 5, 2015, at 9:46 AM, Andrew Musselman <an...@gmail.com> wrote:
> 
> Agree with Suneel's comments.
> 
> So you're proposing these four things for 0.10, right?  I'm good with these.
> 
> 1) mrlegacy & scala dependency reduction and possible split
> 2) sync with most widely used Spark version (implies frequent releases to stay synced with big distros I suspect)
> 3) the release build is completely broken. No artifacts are created for scala, spark, or h2o. No hosted scaladocs are created afaik.
> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than anything like what Mahout is today.
> 
> 
> On Thu, Mar 5, 2015 at 9:31 AM, Suneel Marthi <suneel_marthi@yahoo.com <ma...@yahoo.com>> wrote:
> 
> Agree with most of the points outlined below, next steps would be to work towards 0.10. 
> 
>> From: Pat Ferrel <pat@occamsmachete.com <ma...@occamsmachete.com>>
>> To: Suneel Marthi <suneel_marthi@yahoo.com <ma...@yahoo.com>>; ap.dev <ap.dev@outlook.com <ma...@outlook.com>>; Andrew Musselman <andrew.musselman@gmail.com <ma...@gmail.com>> 
>> Sent: Thursday, March 5, 2015 12:11 PM
>> Subject: Next release
>> 
>> I’d send this to @dev if it won’t turn into a public argument. Maybe leave out the wishlist?
>> 
>> Hopefully people will chime in with opinions or status but here’s what it looks like to me:
>> 
>> 1) The DSL needs the mrlegacy pruning that is ready but held up by external issues. This would be required if we do a project split. Also the external deps have been reduced to nearly the minimum and are written to a smallish jar in the spark module. It is possible to do more fine grained class-level shading but not sure it’s needed.
>> 2) significant DSL additions are held up by external issues but there is already SSVD, PCA, QR and pretty mature linear algebra ops.
>> 3) similarity, item (column) and row seem to be fine with LLR only, and therefor are mainly for recommender use cases.
> >>>> It would be nice to generalize this to be able to use any similarity measure before next release.
> 
>> 4) Naive Bayes only partial pipeline for text classification is implemented in Scala but NB itself is working, TD-IDF in progress
>> 5) There is some distributed aggregation work that is waiting in a PR and seems to be stalled. I’d vote to see this included.
>> 
> >>> +1
> 
>> What is a minimum release?
>> 
>> Sort of an odd question without a clear idea of what Mahout is. I see its future as a scalable R-like environment integrated with Scala and distributed computation engines like Spark. Put another way it is a distributed optimized linear algebra environment and library with some important higher level algorithms. It is general where things like MLlib do not attempt to be.
>> 
>> When would you use Mahout vs MLlib or H2O? If you need deep learning, look at H2O, if you need Kmeans look at MLlib, if you require or want to mix-in a general linear algebra engine look at Mahout’s DSL since it plays well with MLlib and to some degree H2O.
>> 
>> What is a minimum release given the above definition?
>> 
>> Seems like polishing up the 5 things mentioned above along with:
>> 1) mrlegacy & scala dependency reduction and possible split
>> 2) sync with most widely used Spark version (implies frequent releases to stay synced with big distros I suspect)
>> 3) the release build is completely broken. No artifacts are created for scala, spark, or h2o. No hosted scaladocs are created afaik.
>> 4) commitment to revamping the Mahout docs. They look more like 0.9+ than anything like what Mahout is today.
>> 
>> Not sure we should go down this rat hole right now so feel free to ignore this but my intermediate term and post release wishlist is:
>> 
>> 1) more stats and polish to the shell (savable workspaces, etc)
>> 2) some helpers/conversions to make accessing MLlib easier. For instance a few lines of code would make KMeans usable with DRMs 
>> 3) a lightweight package formalization for adding new contributor based high level algorithms—maybe along the lines of Examples which pull in code from github and include their own build mechanism.
> +1
>> 4) finish the text pipeline
> +1, would explore the new text processing features available in Lucene 5. Please don't go by how MlLib does this
>> 5) integrate Spark dataframes with DRMs and IndexedDatasets
> +1
>> 6) retire sequence files for PMML, JSON (SchemaRDD/Dataframes), CSV—whatever. These are only needed as input and output not intermediate results anymore so why have sequence files when supporting IO to other tools like Hive, Spark SQL, Solr/ES and others is more important?
>> 
> +100, sequencefiles have been Mahout's nemesis all along
> 
> 
> 
> 
>