You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Julien Nioche <li...@gmail.com> on 2008/08/28 10:11:12 UTC

Input format [Re: Taste Vs Weka]

Hi,

Talking about IO formats, I've been looking at the source code to see if
there was a way to (de)serialize a matrix to a file system but could not
find anything. I was thinking about implementing a method to load a matrix
from a sparse format such as the one described at
http://math.nist.gov/MatrixMarket/formats.html. Would that be of interest?
Is there already something similar which I haven't spotted?

About Mahout / Weka : I completely agree with what Grant said. (especially
about keping it lean). The same could apply to other Data Mining framework
such as RapidMiner (ex-Yale). One could easily integrate Mahout into these
resources as a plugin to benefit from the GUIs and other functionalities if
needed.

Julien

2008/8/27 Grant Ingersoll <gs...@apache.org>

>
> On Aug 27, 2008, at 8:33 AM, Richard Tomsett wrote:
>
>  There's quite a good description of WEKA and its capabilities on the
>> course page for a module I took this year:
>> http://www.inf.ed.ac.uk/teaching/courses/dme/html/software2.html
>>
>> It's more a general suite of data-mining tools rather than a tool to
>> address a specific task like Taste (plus it's obviously not implemented for
>> parallel processing which could be problematic for scaling up). From the
>> link above:
>>
>>  * *Advantages*: The obvious advantage of a package like Weka is that
>>    *a whole range of data preparation, feature selection and data
>>    mining algorithms are integrated*. This means that only one data
>>    format is needed, and trying out and comparing different
>>    approaches becomes really easy. The package also comes with *a
>>    GUI*, which should make it easier to use.
>>
>
> Yeah, it would be good for Mahout to adopt an approach for either
> translating from ARFF to our format, or just use ARFF or whatever else Weka
> does, but I don't want it to preclude us from innovating where we need to
> innovate.
>
>
>
>>
>>  * *Disadvantages*: Probably the most important disadvantage of data
>>    mining suites like this is that *they do not implement the newest
>>    techniques*. For example the MLP implemented has a very basic
>>    training algorithm (backprop with momentum), and the SVM only uses
>>    polynomial kernels, and does not support numeric estimation. ...
>>    *A third possible problem is scaling*. For difficult tasks on
>>    large datasets, the running time can become quite long, and java
>>    sometimes gives an OutOfMemory error. This problem can be reduced
>>    by using the '-mx/x/' option when calling java, where /x/ is
>>    memory size (eg '50m'). For large datasets it will always be
>>    necessary to reduce the size to be able to work within reasonable
>>    time limits. A fourth problem is that *the GUI does not implement
>>    all the possible options*. Things that could be very useful, like
>>    scoring of a test set, are not provided in the GUI, but can be
>>    called from the command line interface. So sometimes it will be
>>    necessary to switch between GUI and command line. Finally, *the
>>    data preparation and visualisation techniques offered might not be
>>    enough*. Most of them are very useful, but I think in most data
>>    mining tasks you will need more to get to know the data well and
>>    to get it in the right format.
>>
>>
> From a Mahout view, we are very much aiming at addressing the scaling
> issue.  As for the GUI, I think that will always be a "contrib" for Mahout,
> if one ever exists.  My personal goal for Mahout is to keep it lean and
> easily usable in a wide variety of applications.  Just as Lucene has made
> search a commodity in many ways, I think Mahout could enable ML to be a
> commodity in 5 years.
>
> Also, a glaring difference between the two is Weka is GPL.  I'll leave it
> to you to read all the discussions on ASL vs. GPL and do not want to start
> that discussion here, as there is no point.
>
> Last, I imagine we will all coexist nicely.  Weka will be useful for many
> tasks, and Mahout will be useful for many tasks and there will certainly be
> overlap.
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com