You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by HorstItUpright <ho...@gmail.com> on 2011/11/22 17:03:40 UTC

Which input formats to use for classifying WEKA's ARFF format?

Hello,

I am currently working on classification algorithms with Mahout. The
first part is to evaluate several different approachs already
available.

As far as I know, Mahout provides two Bayes algorithms and a Random
Forest (which is - whyever - called Dicision Forest [which is not
wrong, I know, but confusing and inconsistent to the Docs I think]).

It appears to me (and I've also taken a look into the code) that none
of these approaches can handle the MVC format (which is the result,
when parsing the WEKA-ARFF files with the arff-vector converter). The
DF is even more special and requires the UCI format.

My question now is: am I overseeing something? Is there a way to
convert the MVC files on the fly into the proper formats for the
algorithms?
I've expected that algorithms that are part of Mahout since quite a
lot of reversions, take more or less any Mahout input data or at least
output some useful error messages.
The Bayes algorithms e.g. are running with the input data, but print a
lot of strange output to the console during processing and do not give
any usable results.

Am I right, that I need to convert my ARFF or MVC files to the
UCI-format or the "Bayes-format" (the one used in the 20news example)?

PS: I am using the latest checkout as well as the "official" 0.5 release.

Best regards,
Martin

Re: Which input formats to use for classifying WEKA's ARFF format?

Posted by Isabel Drost <is...@apache.org>.
On 22.11.2011 HorstItUpright wrote:
> As far as I know, Mahout provides two Bayes algorithms and a Random
> Forest (which is - whyever - called Dicision Forest [which is not
> wrong, I know, but confusing and inconsistent to the Docs I think]).

+ logistic regression (to be found in the sgd package)


> It appears to me (and I've also taken a look into the code) that none
> of these approaches can handle the MVC format (which is the result,
> when parsing the WEKA-ARFF files with the arff-vector converter).

I am not too familiar with the MVC format - is that an intermediate file format 
used by WEKA after parsint ARFF?

> The DF is even more special and requires the UCI format.

DF?


> My question now is: am I overseeing something? Is there a way to
> convert the MVC files on the fly into the proper formats for the
> algorithms?

All algorithms in Mahout are implemented to accept vectors as input format. So 
in order to plug in what ever input format (or database, NoSQL store, which ever 
other source for data you might have) all you have to do is provide glue code 
that converts your data into Mahout vectors.

Having said that there is limited support for ARFF in Mahout already. To my 
knowledge that is not feature complete - any help with spotting missing features 
and fixing them is highly welcome.


> The Bayes algorithms e.g. are running with the input data, but print a
> lot of strange output to the console during processing and do not give
> any usable results.

Any help with improving logging to make the project easier to use is very 
welcome. Would be great if you could put up a JIRA issue and attach a patch to 
change the code to better match your expectations to get that discussion 
started.


Cheers,
Isabel