You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Martin Häger <ma...@byburt.com> on 2010/02/08 13:54:37 UTC

Classifying general Attribute-Relation data using Mahout

Hi,

We're experimenting a bit with Weka and Mahout. Our input data is a
relation in ARFF format (see attached data.training.arff), and we'd
like to classify it using Mahout. However, it seems (to us, at first)
that the Mahout classifier.bayes.interfaces.Algorithm interface is
centered around documents of text, and not general attribute data.
Thus, running the classifier causes our ARFF data to be interpreted as
a document of words, with not very useful results (see attached
mahout.log).

With Weka, we're able to get the results we want (see attached weka.log).

Any suggestions for how to get this working?

Thanks!

Re: Classifying general Attribute-Relation data using Mahout

Posted by Grant Ingersoll <gs...@apache.org>.
On Feb 9, 2010, at 12:43 PM, Robin Anil wrote:

> Oops. The ARFF Driver writes only vectors not the tab separated format the
> Bayes Classifier reads.  I will try to add that as a flag
> 
> @Grant: For batch classification,yes we can go with vectors, But I dont see
> how we can classify documents on the fly if the dictionary cant fit in the
> memory. Maybe, randomizers can help. We will have to wait for that.

I know Lucene is slower, but I still think that is the way to go.  We can discuss that over on dev.

Re: Classifying general Attribute-Relation data using Mahout

Posted by Robin Anil <ro...@gmail.com>.
Oops. The ARFF Driver writes only vectors not the tab separated format the
Bayes Classifier reads.  I will try to add that as a flag

@Grant: For batch classification,yes we can go with vectors, But I dont see
how we can classify documents on the fly if the dictionary cant fit in the
memory. Maybe, randomizers can help. We will have to wait for that.

@Ted. Waiting to pounce upon the randomizers :)


Robin

On Tue, Feb 9, 2010 at 9:08 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Feb 8, 2010, at 7:54 AM, Martin Häger wrote:
>
> > Hi,
> >
> > We're experimenting a bit with Weka and Mahout. Our input data is a
> > relation in ARFF format (see attached data.training.arff), and we'd
> > like to classify it using Mahout. However, it seems (to us, at first)
> > that the Mahout classifier.bayes.interfaces.Algorithm interface is
> > centered around documents of text, and not general attribute data.
> > Thus, running the classifier causes our ARFF data to be interpreted as
> > a document of words, with not very useful results (see attached
> > mahout.log).
>
> I think we still need to get our Bayes stuff to run off of Vectors instead
> of text, then it should be easy to go from ARFF to Vector format and then
> run all of the Mahout tools.
>
> -Grant

Re: Classifying general Attribute-Relation data using Mahout

Posted by Grant Ingersoll <gs...@apache.org>.
On Feb 8, 2010, at 7:54 AM, Martin Häger wrote:

> Hi,
> 
> We're experimenting a bit with Weka and Mahout. Our input data is a
> relation in ARFF format (see attached data.training.arff), and we'd
> like to classify it using Mahout. However, it seems (to us, at first)
> that the Mahout classifier.bayes.interfaces.Algorithm interface is
> centered around documents of text, and not general attribute data.
> Thus, running the classifier causes our ARFF data to be interpreted as
> a document of words, with not very useful results (see attached
> mahout.log).

I think we still need to get our Bayes stuff to run off of Vectors instead of text, then it should be easy to go from ARFF to Vector format and then run all of the Mahout tools.

-Grant

Re: Classifying general Attribute-Relation data using Mahout

Posted by Martin Häger <ma...@byburt.com>.
I went ahead and attached everything I sent to Robin to MAHOUT-286.

2010/2/9 Robin Anil <ro...@gmail.com>:
> I have the data. I will upload shortly
>
>
> On Wed, Feb 10, 2010 at 12:10 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> Martin,
>>
>> I saw only one attachment here.  The other may have been stripped by the
>> mailing list which prefers not to have attachments.
>>
>> I have filed an issue for this at
>> https://issues.apache.org/jira/browse/MAHOUT-286
>>
>> Can you attach your data files there so that we can work on getting a
>> better
>> resolution for you?
>>
>> On Mon, Feb 8, 2010 at 5:35 AM, Martin Häger <martin.hager@byburt.com
>> >wrote:
>>
>> > Hi Robin,
>> >
>> > The attached data.arff contains the test data, data.training.arff
>> > contains the training data. We're running the svn trunk (r906954) of
>> > Mahout. The attached script run.sh shows how we run it.
>> > Should it be possible to run Mahout's NaiveBayes classifier on this
>> > data in this way or is it limited to text documents only?
>> >
>> > Side note: We're expecting Weka to report 100% incorrect
>> > classification since all test data belongs to the class "unknown",
>> > whereas the training data is either "valid" or "invalid" (in fact, the
>> > test data is the entire "invalid" set, so Weka manages to classify
>> > everything correctly). We're not yet sure what class to put on the
>> > test data, as we of course can't know anything about it (hence the
>> > "unknown").
>> >
>> > 2010/2/8 Robin Anil <ro...@gmail.com>:
>> > > Can you send the train and test data to me. Are you using 0.2 release
>> or
>> > the
>> > > trunk?
>> > >
>> > > Seems model wasnt built as there was an error Exception in thread
>> "main"
>> > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> > exist:
>> > > file:/tmp/hadoop/model/trainer-termDocCount
>> > > Input path does not exist: file:/tmp/hadoop/model/trainer-wordFreq
>> > > Input path does not exist: file:/tmp/hadoop/model/trainer-featureCount
>> > >
>> > > So there is no point running the classifier
>> > >
>> > > Weka also seems not to be doing good either.
>> > >
>> > >
>> > >
>> > > On Mon, Feb 8, 2010 at 6:24 PM, Martin Häger <martin.hager@byburt.com
>> > >wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> We're experimenting a bit with Weka and Mahout. Our input data is a
>> > >> relation in ARFF format (see attached data.training.arff), and we'd
>> > >> like to classify it using Mahout. However, it seems (to us, at first)
>> > >> that the Mahout classifier.bayes.interfaces.Algorithm interface is
>> > >> centered around documents of text, and not general attribute data.
>> > >> Thus, running the classifier causes our ARFF data to be interpreted as
>> > >> a document of words, with not very useful results (see attached
>> > >> mahout.log).
>> > >>
>> > >> With Weka, we're able to get the results we want (see attached
>> > weka.log).
>> > >>
>> > >> Any suggestions for how to get this working?
>> > >>
>> > >> Thanks!
>> > >>
>> > >
>> >
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>

Re: Classifying general Attribute-Relation data using Mahout

Posted by Robin Anil <ro...@gmail.com>.
I have the data. I will upload shortly


On Wed, Feb 10, 2010 at 12:10 AM, Ted Dunning <te...@gmail.com> wrote:

> Martin,
>
> I saw only one attachment here.  The other may have been stripped by the
> mailing list which prefers not to have attachments.
>
> I have filed an issue for this at
> https://issues.apache.org/jira/browse/MAHOUT-286
>
> Can you attach your data files there so that we can work on getting a
> better
> resolution for you?
>
> On Mon, Feb 8, 2010 at 5:35 AM, Martin Häger <martin.hager@byburt.com
> >wrote:
>
> > Hi Robin,
> >
> > The attached data.arff contains the test data, data.training.arff
> > contains the training data. We're running the svn trunk (r906954) of
> > Mahout. The attached script run.sh shows how we run it.
> > Should it be possible to run Mahout's NaiveBayes classifier on this
> > data in this way or is it limited to text documents only?
> >
> > Side note: We're expecting Weka to report 100% incorrect
> > classification since all test data belongs to the class "unknown",
> > whereas the training data is either "valid" or "invalid" (in fact, the
> > test data is the entire "invalid" set, so Weka manages to classify
> > everything correctly). We're not yet sure what class to put on the
> > test data, as we of course can't know anything about it (hence the
> > "unknown").
> >
> > 2010/2/8 Robin Anil <ro...@gmail.com>:
> > > Can you send the train and test data to me. Are you using 0.2 release
> or
> > the
> > > trunk?
> > >
> > > Seems model wasnt built as there was an error Exception in thread
> "main"
> > > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist:
> > > file:/tmp/hadoop/model/trainer-termDocCount
> > > Input path does not exist: file:/tmp/hadoop/model/trainer-wordFreq
> > > Input path does not exist: file:/tmp/hadoop/model/trainer-featureCount
> > >
> > > So there is no point running the classifier
> > >
> > > Weka also seems not to be doing good either.
> > >
> > >
> > >
> > > On Mon, Feb 8, 2010 at 6:24 PM, Martin Häger <martin.hager@byburt.com
> > >wrote:
> > >
> > >> Hi,
> > >>
> > >> We're experimenting a bit with Weka and Mahout. Our input data is a
> > >> relation in ARFF format (see attached data.training.arff), and we'd
> > >> like to classify it using Mahout. However, it seems (to us, at first)
> > >> that the Mahout classifier.bayes.interfaces.Algorithm interface is
> > >> centered around documents of text, and not general attribute data.
> > >> Thus, running the classifier causes our ARFF data to be interpreted as
> > >> a document of words, with not very useful results (see attached
> > >> mahout.log).
> > >>
> > >> With Weka, we're able to get the results we want (see attached
> > weka.log).
> > >>
> > >> Any suggestions for how to get this working?
> > >>
> > >> Thanks!
> > >>
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Classifying general Attribute-Relation data using Mahout

Posted by Ted Dunning <te...@gmail.com>.
Martin,

I saw only one attachment here.  The other may have been stripped by the
mailing list which prefers not to have attachments.

I have filed an issue for this at
https://issues.apache.org/jira/browse/MAHOUT-286

Can you attach your data files there so that we can work on getting a better
resolution for you?

On Mon, Feb 8, 2010 at 5:35 AM, Martin Häger <ma...@byburt.com>wrote:

> Hi Robin,
>
> The attached data.arff contains the test data, data.training.arff
> contains the training data. We're running the svn trunk (r906954) of
> Mahout. The attached script run.sh shows how we run it.
> Should it be possible to run Mahout's NaiveBayes classifier on this
> data in this way or is it limited to text documents only?
>
> Side note: We're expecting Weka to report 100% incorrect
> classification since all test data belongs to the class "unknown",
> whereas the training data is either "valid" or "invalid" (in fact, the
> test data is the entire "invalid" set, so Weka manages to classify
> everything correctly). We're not yet sure what class to put on the
> test data, as we of course can't know anything about it (hence the
> "unknown").
>
> 2010/2/8 Robin Anil <ro...@gmail.com>:
> > Can you send the train and test data to me. Are you using 0.2 release or
> the
> > trunk?
> >
> > Seems model wasnt built as there was an error Exception in thread "main"
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> > file:/tmp/hadoop/model/trainer-termDocCount
> > Input path does not exist: file:/tmp/hadoop/model/trainer-wordFreq
> > Input path does not exist: file:/tmp/hadoop/model/trainer-featureCount
> >
> > So there is no point running the classifier
> >
> > Weka also seems not to be doing good either.
> >
> >
> >
> > On Mon, Feb 8, 2010 at 6:24 PM, Martin Häger <martin.hager@byburt.com
> >wrote:
> >
> >> Hi,
> >>
> >> We're experimenting a bit with Weka and Mahout. Our input data is a
> >> relation in ARFF format (see attached data.training.arff), and we'd
> >> like to classify it using Mahout. However, it seems (to us, at first)
> >> that the Mahout classifier.bayes.interfaces.Algorithm interface is
> >> centered around documents of text, and not general attribute data.
> >> Thus, running the classifier causes our ARFF data to be interpreted as
> >> a document of words, with not very useful results (see attached
> >> mahout.log).
> >>
> >> With Weka, we're able to get the results we want (see attached
> weka.log).
> >>
> >> Any suggestions for how to get this working?
> >>
> >> Thanks!
> >>
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Classifying general Attribute-Relation data using Mahout

Posted by Martin Häger <ma...@byburt.com>.
Hi Robin,

The attached data.arff contains the test data, data.training.arff
contains the training data. We're running the svn trunk (r906954) of
Mahout. The attached script run.sh shows how we run it.
Should it be possible to run Mahout's NaiveBayes classifier on this
data in this way or is it limited to text documents only?

Side note: We're expecting Weka to report 100% incorrect
classification since all test data belongs to the class "unknown",
whereas the training data is either "valid" or "invalid" (in fact, the
test data is the entire "invalid" set, so Weka manages to classify
everything correctly). We're not yet sure what class to put on the
test data, as we of course can't know anything about it (hence the
"unknown").

2010/2/8 Robin Anil <ro...@gmail.com>:
> Can you send the train and test data to me. Are you using 0.2 release or the
> trunk?
>
> Seems model wasnt built as there was an error Exception in thread "main"
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> file:/tmp/hadoop/model/trainer-termDocCount
> Input path does not exist: file:/tmp/hadoop/model/trainer-wordFreq
> Input path does not exist: file:/tmp/hadoop/model/trainer-featureCount
>
> So there is no point running the classifier
>
> Weka also seems not to be doing good either.
>
>
>
> On Mon, Feb 8, 2010 at 6:24 PM, Martin Häger <ma...@byburt.com>wrote:
>
>> Hi,
>>
>> We're experimenting a bit with Weka and Mahout. Our input data is a
>> relation in ARFF format (see attached data.training.arff), and we'd
>> like to classify it using Mahout. However, it seems (to us, at first)
>> that the Mahout classifier.bayes.interfaces.Algorithm interface is
>> centered around documents of text, and not general attribute data.
>> Thus, running the classifier causes our ARFF data to be interpreted as
>> a document of words, with not very useful results (see attached
>> mahout.log).
>>
>> With Weka, we're able to get the results we want (see attached weka.log).
>>
>> Any suggestions for how to get this working?
>>
>> Thanks!
>>
>

Re: Classifying general Attribute-Relation data using Mahout

Posted by Robin Anil <ro...@gmail.com>.
Can you send the train and test data to me. Are you using 0.2 release or the
trunk?

Seems model wasnt built as there was an error Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/tmp/hadoop/model/trainer-termDocCount
Input path does not exist: file:/tmp/hadoop/model/trainer-wordFreq
Input path does not exist: file:/tmp/hadoop/model/trainer-featureCount

So there is no point running the classifier

Weka also seems not to be doing good either.



On Mon, Feb 8, 2010 at 6:24 PM, Martin Häger <ma...@byburt.com>wrote:

> Hi,
>
> We're experimenting a bit with Weka and Mahout. Our input data is a
> relation in ARFF format (see attached data.training.arff), and we'd
> like to classify it using Mahout. However, it seems (to us, at first)
> that the Mahout classifier.bayes.interfaces.Algorithm interface is
> centered around documents of text, and not general attribute data.
> Thus, running the classifier causes our ARFF data to be interpreted as
> a document of words, with not very useful results (see attached
> mahout.log).
>
> With Weka, we're able to get the results we want (see attached weka.log).
>
> Any suggestions for how to get this working?
>
> Thanks!
>