You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Verachten Bruno <Br...@atos.net> on 2012/04/12 18:25:15 UTC

Classification: using the Java API always returns the same category

Hi,

I use mahout 0.5 with hadoop 1.0.1.
I have a model for four categories that I got with:
mahout trainclassifier -i train -o model  -type cbayes -ng 10 -source hdfs
When I test it with some test data, I get a consistent result:
mahout testclassifier -d test -m  model -type cbayes -ng 10 -source hdfs
[...]
Correctly Classified Instances          :         75       90.3614%
Incorrectly Classified Instances        :          8        9.6386%

But when I take some test data with the Java API, I always get the same category.
I'm sure all is my fault, but I just can't see what I got wrong.
Here is the code:
BayesParameters params = new BayesParameters();
params.setGramSize(10);
params.set("verbose", "true");
params.set("classifierType", "cbayes");
params.set("defaultCat", "OTHER");
params.set("encoding", "UTF-8");
params.set("alpha_i", "1.0");
params.set("basePath", (new File("d:\\model")).getAbsolutePath());
Datastore datastore = new InMemoryBayesDatastore(params);
[...]
try {
  algorithm.initialize(datastore);
  ClassifierContext classifier = new ClassifierContext(algorithm,
                                        datastore);
  ClassifierResult result = classifier
                        .classifyDocument(
                                new String[] { "MyStringToCategorize" },
                                                        "defaultLabel");
  System.out.println(result.getLabel());
}[...]

Can someone help?

Thanks.

Bruno Verachten


Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité d'Atos ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.

RE: Classification: using the Java API always returns the same category

Posted by Verachten Bruno <Br...@atos.net>.
Hi,

> If tweaking the algorithm and parameters do not help, I suggest taking a long hard look in your data.
> a. How many examples do you have of category3? Are they the vast majority?
It's the smaller set.

> b. Does category3 data overwhelm the other data? Recently, I tried to classify texts into 20 categories.
> The text documents from several categories were significantly  longer
> (x100) than the other categories,
> so they dominated the classifier.
I see. I don't think that's what happening with my data. The text data is always in the same size range (between 100-300 characters).

Well, I summarized the test results with a bigger test set, and got 84% success, which seems quite good to me:
Found 12800 good guesses
Found 2430 bad guesses
Found 84 % good guesses

So... the "always returns the same category" was just bad luck when choosing my sample.
Sorry for the fuss.

Kind regards,
Bruno Verachten


Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité d'Atos ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.

Re: Classification: using the Java API always returns the same category

Posted by Yuval Feinstein <yu...@citypath.com>.
If tweaking the algorithm and parameters do not help, I suggest taking a
long hard look in your data.
a. How many examples do you have of category3? Are they the vast majority?
b. Does category3 data overwhelm the other data? Recently, I tried to
classify texts into 20 categories.
The text documents from several categories were significantly  longer
(x100) than the other categories,
so they dominated the classifier.
Good luck,
Yuval


On Fri, Apr 13, 2012 at 6:42 PM, Verachten Bruno
<Br...@atos.net>wrote:

> > This shows that category3 is being selected for your input string. I
> dont see any apparent problems.
> The problem is that the category3 is always selected whatever the input
> string is...
>
> >  Can you try to run over the training data and see if the models is
> predicting right in your api version, just as a sanity check. Again send
> logs of the run.
> Will do, thanks.
>
> Bruno Verachten
>
>
> Ce message et les pièces jointes sont confidentiels et réservés à l'usage
> exclusif de ses destinataires. Il peut également être protégé par le secret
> professionnel. Si vous recevez ce message par erreur, merci d'en avertir
> immédiatement l'expéditeur et de le détruire. L'intégrité du message ne
> pouvant être assurée sur Internet, la responsabilité d'Atos ne pourra être
> recherchée quant au contenu de ce message. Bien que les meilleurs efforts
> soient faits pour maintenir cette transmission exempte de tout virus,
> l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne
> saurait être recherchée pour tout dommage résultant d'un virus transmis.
>
> This e-mail and the documents attached are confidential and intended
> solely for the addressee; it may also be privileged. If you receive this
> e-mail in error, please notify the sender immediately and destroy it. As
> its integrity cannot be secured on the Internet, the Atos liability cannot
> be triggered for the message content. Although the sender endeavours to
> maintain a computer virus-free network, the sender does not warrant that
> this transmission is virus-free and will not be liable for any damages
> resulting from any virus transmitted.
>

RE: Classification: using the Java API always returns the same category

Posted by Verachten Bruno <Br...@atos.net>.
> This shows that category3 is being selected for your input string. I dont see any apparent problems.
The problem is that the category3 is always selected whatever the input string is...

>  Can you try to run over the training data and see if the models is predicting right in your api version, just as a sanity check. Again send logs of the run.
Will do, thanks.

Bruno Verachten


Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité d'Atos ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.

Re: Classification: using the Java API always returns the same category

Posted by Robin Anil <ro...@gmail.com>.
This shows that category3 is being selected for your input string. I dont
see any apparent problems.  Can you try to run over the training data and
see if the models is predicting right in your api version, just as a sanity
check. Again send logs of the run.
------
Robin Anil


2012/4/13 Verachten Bruno <Br...@atos.net>

> Here you are:
> 17:48:12.402 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 50000 feature weights
> 17:48:12.678 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 100000 feature weights
> 17:48:13.187 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 150000 feature weights
> 17:48:13.280 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 200000 feature weights
> 17:48:13.375 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 250000 feature weights
> 17:48:13.550 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 300000 feature weights
> 17:48:13.646 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 350000 feature weights
> 17:48:13.760 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 400000 feature weights
> 17:48:14.403 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 450000 feature weights
> 17:48:14.669 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 500000 feature weights
> 17:48:14.769 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 550000 feature weights
> 17:48:14.865 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 600000 feature weights
> 17:48:14.960 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> 1059783.4438602575
> 17:48:17.946 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category1 -8606715.675788553 8842462.841037087 -0.9733391963883126
> 17:48:17.947 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category2 -8842462.841037087 8842462.841037087 -1.0
> 17:48:17.947 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category3 -8839755.207854107 8842462.841037087 -0.9996937919636582
> 17:48:17.947 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category4 -8800100.24475343 8842462.841037087 -0.9952091858291949
> 17:48:18.151 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 50000 feature weights
> 17:48:19.042 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 100000 feature weights
> 17:48:19.159 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 150000 feature weights
> 17:48:19.281 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 200000 feature weights
> 17:48:19.501 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 250000 feature weights
> 17:48:19.632 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 300000 feature weights
> 17:48:19.774 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 350000 feature weights
> 17:48:19.953 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 400000 feature weights
> 17:48:20.086 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 450000 feature weights
> 17:48:20.213 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 500000 feature weights
> 17:48:20.357 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 550000 feature weights
> 17:48:20.483 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> Read 600000 feature weights
> 17:48:20.599 [main]            INFO  o.a.m.c.b.SequenceFileModelReader -
> 1059783.4438602575
> 17:48:22.612 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category1 -8606715.675788553 8842462.841037087 -0.9733391963883126
> 17:48:22.612 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category2 -8842462.841037087 8842462.841037087 -1.0
> 17:48:22.612 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category3 -8839755.207854107 8842462.841037087 -0.9996937919636582
> 17:48:22.613 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore
> - category4 -8800100.24475343 8842462.841037087 -0.9952091858291949
> ClassifierResult{category='category3', score=13.380882224510643}
> Category3
>
> Thanks,
> Bruno Verachten
>
>
> Ce message et les pièces jointes sont confidentiels et réservés à l'usage
> exclusif de ses destinataires. Il peut également être protégé par le secret
> professionnel. Si vous recevez ce message par erreur, merci d'en avertir
> immédiatement l'expéditeur et de le détruire. L'intégrité du message ne
> pouvant être assurée sur Internet, la responsabilité d'Atos ne pourra être
> recherchée quant au contenu de ce message. Bien que les meilleurs efforts
> soient faits pour maintenir cette transmission exempte de tout virus,
> l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne
> saurait être recherchée pour tout dommage résultant d'un virus transmis.
>
> This e-mail and the documents attached are confidential and intended
> solely for the addressee; it may also be privileged. If you receive this
> e-mail in error, please notify the sender immediately and destroy it. As
> its integrity cannot be secured on the Internet, the Atos liability cannot
> be triggered for the message content. Although the sender endeavours to
> maintain a computer virus-free network, the sender does not warrant that
> this transmission is virus-free and will not be liable for any damages
> resulting from any virus transmitted.
>

RE: Classification: using the Java API always returns the same category

Posted by Verachten Bruno <Br...@atos.net>.
Here you are:
17:48:12.402 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 50000 feature weights
17:48:12.678 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 100000 feature weights
17:48:13.187 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 150000 feature weights
17:48:13.280 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 200000 feature weights
17:48:13.375 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 250000 feature weights
17:48:13.550 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 300000 feature weights
17:48:13.646 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 350000 feature weights
17:48:13.760 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 400000 feature weights
17:48:14.403 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 450000 feature weights
17:48:14.669 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 500000 feature weights
17:48:14.769 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 550000 feature weights
17:48:14.865 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 600000 feature weights
17:48:14.960 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - 1059783.4438602575
17:48:17.946 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category1 -8606715.675788553 8842462.841037087 -0.9733391963883126
17:48:17.947 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category2 -8842462.841037087 8842462.841037087 -1.0
17:48:17.947 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category3 -8839755.207854107 8842462.841037087 -0.9996937919636582
17:48:17.947 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category4 -8800100.24475343 8842462.841037087 -0.9952091858291949
17:48:18.151 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 50000 feature weights
17:48:19.042 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 100000 feature weights
17:48:19.159 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 150000 feature weights
17:48:19.281 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 200000 feature weights
17:48:19.501 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 250000 feature weights
17:48:19.632 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 300000 feature weights
17:48:19.774 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 350000 feature weights
17:48:19.953 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 400000 feature weights
17:48:20.086 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 450000 feature weights
17:48:20.213 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 500000 feature weights
17:48:20.357 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 550000 feature weights
17:48:20.483 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - Read 600000 feature weights
17:48:20.599 [main]            INFO  o.a.m.c.b.SequenceFileModelReader - 1059783.4438602575
17:48:22.612 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category1 -8606715.675788553 8842462.841037087 -0.9733391963883126
17:48:22.612 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category2 -8842462.841037087 8842462.841037087 -1.0
17:48:22.612 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category3 -8839755.207854107 8842462.841037087 -0.9996937919636582
17:48:22.613 [main]            INFO  o.a.m.c.bayes.InMemoryBayesDatastore - category4 -8800100.24475343 8842462.841037087 -0.9952091858291949
ClassifierResult{category='category3', score=13.380882224510643}
Category3

Thanks,
Bruno Verachten


Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité d'Atos ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis.

This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.

Re: Classification: using the Java API always returns the same category

Posted by Robin Anil <ro...@gmail.com>.
Can you print the logs when you run your code.
------
Robin Anil


On Thu, Apr 12, 2012 at 11:25 AM, Verachten Bruno
<Br...@atos.net>wrote:

> Hi,
>
> I use mahout 0.5 with hadoop 1.0.1.
> I have a model for four categories that I got with:
> mahout trainclassifier -i train -o model  -type cbayes -ng 10 -source hdfs
> When I test it with some test data, I get a consistent result:
> mahout testclassifier -d test -m  model -type cbayes -ng 10 -source hdfs
> [...]
> Correctly Classified Instances          :         75       90.3614%
> Incorrectly Classified Instances        :          8        9.6386%
>
> But when I take some test data with the Java API, I always get the same
> category.
> I'm sure all is my fault, but I just can't see what I got wrong.
> Here is the code:
> BayesParameters params = new BayesParameters();
> params.setGramSize(10);
> params.set("verbose", "true");
> params.set("classifierType", "cbayes");
> params.set("defaultCat", "OTHER");
> params.set("encoding", "UTF-8");
> params.set("alpha_i", "1.0");
> params.set("basePath", (new File("d:\\model")).getAbsolutePath());
> Datastore datastore = new InMemoryBayesDatastore(params);
> [...]
> try {
>  algorithm.initialize(datastore);
>  ClassifierContext classifier = new ClassifierContext(algorithm,
>                                        datastore);
>  ClassifierResult result = classifier
>                        .classifyDocument(
>                                new String[] { "MyStringToCategorize" },
>                                                        "defaultLabel");
>  System.out.println(result.getLabel());
> }[...]
>
> Can someone help?
>
> Thanks.
>
> Bruno Verachten
>
>
> Ce message et les pièces jointes sont confidentiels et réservés à l'usage
> exclusif de ses destinataires. Il peut également être protégé par le secret
> professionnel. Si vous recevez ce message par erreur, merci d'en avertir
> immédiatement l'expéditeur et de le détruire. L'intégrité du message ne
> pouvant être assurée sur Internet, la responsabilité d'Atos ne pourra être
> recherchée quant au contenu de ce message. Bien que les meilleurs efforts
> soient faits pour maintenir cette transmission exempte de tout virus,
> l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne
> saurait être recherchée pour tout dommage résultant d'un virus transmis.
>
> This e-mail and the documents attached are confidential and intended
> solely for the addressee; it may also be privileged. If you receive this
> e-mail in error, please notify the sender immediately and destroy it. As
> its integrity cannot be secured on the Internet, the Atos liability cannot
> be triggered for the message content. Although the sender endeavours to
> maintain a computer virus-free network, the sender does not warrant that
> this transmission is virus-free and will not be liable for any damages
> resulting from any virus transmitted.
>