You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by vybe3142 <vy...@gmail.com> on 2013/02/01 22:29:00 UTC

How to classifyan individual file after training

1. Index the training data that I've pre-classified manually . Then perform
training and testing. Everything works fine to this point 
/home/me/data/reuters-21578-example
├── chrysler (dir with training files)
├── cocoa (dir with training files)
├── egypt (dir with training files)
└── england (dir with training files)


mahout seqdirectory -i /home/me/data/reuters-21578-example -o
reuters-out-seqdir -c UTF-8 -chunk 5
mahout seq2sparse -i reuters-out-seqdir/ -o reuters-out-seqdir-sparse -lnorm
-nv -wt tfidf
mahout split -i reuters-out-seqdir-sparse/tfidf-vectors --trainingOutput
train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite
--sequenceFiles -xm sequential
                  
mahout trainnb -i train-vectors -el -o model -li labelindex -ow 
mahout testnb -i test-vectors -m model -l labelindex -ow -o testing
This seems to work (looking at the confusion matrix  even though these are
plan old text snippets as opposed to newsgroup text articles. 

2. At this point, I want to classify individual files that are not part of
the training set. I've tried a bunch of things that don't seem to work. 
For example, .. I try to invoke main() on TestNewsGroups.java with the args 

--input /home/me/data/reuters-21578 --model
/home/me/test/mahout/quickstart-classifier/model/naiveBayesModel.bin

and end up with an Exception 
Exception in thread "main" java.io.UTFDataFormatException: malformed input
around byte 5
	at java.io.DataInputStream.readUTF(DataInputStream.java:617)
	at java.io.DataInputStream.readUTF(DataInputStream.java:547)
	at
org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:41)
	at
org.apache.mahout.classifier.sgd.ModelSerializer.readBinary(ModelSerializer.java:69)
	at com.memonews.mahout.sentiment.TestNewsGroups.run(TestNewsGroups.java:67)
	at
com.memonews.mahout.sentiment.TestNewsGroups.main(TestNewsGroups.java:59)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

Any idea what I can do to fix this? Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-classifyan-individual-file-after-training-tp4038036.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: How to classifyan individual file after training

Posted by "Vinay B," <vy...@gmail.com>.
That's exactly what I was trying to do, by running TestNewsGroups.java, as
I explained in my last post.
Here's the code again with the stack trace. There's something wrong I'm
doing while loading up the model (and I can't load up the Naive Bayes, see
code)

Thanks

https://gist.github.com/anonymous/4720473


Exception in thread "main" java.io.UTFDataFormatException: malformed input
around byte 5
at java.io.DataInputStream.readUTF(DataInputStream.java:617)
at java.io.DataInputStream.readUTF(DataInputStream.java:547)
at
org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:41)
at
org.apache.mahout.classifier.sgd.ModelSerializer.readBinary(ModelSerializer.java:69)
at com.memonews.mahout.sentiment.TestNewsGroups.run(TestNewsGroups.java:69)
at com.memonews.mahout.sentiment.TestNewsGroups.main(TestNewsGroups.java:60)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

On Fri, Feb 1, 2013 at 4:48 PM, Sarang Deshpande <sa...@shopzilla.com>wrote:

> You will need to write custom code to convert text from the file into
> vectors and then to use these vectors to talk to the pre-built model.
>
> ~Sarang
>
> -----Original Message-----
> From: vybe3142 [mailto:vybe3142@gmail.com]
> Sent: Friday, February 01, 2013 1:29 PM
> To: mahout-user@lucene.apache.org
> Subject: How to classifyan individual file after training
>
> 1. Index the training data that I've pre-classified manually . Then
> perform training and testing. Everything works fine to this point
> /home/me/data/reuters-21578-example
> ├── chrysler (dir with training files)
> ├── cocoa (dir with training files)
> ├── egypt (dir with training files)
> └── england (dir with training files)
>
>
> mahout seqdirectory -i /home/me/data/reuters-21578-example -o
> reuters-out-seqdir -c UTF-8 -chunk 5
> mahout seq2sparse -i reuters-out-seqdir/ -o reuters-out-seqdir-sparse
> -lnorm
> -nv -wt tfidf
> mahout split -i reuters-out-seqdir-sparse/tfidf-vectors --trainingOutput
> train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite
> --sequenceFiles -xm sequential
>
> mahout trainnb -i train-vectors -el -o model -li labelindex -ow
> mahout testnb -i test-vectors -m model -l labelindex -ow -o testing
> This seems to work (looking at the confusion matrix  even though these are
> plan old text snippets as opposed to newsgroup text articles.
>
> 2. At this point, I want to classify individual files that are not part of
> the training set. I've tried a bunch of things that don't seem to work.
> For example, .. I try to invoke main() on TestNewsGroups.java with the args
>
> --input /home/me/data/reuters-21578 --model
> /home/me/test/mahout/quickstart-classifier/model/naiveBayesModel.bin
>
> and end up with an Exception
> Exception in thread "main" java.io.UTFDataFormatException: malformed input
> around byte 5
>         at java.io.DataInputStream.readUTF(DataInputStream.java:617)
>         at java.io.DataInputStream.readUTF(DataInputStream.java:547)
>         at
>
> org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:41)
>         at
>
> org.apache.mahout.classifier.sgd.ModelSerializer.readBinary(ModelSerializer.java:69)
>         at
> com.memonews.mahout.sentiment.TestNewsGroups.run(TestNewsGroups.java:67)
>         at
> com.memonews.mahout.sentiment.TestNewsGroups.main(TestNewsGroups.java:59)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)
>
> Any idea what I can do to fix this? Thanks
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-classifyan-individual-file-after-training-tp4038036.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>
>

RE: How to classifyan individual file after training

Posted by Sarang Deshpande <sa...@Shopzilla.com>.
You will need to write custom code to convert text from the file into vectors and then to use these vectors to talk to the pre-built model.

~Sarang

-----Original Message-----
From: vybe3142 [mailto:vybe3142@gmail.com] 
Sent: Friday, February 01, 2013 1:29 PM
To: mahout-user@lucene.apache.org
Subject: How to classifyan individual file after training

1. Index the training data that I've pre-classified manually . Then perform training and testing. Everything works fine to this point /home/me/data/reuters-21578-example
├── chrysler (dir with training files)
├── cocoa (dir with training files)
├── egypt (dir with training files)
└── england (dir with training files)


mahout seqdirectory -i /home/me/data/reuters-21578-example -o
reuters-out-seqdir -c UTF-8 -chunk 5
mahout seq2sparse -i reuters-out-seqdir/ -o reuters-out-seqdir-sparse -lnorm
-nv -wt tfidf
mahout split -i reuters-out-seqdir-sparse/tfidf-vectors --trainingOutput
train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite
--sequenceFiles -xm sequential
                  
mahout trainnb -i train-vectors -el -o model -li labelindex -ow 
mahout testnb -i test-vectors -m model -l labelindex -ow -o testing
This seems to work (looking at the confusion matrix  even though these are
plan old text snippets as opposed to newsgroup text articles. 

2. At this point, I want to classify individual files that are not part of
the training set. I've tried a bunch of things that don't seem to work. 
For example, .. I try to invoke main() on TestNewsGroups.java with the args 

--input /home/me/data/reuters-21578 --model
/home/me/test/mahout/quickstart-classifier/model/naiveBayesModel.bin

and end up with an Exception 
Exception in thread "main" java.io.UTFDataFormatException: malformed input
around byte 5
	at java.io.DataInputStream.readUTF(DataInputStream.java:617)
	at java.io.DataInputStream.readUTF(DataInputStream.java:547)
	at
org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:41)
	at
org.apache.mahout.classifier.sgd.ModelSerializer.readBinary(ModelSerializer.java:69)
	at com.memonews.mahout.sentiment.TestNewsGroups.run(TestNewsGroups.java:67)
	at
com.memonews.mahout.sentiment.TestNewsGroups.main(TestNewsGroups.java:59)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

Any idea what I can do to fix this? Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-classifyan-individual-file-after-training-tp4038036.html
Sent from the Mahout User List mailing list archive at Nabble.com.