You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by karthick r <ka...@gmail.com> on 2014/04/14 18:02:23 UTC
Train maxent classifier

Hi,
[ Project stack : Java, Opennlp, Elasticsearch (repo for twitter) ,
twitter4j to read data from twitter. ]

I intend to use maxent classifier to classify tweets. I understand that the
initial step is to train the model. From the documentation I found that we
have a GISTrainer based train method to train the model. I have managed to
put together a simple piece of code which makes use of opennlp's maxent
classifier to train the model and predict the outcome.

I have used two files postive.txt and negative.txt to train the model

Contents of positive.txt

positive This is good
positive This is the best
positive This is fantastic
positive This is super
positive This is fine
positive This is nice

Contents of negative.txt

negative This is bad
negative This is ugly
negative This is the worst
negative This is worse
negative This sucks

And the java methods below generate the outcome.


@Override
public void trainDataset(String source, String destination) throws
Exception {
File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains
both positive and negative.txt
File modelFile = new File(destination);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
int cutoff = 5;
int iterations = 100;
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DoccatModel model = DocumentCategorizerME.train("en", ds,
cutoff,iterations, bowfg);
model.serialize(new FileOutputStream(modelFile));
}

@Override
public void predict(String text, String modelFile) {
InputStream modelStream = null;
try{
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(text);
modelStream = new FileInputStream(modelFile);
DoccatModel model = new DoccatModel(modelStream);
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
double[] probs   = categorizer.categorize(tokens);
if(null!=probs && probs.length>0){
for(int i=0;i<probs.length;i++){
System.out.println("double[] probs index  " + i + " value " + probs[i]);
}
}
String label = categorizer.getBestCategory(probs);
System.out.println("label " + label);
int bestIndex = categorizer.getIndex(label);
System.out.println("bestIndex " + bestIndex);
double score = probs[bestIndex];
System.out.println("score " + score);
}
catch(Exception e){
e.printStackTrace();
}
finally{
if(null!=modelStream){
try {
modelStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

public static void main(String[] args) {
try {
String outputModelPath =
"/home/**/sd-sentiment-analysis/models/trainPostive";
String source =
"/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
MaximunEntropyClassifier me = new MaximunEntropyClassifier();
me.trainDataset(source, outputModelPath);
me.predict("This is bad", outputModelPath);
} catch (Exception e) {
e.printStackTrace();
}
}



I have the following questions.

1) How do we iteratively train a model? Also, how do we add new
sentences/words to the model ? Is there a specific format for the data
file? I found that the file needs to have a minimum of two words separated
by a tab. Is my understanding valid?
2) Are there any publicly available datasets that I can use to train the
model? I found some sources for movie reviews. The project im working on
involves not just movie reviews but also other things such as product
reviews, brand sentiments etc.
3) Is there a working example somewhere publicly available? I couldn't find
the documentation for maxent.

Please help me out. I am kind'a blocked on this.