You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Chris Schilling <ch...@gmail.com> on 2010/12/15 00:40:31 UTC

feature vector encoding in Mahout

Hello,

After going through the newest chapters in MIA (very helpful btw), I have a few questions that I think I know the answer to, but just wanted to get some reinforcement. 

Let's say that I have a list of documents and my own pipeline for feature extraction.  So, for each document I have a list of key words (and multi-key word phrases) and corresponding weights.  So each document is now just a list of keyword phrases and weights i.e.

doc1:
phrase1   wt1
phrase2   wt2
phrase3   wt3
...

I would like to use Mahout to train document classifiers using the phrases and weights in these files.

Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like I can just use the encoder class for these phrases and weights.  Something like this:

RecordValueEncoder encoder = 
	new StaticWordValueEncoder("variable-name");
for (DataRecord ex: trainingData) {
	Vector v = new RandomAccessSparseVector(10000);
	String word = ex.get("variable-name");
	encoder.addToVector(word, v); 
}

Does this make sense?

I would like to compare the results of an SGD and Naive Bayes classification using this data.  However, I am unclear of the vector formation process in Naive Bayes.  I have prepared some input for the Bayes classifier using prepare20newsgroups "macro" - I was able to get my data into a similar format as the 20 news groups dataset.  I guess my main question is can I use Naive Bayes if I already have the features (phrases above)  and weights that I want to use for training?



Re: feature vector encoding in Mahout

Posted by Chris Schilling <ch...@gmail.com>.
Ted,

Thanks for your answers.  I think I might be getting the hang of this thing :)


On Dec 14, 2010, at 11:25 PM, Ted Dunning wrote:

> On Tue, Dec 14, 2010 at 3:40 PM, Chris Schilling
> <ch...@gmail.com>wrote:
> 
>> 
>> After going through the newest chapters in MIA (very helpful btw), I have a
>> few questions that I think I know the answer to, but just wanted to get some
>> reinforcement.
>> 
> 
> We are revising them based on comments and would be happy to entertain
> suggestions, so fire away if you have any confusions.

Cool.  I have a few thoughts.  I will organize them and get back to you.  I also notice typos here and there.  The forum does not seem to be the place to mention these trivial things.  Is there an appropriate offline contact?
> 
> 
>> Let's say that I have a list of documents and my own pipeline for feature
>> extraction.  So, for each document I have a list of key words (and multi-key
>> word phrases) and corresponding weights.  So each document is now just a
>> list of keyword phrases and weights i.e.
>> 
>> doc1:
>> phrase1   wt1
>> phrase2   wt2
>> phrase3   wt3
>> ...
>> 
>> I would like to use Mahout to train document classifiers using the phrases
>> and weights in these files.
>> 
> 
> Cool.  You may eventually have phrases from different fields as well.  More
> about that in a sec.

Okay, a bit about the problem I am working on:  I have documents from different pre-labeled categories.  From these documents I run a feature extraction and basically calculate TFIDF weights for the keywords and multiple keyword phrases in each document across a fairly large corpus.  So, my final dataset looks something more like this:

label1, doc1
phrase1  wt11
phrase2  wt21
phrase3  wt31
...

label1, doc2
phrase1 wt12
phrase4 wt42
phraseX wtX2
...

label2, doc3
phrase2 wt23
phraseY wtY3
....

So basically for phrase i on document j, I calculate w ij = tfidf ij.  Document j can belong to anyone of nLabels < nDoc.  Probably ~10 labels (millions of docs).  The main difference between my extraction and say a 1-gram approach is that my extraction contains n-grams in general, so my features are a combination of mostly 1-grams, 2-grams, 3-grams, although I do not limit it to 3.  I require minimum support and other cuts along the way so that the n-grams I extract are "important."

Anyway, the tf-idf's are calculated for each doc across the corpus (as opposed to being calculated for each label)...  If I understand the Naive Bayes implementation correctly, the tfidf is calculated across each label (as opposed to each training sample/doc).   So, based on what you state below, it would be difficult to implement this with the NB implementation.  It does not seem like I would gain much using my own feature vectors for Naive Bayes. In my preliminary tests, I am already at ~80 classification accuracy on a held-out test set.  

> 
> 
>> Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like
>> I can just use the encoder class for these phrases and weights.
> 
> 
> Yes.  Absolutely.
> 
> 
>> Something like this:
>> 
>> RecordValueEncoder encoder =
>>       new StaticWordValueEncoder("variable-name");
>> for (DataRecord ex: trainingData) {
>>       Vector v = new RandomAccessSparseVector(10000);
>>       String word = ex.get("variable-name");
>>       encoder.addToVector(word, v);
>> }
>> 
>> Does this make sense?
>> 
> 
> Yes.  You can use the weight that you had in your original data as well.
> That happens with a line like this:
> 
>       double weight = ... mumble ...
>       encoder.addToVector(word, weight, v);
> 
> Of course, you will need to have comparable weights at classification time.
> Also, the SGD should over-ride your weight in the interest of accuracy.
> Using large weights is also not a great idea because it can cause unstable
> updates.  If you use the AdaptiveLogisticRegression, it should manage by
> adapting the learning rate down.  Combinations of very large and very small
> weights will cause the items with small weights to be essentially ignored.
> 
So, it sounds like LR and NB are more concerned in general with the existence of the keyword in the document rather than the weight of the keyword in the document.  The weights I calculate for each phrase lie from 0.5 to 1.0, and the variance small between documents.   Well, I can test the case of weights and no weights...
> 
>> I would like to compare the results of an SGD and Naive Bayes
>> classification using this data.  However, I am unclear of the vector
>> formation process in Naive Bayes.  I have prepared some input for the Bayes
>> classifier using prepare20newsgroups "macro" - I was able to get my data
>> into a similar format as the 20 news groups dataset.  I guess my main
>> question is can I use Naive Bayes if I already have the features (phrases
>> above)  and weights that I want to use for training?
>> 
> 
> Naive Bayes is very much more command line oriented.  The SGD logistic
> regression models are very much API oriented.  That means, as you suggest,
> that you have to format your data appropriately for Naive Bayes.  Moreover,
> NaiveBayes will simply ignore your weights.  SGD may optimize them away
> eventually, but it will pay attention to them in the short run.  NaiveBayes
> can only handle text-like input (at the moment) without any fields.
> 
> You can handle separately fielded data in SGD by using multiple encoders.

Most of this was just rambling.  I want to get deeper into the SGD APIs and get some performance/evalution studies running

Thanks again, Ted.  

Re: feature vector encoding in Mahout

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Dec 14, 2010 at 3:40 PM, Chris Schilling
<ch...@gmail.com>wrote:

>
> After going through the newest chapters in MIA (very helpful btw), I have a
> few questions that I think I know the answer to, but just wanted to get some
> reinforcement.
>

We are revising them based on comments and would be happy to entertain
suggestions, so fire away if you have any confusions.


> Let's say that I have a list of documents and my own pipeline for feature
> extraction.  So, for each document I have a list of key words (and multi-key
> word phrases) and corresponding weights.  So each document is now just a
> list of keyword phrases and weights i.e.
>
> doc1:
> phrase1   wt1
> phrase2   wt2
> phrase3   wt3
> ...
>
> I would like to use Mahout to train document classifiers using the phrases
> and weights in these files.
>

Cool.  You may eventually have phrases from different fields as well.  More
about that in a sec.


> Looking at the TrainNewsGroups code in o.a.m.classifier.sgd, It looks like
> I can just use the encoder class for these phrases and weights.


Yes.  Absolutely.


> Something like this:
>
> RecordValueEncoder encoder =
>        new StaticWordValueEncoder("variable-name");
> for (DataRecord ex: trainingData) {
>        Vector v = new RandomAccessSparseVector(10000);
>        String word = ex.get("variable-name");
>        encoder.addToVector(word, v);
> }
>
> Does this make sense?
>

Yes.  You can use the weight that you had in your original data as well.
 That happens with a line like this:

       double weight = ... mumble ...
       encoder.addToVector(word, weight, v);

Of course, you will need to have comparable weights at classification time.
 Also, the SGD should over-ride your weight in the interest of accuracy.
 Using large weights is also not a great idea because it can cause unstable
updates.  If you use the AdaptiveLogisticRegression, it should manage by
adapting the learning rate down.  Combinations of very large and very small
weights will cause the items with small weights to be essentially ignored.


> I would like to compare the results of an SGD and Naive Bayes
> classification using this data.  However, I am unclear of the vector
> formation process in Naive Bayes.  I have prepared some input for the Bayes
> classifier using prepare20newsgroups "macro" - I was able to get my data
> into a similar format as the 20 news groups dataset.  I guess my main
> question is can I use Naive Bayes if I already have the features (phrases
> above)  and weights that I want to use for training?
>

Naive Bayes is very much more command line oriented.  The SGD logistic
regression models are very much API oriented.  That means, as you suggest,
that you have to format your data appropriately for Naive Bayes.  Moreover,
NaiveBayes will simply ignore your weights.  SGD may optimize them away
eventually, but it will pay attention to them in the short run.  NaiveBayes
can only handle text-like input (at the moment) without any fields.

You can handle separately fielded data in SGD by using multiple encoders.

Re: feature vector encoding in Mahout

Posted by Ted Dunning <te...@gmail.com>.
For SGD, yes.  For NaiveBayes, there is a score which isn't a probability
that is embedded in the system but not printed.  For SGD without any
down-sampling or rank learning, the score is supposed to be an estimate of a
probability.

On Tue, Dec 14, 2010 at 5:30 PM, Chris Schilling
<ch...@gmail.com>wrote:

> Hello again,
>
> I sent this a bit early.  I have another question.  Using Naive Bayes or
> SGD LR, is it possible to get a metric (i.e. likelihood, probability) of how
> much a new record belongs to each of the categories I want to classify?  For
> instance, in Naive Bayes rather than just classifying the new record, I
> would like a vector of probabilities of that record belonging to each
> class...
>
> Thanks for all your help!
> Chris
>
>

Re: feature vector encoding in Mahout

Posted by Chris Schilling <ch...@gmail.com>.
Hello again,

I sent this a bit early.  I have another question.  Using Naive Bayes or SGD LR, is it possible to get a metric (i.e. likelihood, probability) of how much a new record belongs to each of the categories I want to classify?  For instance, in Naive Bayes rather than just classifying the new record, I would like a vector of probabilities of that record belonging to each class...

Thanks for all your help!  
Chris