You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Benedict Holland <be...@gmail.com> on 2018/04/12 20:22:46 UTC

Multiple document categories for MaxEnt model?

Hello all,

I understand that maximum entropy models are excellent at categorizing
documents. As it turns out, I have a situation where 1 document can be in
many categories (1:m relationship). I believe that I could create training
data that looks something like:

category_1 <text>
category_2 <text>
...

If I do this, will the resulting probability model return category
probabilities as Pr(<text> in category_m) = 1/m for all categories m or it
return Pr(<text> in category_m) = 1 for all categories m?

This is a very important distinction. I really hope it is the later. If it
isn't, do you have a way to make sure that if I receive a text that is
similar to the training data, I can get a probability close to 1 if it fits
into multiple categories?

Thanks,
~Ben

Re: Multiple document categories for MaxEnt model?

Posted by Daniel Russ <da...@gmail.com>.

I am not sure that will work. My concern is that if you have 3 labels, with NO information you would
guess a class with p=1/3.  so if p<1/3  there is evidence AGAINST the label but...

case 1:
p(L1) = .7, p(L2)=.2 p(L3)=.1
does this mean L1 only or L1 and L2 or L1+L2+L3
note that the sum of L1+L2 =0.9, but p(L2)< 1/3

case 2:
P(L1)=0.6,p(L2)=.3,p(L3)=0.1
L2 is still < 1/3  , sum of L1+L2=0.9 

case 2 is more plausible to me that case 1.  Maybe you have to require p(L2) to be greater than 0.33. 

Another thing to keep in mind is that if the number of labels gets large, the values of p will go down (you need to have probability mass for the other outcomes)

here is a paper on the topic.  (My advice earlier turns our is mentioned in the introduction as a naive solution)
Multi-labelled Classification Using Maximum Entropy ... - CiteSeerX <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.399.2443&rep=rep1&type=pdf>
Daniel


> On Apr 12, 2018, at 11:09 PM, Benedict Holland <be...@gmail.com> wrote:
> 
> I can actually live with a Pr() >> 0 to matching labels, maybe. What might
> be a reasonable option is to specify a sum of probabilities to get over a
> certain margin. Like, sum the probabilities by order, and select the top
> few that sum over a threshold. That could actually work.
> 
> ~Ben
> 
> On Thu, Apr 12, 2018 at 10:26 PM, <dr...@apache.org> wrote:
> 
>> Hi Ben,
>> 
>>   if a document that can be in multiple categories, you should see it
>> reflected in the probabilities.  The top categories will be close in
>> score.  It will not be 1/m because that would imply that ALL categories are
>> “equally probable” or you have no idea.  However, if you have 3 classes and
>> two are likely, it may be 0.49,0.49,0.02.  Remember that the results are
>> normalized with by a softmax at the end. So the sum of all probabilities
>> will be always 1.
>>   Sorry, but multi-class classification is more complicated that binary
>> classification.  If you really are interested in multi-label
>> classification, I’m not sure maxent (at least the way openNLP formulated
>> the solution) is appropriate for your needs.  You might want to consider
>> individual binary classifiers for each label.  Have 1 model for each label:
>> 
>> train_cat1.txt...
>> cat_1_TRUE <text>
>> cat_1_FALSE <text>
>> …
>> 
>> train_cat2.txt…
>> cat_2_FALSE <text>
>> cat_2_TRUE <text>
>> 
>> Hope it helps, Let me know what you wind up doing...
>> Daniel
>> 
>>> On Apr 12, 2018, at 4:22 PM, Benedict Holland <
>> benedict.m.holland@gmail.com> wrote:
>>> 
>>> Hello all,
>>> 
>>> I understand that maximum entropy models are excellent at categorizing
>>> documents. As it turns out, I have a situation where 1 document can be in
>>> many categories (1:m relationship). I believe that I could create
>> training
>>> data that looks something like:
>>> 
>>> category_1 <text>
>>> category_2 <text>
>>> ...
>>> 
>>> If I do this, will the resulting probability model return category
>>> probabilities as Pr(<text> in category_m) = 1/m for all categories m or
>> it
>>> return Pr(<text> in category_m) = 1 for all categories m?
>>> 
>>> This is a very important distinction. I really hope it is the later. If
>> it
>>> isn't, do you have a way to make sure that if I receive a text that is
>>> similar to the training data, I can get a probability close to 1 if it
>> fits
>>> into multiple categories?
>>> 
>>> Thanks,
>>> ~Ben
>> 
>>

Re: Multiple document categories for MaxEnt model?

Posted by Benedict Holland <be...@gmail.com>.

I can actually live with a Pr() >> 0 to matching labels, maybe. What might
be a reasonable option is to specify a sum of probabilities to get over a
certain margin. Like, sum the probabilities by order, and select the top
few that sum over a threshold. That could actually work.

~Ben

On Thu, Apr 12, 2018 at 10:26 PM, <dr...@apache.org> wrote:

> Hi Ben,
>
>    if a document that can be in multiple categories, you should see it
> reflected in the probabilities.  The top categories will be close in
> score.  It will not be 1/m because that would imply that ALL categories are
> “equally probable” or you have no idea.  However, if you have 3 classes and
> two are likely, it may be 0.49,0.49,0.02.  Remember that the results are
> normalized with by a softmax at the end. So the sum of all probabilities
> will be always 1.
>    Sorry, but multi-class classification is more complicated that binary
> classification.  If you really are interested in multi-label
> classification, I’m not sure maxent (at least the way openNLP formulated
> the solution) is appropriate for your needs.  You might want to consider
> individual binary classifiers for each label.  Have 1 model for each label:
>
> train_cat1.txt...
> cat_1_TRUE <text>
> cat_1_FALSE <text>
> …
>
> train_cat2.txt…
> cat_2_FALSE <text>
> cat_2_TRUE <text>
>
> Hope it helps, Let me know what you wind up doing...
> Daniel
>
> > On Apr 12, 2018, at 4:22 PM, Benedict Holland <
> benedict.m.holland@gmail.com> wrote:
> >
> > Hello all,
> >
> > I understand that maximum entropy models are excellent at categorizing
> > documents. As it turns out, I have a situation where 1 document can be in
> > many categories (1:m relationship). I believe that I could create
> training
> > data that looks something like:
> >
> > category_1 <text>
> > category_2 <text>
> > ...
> >
> > If I do this, will the resulting probability model return category
> > probabilities as Pr(<text> in category_m) = 1/m for all categories m or
> it
> > return Pr(<text> in category_m) = 1 for all categories m?
> >
> > This is a very important distinction. I really hope it is the later. If
> it
> > isn't, do you have a way to make sure that if I receive a text that is
> > similar to the training data, I can get a probability close to 1 if it
> fits
> > into multiple categories?
> >
> > Thanks,
> > ~Ben
>
>

Re: Multiple document categories for MaxEnt model?

Posted by dr...@apache.org.

Hi Ben,

   if a document that can be in multiple categories, you should see it reflected in the probabilities.  The top categories will be close in score.  It will not be 1/m because that would imply that ALL categories are “equally probable” or you have no idea.  However, if you have 3 classes and two are likely, it may be 0.49,0.49,0.02.  Remember that the results are normalized with by a softmax at the end. So the sum of all probabilities will be always 1.
   Sorry, but multi-class classification is more complicated that binary classification.  If you really are interested in multi-label classification, I’m not sure maxent (at least the way openNLP formulated the solution) is appropriate for your needs.  You might want to consider individual binary classifiers for each label.  Have 1 model for each label:

train_cat1.txt...
cat_1_TRUE <text>   
cat_1_FALSE <text>
…

train_cat2.txt…
cat_2_FALSE <text>
cat_2_TRUE <text>

Hope it helps, Let me know what you wind up doing...
Daniel

> On Apr 12, 2018, at 4:22 PM, Benedict Holland <be...@gmail.com> wrote:
> 
> Hello all,
> 
> I understand that maximum entropy models are excellent at categorizing
> documents. As it turns out, I have a situation where 1 document can be in
> many categories (1:m relationship). I believe that I could create training
> data that looks something like:
> 
> category_1 <text>
> category_2 <text>
> ...
> 
> If I do this, will the resulting probability model return category
> probabilities as Pr(<text> in category_m) = 1/m for all categories m or it
> return Pr(<text> in category_m) = 1 for all categories m?
> 
> This is a very important distinction. I really hope it is the later. If it
> isn't, do you have a way to make sure that if I receive a text that is
> similar to the training data, I can get a probability close to 1 if it fits
> into multiple categories?
> 
> Thanks,
> ~Ben