You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Viraf Bankwalla <vi...@yahoo.com.INVALID> on 2017/07/05 14:34:13 UTC

TokenNameFinder

I am using OpenNLP 1.8.0 and have trained NameFinder with approximately 78K sentences (perceptron model).  I have 11 named entity types, and am finding alot of noise in the output.  Looking at the output from training it indicates 39 outcomes.  I would have assumed that this would align with the number of named entity types.  Could one please explain what the Number of Outcomes refers to ?
Also any guidance on data prep and / or areas to explore on how to reduce the FP's would be helpful.
Thanks
- viraf


Indexing events using cutoff of 3

    Computing event counts...  done. 1315813 events
    Indexing...  done.
Collecting events... Done indexing.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 1315813
        Number of Outcomes: 39
      Number of Predicates: 290935
Computing model parameters...
Performing 300 iterations.
  1:  . (1313259/1315813) 0.9980589947051747
  2:  . (1314613/1315813) 0.9990880163062684
  3:  . (1314904/1315813) 0.9993091723519983
  4:  . (1315136/1315813) 0.9994854891994531
  5:  . (1315250/1315813) 0.9995721276503576
  6:  . (1315335/1315813) 0.9996367264953303
  7:  . (1315402/1315813) 0.999687645584897
  8:  . (1315451/1315813) 0.9997248849190576
  9:  . (1315517/1315813) 0.9997750440222128
 10:  . (1315509/1315813) 0.9997689641309213
 20:  . (1315687/1315813) 0.9999042417121582
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1315427/1315813) 0.999706645245183
...done.
Compressed 290935 parameters to 13506
2507 outcome patterns

Re: TokenNameFinder

Posted by Rodrigo Agerri <ro...@ehu.eus>.

Hello,

On Thu, Jul 6, 2017 at 1:55 PM, Viraf Bankwalla
<vi...@yahoo.com.invalid> wrote:
> Thanks.  The data is in opennlp format, and the training set has 74680 lines.  A small percentage of the lines have named entities, as shown below.  As the data and some of the labels refer to sensitive information, I have masked the labels below.
>    3314 <START:type1>
>    1568 <START:type2>
>     398 <START:type3>
>     289 <START:type4>
>     175 <START:type5>
>     159 <START:type6>
>      84 <START:type7>
>      81 <START:type8>
>      67 <START:type9>
>      29 <START:type10>
>      24 <START:type11>
>
> What should I look for to track down the discrepancy of 39 reported outcomes to the expected 45 using BILOU ?

For the less frequent classes, that means that some outcome does not
occur. For example, it could be the case that I-token does not occur,
you need to print the output of the outcomes to know. In any case, I
would not worry about that.

> Ant suggestions on how to improve accuracy would be appreciated.  My params / feature generator config is below

Well, it is very difficult (almost impossible) to learn any good
models for those classes with very few samples, although this depends
on the lexical variability of the entity mentions. If the entities
show substantial variability then from type4-5 onwards you will not
learn anything but noise. If you do not have any additional training
data to improve those numbers, you could try with regexnamefinder or
with the DictionaryNameFinder.

For other improvements via feature engineering, I suggest reading
related literature. In a previous email I listed the papers describing
the best performing NER systems in the CoNLL 2003 newswire benchmark.

http://mail-archives.apache.org/mod_mbox/opennlp-dev/201702.mbox/%3CCAKvDkVA0yHXuNQbA-2tJPm6QHVAKz4eMhG_wx31YuJrWffYZ9g%40mail.gmail.com%3E

Cheers,

Rodrigo

Re: TokenNameFinder

Posted by Viraf Bankwalla <vi...@yahoo.com.INVALID>.

Thanks.  The data is in opennlp format, and the training set has 74680 lines.  A small percentage of the lines have named entities, as shown below.  As the data and some of the labels refer to sensitive information, I have masked the labels below.  
   3314 <START:type1>
   1568 <START:type2>
    398 <START:type3>
    289 <START:type4>
    175 <START:type5>
    159 <START:type6>
     84 <START:type7>
     81 <START:type8>
     67 <START:type9>
     29 <START:type10>
     24 <START:type11> 

What should I look for to track down the discrepancy of 39 reported outcomes to the expected 45 using BILOU ?
Ant suggestions on how to improve accuracy would be appreciated.  My params / feature generator config is below
Parameters are: Algorithm=PERCEPTRON
Iterations=150
Cutoff=3
BeamSize=5

Feature generators are:
generators>
    <cache>
        <generators>
            <window prevLength="3" nextLength="3">
                <custom class="com.maximus.ird.caimr.sii.fdl.opennlp.LowercaseTokenFeatureGenerator" />
            </window>
            <window prevLength="3" nextLength="3">
                <tokenclass wordAndClass="true" />
            </window>
             
            <window prevLength="3" nextLength="3">
                <custom class="com.maximus.ird.caimr.sii.fdl.opennlp.TokenPosFeatureGenerator" />
            </window>

             <definition />
            <prevmap />
            <custom class="opennlp.tools.util.featuregen.TrigramNameFeatureGenerator" />
            <sentence begin="true" end="false" />
        </generators>
    </cache>
</generators>

Where LowercaseTokenClassFeatureGenerator is TokenClassFeatureGenerator specifying word and classTokenPosFeatureGenerator is a tokenposTrigramNameFeatureGenerator  generates trigrams

Thanks
- viraf
On Thursday, July 6, 2017, 6:22:39 AM EDT, Rodrigo Agerri <ra...@apache.org> wrote:

Hello,

if you choose BIO encoding, the number of classes are multiplied by 2
(B-token, I-token) plus we need to add the O class. If you have, say
12 classes, the number of outcomes will be 25.

with BILOU encoding, classes x 4 (BILU) plus O class = 49 (12 classes
x 4 combination + O)

I do not know how many entity types do you actually have in the
training, but with 11 entity types the number of outcomes should be
different:

with BIO: (11 * 2) + 1 = 23
with BILOU: (11 * 4) + 1 = 45

if you have your corpus in opennlp format, can you do the following:

cat en-6-class-opennlp.txt | perl -pe 's/ /\n/g' | grep "<START" |
sort | uniq -c | sort -nr

I do this with a 6 class corpus, and I get:

43820 <START:location>
42882 <START:organization>
38802 <START:person>
23217 <START:date>
22976 <START:misc>
2137 <START:time>

HTH,

R


On Wed, Jul 5, 2017 at 4:34 PM, Viraf Bankwalla
<vi...@yahoo.com.invalid> wrote:
> I am using OpenNLP 1.8.0 and have trained NameFinder with approximately 78K sentences (perceptron model).  I have 11 named entity types, and am finding alot of noise in the output.  Looking at the output from training it indicates 39 outcomes.  I would have assumed that this would align with the number of named entity types.  Could one please explain what the Number of Outcomes refers to ?
> Also any guidance on data prep and / or areas to explore on how to reduce the FP's would be helpful.
> Thanks
> - viraf
>
>
> Indexing events using cutoff of 3
>
>    Computing event counts...  done. 1315813 events
>    Indexing...  done.
> Collecting events... Done indexing.
> Incorporating indexed data for training...
> done.
>    Number of Event Tokens: 1315813
>        Number of Outcomes: 39
>      Number of Predicates: 290935
> Computing model parameters...
> Performing 300 iterations.
>  1:  . (1313259/1315813) 0.9980589947051747
>  2:  . (1314613/1315813) 0.9990880163062684
>  3:  . (1314904/1315813) 0.9993091723519983
>  4:  . (1315136/1315813) 0.9994854891994531
>  5:  . (1315250/1315813) 0.9995721276503576
>  6:  . (1315335/1315813) 0.9996367264953303
>  7:  . (1315402/1315813) 0.999687645584897
>  8:  . (1315451/1315813) 0.9997248849190576
>  9:  . (1315517/1315813) 0.9997750440222128
>  10:  . (1315509/1315813) 0.9997689641309213
>  20:  . (1315687/1315813) 0.9999042417121582
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1315427/1315813) 0.999706645245183
> ...done.
> Compressed 290935 parameters to 13506
> 2507 outcome patterns

Re: TokenNameFinder

Posted by Rodrigo Agerri <ra...@apache.org>.

Hello,

if you choose BIO encoding, the number of classes are multiplied by 2
(B-token, I-token) plus we need to add the O class. If you have, say
12 classes, the number of outcomes will be 25.

with BILOU encoding, classes x 4 (BILU) plus O class = 49 (12 classes
x 4 combination + O)

I do not know how many entity types do you actually have in the
training, but with 11 entity types the number of outcomes should be
different:

with BIO: (11 * 2) + 1 = 23
with BILOU: (11 * 4) + 1 = 45

if you have your corpus in opennlp format, can you do the following:

cat en-6-class-opennlp.txt | perl -pe 's/ /\n/g' | grep "<START" |
sort | uniq -c | sort -nr

I do this with a 6 class corpus, and I get:

43820 <START:location>
42882 <START:organization>
38802 <START:person>
23217 <START:date>
22976 <START:misc>
2137 <START:time>

HTH,

R


On Wed, Jul 5, 2017 at 4:34 PM, Viraf Bankwalla
<vi...@yahoo.com.invalid> wrote:
> I am using OpenNLP 1.8.0 and have trained NameFinder with approximately 78K sentences (perceptron model).  I have 11 named entity types, and am finding alot of noise in the output.  Looking at the output from training it indicates 39 outcomes.  I would have assumed that this would align with the number of named entity types.  Could one please explain what the Number of Outcomes refers to ?
> Also any guidance on data prep and / or areas to explore on how to reduce the FP's would be helpful.
> Thanks
> - viraf
>
>
> Indexing events using cutoff of 3
>
>     Computing event counts...  done. 1315813 events
>     Indexing...  done.
> Collecting events... Done indexing.
> Incorporating indexed data for training...
> done.
>     Number of Event Tokens: 1315813
>         Number of Outcomes: 39
>       Number of Predicates: 290935
> Computing model parameters...
> Performing 300 iterations.
>   1:  . (1313259/1315813) 0.9980589947051747
>   2:  . (1314613/1315813) 0.9990880163062684
>   3:  . (1314904/1315813) 0.9993091723519983
>   4:  . (1315136/1315813) 0.9994854891994531
>   5:  . (1315250/1315813) 0.9995721276503576
>   6:  . (1315335/1315813) 0.9996367264953303
>   7:  . (1315402/1315813) 0.999687645584897
>   8:  . (1315451/1315813) 0.9997248849190576
>   9:  . (1315517/1315813) 0.9997750440222128
>  10:  . (1315509/1315813) 0.9997689641309213
>  20:  . (1315687/1315813) 0.9999042417121582
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1315427/1315813) 0.999706645245183
> ...done.
> Compressed 290935 parameters to 13506
> 2507 outcome patterns

Re: Document Classification with imbalanced data

Posted by Tommaso Teofili <to...@gmail.com>.

or you may hook into the training part and give a higher weight to the very
rare class with respect to the common class in order to make occurrences of
that rare class have higher impact in changing the model parameters/weights.

Regards,
Tommaso

On Wed, 3 Jul 2019 at 17:51, Dan Russ <da...@gmail.com> wrote:

> You may have to run one class at a time and find a way to resolve cases
> where more than 1 class wants a document.
> Daniel
>
> > On Jul 3, 2019, at 11:49 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> >
> > Thanks, I am unfamiliar with the approaches that you mentioned - will
> investigate.  I forgot to mention that this is a multi-class classification
> problem.  Each sample represents a page of a corpus of document that have
> been scanned and text extracted using OCR (thus noisy text)
> > Label  | Samples | %-------+---------+----------------C1     | 131613  |
> 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |
> 0.37C5     |    456  |  0.34C6     |    430  |  0.32
> > - viraf
> >
> >
> >    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <
> danruss00@gmail.com> wrote:
> >
> > Have you considered using outlier detection methods?  I’m not really an
> expert on this, but maybe you can define your majority class very well, and
> the other class is the outlier.  Another option may be one-sided
> classification (https://en.wikipedia.org/wiki/One-class_classification),
> SVDD is an example of this. Finally, you might want to look at data
> augmentation techniques.  I am in the middle of some work using conditional
> GANs, but it is not working out so great for me at the moment.
> >
> > Let me know if any of these work out for you.
> > Daniel
> >
> >
> >> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> >>
> >> I am trying document classification using OpenNLP however my data is
> highly unbalanced (majority class is 97%).  I recognize that I could
> randomly over/under sample the data set, and am reading up on SMOTE and
> ADASYN (not sure how to apply these to OpenNLP).
>
>

Re: Document Classification with imbalanced data

Posted by Dan Russ <da...@gmail.com>.

You may have to run one class at a time and find a way to resolve cases where more than 1 class wants a document.
Daniel

> On Jul 3, 2019, at 11:49 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> 
> Thanks, I am unfamiliar with the approaches that you mentioned - will investigate.  I forgot to mention that this is a multi-class classification problem.  Each sample represents a page of a corpus of document that have been scanned and text extracted using OCR (thus noisy text)
> Label  | Samples | %-------+---------+----------------C1     | 131613  | 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |  0.37C5     |    456  |  0.34C6     |    430  |  0.32
> - viraf
> 
> 
>    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <da...@gmail.com> wrote:  
> 
> Have you considered using outlier detection methods?  I’m not really an expert on this, but maybe you can define your majority class very well, and the other class is the outlier.  Another option may be one-sided classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of this. Finally, you might want to look at data augmentation techniques.  I am in the middle of some work using conditional GANs, but it is not working out so great for me at the moment.
> 
> Let me know if any of these work out for you.
> Daniel
> 
> 
>> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
>> 
>> I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).

Re: Document Classification with imbalanced data

Posted by "viraf.bankwalla@yahoo.com.INVALID" <vi...@yahoo.com.INVALID>.

 Thanks, I am unfamiliar with the approaches that you mentioned - will investigate.  I forgot to mention that this is a multi-class classification problem.  Each sample represents a page of a corpus of document that have been scanned and text extracted using OCR (thus noisy text)
Label  | Samples | %-------+---------+----------------C1     | 131613  | 97.71C2     |    873  |  0.65C3     |    830  |  0.62C4     |    492  |  0.37C5     |    456  |  0.34C6     |    430  |  0.32
- viraf

    On Wednesday, July 3, 2019, 10:31:44 AM EDT, Dan Russ <da...@gmail.com> wrote:  

 Have you considered using outlier detection methods?  I’m not really an expert on this, but maybe you can define your majority class very well, and the other class is the outlier.  Another option may be one-sided classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of this. Finally, you might want to look at data augmentation techniques.  I am in the middle of some work using conditional GANs, but it is not working out so great for me at the moment.

Let me know if any of these work out for you.
Daniel

> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> 
> I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).

Re: Document Classification with imbalanced data

Posted by Dan Russ <da...@gmail.com>.

Have you considered using outlier detection methods?  I’m not really an expert on this, but maybe you can define your majority class very well, and the other class is the outlier.  Another option may be one-sided classification (https://en.wikipedia.org/wiki/One-class_classification), SVDD is an example of this. Finally, you might want to look at data augmentation techniques.  I am in the middle of some work using conditional GANs, but it is not working out so great for me at the moment.

Let me know if any of these work out for you.
Daniel

> On Jul 3, 2019, at 10:22 AM, viraf.bankwalla@yahoo.com.INVALID wrote:
> 
> I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).

Document Classification with imbalanced data

Posted by "viraf.bankwalla@yahoo.com.INVALID" <vi...@yahoo.com.INVALID>.

 I am trying document classification using OpenNLP however my data is highly unbalanced (majority class is 97%).  I recognize that I could randomly over/under sample the data set, and am reading up on SMOTE and ADASYN (not sure how to apply these to OpenNLP).  
Any suggestions on dealing with the highly unbalanced data would be appreciated.
Thanks
- viraf