You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Michael (Jira)" <ji...@apache.org> on 2020/09/23 19:48:00 UTC

[jira] [Updated] (OPENNLP-1309) NameFinderME - Unexpected result using unchanged training data

     [ https://issues.apache.org/jira/browse/OPENNLP-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael updated OPENNLP-1309:
-----------------------------
    Description: 
 

Hello,

Based on [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java] / function _testNameFinder()_, I have written a simple test code and changed the [test sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
 from *(1)*:

 
{code:java}
String[] sentence = {"Alisa",
 "appreciated",
 "the",
 "hint",
 "and",
 "enjoyed",
 "a",
 "delicious",
 "traditional",
 "meal."};
{code}
 

 

to *(2)*:
{code:java}
String[] sentence = {"Alisa",
 "and",
 "Mike",
 "appreciated",
 "the",
 "hint",
 "and",
 "enjoyed",
 "a",
 "delicious",
 "traditional",
 "meal."};
{code}
 

 

(Just added "and Mike") and expected to get 2 results (two names _Alisa_ and _Mike_) because both names are annotated in the training data. I just get 1 result (Mike) for *(2)*. I used the training data file [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt]  (unchanged).

Can anyone tell me what's wrong? Thanks.
h3. +Test code:+

 
{code:java}
String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
String encoding = "ISO-8859-1";
 ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
 
 TrainingParameters params = new TrainingParameters();
 params.put(TrainingParameters.ITERATIONS_PARAM, 70);
 params.put(TrainingParameters.CUTOFF_PARAM, 1);
TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
 params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
// now test if it can detect the sample sentences
 String[] sentence = {"Alisa",
 "and",
 "Mike",
 "appreciated",
 "the",
 "hint",
 "and",
 "enjoyed",
 "a",
 "delicious",
 "traditional",
 "meal."};
Span[] names = nameFinder.find(sentence);
 if (names != null && names.length != 0) {
 System.out.println(" > Found ["+names.length+"] results");
 for(Span name : names){
 String personName="";
 for(int i=name.getStart(); i<name.getEnd(); i++){
 personName+=sentence[i]+" ";
 }
 System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
 }
 } else {
 System.out.println(" > No results found");
 }
{code}
 

 
h3. +Result for (1):+

Indexing events with TwoPass using cutoff of 1
 Computing event counts... done. 1392 events
 Indexing... done.
 Collecting events... Done indexing in 0.22 s.
 Incorporating indexed data for training... 
 done.
 Number of Event Tokens: 1392
 Number of Outcomes: 3
 Number of Predicates: 9164
 Computing model parameters...
 Performing 70 iterations.
 1: . (1355/1392) 0.9734195402298851
 2: . (1383/1392) 0.9935344827586207
 3: . (1390/1392) 0.9985632183908046
 4: . (1390/1392) 0.9985632183908046
 5: . (1391/1392) 0.9992816091954023
 6: . (1392/1392) 1.0
 7: . (1392/1392) 1.0
 8: . (1392/1392) 1.0
 9: . (1392/1392) 1.0
 Stopping: change in training set accuracy less than 1.0E-5
 Stats: (1392/1392) 1.0
 ...done.

*Found [1] results*
 *Result 1: Type: [default] : PersonName: [Alisa ] [probability=0.5483001511243855]*
h3.  

+Result for (2):+

Indexing events with TwoPass using cutoff of 1
 Computing event counts... done. 1392 events
 Indexing... done.
 Collecting events... Done indexing in 0.22 s.
 Incorporating indexed data for training... 
 done.
 Number of Event Tokens: 1392
 Number of Outcomes: 3
 Number of Predicates: 9164
 Computing model parameters...
 Performing 70 iterations.
 1: . (1355/1392) 0.9734195402298851
 2: . (1383/1392) 0.9935344827586207
 3: . (1390/1392) 0.9985632183908046
 4: . (1390/1392) 0.9985632183908046
 5: . (1391/1392) 0.9992816091954023
 6: . (1392/1392) 1.0
 7: . (1392/1392) 1.0
 8: . (1392/1392) 1.0
 9: . (1392/1392) 1.0
 Stopping: change in training set accuracy less than 1.0E-5
 Stats: (1392/1392) 1.0
 ...done.

*Found [1] results*
 *Result 1: Type: [default] : PersonName: [Mike ] [probability=0.460685209028902]*

  was:
 

Hello,

I based on [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java] / function _testNameFinder()_, I have written a simple test code and changed the [test sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
from *(1)*:

 
{code:java}
String[] sentence = {"Alisa",
 "appreciated",
 "the",
 "hint",
 "and",
 "enjoyed",
 "a",
 "delicious",
 "traditional",
 "meal."};
{code}
 

 

to *(2)*:
{code:java}
String[] sentence = {"Alisa",
 "and",
 "Mike",
 "appreciated",
 "the",
 "hint",
 "and",
 "enjoyed",
 "a",
 "delicious",
 "traditional",
 "meal."};
{code}
 

 

(Just added "and Mike") and expected to get 2 results (two names _Alisa_ and _Mike_) because both names are annotated in the training data. I just get 1 result (Mike) for *(2)*. I used the training data file [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt]  (unchanged).

Can anyone tell me what's wrong? Thanks.
h3. +Test code:+

 
{code:java}
String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
String encoding = "ISO-8859-1";
 ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
 
 TrainingParameters params = new TrainingParameters();
 params.put(TrainingParameters.ITERATIONS_PARAM, 70);
 params.put(TrainingParameters.CUTOFF_PARAM, 1);
TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
 params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
// now test if it can detect the sample sentences
 String[] sentence = {"Alisa",
 "and",
 "Mike",
 "appreciated",
 "the",
 "hint",
 "and",
 "enjoyed",
 "a",
 "delicious",
 "traditional",
 "meal."};
Span[] names = nameFinder.find(sentence);
 if (names != null && names.length != 0) {
 System.out.println(" > Found ["+names.length+"] results");
 for(Span name : names){
 String personName="";
 for(int i=name.getStart(); i<name.getEnd(); i++){
 personName+=sentence[i]+" ";
 }
 System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
 }
 } else {
 System.out.println(" > No results found");
 }
{code}
 

 
h3. +Result for (1):+

Indexing events with TwoPass using cutoff of 1
 Computing event counts... done. 1392 events
 Indexing... done.
Collecting events... Done indexing in 0.22 s.
Incorporating indexed data for training... 
done.
 Number of Event Tokens: 1392
 Number of Outcomes: 3
 Number of Predicates: 9164
Computing model parameters...
Performing 70 iterations.
 1: . (1355/1392) 0.9734195402298851
 2: . (1383/1392) 0.9935344827586207
 3: . (1390/1392) 0.9985632183908046
 4: . (1390/1392) 0.9985632183908046
 5: . (1391/1392) 0.9992816091954023
 6: . (1392/1392) 1.0
 7: . (1392/1392) 1.0
 8: . (1392/1392) 1.0
 9: . (1392/1392) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1392/1392) 1.0
...done.
 
*Found [1] results*
*Result 1: Type: [default] : PersonName: [Alisa ] [probability=0.5483001511243855]*
h3.  
+Result for (2):+

Indexing events with TwoPass using cutoff of 1
 Computing event counts... done. 1392 events
 Indexing... done.
Collecting events... Done indexing in 0.22 s.
Incorporating indexed data for training... 
done.
 Number of Event Tokens: 1392
 Number of Outcomes: 3
 Number of Predicates: 9164
Computing model parameters...
Performing 70 iterations.
 1: . (1355/1392) 0.9734195402298851
 2: . (1383/1392) 0.9935344827586207
 3: . (1390/1392) 0.9985632183908046
 4: . (1390/1392) 0.9985632183908046
 5: . (1391/1392) 0.9992816091954023
 6: . (1392/1392) 1.0
 7: . (1392/1392) 1.0
 8: . (1392/1392) 1.0
 9: . (1392/1392) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1392/1392) 1.0
...done.

*Found [1] results*
*Result 1: Type: [default] : PersonName: [Mike ] [probability=0.460685209028902]*


> NameFinderME - Unexpected result using unchanged training data
> --------------------------------------------------------------
>
>                 Key: OPENNLP-1309
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1309
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: 1.9.2
>            Reporter: Michael
>            Priority: Major
>
>  
> Hello,
> Based on [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java] / function _testNameFinder()_, I have written a simple test code and changed the [test sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
>  from *(1)*:
>  
> {code:java}
> String[] sentence = {"Alisa",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> {code}
>  
>  
> to *(2)*:
> {code:java}
> String[] sentence = {"Alisa",
>  "and",
>  "Mike",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> {code}
>  
>  
> (Just added "and Mike") and expected to get 2 results (two names _Alisa_ and _Mike_) because both names are annotated in the training data. I just get 1 result (Mike) for *(2)*. I used the training data file [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt]  (unchanged).
> Can anyone tell me what's wrong? Thanks.
> h3. +Test code:+
>  
> {code:java}
> String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
> String encoding = "ISO-8859-1";
>  ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
>  
>  TrainingParameters params = new TrainingParameters();
>  params.put(TrainingParameters.ITERATIONS_PARAM, 70);
>  params.put(TrainingParameters.CUTOFF_PARAM, 1);
> TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
>  params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
> TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
> // now test if it can detect the sample sentences
>  String[] sentence = {"Alisa",
>  "and",
>  "Mike",
>  "appreciated",
>  "the",
>  "hint",
>  "and",
>  "enjoyed",
>  "a",
>  "delicious",
>  "traditional",
>  "meal."};
> Span[] names = nameFinder.find(sentence);
>  if (names != null && names.length != 0) {
>  System.out.println(" > Found ["+names.length+"] results");
>  for(Span name : names){
>  String personName="";
>  for(int i=name.getStart(); i<name.getEnd(); i++){
>  personName+=sentence[i]+" ";
>  }
>  System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
>  }
>  } else {
>  System.out.println(" > No results found");
>  }
> {code}
>  
>  
> h3. +Result for (1):+
> Indexing events with TwoPass using cutoff of 1
>  Computing event counts... done. 1392 events
>  Indexing... done.
>  Collecting events... Done indexing in 0.22 s.
>  Incorporating indexed data for training... 
>  done.
>  Number of Event Tokens: 1392
>  Number of Outcomes: 3
>  Number of Predicates: 9164
>  Computing model parameters...
>  Performing 70 iterations.
>  1: . (1355/1392) 0.9734195402298851
>  2: . (1383/1392) 0.9935344827586207
>  3: . (1390/1392) 0.9985632183908046
>  4: . (1390/1392) 0.9985632183908046
>  5: . (1391/1392) 0.9992816091954023
>  6: . (1392/1392) 1.0
>  7: . (1392/1392) 1.0
>  8: . (1392/1392) 1.0
>  9: . (1392/1392) 1.0
>  Stopping: change in training set accuracy less than 1.0E-5
>  Stats: (1392/1392) 1.0
>  ...done.
> *Found [1] results*
>  *Result 1: Type: [default] : PersonName: [Alisa ] [probability=0.5483001511243855]*
> h3.  
> +Result for (2):+
> Indexing events with TwoPass using cutoff of 1
>  Computing event counts... done. 1392 events
>  Indexing... done.
>  Collecting events... Done indexing in 0.22 s.
>  Incorporating indexed data for training... 
>  done.
>  Number of Event Tokens: 1392
>  Number of Outcomes: 3
>  Number of Predicates: 9164
>  Computing model parameters...
>  Performing 70 iterations.
>  1: . (1355/1392) 0.9734195402298851
>  2: . (1383/1392) 0.9935344827586207
>  3: . (1390/1392) 0.9985632183908046
>  4: . (1390/1392) 0.9985632183908046
>  5: . (1391/1392) 0.9992816091954023
>  6: . (1392/1392) 1.0
>  7: . (1392/1392) 1.0
>  8: . (1392/1392) 1.0
>  9: . (1392/1392) 1.0
>  Stopping: change in training set accuracy less than 1.0E-5
>  Stats: (1392/1392) 1.0
>  ...done.
> *Found [1] results*
>  *Result 1: Type: [default] : PersonName: [Mike ] [probability=0.460685209028902]*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)