You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Michael (Jira)" <ji...@apache.org> on 2020/09/23 19:48:00 UTC
[jira] [Updated] (OPENNLP-1309) NameFinderME - Unexpected result
using unchanged training data
[ https://issues.apache.org/jira/browse/OPENNLP-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael updated OPENNLP-1309:
-----------------------------
Description:
Hello,
Based on [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java] / function _testNameFinder()_, I have written a simple test code and changed the [test sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
from *(1)*:
{code:java}
String[] sentence = {"Alisa",
"appreciated",
"the",
"hint",
"and",
"enjoyed",
"a",
"delicious",
"traditional",
"meal."};
{code}
to *(2)*:
{code:java}
String[] sentence = {"Alisa",
"and",
"Mike",
"appreciated",
"the",
"hint",
"and",
"enjoyed",
"a",
"delicious",
"traditional",
"meal."};
{code}
(Just added "and Mike") and expected to get 2 results (two names _Alisa_ and _Mike_) because both names are annotated in the training data. I just get 1 result (Mike) for *(2)*. I used the training data file [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt] (unchanged).
Can anyone tell me what's wrong? Thanks.
h3. +Test code:+
{code:java}
String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
String encoding = "ISO-8859-1";
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 70);
params.put(TrainingParameters.CUTOFF_PARAM, 1);
TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
// now test if it can detect the sample sentences
String[] sentence = {"Alisa",
"and",
"Mike",
"appreciated",
"the",
"hint",
"and",
"enjoyed",
"a",
"delicious",
"traditional",
"meal."};
Span[] names = nameFinder.find(sentence);
if (names != null && names.length != 0) {
System.out.println(" > Found ["+names.length+"] results");
for(Span name : names){
String personName="";
for(int i=name.getStart(); i<name.getEnd(); i++){
personName+=sentence[i]+" ";
}
System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
}
} else {
System.out.println(" > No results found");
}
{code}
h3. +Result for (1):+
Indexing events with TwoPass using cutoff of 1
Computing event counts... done. 1392 events
Indexing... done.
Collecting events... Done indexing in 0.22 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1392
Number of Outcomes: 3
Number of Predicates: 9164
Computing model parameters...
Performing 70 iterations.
1: . (1355/1392) 0.9734195402298851
2: . (1383/1392) 0.9935344827586207
3: . (1390/1392) 0.9985632183908046
4: . (1390/1392) 0.9985632183908046
5: . (1391/1392) 0.9992816091954023
6: . (1392/1392) 1.0
7: . (1392/1392) 1.0
8: . (1392/1392) 1.0
9: . (1392/1392) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1392/1392) 1.0
...done.
*Found [1] results*
*Result 1: Type: [default] : PersonName: [Alisa ] [probability=0.5483001511243855]*
h3.
+Result for (2):+
Indexing events with TwoPass using cutoff of 1
Computing event counts... done. 1392 events
Indexing... done.
Collecting events... Done indexing in 0.22 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1392
Number of Outcomes: 3
Number of Predicates: 9164
Computing model parameters...
Performing 70 iterations.
1: . (1355/1392) 0.9734195402298851
2: . (1383/1392) 0.9935344827586207
3: . (1390/1392) 0.9985632183908046
4: . (1390/1392) 0.9985632183908046
5: . (1391/1392) 0.9992816091954023
6: . (1392/1392) 1.0
7: . (1392/1392) 1.0
8: . (1392/1392) 1.0
9: . (1392/1392) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1392/1392) 1.0
...done.
*Found [1] results*
*Result 1: Type: [default] : PersonName: [Mike ] [probability=0.460685209028902]*
was:
Hello,
I based on [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java] / function _testNameFinder()_, I have written a simple test code and changed the [test sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
from *(1)*:
{code:java}
String[] sentence = {"Alisa",
"appreciated",
"the",
"hint",
"and",
"enjoyed",
"a",
"delicious",
"traditional",
"meal."};
{code}
to *(2)*:
{code:java}
String[] sentence = {"Alisa",
"and",
"Mike",
"appreciated",
"the",
"hint",
"and",
"enjoyed",
"a",
"delicious",
"traditional",
"meal."};
{code}
(Just added "and Mike") and expected to get 2 results (two names _Alisa_ and _Mike_) because both names are annotated in the training data. I just get 1 result (Mike) for *(2)*. I used the training data file [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt] (unchanged).
Can anyone tell me what's wrong? Thanks.
h3. +Test code:+
{code:java}
String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
String encoding = "ISO-8859-1";
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 70);
params.put(TrainingParameters.CUTOFF_PARAM, 1);
TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
// now test if it can detect the sample sentences
String[] sentence = {"Alisa",
"and",
"Mike",
"appreciated",
"the",
"hint",
"and",
"enjoyed",
"a",
"delicious",
"traditional",
"meal."};
Span[] names = nameFinder.find(sentence);
if (names != null && names.length != 0) {
System.out.println(" > Found ["+names.length+"] results");
for(Span name : names){
String personName="";
for(int i=name.getStart(); i<name.getEnd(); i++){
personName+=sentence[i]+" ";
}
System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
}
} else {
System.out.println(" > No results found");
}
{code}
h3. +Result for (1):+
Indexing events with TwoPass using cutoff of 1
Computing event counts... done. 1392 events
Indexing... done.
Collecting events... Done indexing in 0.22 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1392
Number of Outcomes: 3
Number of Predicates: 9164
Computing model parameters...
Performing 70 iterations.
1: . (1355/1392) 0.9734195402298851
2: . (1383/1392) 0.9935344827586207
3: . (1390/1392) 0.9985632183908046
4: . (1390/1392) 0.9985632183908046
5: . (1391/1392) 0.9992816091954023
6: . (1392/1392) 1.0
7: . (1392/1392) 1.0
8: . (1392/1392) 1.0
9: . (1392/1392) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1392/1392) 1.0
...done.
*Found [1] results*
*Result 1: Type: [default] : PersonName: [Alisa ] [probability=0.5483001511243855]*
h3.
+Result for (2):+
Indexing events with TwoPass using cutoff of 1
Computing event counts... done. 1392 events
Indexing... done.
Collecting events... Done indexing in 0.22 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 1392
Number of Outcomes: 3
Number of Predicates: 9164
Computing model parameters...
Performing 70 iterations.
1: . (1355/1392) 0.9734195402298851
2: . (1383/1392) 0.9935344827586207
3: . (1390/1392) 0.9985632183908046
4: . (1390/1392) 0.9985632183908046
5: . (1391/1392) 0.9992816091954023
6: . (1392/1392) 1.0
7: . (1392/1392) 1.0
8: . (1392/1392) 1.0
9: . (1392/1392) 1.0
Stopping: change in training set accuracy less than 1.0E-5
Stats: (1392/1392) 1.0
...done.
*Found [1] results*
*Result 1: Type: [default] : PersonName: [Mike ] [probability=0.460685209028902]*
> NameFinderME - Unexpected result using unchanged training data
> --------------------------------------------------------------
>
> Key: OPENNLP-1309
> URL: https://issues.apache.org/jira/browse/OPENNLP-1309
> Project: OpenNLP
> Issue Type: Bug
> Components: Name Finder
> Affects Versions: 1.9.2
> Reporter: Michael
> Priority: Major
>
>
> Hello,
> Based on [NameFinderMETest.java|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java] / function _testNameFinder()_, I have written a simple test code and changed the [test sentence|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/java/opennlp/tools/namefind/NameFinderMETest.java#L79]
> from *(1)*:
>
> {code:java}
> String[] sentence = {"Alisa",
> "appreciated",
> "the",
> "hint",
> "and",
> "enjoyed",
> "a",
> "delicious",
> "traditional",
> "meal."};
> {code}
>
>
> to *(2)*:
> {code:java}
> String[] sentence = {"Alisa",
> "and",
> "Mike",
> "appreciated",
> "the",
> "hint",
> "and",
> "enjoyed",
> "a",
> "delicious",
> "traditional",
> "meal."};
> {code}
>
>
> (Just added "and Mike") and expected to get 2 results (two names _Alisa_ and _Mike_) because both names are annotated in the training data. I just get 1 result (Mike) for *(2)*. I used the training data file [AnnotatedSentences.txt|https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/namefind/AnnotatedSentences.txt] (unchanged).
> Can anyone tell me what's wrong? Thanks.
> h3. +Test code:+
>
> {code:java}
> String trainingDatafilePath = "opennlp/tools/namefind/AnnotatedSentences.txt";
> String encoding = "ISO-8859-1";
> ObjectStream<NameSample> sampleStream = new NameSampleDataStream(new PlainTextByLineStream(new MarkableFileInputStreamFactory(new File(trainingDatafilePath+"AnnotatedSentences.txt")), encoding));
>
> TrainingParameters params = new TrainingParameters();
> params.put(TrainingParameters.ITERATIONS_PARAM, 70);
> params.put(TrainingParameters.CUTOFF_PARAM, 1);
> TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
> params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));
> TokenNameFinder nameFinder = new NameFinderME(nameFinderModel);
> // now test if it can detect the sample sentences
> String[] sentence = {"Alisa",
> "and",
> "Mike",
> "appreciated",
> "the",
> "hint",
> "and",
> "enjoyed",
> "a",
> "delicious",
> "traditional",
> "meal."};
> Span[] names = nameFinder.find(sentence);
> if (names != null && names.length != 0) {
> System.out.println(" > Found ["+names.length+"] results");
> for(Span name : names){
> String personName="";
> for(int i=name.getStart(); i<name.getEnd(); i++){
> personName+=sentence[i]+" ";
> }
> System.out.println(" > Result "+1+": Type: ["+name.getType()+"] : PersonName: ["+personName+"]\t [probability="+name.getProb()+"]");
> }
> } else {
> System.out.println(" > No results found");
> }
> {code}
>
>
> h3. +Result for (1):+
> Indexing events with TwoPass using cutoff of 1
> Computing event counts... done. 1392 events
> Indexing... done.
> Collecting events... Done indexing in 0.22 s.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 1392
> Number of Outcomes: 3
> Number of Predicates: 9164
> Computing model parameters...
> Performing 70 iterations.
> 1: . (1355/1392) 0.9734195402298851
> 2: . (1383/1392) 0.9935344827586207
> 3: . (1390/1392) 0.9985632183908046
> 4: . (1390/1392) 0.9985632183908046
> 5: . (1391/1392) 0.9992816091954023
> 6: . (1392/1392) 1.0
> 7: . (1392/1392) 1.0
> 8: . (1392/1392) 1.0
> 9: . (1392/1392) 1.0
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1392/1392) 1.0
> ...done.
> *Found [1] results*
> *Result 1: Type: [default] : PersonName: [Alisa ] [probability=0.5483001511243855]*
> h3.
> +Result for (2):+
> Indexing events with TwoPass using cutoff of 1
> Computing event counts... done. 1392 events
> Indexing... done.
> Collecting events... Done indexing in 0.22 s.
> Incorporating indexed data for training...
> done.
> Number of Event Tokens: 1392
> Number of Outcomes: 3
> Number of Predicates: 9164
> Computing model parameters...
> Performing 70 iterations.
> 1: . (1355/1392) 0.9734195402298851
> 2: . (1383/1392) 0.9935344827586207
> 3: . (1390/1392) 0.9985632183908046
> 4: . (1390/1392) 0.9985632183908046
> 5: . (1391/1392) 0.9992816091954023
> 6: . (1392/1392) 1.0
> 7: . (1392/1392) 1.0
> 8: . (1392/1392) 1.0
> 9: . (1392/1392) 1.0
> Stopping: change in training set accuracy less than 1.0E-5
> Stats: (1392/1392) 1.0
> ...done.
> *Found [1] results*
> *Result 1: Type: [default] : PersonName: [Mike ] [probability=0.460685209028902]*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)