You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Amal Elmah <am...@hotmail.com> on 2011/06/21 18:59:42 UTC

What is the problem with the training filr

When I used command line training tool on my data (training.txt)  it gives error as follows:
------------------------------------------------------------------------------------------------------------------------
C:\OpenNLP\apache-opennlp-1.5.1-incubating-bin\apache-opennlp-1.5.1-incubating>java -jar lib\opennlp-tools-*.jar TokenNameFinderTrainer -encoding UTF-8 -lang en
 -data trainingFile.txt -model mymodel.bin
Indexing events using cutoff of 5
        Computing event counts...  java.nio.charset.MalformedInputException: Input length = 1
Incorporating indexed data for training...
Exception in thread "main" java.lang.NullPointerException
        at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:272)
        at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:252)
        at opennlp.maxent.GIS.trainModel(GIS.java:228)
        at opennlp.maxent.GIS.trainModel(GIS.java:179)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:345)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:356)
        at opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNa
meFinderTrainerTool.java:87)
        at opennlp.tools.cmdline.CLI.main(CLI.java:183) 
---------------------------------------------------------------------------
I do not know what is the problem and this is part of my data in the text file
 
Professor <START> Michael <END> 
Professor <START> Naci  <END>
Dr <START> Richard <END> ( p / t ) 
Dr <START> David  <END>
Professor <START> Vic <END> 
Dr <START> Adrian  <END>
Dr <START> Martin <END>
Dr <START> Timothy  <END>
Dr <START> Ian  <END>
Dr <START> Ali <END> 
-----------------------------------------------------------------------------------------------------------------------
  		 	   		  

Re: What is the problem with the training filr

Posted by James Kosin <ja...@gmail.com>.
Amal,

Here is a list of the supported character encoding by Java.
http://download.oracle.com/javase/1.4.2/docs/guide/intl/encoding.doc.html

James

On 6/21/2011 9:03 PM, Amal Elmah wrote:
> Thanks 
>  
> I noticed that and I corrected mine now it works the problem in this I could not find any error in the format but the trainer does not accept this data
>  
> Throughout <START> Ray <END> ’ s career , he was committed to developing public engagement with sociology and ensuring the value of sociological research is understood by decision makers .
>
>  thanks 
>  
>  
>
>> Date: Tue, 21 Jun 2011 20:50:26 -0400
>> From: james.kosin@gmail.com
>> To: opennlp-users@incubator.apache.org
>> Subject: Re: What is the problem with the training filr
>>
>> On 6/21/2011 2:25 PM, Amal Elmah wrote:
>>>>> ---------------------------------------------------------------------------
>>>>> I do not know what is the problem and this is part of my data in the text file
>>>>>
>>>>> Professor<START> Michael<END>
>>>>> Professor<START> Naci<END>
>>>>> Dr<START> Richard<END> ( p / t )
>>>>> Dr<START> David<END>
>>>>> Professor<START> Vic<END>
>>>>> Dr<START> Adrian<END>
>>>>> Dr<START> Martin<END>
>>>>> Dr<START> Timothy<END>
>>>>> Dr<START> Ian<END>
>>>>> Dr<START> Ali<END>
>>>>> -----------------------------------------------------------------------------------------------------------------------
>>>>>
>> Amal,
>>
>> (1) This isn't exactly the correct format. The format needs to be like
>> this:
>>
>> Professor <START> Michael <END>
>> Professor <START> Naci <END>
>> Dr <START> Richard <END> ( p / t )
>> Dr <START> David <END>
>> Professor <START> Vic <END>
>> Dr <START> Adrian <END>
>> Dr <START> Martin <END>
>> Dr <START> Timothy <END>
>> Dr <START> Ian <END>
>> Dr <START> Ali <END>
>>
>>
>  		 	   		  


Re: What is the problem with the training filr

Posted by James Kosin <ja...@gmail.com>.
On 6/21/2011 9:03 PM, Amal Elmah wrote:
> Thanks 
>  
> I noticed that and I corrected mine now it works the problem in this I could not find any error in the format but the trainer does not accept this data
>  
> Throughout <START> Ray <END> ’ s career , he was committed to developing public engagement with sociology and ensuring the value of sociological research is understood by decision makers .
>
>  thanks 
>  
>  
>
The error is related to the encoding and specifying the wrong type.  I
saved the file with the Windows default for Notepad and got an error
like this, if I specified utf-8 as the encoding:
> C:\Users\James
> Kosin\Documents\NetBeansProjects\thesis\DocCompare>opennlp.bat To
> kenNameFinderTrainer -lang en -encoding utf-8 -cutoff 0 -data
> temp2.txt -model t
> emp.model
> Indexing events using cutoff of 0
>
>         Computing event counts... 
> java.nio.charset.MalformedInputException: Inp
> ut length = 1
> Incorporating indexed data for training...
> Exception in thread "main" java.lang.NullPointerException
>         at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
>         at opennlp.maxent.GIS.trainModel(GIS.java:256)
>         at opennlp.model.TrainUtil.train(TrainUtil.java:170)
>         at
> opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:381)
>         at
> opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:453)
>         at
> opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:476)
>         at
> opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNa
> meFinderTrainerTool.java:188)
>         at opennlp.tools.cmdline.CLI.main(CLI.java:187)
Windows uses ANSI as the default for Notepad; which probably causes
problems on the ' (apostrophe) character in the string.  You can force
UTF-8 by using Save as... instead of the normal save in Windows.
Java doesn't support ANSI as an encoding at least it didn't take the
encoding as that....
I'm sure that there are other issues with encoding if not specified
properly on the command line.

James


RE: What is the problem with the training filr

Posted by Amal Elmah <am...@hotmail.com>.
Thanks 
 
I noticed that and I corrected mine now it works the problem in this I could not find any error in the format but the trainer does not accept this data
 
Throughout <START> Ray <END> ’ s career , he was committed to developing public engagement with sociology and ensuring the value of sociological research is understood by decision makers .

 thanks 
 
 

> Date: Tue, 21 Jun 2011 20:50:26 -0400
> From: james.kosin@gmail.com
> To: opennlp-users@incubator.apache.org
> Subject: Re: What is the problem with the training filr
> 
> On 6/21/2011 2:25 PM, Amal Elmah wrote:
> >>> ---------------------------------------------------------------------------
> >>> I do not know what is the problem and this is part of my data in the text file
> >>>
> >>> Professor<START> Michael<END>
> >>> Professor<START> Naci<END>
> >>> Dr<START> Richard<END> ( p / t )
> >>> Dr<START> David<END>
> >>> Professor<START> Vic<END>
> >>> Dr<START> Adrian<END>
> >>> Dr<START> Martin<END>
> >>> Dr<START> Timothy<END>
> >>> Dr<START> Ian<END>
> >>> Dr<START> Ali<END>
> >>> -----------------------------------------------------------------------------------------------------------------------
> >>>
> Amal,
> 
> (1) This isn't exactly the correct format. The format needs to be like
> this:
> 
> Professor <START> Michael <END>
> Professor <START> Naci <END>
> Dr <START> Richard <END> ( p / t )
> Dr <START> David <END>
> Professor <START> Vic <END>
> Dr <START> Adrian <END>
> Dr <START> Martin <END>
> Dr <START> Timothy <END>
> Dr <START> Ian <END>
> Dr <START> Ali <END>
> 
> 
 		 	   		  

Re: What is the problem with the training filr

Posted by James Kosin <ja...@gmail.com>.
On 6/21/2011 2:25 PM, Amal Elmah wrote:
>>> ---------------------------------------------------------------------------
>>> I do not know what is the problem and this is part of my data in the text file
>>>
>>> Professor<START> Michael<END>
>>> Professor<START> Naci<END>
>>> Dr<START> Richard<END> ( p / t )
>>> Dr<START> David<END>
>>> Professor<START> Vic<END>
>>> Dr<START> Adrian<END>
>>> Dr<START> Martin<END>
>>> Dr<START> Timothy<END>
>>> Dr<START> Ian<END>
>>> Dr<START> Ali<END>
>>> -----------------------------------------------------------------------------------------------------------------------
>>>
Amal,

(1)  This isn't exactly the correct format.  The format needs to be like
this:

Professor <START> Michael <END>
Professor <START> Naci <END>
Dr <START> Richard <END> ( p / t )
Dr <START> David <END>
Professor <START> Vic <END>
Dr <START> Adrian <END>
Dr <START> Martin <END>
Dr <START> Timothy <END>
Dr <START> Ian <END>
Dr <START> Ali <END>



RE: What is the problem with the training filr

Posted by Amal Elmah <am...@hotmail.com>.
Hi Jorn,
 
thanks for replying. I changed the encoding of the file to the ANSI but I got another error
-----------------------------------------------------------------------------------------------------
C:\OpenNLP\apache-opennlp-1.5.1-incubating-bin\apache-opennlp-1.5.1-incubating>j
ava -jar lib\opennlp-tools-*.jar TokenNameFinderTrainer -encoding UTF-8 -lang en
 -data data1.txt -model maha.bin
Indexing events using cutoff of 5
        Computing event counts...  java.io.IOException: Found unexpected annotat
ion <END>.
Incorporating indexed data for training...
Exception in thread "main" java.lang.NullPointerException
        at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:272)
        at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:252)
        at opennlp.maxent.GIS.trainModel(GIS.java:228)
        at opennlp.maxent.GIS.trainModel(GIS.java:179)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:345)
        at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:356)
        at opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNa
meFinderTrainerTool.java:87)
        at opennlp.tools.cmdline.CLI.main(CLI.java:183)
--------------------------------------------------------------------------------------------------------
 
I am sure there is not annotation END followed by period in my file there is always space between <END> and .
 
 

> Date: Tue, 21 Jun 2011 19:02:14 +0200
> From: kottmann@gmail.com
> To: opennlp-users@incubator.apache.org
> Subject: Re: What is the problem with the training filr
> 
> Hi,
> 
> there is an issue with the encoding of your trainingFile.txt, for some 
> reason it cannot be decoded
> using UTF-8. Try to open it in a text editor with UTF-8 and you will get 
> an error too.
> 
> Hope that helps,
> Jörn
> 
> On 6/21/11 6:59 PM, Amal Elmah wrote:
> > When I used command line training tool on my data (training.txt) it gives error as follows:
> > ------------------------------------------------------------------------------------------------------------------------
> > C:\OpenNLP\apache-opennlp-1.5.1-incubating-bin\apache-opennlp-1.5.1-incubating>java -jar lib\opennlp-tools-*.jar TokenNameFinderTrainer -encoding UTF-8 -lang en
> > -data trainingFile.txt -model mymodel.bin
> > Indexing events using cutoff of 5
> > Computing event counts... java.nio.charset.MalformedInputException: Input length = 1
> > Incorporating indexed data for training...
> > Exception in thread "main" java.lang.NullPointerException
> > at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:272)
> > at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:252)
> > at opennlp.maxent.GIS.trainModel(GIS.java:228)
> > at opennlp.maxent.GIS.trainModel(GIS.java:179)
> > at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:345)
> > at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:356)
> > at opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNa
> > meFinderTrainerTool.java:87)
> > at opennlp.tools.cmdline.CLI.main(CLI.java:183)
> > ---------------------------------------------------------------------------
> > I do not know what is the problem and this is part of my data in the text file
> >
> > Professor<START> Michael<END>
> > Professor<START> Naci<END>
> > Dr<START> Richard<END> ( p / t )
> > Dr<START> David<END>
> > Professor<START> Vic<END>
> > Dr<START> Adrian<END>
> > Dr<START> Martin<END>
> > Dr<START> Timothy<END>
> > Dr<START> Ian<END>
> > Dr<START> Ali<END>
> > -----------------------------------------------------------------------------------------------------------------------
> > 
> 
 		 	   		  

Re: What is the problem with the training filr

Posted by Jörn Kottmann <ko...@gmail.com>.
Hi,

there is an issue with the encoding of your trainingFile.txt, for some 
reason it cannot be decoded
using UTF-8. Try to open it in a text editor with UTF-8 and you will get 
an error too.

Hope that helps,
Jörn

On 6/21/11 6:59 PM, Amal Elmah wrote:
> When I used command line training tool on my data (training.txt)  it gives error as follows:
> ------------------------------------------------------------------------------------------------------------------------
> C:\OpenNLP\apache-opennlp-1.5.1-incubating-bin\apache-opennlp-1.5.1-incubating>java -jar lib\opennlp-tools-*.jar TokenNameFinderTrainer -encoding UTF-8 -lang en
>   -data trainingFile.txt -model mymodel.bin
> Indexing events using cutoff of 5
>          Computing event counts...  java.nio.charset.MalformedInputException: Input length = 1
> Incorporating indexed data for training...
> Exception in thread "main" java.lang.NullPointerException
>          at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:272)
>          at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:252)
>          at opennlp.maxent.GIS.trainModel(GIS.java:228)
>          at opennlp.maxent.GIS.trainModel(GIS.java:179)
>          at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:345)
>          at opennlp.tools.namefind.NameFinderME.train(NameFinderME.java:356)
>          at opennlp.tools.cmdline.namefind.TokenNameFinderTrainerTool.run(TokenNa
> meFinderTrainerTool.java:87)
>          at opennlp.tools.cmdline.CLI.main(CLI.java:183)
> ---------------------------------------------------------------------------
> I do not know what is the problem and this is part of my data in the text file
>
> Professor<START>  Michael<END>
> Professor<START>  Naci<END>
> Dr<START>  Richard<END>  ( p / t )
> Dr<START>  David<END>
> Professor<START>  Vic<END>
> Dr<START>  Adrian<END>
> Dr<START>  Martin<END>
> Dr<START>  Timothy<END>
> Dr<START>  Ian<END>
> Dr<START>  Ali<END>
> -----------------------------------------------------------------------------------------------------------------------
>