You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@opennlp.apache.org by mark meiklejohn <ma...@yahoo.co.uk> on 2011/08/30 22:55:15 UTC

Case Sensitive POSDictionary usage

Hi,

What is the best way to go about instantiating the POSDictionary with a 
custom tag dictionary and with case sensitive flag set to false??

I need this for a full parse tree output as I have no control over the 
input. I've managed to create a parse model, but the only thing holding 
me back is the defining the case insensitive tag dictionary.

Now I know a new fix was recently done here and I have a copy of 
1.5.2rc, but I just can't see how to go about it.

The majority of the methods within POSDictionary are deprecated and its 
recommend to use POSDictionary.create(), but there is no way to set the 
case sensitivity flag, which is true by default.

I can use the POSDictionary(String file, boolean caseSensitive) 
(deprecated constructor), but this leads to a NPE when calling 
getTags(String word) as when it attempts to find a word loaded into the 
dictionary, it is not found because it is in its proper case i.e. 
('Italy', 'italy')

The reason being for this input:

'I INTEND TO COMPLETE SCHOOL AND GET A CERTIFICATE THIS SUMMER.'

results in

(TOP (S (NP (PRP I)) (VP (VBD INTEND) (VP (TO TO) (VP (VB COMPLETE) (NP 
(NNP SCHOOL) (CC AND) (NNP GET) (NNP A) (NNP CERTIFICATE) (NNP THIS) 
(NNP SUMMER))))) (. .)))

with all nouns being identified as NNP(s), the only way to avoid this is 
to use case sensitive set as 'false', which was possible with 1.3.1, 
which also gave the correct output.

(TOP (S (NP (PRP I)) (VP (VBP INTEND) (VP (TO TO) (VP (VP (VB COMPLETE) 
(NP (NN SCHOOL))) (CC AND) (VP (VB GET) (NP (DT A) (NN CERTIFICATE) (NN 
THIS) (NNP SUMMER.))))))))

Maybe I'm being blind to the solution.

Your help is greatly appreciated.


Re: Case Sensitive POSDictionary usage

Posted by Jörn Kottmann <ko...@gmail.com>.
Hello,

we now have the following jiras to improve things:
https://issues.apache.org/jira/browse/OPENNLP-286
https://issues.apache.org/jira/browse/OPENNLP-287
https://issues.apache.org/jira/browse/OPENNLP-288

Contributions are very welcome.

Jörn

On 8/31/11 1:25 AM, Jörn Kottmann wrote:
> On 8/30/11 10:55 PM, mark meiklejohn wrote:
>> Hi,
>>
>> What is the best way to go about instantiating the POSDictionary with 
>> a custom tag dictionary and with case sensitive flag set to false??
>>
>
> The preferred way is to create it from the xml dictionary file. There 
> you can set
> the case_sensitive attribute to false (in the xml), and it will create 
> a case insensitive POS Dictionary.
> We don't really have an API to create this dictionary, looks like 
> something we should add.
>
> For example:
> <dictionary case_sensitive="false">
> <entry tags="JJ VB">
> <token>brave</token>
> </entry>
> </dictionary>
>
> Such a dictionary should output JJ and VB for BRAVE as input.
> Looks like we could need a unit test for that.
>
>> I need this for a full parse tree output as I have no control over 
>> the input. I've managed to create a parse model, but the only thing 
>> holding me back is the defining the case insensitive tag dictionary.
>>
>> Now I know a new fix was recently done here and I have a copy of 
>> 1.5.2rc, but I just can't see how to go about it.
>>
>> The majority of the methods within POSDictionary are deprecated and 
>> its recommend to use POSDictionary.create(), but there is no way to 
>> set the case sensitivity flag, which is true by default.
>>
>> I can use the POSDictionary(String file, boolean caseSensitive) 
>> (deprecated constructor), but this leads to a NPE when calling 
>> getTags(String word) as when it attempts to find a word loaded into 
>> the dictionary, it is not found because it is in its proper case i.e. 
>> ('Italy', 'italy')
>>
>
> That is a bug. The constructor does not transform the words into lower 
> case token, when they are case insensitive,
> but that is assumed during look up. We will fix that before we release.
>
> If you create a dictionary this way, then serialize it to xml, and 
> re-create it, it should work.
>
> Hope this helps,
> Jörn


Re: Case Sensitive POSDictionary usage

Posted by Jörn Kottmann <ko...@gmail.com>.
On 8/30/11 10:55 PM, mark meiklejohn wrote:
> Hi,
>
> What is the best way to go about instantiating the POSDictionary with 
> a custom tag dictionary and with case sensitive flag set to false??
>

The preferred way is to create it from the xml dictionary file. There 
you can set
the case_sensitive attribute to false (in the xml), and it will create a 
case insensitive POS Dictionary.
We don't really have an API to create this dictionary, looks like 
something we should add.

For example:
<dictionary case_sensitive="false">
<entry tags="JJ VB">
<token>brave</token>
</entry>
</dictionary>

Such a dictionary should output JJ and VB for BRAVE as input.
Looks like we could need a unit test for that.

> I need this for a full parse tree output as I have no control over the 
> input. I've managed to create a parse model, but the only thing 
> holding me back is the defining the case insensitive tag dictionary.
>
> Now I know a new fix was recently done here and I have a copy of 
> 1.5.2rc, but I just can't see how to go about it.
>
> The majority of the methods within POSDictionary are deprecated and 
> its recommend to use POSDictionary.create(), but there is no way to 
> set the case sensitivity flag, which is true by default.
>
> I can use the POSDictionary(String file, boolean caseSensitive) 
> (deprecated constructor), but this leads to a NPE when calling 
> getTags(String word) as when it attempts to find a word loaded into 
> the dictionary, it is not found because it is in its proper case i.e. 
> ('Italy', 'italy')
>

That is a bug. The constructor does not transform the words into lower 
case token, when they are case insensitive,
but that is assumed during look up. We will fix that before we release.

If you create a dictionary this way, then serialize it to xml, and 
re-create it, it should work.

Hope this helps,
Jörn