You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by "Jim - FooBar();" <ji...@gmail.com> on 2012/10/20 18:36:23 UTC

subclassing Dictionary failure...

Hello everyone,

as you've probably gathered fromt he subject line, I'm having issues 
subclassing the Dictionary class. In a nutshell, I want to be able to 
provide a mapping between names and synonyms which the current 
dictionary doesn't seem to support. So my approach is the following:

 1. create a class (MyDictionary) that extends Dictionary and overrides
    '.contains()'
 2. MyDictionary calls the constructor of the superclass with no args
    (per 'super();')
 3. it also hard-codes 'maxTokenCount' and 'minTokenCount' (10 & 1
    respectively) because there is no setter method for them and are
    declared private. Unless, one reads entries in from a xml file and
    'puts' them in the Dictionary via 'EntryInserter' there is no way to
    provide them even if we know the correct values. without doing this
    the name-finder 'breaks out' the for-loop because this condition
      doesn't hold:

if (lengthSearching > d.getMaxTokenCount()) { //maxTokenCount stays 0 if 
the .put() is never called
           break; //breaks immediately
         } else ...
                  ....

anyway, I know it is not pretty to hard-code it like that but I couldn't 
think of anything else...I have verified that MyDictionary works as 
expected on its own. that is, I'm passing a string to .contains() and it 
correctly says true / false. I have also verified that when MyDictionary 
is passed to the DictionaryNameFinder it spends some time searching 
which implies that the for-loop continues as usual. Before hardcoding 
'maxTokenCount' and 'minTokenCount' the call to .find() would finish 
instantly cos it was never reaching the call to .contains()...

My problem is that despite everything seemingly working as expected I 
get 0 in all statistics!!! I can verify that when i use a proper xml 
file with the regular Dictionary i get descent statistics so adding 
synonyms in the search space can only improve statistics and decrease 
performance. I do see a decrease in performance but certainly not in 
performance! very weird stuff...

would you say that I need to follow any extra steps for subclassing the 
Dictionary? It has many methods but only the .contains() and the 
getMaxTokenCount() are used in the actual .find() method of the NameFinder.

any clues/pointers? the odd thing is that MyDictionary on its own does 
the right thing (regardless of what happens inside the .contains() method).

Help please...
thanks in advance

Jim

Re: subclassing Dictionary failure...

Posted by "Jim foo.bar" <ji...@gmail.com>.
On 25/10/12 14:43, Jörn Kottmann wrote:
> the dictionary class was not designed to be sub-classed.
> If you want to implement a custom dictionary you can use our event style
> interface to parse a dictionary like file. 

I subclassed it just fine...The only real problem I faced was the fact 
that unless the entries are inserted into the Dictionary one-by-one (per 
EntryInserter), the maxTkenCount & minTokenCount are not initialised 
properly...so, for example I've got a giant HashMap with official terms 
as keys and lists with synonyms as the data corresponding to the 
respective key...The data is already in there - I shouldn't need to 
insert anything into the Dictionary just override .contains()... However 
with max/minTokenCount being part of the Dictionary's global state there 
is no other option but calculate those separately and 'set' them before 
.conatains() is called for the first time...anyway what I'm trying to 
say is that it is not impossible it's just not that pretty as I 
expected! It would have be nice to have setters for these 2 fields - 
they are not declared final or anything like that! I've got it working 
just fine - the problem is that my statistics are worse when including 
synonyms because most synonyms are acronyms and acronyms have not been 
annotated in my gold-corpus! :-(

thanks for your time Jorn

Jim

Re: subclassing Dictionary failure...

Posted by Jörn Kottmann <ko...@gmail.com>.
Hi Jim

the dictionary class was not designed to be sub-classed.
If you want to implement a custom dictionary you can use our event style
interface to parse a dictionary like file.

Have a look at the code in the Dictionary class which loads and 
serializes it.

HTH,
Jörn

On 10/20/2012 06:36 PM, Jim - FooBar(); wrote:
> Hello everyone,
>
> as you've probably gathered fromt he subject line, I'm having issues 
> subclassing the Dictionary class. In a nutshell, I want to be able to 
> provide a mapping between names and synonyms which the current 
> dictionary doesn't seem to support. So my approach is the following:
>
> 1. create a class (MyDictionary) that extends Dictionary and overrides
>    '.contains()'
> 2. MyDictionary calls the constructor of the superclass with no args
>    (per 'super();')
> 3. it also hard-codes 'maxTokenCount' and 'minTokenCount' (10 & 1
>    respectively) because there is no setter method for them and are
>    declared private. Unless, one reads entries in from a xml file and
>    'puts' them in the Dictionary via 'EntryInserter' there is no way to
>    provide them even if we know the correct values. without doing this
>    the name-finder 'breaks out' the for-loop because this condition
>      doesn't hold:
>
> if (lengthSearching > d.getMaxTokenCount()) { //maxTokenCount stays 0 
> if the .put() is never called
>           break; //breaks immediately
>         } else ...
>                  ....
>
> anyway, I know it is not pretty to hard-code it like that but I 
> couldn't think of anything else...I have verified that MyDictionary 
> works as expected on its own. that is, I'm passing a string to 
> .contains() and it correctly says true / false. I have also verified 
> that when MyDictionary is passed to the DictionaryNameFinder it spends 
> some time searching which implies that the for-loop continues as 
> usual. Before hardcoding 'maxTokenCount' and 'minTokenCount' the call 
> to .find() would finish instantly cos it was never reaching the call 
> to .contains()...
>
> My problem is that despite everything seemingly working as expected I 
> get 0 in all statistics!!! I can verify that when i use a proper xml 
> file with the regular Dictionary i get descent statistics so adding 
> synonyms in the search space can only improve statistics and decrease 
> performance. I do see a decrease in performance but certainly not in 
> performance! very weird stuff...
>
> would you say that I need to follow any extra steps for subclassing 
> the Dictionary? It has many methods but only the .contains() and the 
> getMaxTokenCount() are used in the actual .find() method of the 
> NameFinder.
>
> any clues/pointers? the odd thing is that MyDictionary on its own does 
> the right thing (regardless of what happens inside the .contains() 
> method).
>
> Help please...
> thanks in advance
>
> Jim
>


Re: subclassing Dictionary failure...

Posted by "Jim - FooBar();" <ji...@gmail.com>.
Thanks James, I'll look into it...at the moment however, I'm working 
with DrugBank which does provide names + synonyms...

Jim


On 21/10/12 01:47, James Kosin wrote:
> On 10/20/2012 2:10 PM, Jim - FooBar(); wrote:
>> Never-mind I found the problem...It has nothing to do with the finding
>> process...it is simply because the tag produced during evaluation
>> doesn't include the type! It is predicting <START> ... <END> instead
>> of <START:drug> ... <END>.
>>
>> I could hardcode that as well for the purposes of getting some results...
>>
>> Jim
>>
>>
> Jim,
>
> Have you tried the jwnd and the Word Net Dictionary.  It has limitless
> tagging capability for finding synonyms and other word relationships.
>
>      http://wordnet.princeton.edu/
>
>
> It provides more relationships and my be something to look into for your
> work.
>
> James
>


Re: subclassing Dictionary failure...

Posted by James Kosin <ja...@gmail.com>.
On 10/20/2012 2:10 PM, Jim - FooBar(); wrote:
> Never-mind I found the problem...It has nothing to do with the finding
> process...it is simply because the tag produced during evaluation
> doesn't include the type! It is predicting <START> ... <END> instead
> of <START:drug> ... <END>.
>
> I could hardcode that as well for the purposes of getting some results...
>
> Jim
>
>
Jim,

Have you tried the jwnd and the Word Net Dictionary.  It has limitless
tagging capability for finding synonyms and other word relationships.

    http://wordnet.princeton.edu/


It provides more relationships and my be something to look into for your
work.

James


Re: subclassing Dictionary failure...

Posted by "Jim - FooBar();" <ji...@gmail.com>.
Never-mind I found the problem...It has nothing to do with the finding 
process...it is simply because the tag produced during evaluation 
doesn't include the type! It is predicting <START> ... <END> instead of 
<START:drug> ... <END>.

I could hardcode that as well for the purposes of getting some results...

Jim


On 20/10/12 18:30, Jim - FooBar(); wrote:
> It's getting weirder and weirder !!!
> I just thought I'd try calling the .find() of the DictionaryNameFinder 
> and it works just fine with MyDictionary. For example I tried passing 
> these: ("ibuprofen" is the named-entity)
>
>  ["I" "use" "ibuprofen" "."]         // got back a <Span [2..3)>
>  ["I" "do" "use" "ibuprofen" "."]   //got back a <Span [3..4)>
>
> so it seems that my implementation is just fine! it does what it is 
> supposed to do...what is different in the evaluator?
> I'm at a loss! This should work...
>
> Jim
>
>
>
>
> On 20/10/12 17:38, Jim - FooBar(); wrote:
>> On 20/10/12 17:36, Jim - FooBar(); wrote:
>>> that is, I'm passing a string to .contains() 
>>
>> sorry I meant  a StringList object exactly as it expects...the 
>> compiler would complain otherwise anyway!
>>
>> Jim
>


Re: subclassing Dictionary failure...

Posted by "Jim - FooBar();" <ji...@gmail.com>.
It's getting weirder and weirder !!!
I just thought I'd try calling the .find() of the DictionaryNameFinder 
and it works just fine with MyDictionary. For example I tried passing 
these: ("ibuprofen" is the named-entity)

  ["I" "use" "ibuprofen" "."]         // got back a <Span [2..3)>
  ["I" "do" "use" "ibuprofen" "."]   //got back a <Span [3..4)>

so it seems that my implementation is just fine! it does what it is 
supposed to do...what is different in the evaluator?
I'm at a loss! This should work...

Jim




On 20/10/12 17:38, Jim - FooBar(); wrote:
> On 20/10/12 17:36, Jim - FooBar(); wrote:
>> that is, I'm passing a string to .contains() 
>
> sorry I meant  a StringList object exactly as it expects...the 
> compiler would complain otherwise anyway!
>
> Jim


Re: subclassing Dictionary failure...

Posted by "Jim - FooBar();" <ji...@gmail.com>.
On 20/10/12 17:36, Jim - FooBar(); wrote:
> that is, I'm passing a string to .contains() 

sorry I meant  a StringList object exactly as it expects...the compiler 
would complain otherwise anyway!

Jim