You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by "Jim - FooBar();" <ji...@gmail.com> on 2012/10/20 18:36:23 UTC
subclassing Dictionary failure...
Hello everyone,
as you've probably gathered fromt he subject line, I'm having issues
subclassing the Dictionary class. In a nutshell, I want to be able to
provide a mapping between names and synonyms which the current
dictionary doesn't seem to support. So my approach is the following:
1. create a class (MyDictionary) that extends Dictionary and overrides
'.contains()'
2. MyDictionary calls the constructor of the superclass with no args
(per 'super();')
3. it also hard-codes 'maxTokenCount' and 'minTokenCount' (10 & 1
respectively) because there is no setter method for them and are
declared private. Unless, one reads entries in from a xml file and
'puts' them in the Dictionary via 'EntryInserter' there is no way to
provide them even if we know the correct values. without doing this
the name-finder 'breaks out' the for-loop because this condition
doesn't hold:
if (lengthSearching > d.getMaxTokenCount()) { //maxTokenCount stays 0 if
the .put() is never called
break; //breaks immediately
} else ...
....
anyway, I know it is not pretty to hard-code it like that but I couldn't
think of anything else...I have verified that MyDictionary works as
expected on its own. that is, I'm passing a string to .contains() and it
correctly says true / false. I have also verified that when MyDictionary
is passed to the DictionaryNameFinder it spends some time searching
which implies that the for-loop continues as usual. Before hardcoding
'maxTokenCount' and 'minTokenCount' the call to .find() would finish
instantly cos it was never reaching the call to .contains()...
My problem is that despite everything seemingly working as expected I
get 0 in all statistics!!! I can verify that when i use a proper xml
file with the regular Dictionary i get descent statistics so adding
synonyms in the search space can only improve statistics and decrease
performance. I do see a decrease in performance but certainly not in
performance! very weird stuff...
would you say that I need to follow any extra steps for subclassing the
Dictionary? It has many methods but only the .contains() and the
getMaxTokenCount() are used in the actual .find() method of the NameFinder.
any clues/pointers? the odd thing is that MyDictionary on its own does
the right thing (regardless of what happens inside the .contains() method).
Help please...
thanks in advance
Jim
Re: subclassing Dictionary failure...
Posted by "Jim foo.bar" <ji...@gmail.com>.
On 25/10/12 14:43, Jörn Kottmann wrote:
> the dictionary class was not designed to be sub-classed.
> If you want to implement a custom dictionary you can use our event style
> interface to parse a dictionary like file.
I subclassed it just fine...The only real problem I faced was the fact
that unless the entries are inserted into the Dictionary one-by-one (per
EntryInserter), the maxTkenCount & minTokenCount are not initialised
properly...so, for example I've got a giant HashMap with official terms
as keys and lists with synonyms as the data corresponding to the
respective key...The data is already in there - I shouldn't need to
insert anything into the Dictionary just override .contains()... However
with max/minTokenCount being part of the Dictionary's global state there
is no other option but calculate those separately and 'set' them before
.conatains() is called for the first time...anyway what I'm trying to
say is that it is not impossible it's just not that pretty as I
expected! It would have be nice to have setters for these 2 fields -
they are not declared final or anything like that! I've got it working
just fine - the problem is that my statistics are worse when including
synonyms because most synonyms are acronyms and acronyms have not been
annotated in my gold-corpus! :-(
thanks for your time Jorn
Jim
Re: subclassing Dictionary failure...
Posted by Jörn Kottmann <ko...@gmail.com>.
Hi Jim
the dictionary class was not designed to be sub-classed.
If you want to implement a custom dictionary you can use our event style
interface to parse a dictionary like file.
Have a look at the code in the Dictionary class which loads and
serializes it.
HTH,
Jörn
On 10/20/2012 06:36 PM, Jim - FooBar(); wrote:
> Hello everyone,
>
> as you've probably gathered fromt he subject line, I'm having issues
> subclassing the Dictionary class. In a nutshell, I want to be able to
> provide a mapping between names and synonyms which the current
> dictionary doesn't seem to support. So my approach is the following:
>
> 1. create a class (MyDictionary) that extends Dictionary and overrides
> '.contains()'
> 2. MyDictionary calls the constructor of the superclass with no args
> (per 'super();')
> 3. it also hard-codes 'maxTokenCount' and 'minTokenCount' (10 & 1
> respectively) because there is no setter method for them and are
> declared private. Unless, one reads entries in from a xml file and
> 'puts' them in the Dictionary via 'EntryInserter' there is no way to
> provide them even if we know the correct values. without doing this
> the name-finder 'breaks out' the for-loop because this condition
> doesn't hold:
>
> if (lengthSearching > d.getMaxTokenCount()) { //maxTokenCount stays 0
> if the .put() is never called
> break; //breaks immediately
> } else ...
> ....
>
> anyway, I know it is not pretty to hard-code it like that but I
> couldn't think of anything else...I have verified that MyDictionary
> works as expected on its own. that is, I'm passing a string to
> .contains() and it correctly says true / false. I have also verified
> that when MyDictionary is passed to the DictionaryNameFinder it spends
> some time searching which implies that the for-loop continues as
> usual. Before hardcoding 'maxTokenCount' and 'minTokenCount' the call
> to .find() would finish instantly cos it was never reaching the call
> to .contains()...
>
> My problem is that despite everything seemingly working as expected I
> get 0 in all statistics!!! I can verify that when i use a proper xml
> file with the regular Dictionary i get descent statistics so adding
> synonyms in the search space can only improve statistics and decrease
> performance. I do see a decrease in performance but certainly not in
> performance! very weird stuff...
>
> would you say that I need to follow any extra steps for subclassing
> the Dictionary? It has many methods but only the .contains() and the
> getMaxTokenCount() are used in the actual .find() method of the
> NameFinder.
>
> any clues/pointers? the odd thing is that MyDictionary on its own does
> the right thing (regardless of what happens inside the .contains()
> method).
>
> Help please...
> thanks in advance
>
> Jim
>
Re: subclassing Dictionary failure...
Posted by "Jim - FooBar();" <ji...@gmail.com>.
Thanks James, I'll look into it...at the moment however, I'm working
with DrugBank which does provide names + synonyms...
Jim
On 21/10/12 01:47, James Kosin wrote:
> On 10/20/2012 2:10 PM, Jim - FooBar(); wrote:
>> Never-mind I found the problem...It has nothing to do with the finding
>> process...it is simply because the tag produced during evaluation
>> doesn't include the type! It is predicting <START> ... <END> instead
>> of <START:drug> ... <END>.
>>
>> I could hardcode that as well for the purposes of getting some results...
>>
>> Jim
>>
>>
> Jim,
>
> Have you tried the jwnd and the Word Net Dictionary. It has limitless
> tagging capability for finding synonyms and other word relationships.
>
> http://wordnet.princeton.edu/
>
>
> It provides more relationships and my be something to look into for your
> work.
>
> James
>
Re: subclassing Dictionary failure...
Posted by James Kosin <ja...@gmail.com>.
On 10/20/2012 2:10 PM, Jim - FooBar(); wrote:
> Never-mind I found the problem...It has nothing to do with the finding
> process...it is simply because the tag produced during evaluation
> doesn't include the type! It is predicting <START> ... <END> instead
> of <START:drug> ... <END>.
>
> I could hardcode that as well for the purposes of getting some results...
>
> Jim
>
>
Jim,
Have you tried the jwnd and the Word Net Dictionary. It has limitless
tagging capability for finding synonyms and other word relationships.
http://wordnet.princeton.edu/
It provides more relationships and my be something to look into for your
work.
James
Re: subclassing Dictionary failure...
Posted by "Jim - FooBar();" <ji...@gmail.com>.
Never-mind I found the problem...It has nothing to do with the finding
process...it is simply because the tag produced during evaluation
doesn't include the type! It is predicting <START> ... <END> instead of
<START:drug> ... <END>.
I could hardcode that as well for the purposes of getting some results...
Jim
On 20/10/12 18:30, Jim - FooBar(); wrote:
> It's getting weirder and weirder !!!
> I just thought I'd try calling the .find() of the DictionaryNameFinder
> and it works just fine with MyDictionary. For example I tried passing
> these: ("ibuprofen" is the named-entity)
>
> ["I" "use" "ibuprofen" "."] // got back a <Span [2..3)>
> ["I" "do" "use" "ibuprofen" "."] //got back a <Span [3..4)>
>
> so it seems that my implementation is just fine! it does what it is
> supposed to do...what is different in the evaluator?
> I'm at a loss! This should work...
>
> Jim
>
>
>
>
> On 20/10/12 17:38, Jim - FooBar(); wrote:
>> On 20/10/12 17:36, Jim - FooBar(); wrote:
>>> that is, I'm passing a string to .contains()
>>
>> sorry I meant a StringList object exactly as it expects...the
>> compiler would complain otherwise anyway!
>>
>> Jim
>
Re: subclassing Dictionary failure...
Posted by "Jim - FooBar();" <ji...@gmail.com>.
It's getting weirder and weirder !!!
I just thought I'd try calling the .find() of the DictionaryNameFinder
and it works just fine with MyDictionary. For example I tried passing
these: ("ibuprofen" is the named-entity)
["I" "use" "ibuprofen" "."] // got back a <Span [2..3)>
["I" "do" "use" "ibuprofen" "."] //got back a <Span [3..4)>
so it seems that my implementation is just fine! it does what it is
supposed to do...what is different in the evaluator?
I'm at a loss! This should work...
Jim
On 20/10/12 17:38, Jim - FooBar(); wrote:
> On 20/10/12 17:36, Jim - FooBar(); wrote:
>> that is, I'm passing a string to .contains()
>
> sorry I meant a StringList object exactly as it expects...the
> compiler would complain otherwise anyway!
>
> Jim
Re: subclassing Dictionary failure...
Posted by "Jim - FooBar();" <ji...@gmail.com>.
On 20/10/12 17:36, Jim - FooBar(); wrote:
> that is, I'm passing a string to .contains()
sorry I meant a StringList object exactly as it expects...the compiler
would complain otherwise anyway!
Jim