You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by "Masanz, James J." <Ma...@mayo.edu> on 2014/04/21 17:57:07 UTC

new dictionary lookup {was RE: lvg entries]

Sean,

Will the new dictionary lookup use the canonicalForm? If not, perhaps you can remove LVG from at least some of the pipelines (drug-ner does not include the dependency parser)

-----Original Message-----
From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu] 
Sent: Thursday, April 17, 2014 12:52 PM
To: dev@ctakes.apache.org
Subject: RE: lvg entries

Those variants are not used by the dictionary lookup.  I did look at them to see if it was worthwhile for the new dictionary, but they are all over the place so I passed.  
________________________________________
From: Miller, Timothy [Timothy.Miller@childrens.harvard.edu]
Sent: Thursday, April 17, 2014 1:25 PM
To: dev@ctakes.apache.org
Subject: Re: lvg entries

Pei and I had a similar discussion in person -- mapping from lexical
variants to a stem might be useful. Pei also mentioned that one intended
use might have been searching the dictionary with lexical variants, but
I don't think that is done. Looking at the precision of the variants, I
think its highly unlikely the speed tradeoff would be worth any
improvements in recall.

Finally, at least in eclipse doing a search on references to the method
to retrieve the lemma entries turns up nothing.

Tim


On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> I don't know of any applications within cTAKES that make use of this... The reverse (mapping from these "variants" to the normal form) may be useful though.
>
> Dima
>
>
>
>
> On Apr 17, 2014, at 11:50, Miller, Timothy <Ti...@childrens.harvard.edu> wrote:
>
>> Sure, just as an example, I gave it a note with about 1000 words. It
>> generates 11500 NonEmptyFSList elements (each is basically one lexical
>> variant).
>>
>> For the word "symptomatic", these are the first 10 of 20 lexical variants:
>> Symptomaticer/JJ
>> Symptomaticer/RB
>> Symptomaticed/VB
>> Symptomaticcing/VB
>> Symptomatics/VB
>> Symptomatics/NN
>> Symptomaticked/VB
>> Symptomatic/VB
>> Symptomatic/JJ
>> Symptomatic/RB
>>
>> Tim
>>
>>
>> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
>>> Tim, this is a very interesting observation. Could you please send a few examples of what LVG generates? Both sensical and non :)
>>>
>>> Dima
>>>
>>>
>>>
>>>
>>> On Apr 17, 2014, at 11:28, Miller, Timothy <Ti...@childrens.harvard.edu> wrote:
>>>
>>>> The LVG annotator creates an enormous number of "lemmas" for every
>>>> WordToken in the CAS, and I'm wondering what the original purpose was? I
>>>> think this is probably a minor bottleneck for speed but mostly a pretty
>>>> big space hog (at least 50% of the space of xmi files in my tests).
>>>>
>>>> As of right now I'm not sure if any downstream components are using
>>>> these lemmas, and on a manual inspection the precision seems to be
>>>> pretty abysmal (meaning most of them are nonsensical as lexical
>>>> variants), so as I said, just wondering if we can revisit why cTAKES
>>>> generates so many and whether that component can be optimized.
>>>>
>>>> Thanks
>>>> Tim
>>>>
>


RE: new dictionary lookup {was RE: lvg entries]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi James,

>> Will the new dictionary lookup use the canonicalForm?

It does use WordToken.getCanonicalForm();
Usually this seems to be empty, but as long as it is present it will be used.


-----Original Message-----
From: andy mcmurry [mailto:mcmurry.andy@gmail.com] 
Sent: Tuesday, April 22, 2014 4:23 AM
To: dev@ctakes.apache.org
Subject: Re: new dictionary lookup {was RE: lvg entries]

Highly Relevant

*DNorm: disease name normalization*
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/

"Disease names are often created by combining roots and affixes from Greek or Latin (e.g. ‘hemochromatosis’)" ....






On Mon, Apr 21, 2014 at 8:57 AM, Masanz, James J. <Ma...@mayo.edu>wrote:

> Sean,
>
> Will the new dictionary lookup use the canonicalForm? If not, perhaps 
> you can remove LVG from at least some of the pipelines (drug-ner does 
> not include the dependency parser)
>
> -----Original Message-----
> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 12:52 PM
> To: dev@ctakes.apache.org
> Subject: RE: lvg entries
>
> Those variants are not used by the dictionary lookup.  I did look at 
> them to see if it was worthwhile for the new dictionary, but they are 
> all over the place so I passed.
> ________________________________________
> From: Miller, Timothy [Timothy.Miller@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 1:25 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Pei and I had a similar discussion in person -- mapping from lexical 
> variants to a stem might be useful. Pei also mentioned that one 
> intended use might have been searching the dictionary with lexical 
> variants, but I don't think that is done. Looking at the precision of 
> the variants, I think its highly unlikely the speed tradeoff would be 
> worth any improvements in recall.
>
> Finally, at least in eclipse doing a search on references to the 
> method to retrieve the lemma entries turns up nothing.
>
> Tim
>
>
> On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> > I don't know of any applications within cTAKES that make use of this...
> The reverse (mapping from these "variants" to the normal form) may be 
> useful though.
> >
> > Dima
> >
> >
> >
> >
> > On Apr 17, 2014, at 11:50, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu> wrote:
> >
> >> Sure, just as an example, I gave it a note with about 1000 words. 
> >> It generates 11500 NonEmptyFSList elements (each is basically one 
> >> lexical variant).
> >>
> >> For the word "symptomatic", these are the first 10 of 20 lexical
> variants:
> >> Symptomaticer/JJ
> >> Symptomaticer/RB
> >> Symptomaticed/VB
> >> Symptomaticcing/VB
> >> Symptomatics/VB
> >> Symptomatics/NN
> >> Symptomaticked/VB
> >> Symptomatic/VB
> >> Symptomatic/JJ
> >> Symptomatic/RB
> >>
> >> Tim
> >>
> >>
> >> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
> >>> Tim, this is a very interesting observation. Could you please send 
> >>> a
> few examples of what LVG generates? Both sensical and non :)
> >>>
> >>> Dima
> >>>
> >>>
> >>>
> >>>
> >>> On Apr 17, 2014, at 11:28, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu> wrote:
> >>>
> >>>> The LVG annotator creates an enormous number of "lemmas" for 
> >>>> every WordToken in the CAS, and I'm wondering what the original 
> >>>> purpose
> was? I
> >>>> think this is probably a minor bottleneck for speed but mostly a
> pretty
> >>>> big space hog (at least 50% of the space of xmi files in my tests).
> >>>>
> >>>> As of right now I'm not sure if any downstream components are 
> >>>> using these lemmas, and on a manual inspection the precision 
> >>>> seems to be pretty abysmal (meaning most of them are nonsensical 
> >>>> as lexical variants), so as I said, just wondering if we can 
> >>>> revisit why cTAKES generates so many and whether that component can be optimized.
> >>>>
> >>>> Thanks
> >>>> Tim
> >>>>
> >
>
>

Re: new dictionary lookup {was RE: lvg entries]

Posted by andy mcmurry <mc...@gmail.com>.
Highly Relevant

*DNorm: disease name normalization*
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3810844/

"Disease names are often created by combining roots and affixes from Greek
or Latin (e.g. ‘hemochromatosis’)" ....






On Mon, Apr 21, 2014 at 8:57 AM, Masanz, James J. <Ma...@mayo.edu>wrote:

> Sean,
>
> Will the new dictionary lookup use the canonicalForm? If not, perhaps you
> can remove LVG from at least some of the pipelines (drug-ner does not
> include the dependency parser)
>
> -----Original Message-----
> From: Finan, Sean [mailto:Sean.Finan@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 12:52 PM
> To: dev@ctakes.apache.org
> Subject: RE: lvg entries
>
> Those variants are not used by the dictionary lookup.  I did look at them
> to see if it was worthwhile for the new dictionary, but they are all over
> the place so I passed.
> ________________________________________
> From: Miller, Timothy [Timothy.Miller@childrens.harvard.edu]
> Sent: Thursday, April 17, 2014 1:25 PM
> To: dev@ctakes.apache.org
> Subject: Re: lvg entries
>
> Pei and I had a similar discussion in person -- mapping from lexical
> variants to a stem might be useful. Pei also mentioned that one intended
> use might have been searching the dictionary with lexical variants, but
> I don't think that is done. Looking at the precision of the variants, I
> think its highly unlikely the speed tradeoff would be worth any
> improvements in recall.
>
> Finally, at least in eclipse doing a search on references to the method
> to retrieve the lemma entries turns up nothing.
>
> Tim
>
>
> On 04/17/2014 01:14 PM, Dligach, Dmitriy wrote:
> > I don't know of any applications within cTAKES that make use of this...
> The reverse (mapping from these "variants" to the normal form) may be
> useful though.
> >
> > Dima
> >
> >
> >
> >
> > On Apr 17, 2014, at 11:50, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu> wrote:
> >
> >> Sure, just as an example, I gave it a note with about 1000 words. It
> >> generates 11500 NonEmptyFSList elements (each is basically one lexical
> >> variant).
> >>
> >> For the word "symptomatic", these are the first 10 of 20 lexical
> variants:
> >> Symptomaticer/JJ
> >> Symptomaticer/RB
> >> Symptomaticed/VB
> >> Symptomaticcing/VB
> >> Symptomatics/VB
> >> Symptomatics/NN
> >> Symptomaticked/VB
> >> Symptomatic/VB
> >> Symptomatic/JJ
> >> Symptomatic/RB
> >>
> >> Tim
> >>
> >>
> >> On 04/17/2014 12:31 PM, Dligach, Dmitriy wrote:
> >>> Tim, this is a very interesting observation. Could you please send a
> few examples of what LVG generates? Both sensical and non :)
> >>>
> >>> Dima
> >>>
> >>>
> >>>
> >>>
> >>> On Apr 17, 2014, at 11:28, Miller, Timothy <
> Timothy.Miller@childrens.harvard.edu> wrote:
> >>>
> >>>> The LVG annotator creates an enormous number of "lemmas" for every
> >>>> WordToken in the CAS, and I'm wondering what the original purpose
> was? I
> >>>> think this is probably a minor bottleneck for speed but mostly a
> pretty
> >>>> big space hog (at least 50% of the space of xmi files in my tests).
> >>>>
> >>>> As of right now I'm not sure if any downstream components are using
> >>>> these lemmas, and on a manual inspection the precision seems to be
> >>>> pretty abysmal (meaning most of them are nonsensical as lexical
> >>>> variants), so as I said, just wondering if we can revisit why cTAKES
> >>>> generates so many and whether that component can be optimized.
> >>>>
> >>>> Thanks
> >>>> Tim
> >>>>
> >
>
>