You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by britt fitch <br...@wiredinformatics.com> on 2015/07/14 21:56:52 UTC

periods and the interaction with PTB & Fast Dict Lookup.

Another question/topic likely for Sean & Tim. Happy to get others’ feedback as well.

I am trying to identify gene related information.

It appears that the PTB tokenization logic in places like the tokenizer & dictionary building will split a string into multiple tokens if it is not a number and contains a period.

For example, given “22q11.2 deletion syndrome”:

PTB tokenizer: [22q11, .2, deletion, syndrome]
POS for the above term: [CD, CD, NN, NN]
Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]

The same string creates a different split of [22q11, ., 2, deletion, syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens)
When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]

The period-split difference above (period alone vs period + number) might be irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3].
The new lookup will ignore incoming tokens “22q11” because its CD and “.2” because its a number.

It looks like this concept might not be possible to be identified unless CD is allowed as a lookup token POS.
Even if this is allowed though, in the case of gene locations I think the PTB rules might not be the best fit.

Are there any thoughts/experiences regarding addressing the gene location mentions like this?
Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce the same components?

Let me know if I read into one of these points wrong. Since these items would likely cause large changes I am looking to get some feedback before moving forward.

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com


Re: periods and the interaction with PTB & Fast Dict Lookup.

Posted by britt fitch <br...@wiredinformatics.com>.
Hi Sean, do you want a ticket for the PTB update?

Cheers,

Britt



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 15, 2015, at 9:07 AM, britt fitch <br...@wiredinformatics.com> wrote:
> 
> Thanks Sean.
> 
> The other part of the concern is if its reasonable/feasible to alter tokenization rules for things like gene locations. I can work around this in a few ways but if there are other examples of how this might come up in other cases it could be worth looking at a blanket change. Sadly I don’t have another example off the top of my head, maybe organism names? Doing a few queries for terms in the UMLS with periods the majority of them seem to be things you really would want to split on. Perhaps genes are just an edge case.
> 
> I was looking at gene locations overall, not any particular gene or disorder grouping. The term I mentioned was just meant to be an example.
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com <http://wiredinformatics.com/>
> Britt.Fitch@wiredinformatics.com
> 
>> On Jul 15, 2015, at 8:57 AM, Finan, Sean <Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu>> wrote:
>> 
>> Hi Britt,
>> 
>> The dictionary should be using ptb tokenization, but I obviously missed a rule and separated the . from the following 2 in the dictionary.
>> 
>> I will double-check everything.
>> 
>> Sean
>> 
>> p.s. if you don’t mind my asking, are you looking into all connective tissue disorders or just Shprintzen?
>> 
>> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <ma...@wiredinformatics.com>]
>> Sent: Tuesday, July 14, 2015 3:58 PM
>> To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
>> Subject: periods and the interaction with PTB & Fast Dict Lookup.
>> 
>> Another question/topic likely for Sean & Tim. Happy to get others’ feedback as well.
>> 
>> I am trying to identify gene related information.
>> 
>> It appears that the PTB tokenization logic in places like the tokenizer & dictionary building will split a string into multiple tokens if it is not a number and contains a period.
>> 
>> For example, given “22q11.2 deletion syndrome”:
>> 
>> PTB tokenizer: [22q11, .2, deletion, syndrome]
>> POS for the above term: [CD, CD, NN, NN]
>> Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]
>> 
>> The same string creates a different split of [22q11, ., 2, deletion, syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens)
>> When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]
>> 
>> The period-split difference above (period alone vs period + number) might be irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3].
>> The new lookup will ignore incoming tokens “22q11” because its CD and “.2” because its a number.
>> 
>> It looks like this concept might not be possible to be identified unless CD is allowed as a lookup token POS.
>> Even if this is allowed though, in the case of gene locations I think the PTB rules might not be the best fit.
>> 
>> Are there any thoughts/experiences regarding addressing the gene location mentions like this?
>> Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce the same components?
>> 
>> Let me know if I read into one of these points wrong. Since these items would likely cause large changes I am looking to get some feedback before moving forward.
>> 
>> Cheers,
>> 
>> Britt
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Britt Fitch
>> Wired Informatics
>> 265 Franklin St Ste 1702
>> Boston, MA 02110
>> http://wiredinformatics.com <http://wiredinformatics.com/>
>> Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>>
> 


Re: periods and the interaction with PTB & Fast Dict Lookup.

Posted by britt fitch <br...@wiredinformatics.com>.
Thanks Sean.

The other part of the concern is if its reasonable/feasible to alter tokenization rules for things like gene locations. I can work around this in a few ways but if there are other examples of how this might come up in other cases it could be worth looking at a blanket change. Sadly I don’t have another example off the top of my head, maybe organism names? Doing a few queries for terms in the UMLS with periods the majority of them seem to be things you really would want to split on. Perhaps genes are just an edge case.

I was looking at gene locations overall, not any particular gene or disorder grouping. The term I mentioned was just meant to be an example.


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 15, 2015, at 8:57 AM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Hi Britt,
> 
> The dictionary should be using ptb tokenization, but I obviously missed a rule and separated the . from the following 2 in the dictionary.
> 
> I will double-check everything.
> 
> Sean
> 
> p.s. if you don’t mind my asking, are you looking into all connective tissue disorders or just Shprintzen?
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <ma...@wiredinformatics.com>]
> Sent: Tuesday, July 14, 2015 3:58 PM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
> Subject: periods and the interaction with PTB & Fast Dict Lookup.
> 
> Another question/topic likely for Sean & Tim. Happy to get others’ feedback as well.
> 
> I am trying to identify gene related information.
> 
> It appears that the PTB tokenization logic in places like the tokenizer & dictionary building will split a string into multiple tokens if it is not a number and contains a period.
> 
> For example, given “22q11.2 deletion syndrome”:
> 
> PTB tokenizer: [22q11, .2, deletion, syndrome]
> POS for the above term: [CD, CD, NN, NN]
> Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]
> 
> The same string creates a different split of [22q11, ., 2, deletion, syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens)
> When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]
> 
> The period-split difference above (period alone vs period + number) might be irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3].
> The new lookup will ignore incoming tokens “22q11” because its CD and “.2” because its a number.
> 
> It looks like this concept might not be possible to be identified unless CD is allowed as a lookup token POS.
> Even if this is allowed though, in the case of gene locations I think the PTB rules might not be the best fit.
> 
> Are there any thoughts/experiences regarding addressing the gene location mentions like this?
> Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce the same components?
> 
> Let me know if I read into one of these points wrong. Since these items would likely cause large changes I am looking to get some feedback before moving forward.
> 
> Cheers,
> 
> Britt
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>>


RE: periods and the interaction with PTB & Fast Dict Lookup.

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Britt,

The dictionary should be using ptb tokenization, but I obviously missed a rule and separated the . from the following 2 in the dictionary.

I will double-check everything.

Sean

p.s. if you don’t mind my asking, are you looking into all connective tissue disorders or just Shprintzen?

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Tuesday, July 14, 2015 3:58 PM
To: dev@ctakes.apache.org
Subject: periods and the interaction with PTB & Fast Dict Lookup.

Another question/topic likely for Sean & Tim. Happy to get others’ feedback as well.

I am trying to identify gene related information.

It appears that the PTB tokenization logic in places like the tokenizer & dictionary building will split a string into multiple tokens if it is not a number and contains a period.

For example, given “22q11.2 deletion syndrome”:

PTB tokenizer: [22q11, .2, deletion, syndrome]
POS for the above term: [CD, CD, NN, NN]
Chunks for the above term: [B-NP, I-NP, I-NP, I-NP]

The same string creates a different split of [22q11, ., 2, deletion, syndrome] in the new dictionary module (RareWordTermMapCreator.getTokens)
When the _rareWordTermMap gets created it uses the first token as the key: 22q11=[org.apache.ctakes.dictionary.lookup2.term.RareWordTerm@37917c4d]

The period-split difference above (period alone vs period + number) might be irrelevant here because for the input “22q11.2 deletion syndrome”, the lookup indices are [2,3].
The new lookup will ignore incoming tokens “22q11” because its CD and “.2” because its a number.

It looks like this concept might not be possible to be identified unless CD is allowed as a lookup token POS.
Even if this is allowed though, in the case of gene locations I think the PTB rules might not be the best fit.

Are there any thoughts/experiences regarding addressing the gene location mentions like this?
Should the Fast Dict tokenization logic match the PTB tokenizer logic to produce the same components?

Let me know if I read into one of these points wrong. Since these items would likely cause large changes I am looking to get some feedback before moving forward.

Cheers,

Britt









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>