You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by britt fitch <br...@wiredinformatics.com> on 2015/07/08 20:22:19 UTC

dictionary-look-fast fails to handle alternative CUIs

This is largely directed to Sean but open to other feedback as well.

The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]

The CUI string is then substring’d from 1 to len and parsed as a Long.

This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.

I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.

I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.

Thoughts?


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

RE: dictionary-look-fast fails to handle alternative CUIs

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hopefully the speed difference will be negligible.  It only makes the conversion at two times: 1. When internally storing a custom dictionary, 2. When storing discovered CUIs in the cas.  Since custom dictionaries are only read once #1 shouldn’t have any real impact.  #2 should require an execution per unique cui in the document, so if there are 100 cuis per doc * 10,000,000 docs it will probably add up to a few seconds – minor in the grande scheme of things.  However, there may be a situation that I’m missing.
There shouldn’t be any impact upon accuracy as the adjustments occur completely outside the lookup loop.

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Friday, July 10, 2015 5:57 PM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

No issues so far.

I think you are already handling the 1 edge case I could come up with which was if the numeral portion of the code started with a 0 and it 0 was lost during the divide step but it looks like you are inserting leading zeros to the numeral portion if needed with digitCount.

I’ll definitely report back if I notice any performance change given the new logic though.

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

On Jul 10, 2015, at 5:31 PM, Finan, Sean <Se...@childrens.harvard.edu>> wrote:

Great, thanks.   Any issues or concerns?  Possible enhancements?  Like the source, I’m open to change …

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Friday, July 10, 2015 5:29 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Thanks, just finished testing and closed the ticket.

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>>

On Jul 9, 2015, at 3:44 PM, Finan, Sean <Se...@childrens.harvard.edu>>> wrote:

Checked in, please give it a test and close the ticket if it fits your purposes.

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Thursday, July 09, 2015 3:30 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>>
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Linking ticket here for completeness https://issues.apache.org/jira/browse/CTAKES-368

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>>>

On Jul 9, 2015, at 3:19 PM, britt fitch <br...@wiredinformatics.com>>>> wrote:

Absolutely. I’ll create it now.

Re: dictionary-look-fast fails to handle alternative CUIs

Posted by britt fitch <br...@wiredinformatics.com>.

No issues so far.

I think you are already handling the 1 edge case I could come up with which was if the numeral portion of the code started with a 0 and it 0 was lost during the divide step but it looks like you are inserting leading zeros to the numeral portion if needed with digitCount.

I’ll definitely report back if I notice any performance change given the new logic though.


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 10, 2015, at 5:31 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Great, thanks.   Any issues or concerns?  Possible enhancements?  Like the source, I’m open to change …
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
> Sent: Friday, July 10, 2015 5:29 PM
> To: dev@ctakes.apache.org
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Thanks, just finished testing and closed the ticket.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>
> 
> On Jul 9, 2015, at 3:44 PM, Finan, Sean <Se...@childrens.harvard.edu>> wrote:
> 
> Checked in, please give it a test and close the ticket if it fits your purposes.
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
> Sent: Thursday, July 09, 2015 3:30 PM
> To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Linking ticket here for completeness https://issues.apache.org/jira/browse/CTAKES-368
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>>
> 
> On Jul 9, 2015, at 3:19 PM, britt fitch <br...@wiredinformatics.com>>> wrote:
> 
> Absolutely. I’ll create it now.
>

RE: dictionary-look-fast fails to handle alternative CUIs

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Great, thanks.   Any issues or concerns?  Possible enhancements?  Like the source, I’m open to change …

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Friday, July 10, 2015 5:29 PM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Thanks, just finished testing and closed the ticket.

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

On Jul 9, 2015, at 3:44 PM, Finan, Sean <Se...@childrens.harvard.edu>> wrote:

Checked in, please give it a test and close the ticket if it fits your purposes.

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Thursday, July 09, 2015 3:30 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Linking ticket here for completeness https://issues.apache.org/jira/browse/CTAKES-368

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>>

On Jul 9, 2015, at 3:19 PM, britt fitch <br...@wiredinformatics.com>>> wrote:

Absolutely. I’ll create it now.

Re: dictionary-look-fast fails to handle alternative CUIs

Posted by britt fitch <br...@wiredinformatics.com>.

Thanks, just finished testing and closed the ticket.



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 9, 2015, at 3:44 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Checked in, please give it a test and close the ticket if it fits your purposes.
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
> Sent: Thursday, July 09, 2015 3:30 PM
> To: dev@ctakes.apache.org
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Linking ticket here for completeness https://issues.apache.org/jira/browse/CTAKES-368
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>
> 
> On Jul 9, 2015, at 3:19 PM, britt fitch <br...@wiredinformatics.com>> wrote:
> 
> Absolutely. I’ll create it now.
>

RE: dictionary-look-fast fails to handle alternative CUIs

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Checked in, please give it a test and close the ticket if it fits your purposes.

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Thursday, July 09, 2015 3:30 PM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Linking ticket here for completeness https://issues.apache.org/jira/browse/CTAKES-368









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

On Jul 9, 2015, at 3:19 PM, britt fitch <br...@wiredinformatics.com>> wrote:

Absolutely. I’ll create it now.

Re: dictionary-look-fast fails to handle alternative CUIs

Posted by britt fitch <br...@wiredinformatics.com>.

Linking ticket here for completeness https://issues.apache.org/jira/browse/CTAKES-368 <https://issues.apache.org/jira/browse/CTAKES-368>


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 9, 2015, at 3:19 PM, britt fitch <br...@wiredinformatics.com> wrote:
> 
> Absolutely. I’ll create it now.

Re: dictionary-look-fast fails to handle alternative CUIs

Posted by britt fitch <br...@wiredinformatics.com>.

Absolutely. I’ll create it now.

Thanks!



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 9, 2015, at 3:12 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Hi Britt,
> 
> I’ve got some code and tests to check in.  Would you like to write the jira item?
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <ma...@wiredinformatics.com>]
> Sent: Thursday, July 09, 2015 8:55 AM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> I don’t think that is too much of a constraint, at least initially, to have all CUI values a consistent length for a given prefix.
> 
> Thanks Sean, let me know if there is any part of this you’d like a hand with.
> 
> Cheers,
> 
> Britt
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
> 
> On Jul 8, 2015, at 7:16 PM, Finan, Sean <Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu><mailto:Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu>>> wrote:
> 
> Hi Britt,
> 
> You’ve got it exactly.
> 
> I actually started working on this right before a meeting right before I left work right before I went to the store … but I’m now back to it and I’m going to move forward with the tiny bot that I’ve got.  I don’t think that it will take too long …
> 
> One reason that I like the “pair” idea is that something like “CN123456” won’t get converted to “CN0123456” by assuming that it is a seven digit numerical base. Likewise somebody could make a tiny dictionary with “SEAN01, SEAN02, SEAN03…” through 99.  Then their output would still be formatted as “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that too much of a restraint?  Hmmm.  Well, I’m going to push forward with this idea.
> 
> I’ll check in whatever I get done tonight.
> 
> Cheers,
> Sean
> 
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <ma...@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 4:21 PM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org><mailto:dev@ctakes.apache.org <ma...@ctakes.apache.org>>
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Thanks for the details Sean. I had assumed the conversion to Long was related to sort/search efficiency but that makes sense.
> 
> I had been thinking of something similar with parsing out the non-numerals and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately CN123456 would become 0314123456 but I don’t think its sophisticated enough to avoid issues with leading zeros. We could prepend a 9 to it to avoid losing digits and use something like:
> 
> if(length>8 && begins with 9)
>           discard 9
>           while (length > 8)
>                       convert first 2 numbers to a letter
> 
> I think your suggestion sounds good to me. To run the example through:
> 
> “NLM300" gets parsed to “NLM” + “300”
> Store Pair<Integer,String>(3, NLM) at Pair[0]
> Produce a Long of 0x10000000 + 300 = 300L
> Backtrack to the actual “CUI” floor(300/10000000) = 0L
> 300L - 0L = 300L
> Pair[0] = NLM
> CUI = NLM + 300
> 
> In that case, do we need to store it as a Pair at all or is just storing the prefix in a String[] sufficient?
> 
> I’m happy to start working on this unless you have a preference for splitting it out into multiple tasks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com <http://wiredinformatics.com/>
> Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>>
> 
> On Jul 8, 2015, at 2:54 PM, Finan, Sean <Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu><mailto:Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu>><mailto:Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu>>> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database that we use has roughly half a million cuis.  Storing cuis in the various tables as longs takes up a lot less space than storing them as 8 character strings.
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <ma...@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org><mailto:dev@ctakes.apache.org <ma...@ctakes.apache.org>><mailto:dev@ctakes.apache.org <ma...@ctakes.apache.org>>
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other feedback as well.
> 
> The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]
> 
> The CUI string is then substring’d from 1 to len and parsed as a Long.
> 
> This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.
> 
> I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.
> 
> I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.
> 
> Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com<http://wiredinformatics.com/> <http://wiredinformatics.com<http://wiredinformatics.com/>>
> Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>>

RE: dictionary-look-fast fails to handle alternative CUIs

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Britt,

I’ve got some code and tests to check in.  Would you like to write the jira item?

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Thursday, July 09, 2015 8:55 AM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

I don’t think that is too much of a constraint, at least initially, to have all CUI values a consistent length for a given prefix.

Thanks Sean, let me know if there is any part of this you’d like a hand with.

Cheers,

Britt

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

On Jul 8, 2015, at 7:16 PM, Finan, Sean <Se...@childrens.harvard.edu>> wrote:

Hi Britt,

You’ve got it exactly.

I actually started working on this right before a meeting right before I left work right before I went to the store … but I’m now back to it and I’m going to move forward with the tiny bot that I’ve got.  I don’t think that it will take too long …

One reason that I like the “pair” idea is that something like “CN123456” won’t get converted to “CN0123456” by assuming that it is a seven digit numerical base. Likewise somebody could make a tiny dictionary with “SEAN01, SEAN02, SEAN03…” through 99.  Then their output would still be formatted as “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that too much of a restraint?  Hmmm.  Well, I’m going to push forward with this idea.

I’ll check in whatever I get done tonight.

Cheers,
Sean

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 4:21 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Thanks for the details Sean. I had assumed the conversion to Long was related to sort/search efficiency but that makes sense.

I had been thinking of something similar with parsing out the non-numerals and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately CN123456 would become 0314123456 but I don’t think its sophisticated enough to avoid issues with leading zeros. We could prepend a 9 to it to avoid losing digits and use something like:

if(length>8 && begins with 9)
           discard 9
           while (length > 8)
                       convert first 2 numbers to a letter

I think your suggestion sounds good to me. To run the example through:

“NLM300" gets parsed to “NLM” + “300”
Store Pair<Integer,String>(3, NLM) at Pair[0]
Produce a Long of 0x10000000 + 300 = 300L
Backtrack to the actual “CUI” floor(300/10000000) = 0L
300L - 0L = 300L
Pair[0] = NLM
CUI = NLM + 300

In that case, do we need to store it as a Pair at all or is just storing the prefix in a String[] sufficient?

I’m happy to start working on this unless you have a preference for splitting it out into multiple tasks.

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

On Jul 8, 2015, at 2:54 PM, Finan, Sean <Se...@childrens.harvard.edu>> wrote:

By the way, in case you are wondering why it does this … the umls database that we use has roughly half a million cuis.  Storing cuis in the various tables as longs takes up a lot less space than storing them as 8 character strings.

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 2:23 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: dictionary-look-fast fails to handle alternative CUIs

This is largely directed to Sean but open to other feedback as well.

The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]

The CUI string is then substring’d from 1 to len and parsed as a Long.

This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.

I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.

I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.

Thoughts?

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com<http://wiredinformatics.com/>
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

Re: dictionary-look-fast fails to handle alternative CUIs

Posted by britt fitch <br...@wiredinformatics.com>.

I don’t think that is too much of a constraint, at least initially, to have all CUI values a consistent length for a given prefix.

Thanks Sean, let me know if there is any part of this you’d like a hand with.

Cheers,

Britt


Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 8, 2015, at 7:16 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> Hi Britt,
> 
> You’ve got it exactly.
> 
> I actually started working on this right before a meeting right before I left work right before I went to the store … but I’m now back to it and I’m going to move forward with the tiny bot that I’ve got.  I don’t think that it will take too long …
> 
> One reason that I like the “pair” idea is that something like “CN123456” won’t get converted to “CN0123456” by assuming that it is a seven digit numerical base. Likewise somebody could make a tiny dictionary with “SEAN01, SEAN02, SEAN03…” through 99.  Then their output would still be formatted as “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that too much of a restraint?  Hmmm.  Well, I’m going to push forward with this idea.
> 
> I’ll check in whatever I get done tonight.
> 
> Cheers,
> Sean
> 
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <ma...@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 4:21 PM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org>
> Subject: Re: dictionary-look-fast fails to handle alternative CUIs
> 
> Thanks for the details Sean. I had assumed the conversion to Long was related to sort/search efficiency but that makes sense.
> 
> I had been thinking of something similar with parsing out the non-numerals and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately CN123456 would become 0314123456 but I don’t think its sophisticated enough to avoid issues with leading zeros. We could prepend a 9 to it to avoid losing digits and use something like:
> 
> if(length>8 && begins with 9)
>            discard 9
>            while (length > 8)
>                        convert first 2 numbers to a letter
> 
> I think your suggestion sounds good to me. To run the example through:
> 
> “NLM300" gets parsed to “NLM” + “300”
> Store Pair<Integer,String>(3, NLM) at Pair[0]
> Produce a Long of 0x10000000 + 300 = 300L
> Backtrack to the actual “CUI” floor(300/10000000) = 0L
> 300L - 0L = 300L
> Pair[0] = NLM
> CUI = NLM + 300
> 
> In that case, do we need to store it as a Pair at all or is just storing the prefix in a String[] sufficient?
> 
> I’m happy to start working on this unless you have a preference for splitting it out into multiple tasks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>>
> 
> On Jul 8, 2015, at 2:54 PM, Finan, Sean <Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu><mailto:Sean.Finan@childrens.harvard.edu <ma...@childrens.harvard.edu>>> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database that we use has roughly half a million cuis.  Storing cuis in the various tables as longs takes up a lot less space than storing them as 8 character strings.
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com <ma...@wiredinformatics.com>]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org <ma...@ctakes.apache.org><mailto:dev@ctakes.apache.org <ma...@ctakes.apache.org>>
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other feedback as well.
> 
> The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]
> 
> The CUI string is then substring’d from 1 to len and parsed as a Long.
> 
> This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.
> 
> I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.
> 
> I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.
> 
> Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com <http://wiredinformatics.com/>
> Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com><mailto:Britt.Fitch@wiredinformatics.com <ma...@wiredinformatics.com>>

RE: dictionary-look-fast fails to handle alternative CUIs

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Britt,

You’ve got it exactly.

I actually started working on this right before a meeting right before I left work right before I went to the store … but I’m now back to it and I’m going to move forward with the tiny bot that I’ve got.  I don’t think that it will take too long …

One reason that I like the “pair” idea is that something like “CN123456” won’t get converted to “CN0123456” by assuming that it is a seven digit numerical base.  Likewise somebody could make a tiny dictionary with “SEAN01, SEAN02, SEAN03…” through 99.  Then their output would still be formatted as “SEAN01 .. SEAN99”.  They couldn’t mix in “SEAN1, SEAN2 …” though.  Is that too much of a restraint?  Hmmm.  Well, I’m going to push forward with this idea.

I’ll check in whatever I get done tonight.

Cheers,
Sean


From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 4:21 PM
To: dev@ctakes.apache.org
Subject: Re: dictionary-look-fast fails to handle alternative CUIs

Thanks for the details Sean. I had assumed the conversion to Long was related to sort/search efficiency but that makes sense.

I had been thinking of something similar with parsing out the non-numerals and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately CN123456 would become 0314123456 but I don’t think its sophisticated enough to avoid issues with leading zeros. We could prepend a 9 to it to avoid losing digits and use something like:

if(length>8 && begins with 9)
            discard 9
            while (length > 8)
                        convert first 2 numbers to a letter

I think your suggestion sounds good to me. To run the example through:

“NLM300" gets parsed to “NLM” + “300”
Store Pair<Integer,String>(3, NLM) at Pair[0]
Produce a Long of 0x10000000 + 300 = 300L
Backtrack to the actual “CUI” floor(300/10000000) = 0L
300L - 0L = 300L
Pair[0] = NLM
CUI = NLM + 300

In that case, do we need to store it as a Pair at all or is just storing the prefix in a String[] sufficient?

I’m happy to start working on this unless you have a preference for splitting it out into multiple tasks.










Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

On Jul 8, 2015, at 2:54 PM, Finan, Sean <Se...@childrens.harvard.edu>> wrote:

By the way, in case you are wondering why it does this … the umls database that we use has roughly half a million cuis.  Storing cuis in the various tables as longs takes up a lot less space than storing them as 8 character strings.

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 2:23 PM
To: dev@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: dictionary-look-fast fails to handle alternative CUIs

This is largely directed to Sean but open to other feedback as well.

The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]

The CUI string is then substring’d from 1 to len and parsed as a Long.

This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.

I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.

I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.

Thoughts?









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>

Re: dictionary-look-fast fails to handle alternative CUIs

Posted by britt fitch <br...@wiredinformatics.com>.

Thanks for the details Sean. I had assumed the conversion to Long was related to sort/search efficiency but that makes sense.

I had been thinking of something similar with parsing out the non-numerals and converting them to 2-digit numeral values. i.e. “a” = 01, “z” = 26. Ultimately CN123456 would become 0314123456 but I don’t think its sophisticated enough to avoid issues with leading zeros. We could prepend a 9 to it to avoid losing digits and use something like:

if(length>8 && begins with 9)
	discard 9
	while (length > 8)
		convert first 2 numbers to a letter

I think your suggestion sounds good to me. To run the example through:

“NLM300" gets parsed to “NLM” + “300”
Store Pair<Integer,String>(3, NLM) at Pair[0]
Produce a Long of 0x10000000 + 300 = 300L
Backtrack to the actual “CUI” floor(300/10000000) = 0L
300L - 0L = 300L
Pair[0] = NLM
CUI = NLM + 300

In that case, do we need to store it as a Pair at all or is just storing the prefix in a String[] sufficient?

I’m happy to start working on this unless you have a preference for splitting it out into multiple tasks.



Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

> On Jul 8, 2015, at 2:54 PM, Finan, Sean <Se...@childrens.harvard.edu> wrote:
> 
> By the way, in case you are wondering why it does this … the umls database that we use has roughly half a million cuis.  Storing cuis in the various tables as longs takes up a lot less space than storing them as 8 character strings.
> 
> From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
> Sent: Wednesday, July 08, 2015 2:23 PM
> To: dev@ctakes.apache.org
> Subject: dictionary-look-fast fails to handle alternative CUIs
> 
> This is largely directed to Sean but open to other feedback as well.
> 
> The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]
> 
> The CUI string is then substring’d from 1 to len and parsed as a Long.
> 
> This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
> These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.
> 
> I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.
> 
> I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.
> 
> Thoughts?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Britt Fitch
> Wired Informatics
> 265 Franklin St Ste 1702
> Boston, MA 02110
> http://wiredinformatics.com
> Britt.Fitch@wiredinformatics.com
>

RE: dictionary-look-fast fails to handle alternative CUIs

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

By the way, in case you are wondering why it does this … the umls database that we use has roughly half a million cuis.  Storing cuis in the various tables as longs takes up a lot less space than storing them as 8 character strings.

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 2:23 PM
To: dev@ctakes.apache.org
Subject: dictionary-look-fast fails to handle alternative CUIs

This is largely directed to Sean but open to other feedback as well.

The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]

The CUI string is then substring’d from 1 to len and parsed as a Long.

This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.

I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.

I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.

Thoughts?

Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com

RE: dictionary-look-fast fails to handle alternative CUIs

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Britt,

This did come up briefly wrt NCI custom cuis, but there was no urgency so it got shoved onto a back burner.  I’m glad that you brought it up again as it is something that I’ve been wanting to enhance.

My thought at the moment is to change CuiCodeUtil (…lookup2.util.) from a non-instantiable utility class to a singleton.  In the singleton we could keep an array of String prefixes.  At each conversion of custom cuis String to long, store each unique String prefix in the array and add [arrayIndex*10000000] to the long version of the cui.  That should allow for more than enough custom cui prefixes.  Upon conversion back to the String version, grab the prefix according to the long/10000000 and append accordingly.  Of course, this would work just fine for single-character prefixes with a 7 digit cui.  Multiple-character prefixes might require an array of Integer,prefix pairs, where each integer is the number of digits in the cui:  NLM003 ->  3,”NLM”.

I think that the array of pairs would work – it would just need to be carried through and tested.  Any thoughts?

Sean

From: britt fitch [mailto:britt.fitch@wiredinformatics.com]
Sent: Wednesday, July 08, 2015 2:23 PM
To: dev@ctakes.apache.org
Subject: dictionary-look-fast fails to handle alternative CUIs

This is largely directed to Sean but open to other feedback as well.

The current fast lookup using a BSV parses the first field as “C” and up to 7 numerals, padding with “0" as needed to reach that length when applicable [see CuiCodeUtil.getCuiCode(String)]

The CUI string is then substring’d from 1 to len and parsed as a Long.

This is producing issues with other related, but separate, ontologies (MedGen) where the bulk of concepts use UMLS CUIs but some additional concepts were created by the NCBI where no CUI previously existed.
These MedGen-specific concepts are created with a prefix “CN” + 6 numerals, resulting in “N123456” failing to produce a Long.

I wanted Sean’s thoughts on this and to get some feedback on if others are running into this issue and if the community wants a solution to providing a CUI format beyond the standard C + 7 numerals.

I’m happy to make these edits and check them in whether that means updating the CuiCodeUtil class or creating an entirely new BSVConceptFactory if thats what makes the most sense.

Thoughts?









Britt Fitch
Wired Informatics
265 Franklin St Ste 1702
Boston, MA 02110
http://wiredinformatics.com
Britt.Fitch@wiredinformatics.com<ma...@wiredinformatics.com>