You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by andy mcmurry <mc...@gmail.com> on 2014/11/11 23:02:08 UTC

Announcement: UMLS MedGen-MySQL dataset now available as open access download

Hello!

https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)

We just released a new library containing a huge chunk of UMLS concepts
which are available without registering accounts/username/passwords.
LEGALLY. Yes, really!

The subset is from NCBI and it contains *thousands of concepts from SNOMED
and other vocabularies*.

The code is essentially
1. a list of WGET targets to various NCBI FTP site mirrors
2. Makefile for building the databases of interest

Our legal team has approved distribution for Open Access work, ASL2
LICENSE.

I recommend we use this opportunity to make this the default distribution
for CTAKES UMLS connections, because it obviates the need for so much
painful credentialing and back and forth agreements with the US National
Library of Medicine.

Cheers!
--Andy


On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <Ma...@mayo.edu>
wrote:

>
> I would love to see the install be as simple as apt-get install to end up
> with some working dictionary that have more than a handful of entries to
> get them started.
>
> Regards,
> James Masanz
>
> -----Original Message-----
> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
> Sent: Tuesday, September 09, 2014 4:32 PM
> To: ctakes-dev@incubator.apache.org
> Subject: Recommendation for ctakes default (UMLS) dictionaries
>
> Greetings ctakes-dev:
>
> *UMLS license restrictions have been getting more lax over the years --
> *much of the UMLS can be downloaded directly from the NCBI official FTP
> site.
>
> In fact, the NIH (and implicitly the NLM) *have already made the standard
> terms public for some medical specialities*.
>
> For example: Here is the UMLS subset specific to Medical Genetics (MedGen)
> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and names,
> etc :
>
> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>
> My team has developed a JVM based wrapper for MetaMap 2013AB which I
> intend to open source soon (Clojure).  It includes REST support for
> invoking MetaMap with any or all of the command line arguments.
> We do not integrate with UIMA, we are basically a wrapper around the
> binary installation of MetaMap. The emphasis is on publication text not
> clinical text, still, some services are common (such as LVG).
>
> Strangely, the NLM still requires UMLS licenses to download MetaMap
> execution binaries. The MetaMap binary install is better but customizing
> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with YTEXT
>
> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>
> *** Hence, there is a real opportunity here to enable Apache cTAKES to
> have a stronger default dictionary. ** *
>
> Imagine if we could
> *$ apt-get install apache-ctakes *
>
> and instantly have a working package for SOME problem domain.
> In my case (Medical Genetics) the UMLS definitions are already available
> and the UMLS license problem becomes a non issue, at least for many first
> time users
>
> Your thoughts?
> AndyMC
>

RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Hi Andy,

Great stuff!  I think that I understand the method, but I have a question about the statement:

>the content is publicly available per the NCBI policy and license for MedGen sources

Does this mean that I, Joe Anybody, could download the content, place some of the content in a database structured in my own fashion, package the -new- database, and include it in a cTakes distribution?
Or, does it mean that content downloaded by script is usable as-is and only as-is?  The whole "if I'd known your were going to do that I wouldn't have given it to you ..."

Thanks,
Sean

________________________________________
From: andy mcmurry [mcmurry.andy@gmail.com]
Sent: Thursday, November 13, 2014 6:59 PM
To: dev@ctakes.apache.org
Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Pei: Yes, specifically:

The source code was released by Invitae under Apache ASL 2.0 per my request
and with full blessing from our legal counsel and software team. I also
reviewed in principle the idea with John Wilbanks of Sage Bionetworks (and
formerly creative commons). This is legit, or I wouldn't have spent tons of
hours doing it.

The raw content is a set of scripts which wget a list of URLS from the NCBI
public FTP repositories. This code DOES NOT redistribute any content
whatsoever, just a list of URLs to download, unzip, and insert into a local
mysql database. To repeat: I am NOT circulating any content, just URL links
-- you must download the content yourself. And that is the beauty -- all
content is downloaded BY THE USER and the content is publicly available per
the NCBI policy and license for MedGen sources.


On Thu, Nov 13, 2014 at 11:18 AM, Chen, Pei <Pe...@childrens.harvard.edu>
wrote:

> John- I believe that was the thinking.
> Andy- Just to confirm- Is the raw content of this dataset released under
> ASL2.0?  i.e. can you contribute it as a CSV or similar so that cTAKES may
> re-tokenize it using the same PTB rules, format it for cTAKES' dictionary
> lookup, etc., and then redistribute it under the same License.
>
> > -----Original Message-----
> > From: John Green [mailto:john.travis.green@gmail.com]
> > Sent: Thursday, November 13, 2014 1:55 PM
> > To: dev@ctakes.apache.org
> > Cc: dev@ctakes.apache.org
> > Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available
> > as open access download
> >
> > The old licensed setup would be kept as a packaged option? Much as it is
> > now.... With the unlicensed going out in place of the current "free"
> > dictionary? Am I understanding that right?
> >
> >
> > JG
> > —
> > Sent from Mailbox
> >
> > On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry
> > <mc...@gmail.com>
> > wrote:
> >
> > > I'll crunch the numbers -- in the meantime I can tell you that
> > > phenotypes vary by semantic type. clinical attributes  from SNOMED are
> > > abundant, many concepts in mesh that are mapped to diseases. Tons of
> > > "pharmacological substances"
> > > On Nov 12, 2014 6:19 AM, "Dligach, Dmitriy" <
> > > Dmitriy.Dligach@childrens.harvard.edu> wrote:
> > >> Andy, thank you for this resource!
> > >>
> > >> Do you have an estimate of what percentage of UMLS concepts were left
> > out?
> > >>
> > >> Dima
> > >>
> > >>
> > >>
> > >>
> > >> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com>
> > wrote:
> > >>
> > >> > Hello!
> > >> >
> > >> > https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> > >> >
> > >> > We just released a new library containing a huge chunk of UMLS
> > >> > concepts which are available without registering
> > accounts/username/passwords.
> > >> > LEGALLY. Yes, really!
> > >> >
> > >> > The subset is from NCBI and it contains *thousands of concepts from
> > >> SNOMED
> > >> > and other vocabularies*.
> > >> >
> > >> > The code is essentially
> > >> > 1. a list of WGET targets to various NCBI FTP site mirrors 2.
> > >> > Makefile for building the databases of interest
> > >> >
> > >> > Our legal team has approved distribution for Open Access work, ASL2
> > >> > LICENSE.
> > >> >
> > >> > I recommend we use this opportunity to make this the default
> > >> > distribution for CTAKES UMLS connections, because it obviates the
> > >> > need for so much painful credentialing and back and forth
> > >> > agreements with the US National Library of Medicine.
> > >> >
> > >> > Cheers!
> > >> > --Andy
> > >> >
> > >> >
> > >> > On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <
> > >> Masanz.James@mayo.edu>
> > >> > wrote:
> > >> >
> > >> >>
> > >> >> I would love to see the install be as simple as apt-get install to
> > >> >> end
> > >> up
> > >> >> with some working dictionary that have more than a handful of
> > >> >> entries to get them started.
> > >> >>
> > >> >> Regards,
> > >> >> James Masanz
> > >> >>
> > >> >> -----Original Message-----
> > >> >> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
> > >> >> Sent: Tuesday, September 09, 2014 4:32 PM
> > >> >> To: ctakes-dev@incubator.apache.org
> > >> >> Subject: Recommendation for ctakes default (UMLS) dictionaries
> > >> >>
> > >> >> Greetings ctakes-dev:
> > >> >>
> > >> >> *UMLS license restrictions have been getting more lax over the
> > >> >> years -- *much of the UMLS can be downloaded directly from the
> > >> >> NCBI official FTP site.
> > >> >>
> > >> >> In fact, the NIH (and implicitly the NLM) *have already made the
> > >> standard
> > >> >> terms public for some medical specialities*.
> > >> >>
> > >> >> For example: Here is the UMLS subset specific to Medical Genetics
> > >> (MedGen)
> > >> >> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s)
> > >> >> and
> > >> names,
> > >> >> etc :
> > >> >>
> > >> >> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
> > >> >>
> > >> >> My team has developed a JVM based wrapper for MetaMap 2013AB
> > which
> > >> >> I intend to open source soon (Clojure).  It includes REST support
> > >> >> for invoking MetaMap with any or all of the command line arguments.
> > >> >> We do not integrate with UIMA, we are basically a wrapper around
> > >> >> the binary installation of MetaMap. The emphasis is on publication
> > >> >> text not clinical text, still, some services are common (such as
> LVG).
> > >> >>
> > >> >> Strangely, the NLM still requires UMLS licenses to download
> > >> >> MetaMap execution binaries. The MetaMap binary install is better
> > >> >> but customizing dictionaries (DataFileBuilder) is not as easy to
> > >> >> use as CTAKES with
> > >> YTEXT
> > >> >>
> > >> >> [
> > >> >> https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installati
> > >> >> on
> > >> ]
> > >> >>
> > >> >> *** Hence, there is a real opportunity here to enable Apache
> > >> >> cTAKES to have a stronger default dictionary. ** *
> > >> >>
> > >> >> Imagine if we could
> > >> >> *$ apt-get install apache-ctakes *
> > >> >>
> > >> >> and instantly have a working package for SOME problem domain.
> > >> >> In my case (Medical Genetics) the UMLS definitions are already
> > >> >> available and the UMLS license problem becomes a non issue, at
> > >> >> least for many
> > >> first
> > >> >> time users
> > >> >>
> > >> >> Your thoughts?
> > >> >> AndyMC
> > >> >>
> > >>
> > >>
>

Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by andy mcmurry <mc...@gmail.com>.
Pei: Yes, specifically:

The source code was released by Invitae under Apache ASL 2.0 per my request
and with full blessing from our legal counsel and software team. I also
reviewed in principle the idea with John Wilbanks of Sage Bionetworks (and
formerly creative commons). This is legit, or I wouldn't have spent tons of
hours doing it.

The raw content is a set of scripts which wget a list of URLS from the NCBI
public FTP repositories. This code DOES NOT redistribute any content
whatsoever, just a list of URLs to download, unzip, and insert into a local
mysql database. To repeat: I am NOT circulating any content, just URL links
-- you must download the content yourself. And that is the beauty -- all
content is downloaded BY THE USER and the content is publicly available per
the NCBI policy and license for MedGen sources.


On Thu, Nov 13, 2014 at 11:18 AM, Chen, Pei <Pe...@childrens.harvard.edu>
wrote:

> John- I believe that was the thinking.
> Andy- Just to confirm- Is the raw content of this dataset released under
> ASL2.0?  i.e. can you contribute it as a CSV or similar so that cTAKES may
> re-tokenize it using the same PTB rules, format it for cTAKES' dictionary
> lookup, etc., and then redistribute it under the same License.
>
> > -----Original Message-----
> > From: John Green [mailto:john.travis.green@gmail.com]
> > Sent: Thursday, November 13, 2014 1:55 PM
> > To: dev@ctakes.apache.org
> > Cc: dev@ctakes.apache.org
> > Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available
> > as open access download
> >
> > The old licensed setup would be kept as a packaged option? Much as it is
> > now.... With the unlicensed going out in place of the current "free"
> > dictionary? Am I understanding that right?
> >
> >
> > JG
> > —
> > Sent from Mailbox
> >
> > On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry
> > <mc...@gmail.com>
> > wrote:
> >
> > > I'll crunch the numbers -- in the meantime I can tell you that
> > > phenotypes vary by semantic type. clinical attributes  from SNOMED are
> > > abundant, many concepts in mesh that are mapped to diseases. Tons of
> > > "pharmacological substances"
> > > On Nov 12, 2014 6:19 AM, "Dligach, Dmitriy" <
> > > Dmitriy.Dligach@childrens.harvard.edu> wrote:
> > >> Andy, thank you for this resource!
> > >>
> > >> Do you have an estimate of what percentage of UMLS concepts were left
> > out?
> > >>
> > >> Dima
> > >>
> > >>
> > >>
> > >>
> > >> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com>
> > wrote:
> > >>
> > >> > Hello!
> > >> >
> > >> > https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> > >> >
> > >> > We just released a new library containing a huge chunk of UMLS
> > >> > concepts which are available without registering
> > accounts/username/passwords.
> > >> > LEGALLY. Yes, really!
> > >> >
> > >> > The subset is from NCBI and it contains *thousands of concepts from
> > >> SNOMED
> > >> > and other vocabularies*.
> > >> >
> > >> > The code is essentially
> > >> > 1. a list of WGET targets to various NCBI FTP site mirrors 2.
> > >> > Makefile for building the databases of interest
> > >> >
> > >> > Our legal team has approved distribution for Open Access work, ASL2
> > >> > LICENSE.
> > >> >
> > >> > I recommend we use this opportunity to make this the default
> > >> > distribution for CTAKES UMLS connections, because it obviates the
> > >> > need for so much painful credentialing and back and forth
> > >> > agreements with the US National Library of Medicine.
> > >> >
> > >> > Cheers!
> > >> > --Andy
> > >> >
> > >> >
> > >> > On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <
> > >> Masanz.James@mayo.edu>
> > >> > wrote:
> > >> >
> > >> >>
> > >> >> I would love to see the install be as simple as apt-get install to
> > >> >> end
> > >> up
> > >> >> with some working dictionary that have more than a handful of
> > >> >> entries to get them started.
> > >> >>
> > >> >> Regards,
> > >> >> James Masanz
> > >> >>
> > >> >> -----Original Message-----
> > >> >> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
> > >> >> Sent: Tuesday, September 09, 2014 4:32 PM
> > >> >> To: ctakes-dev@incubator.apache.org
> > >> >> Subject: Recommendation for ctakes default (UMLS) dictionaries
> > >> >>
> > >> >> Greetings ctakes-dev:
> > >> >>
> > >> >> *UMLS license restrictions have been getting more lax over the
> > >> >> years -- *much of the UMLS can be downloaded directly from the
> > >> >> NCBI official FTP site.
> > >> >>
> > >> >> In fact, the NIH (and implicitly the NLM) *have already made the
> > >> standard
> > >> >> terms public for some medical specialities*.
> > >> >>
> > >> >> For example: Here is the UMLS subset specific to Medical Genetics
> > >> (MedGen)
> > >> >> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s)
> > >> >> and
> > >> names,
> > >> >> etc :
> > >> >>
> > >> >> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
> > >> >>
> > >> >> My team has developed a JVM based wrapper for MetaMap 2013AB
> > which
> > >> >> I intend to open source soon (Clojure).  It includes REST support
> > >> >> for invoking MetaMap with any or all of the command line arguments.
> > >> >> We do not integrate with UIMA, we are basically a wrapper around
> > >> >> the binary installation of MetaMap. The emphasis is on publication
> > >> >> text not clinical text, still, some services are common (such as
> LVG).
> > >> >>
> > >> >> Strangely, the NLM still requires UMLS licenses to download
> > >> >> MetaMap execution binaries. The MetaMap binary install is better
> > >> >> but customizing dictionaries (DataFileBuilder) is not as easy to
> > >> >> use as CTAKES with
> > >> YTEXT
> > >> >>
> > >> >> [
> > >> >> https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installati
> > >> >> on
> > >> ]
> > >> >>
> > >> >> *** Hence, there is a real opportunity here to enable Apache
> > >> >> cTAKES to have a stronger default dictionary. ** *
> > >> >>
> > >> >> Imagine if we could
> > >> >> *$ apt-get install apache-ctakes *
> > >> >>
> > >> >> and instantly have a working package for SOME problem domain.
> > >> >> In my case (Medical Genetics) the UMLS definitions are already
> > >> >> available and the UMLS license problem becomes a non issue, at
> > >> >> least for many
> > >> first
> > >> >> time users
> > >> >>
> > >> >> Your thoughts?
> > >> >> AndyMC
> > >> >>
> > >>
> > >>
>

RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by "Chen, Pei" <Pe...@childrens.harvard.edu>.
John- I believe that was the thinking.
Andy- Just to confirm- Is the raw content of this dataset released under ASL2.0?  i.e. can you contribute it as a CSV or similar so that cTAKES may re-tokenize it using the same PTB rules, format it for cTAKES' dictionary lookup, etc., and then redistribute it under the same License.

> -----Original Message-----
> From: John Green [mailto:john.travis.green@gmail.com]
> Sent: Thursday, November 13, 2014 1:55 PM
> To: dev@ctakes.apache.org
> Cc: dev@ctakes.apache.org
> Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available
> as open access download
> 
> The old licensed setup would be kept as a packaged option? Much as it is
> now.... With the unlicensed going out in place of the current "free"
> dictionary? Am I understanding that right?
> 
> 
> JG
> —
> Sent from Mailbox
> 
> On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry
> <mc...@gmail.com>
> wrote:
> 
> > I'll crunch the numbers -- in the meantime I can tell you that
> > phenotypes vary by semantic type. clinical attributes  from SNOMED are
> > abundant, many concepts in mesh that are mapped to diseases. Tons of
> > "pharmacological substances"
> > On Nov 12, 2014 6:19 AM, "Dligach, Dmitriy" <
> > Dmitriy.Dligach@childrens.harvard.edu> wrote:
> >> Andy, thank you for this resource!
> >>
> >> Do you have an estimate of what percentage of UMLS concepts were left
> out?
> >>
> >> Dima
> >>
> >>
> >>
> >>
> >> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com>
> wrote:
> >>
> >> > Hello!
> >> >
> >> > https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> >> >
> >> > We just released a new library containing a huge chunk of UMLS
> >> > concepts which are available without registering
> accounts/username/passwords.
> >> > LEGALLY. Yes, really!
> >> >
> >> > The subset is from NCBI and it contains *thousands of concepts from
> >> SNOMED
> >> > and other vocabularies*.
> >> >
> >> > The code is essentially
> >> > 1. a list of WGET targets to various NCBI FTP site mirrors 2.
> >> > Makefile for building the databases of interest
> >> >
> >> > Our legal team has approved distribution for Open Access work, ASL2
> >> > LICENSE.
> >> >
> >> > I recommend we use this opportunity to make this the default
> >> > distribution for CTAKES UMLS connections, because it obviates the
> >> > need for so much painful credentialing and back and forth
> >> > agreements with the US National Library of Medicine.
> >> >
> >> > Cheers!
> >> > --Andy
> >> >
> >> >
> >> > On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <
> >> Masanz.James@mayo.edu>
> >> > wrote:
> >> >
> >> >>
> >> >> I would love to see the install be as simple as apt-get install to
> >> >> end
> >> up
> >> >> with some working dictionary that have more than a handful of
> >> >> entries to get them started.
> >> >>
> >> >> Regards,
> >> >> James Masanz
> >> >>
> >> >> -----Original Message-----
> >> >> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
> >> >> Sent: Tuesday, September 09, 2014 4:32 PM
> >> >> To: ctakes-dev@incubator.apache.org
> >> >> Subject: Recommendation for ctakes default (UMLS) dictionaries
> >> >>
> >> >> Greetings ctakes-dev:
> >> >>
> >> >> *UMLS license restrictions have been getting more lax over the
> >> >> years -- *much of the UMLS can be downloaded directly from the
> >> >> NCBI official FTP site.
> >> >>
> >> >> In fact, the NIH (and implicitly the NLM) *have already made the
> >> standard
> >> >> terms public for some medical specialities*.
> >> >>
> >> >> For example: Here is the UMLS subset specific to Medical Genetics
> >> (MedGen)
> >> >> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s)
> >> >> and
> >> names,
> >> >> etc :
> >> >>
> >> >> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
> >> >>
> >> >> My team has developed a JVM based wrapper for MetaMap 2013AB
> which
> >> >> I intend to open source soon (Clojure).  It includes REST support
> >> >> for invoking MetaMap with any or all of the command line arguments.
> >> >> We do not integrate with UIMA, we are basically a wrapper around
> >> >> the binary installation of MetaMap. The emphasis is on publication
> >> >> text not clinical text, still, some services are common (such as LVG).
> >> >>
> >> >> Strangely, the NLM still requires UMLS licenses to download
> >> >> MetaMap execution binaries. The MetaMap binary install is better
> >> >> but customizing dictionaries (DataFileBuilder) is not as easy to
> >> >> use as CTAKES with
> >> YTEXT
> >> >>
> >> >> [
> >> >> https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installati
> >> >> on
> >> ]
> >> >>
> >> >> *** Hence, there is a real opportunity here to enable Apache
> >> >> cTAKES to have a stronger default dictionary. ** *
> >> >>
> >> >> Imagine if we could
> >> >> *$ apt-get install apache-ctakes *
> >> >>
> >> >> and instantly have a working package for SOME problem domain.
> >> >> In my case (Medical Genetics) the UMLS definitions are already
> >> >> available and the UMLS license problem becomes a non issue, at
> >> >> least for many
> >> first
> >> >> time users
> >> >>
> >> >> Your thoughts?
> >> >> AndyMC
> >> >>
> >>
> >>

Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by John Green <jo...@gmail.com>.
The old licensed setup would be kept as a packaged option? Much as it is now.... With the unlicensed going out in place of the current "free" dictionary? Am I understanding that right? 


JG
—
Sent from Mailbox

On Thu, Nov 13, 2014 at 1:40 PM, andy mcmurry <mc...@gmail.com>
wrote:

> I'll crunch the numbers -- in the meantime I can tell you that phenotypes
> vary by semantic type. clinical attributes  from SNOMED are abundant, many
> concepts in mesh that are mapped to diseases. Tons of "pharmacological
> substances"
> On Nov 12, 2014 6:19 AM, "Dligach, Dmitriy" <
> Dmitriy.Dligach@childrens.harvard.edu> wrote:
>> Andy, thank you for this resource!
>>
>> Do you have an estimate of what percentage of UMLS concepts were left out?
>>
>> Dima
>>
>>
>>
>>
>> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com> wrote:
>>
>> > Hello!
>> >
>> > https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
>> >
>> > We just released a new library containing a huge chunk of UMLS concepts
>> > which are available without registering accounts/username/passwords.
>> > LEGALLY. Yes, really!
>> >
>> > The subset is from NCBI and it contains *thousands of concepts from
>> SNOMED
>> > and other vocabularies*.
>> >
>> > The code is essentially
>> > 1. a list of WGET targets to various NCBI FTP site mirrors
>> > 2. Makefile for building the databases of interest
>> >
>> > Our legal team has approved distribution for Open Access work, ASL2
>> > LICENSE.
>> >
>> > I recommend we use this opportunity to make this the default distribution
>> > for CTAKES UMLS connections, because it obviates the need for so much
>> > painful credentialing and back and forth agreements with the US National
>> > Library of Medicine.
>> >
>> > Cheers!
>> > --Andy
>> >
>> >
>> > On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <
>> Masanz.James@mayo.edu>
>> > wrote:
>> >
>> >>
>> >> I would love to see the install be as simple as apt-get install to end
>> up
>> >> with some working dictionary that have more than a handful of entries to
>> >> get them started.
>> >>
>> >> Regards,
>> >> James Masanz
>> >>
>> >> -----Original Message-----
>> >> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>> >> Sent: Tuesday, September 09, 2014 4:32 PM
>> >> To: ctakes-dev@incubator.apache.org
>> >> Subject: Recommendation for ctakes default (UMLS) dictionaries
>> >>
>> >> Greetings ctakes-dev:
>> >>
>> >> *UMLS license restrictions have been getting more lax over the years --
>> >> *much of the UMLS can be downloaded directly from the NCBI official FTP
>> >> site.
>> >>
>> >> In fact, the NIH (and implicitly the NLM) *have already made the
>> standard
>> >> terms public for some medical specialities*.
>> >>
>> >> For example: Here is the UMLS subset specific to Medical Genetics
>> (MedGen)
>> >> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and
>> names,
>> >> etc :
>> >>
>> >> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>> >>
>> >> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>> >> intend to open source soon (Clojure).  It includes REST support for
>> >> invoking MetaMap with any or all of the command line arguments.
>> >> We do not integrate with UIMA, we are basically a wrapper around the
>> >> binary installation of MetaMap. The emphasis is on publication text not
>> >> clinical text, still, some services are common (such as LVG).
>> >>
>> >> Strangely, the NLM still requires UMLS licenses to download MetaMap
>> >> execution binaries. The MetaMap binary install is better but customizing
>> >> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with
>> YTEXT
>> >>
>> >> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation
>> ]
>> >>
>> >> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>> >> have a stronger default dictionary. ** *
>> >>
>> >> Imagine if we could
>> >> *$ apt-get install apache-ctakes *
>> >>
>> >> and instantly have a working package for SOME problem domain.
>> >> In my case (Medical Genetics) the UMLS definitions are already available
>> >> and the UMLS license problem becomes a non issue, at least for many
>> first
>> >> time users
>> >>
>> >> Your thoughts?
>> >> AndyMC
>> >>
>>
>>

Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by andy mcmurry <mc...@gmail.com>.
I'll crunch the numbers -- in the meantime I can tell you that phenotypes
vary by semantic type. clinical attributes  from SNOMED are abundant, many
concepts in mesh that are mapped to diseases. Tons of "pharmacological
substances"
On Nov 12, 2014 6:19 AM, "Dligach, Dmitriy" <
Dmitriy.Dligach@childrens.harvard.edu> wrote:

> Andy, thank you for this resource!
>
> Do you have an estimate of what percentage of UMLS concepts were left out?
>
> Dima
>
>
>
>
> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com> wrote:
>
> > Hello!
> >
> > https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> >
> > We just released a new library containing a huge chunk of UMLS concepts
> > which are available without registering accounts/username/passwords.
> > LEGALLY. Yes, really!
> >
> > The subset is from NCBI and it contains *thousands of concepts from
> SNOMED
> > and other vocabularies*.
> >
> > The code is essentially
> > 1. a list of WGET targets to various NCBI FTP site mirrors
> > 2. Makefile for building the databases of interest
> >
> > Our legal team has approved distribution for Open Access work, ASL2
> > LICENSE.
> >
> > I recommend we use this opportunity to make this the default distribution
> > for CTAKES UMLS connections, because it obviates the need for so much
> > painful credentialing and back and forth agreements with the US National
> > Library of Medicine.
> >
> > Cheers!
> > --Andy
> >
> >
> > On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <
> Masanz.James@mayo.edu>
> > wrote:
> >
> >>
> >> I would love to see the install be as simple as apt-get install to end
> up
> >> with some working dictionary that have more than a handful of entries to
> >> get them started.
> >>
> >> Regards,
> >> James Masanz
> >>
> >> -----Original Message-----
> >> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
> >> Sent: Tuesday, September 09, 2014 4:32 PM
> >> To: ctakes-dev@incubator.apache.org
> >> Subject: Recommendation for ctakes default (UMLS) dictionaries
> >>
> >> Greetings ctakes-dev:
> >>
> >> *UMLS license restrictions have been getting more lax over the years --
> >> *much of the UMLS can be downloaded directly from the NCBI official FTP
> >> site.
> >>
> >> In fact, the NIH (and implicitly the NLM) *have already made the
> standard
> >> terms public for some medical specialities*.
> >>
> >> For example: Here is the UMLS subset specific to Medical Genetics
> (MedGen)
> >> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and
> names,
> >> etc :
> >>
> >> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
> >>
> >> My team has developed a JVM based wrapper for MetaMap 2013AB which I
> >> intend to open source soon (Clojure).  It includes REST support for
> >> invoking MetaMap with any or all of the command line arguments.
> >> We do not integrate with UIMA, we are basically a wrapper around the
> >> binary installation of MetaMap. The emphasis is on publication text not
> >> clinical text, still, some services are common (such as LVG).
> >>
> >> Strangely, the NLM still requires UMLS licenses to download MetaMap
> >> execution binaries. The MetaMap binary install is better but customizing
> >> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with
> YTEXT
> >>
> >> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation
> ]
> >>
> >> *** Hence, there is a real opportunity here to enable Apache cTAKES to
> >> have a stronger default dictionary. ** *
> >>
> >> Imagine if we could
> >> *$ apt-get install apache-ctakes *
> >>
> >> and instantly have a working package for SOME problem domain.
> >> In my case (Medical Genetics) the UMLS definitions are already available
> >> and the UMLS license problem becomes a non issue, at least for many
> first
> >> time users
> >>
> >> Your thoughts?
> >> AndyMC
> >>
>
>

RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
Agree with everyone else, this is very exciting.

As far as adopting it as default, I would like to try it first of course, but if all that means is replacing the toy dictionary that sounds great, then new users will hopefully stop being confused about why "ctakes doesn't find obvious things." Would we still maintain the SNOMED resources for those who use them? I think that despite the licensing issues SNOMED is a widely-used standard and some users probably want to continue using those codes. With this new resource we would probably not be able to fill in the SNOMED codes even if there is an equivalent SNOMED concept.

Tim
________________________________________
From: Dligach, Dmitriy [Dmitriy.Dligach@childrens.harvard.edu]
Sent: Wednesday, November 12, 2014 9:19 AM
To: cTAKES Developer list
Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Andy, thank you for this resource!

Do you have an estimate of what percentage of UMLS concepts were left out?

Dima




On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com> wrote:

> Hello!
>
> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
>
> We just released a new library containing a huge chunk of UMLS concepts
> which are available without registering accounts/username/passwords.
> LEGALLY. Yes, really!
>
> The subset is from NCBI and it contains *thousands of concepts from SNOMED
> and other vocabularies*.
>
> The code is essentially
> 1. a list of WGET targets to various NCBI FTP site mirrors
> 2. Makefile for building the databases of interest
>
> Our legal team has approved distribution for Open Access work, ASL2
> LICENSE.
>
> I recommend we use this opportunity to make this the default distribution
> for CTAKES UMLS connections, because it obviates the need for so much
> painful credentialing and back and forth agreements with the US National
> Library of Medicine.
>
> Cheers!
> --Andy
>
>
> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <Ma...@mayo.edu>
> wrote:
>
>>
>> I would love to see the install be as simple as apt-get install to end up
>> with some working dictionary that have more than a handful of entries to
>> get them started.
>>
>> Regards,
>> James Masanz
>>
>> -----Original Message-----
>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>> Sent: Tuesday, September 09, 2014 4:32 PM
>> To: ctakes-dev@incubator.apache.org
>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>>
>> Greetings ctakes-dev:
>>
>> *UMLS license restrictions have been getting more lax over the years --
>> *much of the UMLS can be downloaded directly from the NCBI official FTP
>> site.
>>
>> In fact, the NIH (and implicitly the NLM) *have already made the standard
>> terms public for some medical specialities*.
>>
>> For example: Here is the UMLS subset specific to Medical Genetics (MedGen)
>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and names,
>> etc :
>>
>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>>
>> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>> intend to open source soon (Clojure).  It includes REST support for
>> invoking MetaMap with any or all of the command line arguments.
>> We do not integrate with UIMA, we are basically a wrapper around the
>> binary installation of MetaMap. The emphasis is on publication text not
>> clinical text, still, some services are common (such as LVG).
>>
>> Strangely, the NLM still requires UMLS licenses to download MetaMap
>> execution binaries. The MetaMap binary install is better but customizing
>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with YTEXT
>>
>> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>>
>> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>> have a stronger default dictionary. ** *
>>
>> Imagine if we could
>> *$ apt-get install apache-ctakes *
>>
>> and instantly have a working package for SOME problem domain.
>> In my case (Medical Genetics) the UMLS definitions are already available
>> and the UMLS license problem becomes a non issue, at least for many first
>> time users
>>
>> Your thoughts?
>> AndyMC
>>


Re: building a *real sample dictionary* without UMLS login

Posted by "AndyMC@apache.org (forwarding)" <mc...@gmail.com>.
Option 1: 
hg clone https://bitbucket.org/invitae/medgen-mysql
cd medgen-mysql 
make user 
make medgen 

Then we point cTAKES to use DictionaryLookup using the MySQL database. 
Nicely indexed, customizable, linkable, etc. 

Option 2: 
hg clone https://bitbucket.org/invitae/medgen-mysql
cd medgen-mysql 
./mirror.sh medgen/urls 

This will fetch the dictionary files, replace MRCONSO with MGCONSO from medgen/mirror directory. 

Option 3: 
Directly bake this process right into the cTAKES installation. 
Interested in what you and others feel would be the fastest way to get new users online with cTAKES. 

Hope this helps, 
AndyMC@apache.org

On Oct 2, 2015, at 7:02 AM, "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov> wrote:

> Hi,
> 
> I would be extremely interested in a sample dictionary that
> doesn’t require a UMLS login.
> 
> How would I use this?
> 
> Thanks,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> -----Original Message-----
> From: "AndyMC@apache.org (forwarding)" <mc...@gmail.com>
> Reply-To: "dev@ctakes.apache.org" <de...@ctakes.apache.org>
> Date: Friday, October 2, 2015 at 12:43 AM
> To: "dev@ctakes.apache.org" <de...@ctakes.apache.org>
> Subject: building a *real sample dictionary* without UMLS login
> 
>> Greetings ctakes-dev!
>> 
>> I have been polishing MedGen (UMLS) dictionaries for over a year now and
>> I am confident in saying "this is solid".
>> As a reminder, the medgen-mysql package contains a large subset of the
>> UMLS that can be downloaded without UMLS login, greatly simplifying the
>> creation of an example dictionary.
>> 
>> QUESTION: 
>> Would you like me to integrate this into ctakes to simplify installations
>> for new-users, and if so, what would be your preferred method?
>> 
>> Source Vocabularies (SAB)
>> +-------------+--------+
>> | SourceVocab | cnt    |
>> +-------------+--------+
>> | MSH         | 245435 | Medical Subject Headings
>> | SNOMEDCT_US | 156105 | SNOMED Clinical Terms
>> | NCI         | 136888 | NCI Cancer Terms
>> | ...         |  ...   |
>> +-------------+--------+
>> 
>> Semantic Types (STY)
>> +-------------------------------------------+--------+
>> | SemanticType                              | cnt    |
>> +-------------------------------------------+--------+
>> | Pharmacologic Substance                   | 102511 |
>> | Finding                                   |  90413 |
>> | Organic Chemical                          |  81329 |
>> | Disease or Syndrome                       |  47223 |
>> | Neoplastic Process                        |  16151 |
>> | Amino Acid, Peptide, or Protein           |   9383 |
>> | Congenital Abnormality                    |   6536 |
>> | Pathologic Function                       |   5655 |
>> | Steroid                                   |   3919 |
>> | Sign or Symptom                           |   2909 |
>> | ...                                       |   ...  |
>> 
>> 
>> What would you like to see?
>> AndyMC@apache.org	
>> 
>> 
>> On Nov 12, 2014, at 6:14 AM, "Dligach, Dmitriy"
>> <Dm...@childrens.harvard.edu> wrote:
>> 
>>> Andy, thank you for this resource!
>>> 
>>> Do you have an estimate of what percentage of UMLS concepts were left
>>> out?
>>> 
>>> Dima
>>> 
>>> 
>>> 
>>> 
>>> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com> wrote:
>>> 
>>>> Hello!
>>>> 
>>>> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
>>>> 
>>>> We just released a new library containing a huge chunk of UMLS concepts
>>>> which are available without registering accounts/username/passwords.
>>>> LEGALLY. Yes, really!
>>>> 
>>>> The subset is from NCBI and it contains *thousands of concepts from
>>>> SNOMED
>>>> and other vocabularies*.
>>>> 
>>>> The code is essentially
>>>> 1. a list of WGET targets to various NCBI FTP site mirrors
>>>> 2. Makefile for building the databases of interest
>>>> 
>>>> Our legal team has approved distribution for Open Access work, ASL2
>>>> LICENSE.
>>>> 
>>>> I recommend we use this opportunity to make this the default
>>>> distribution
>>>> for CTAKES UMLS connections, because it obviates the need for so much
>>>> painful credentialing and back and forth agreements with the US
>>>> National
>>>> Library of Medicine.
>>>> 
>>>> Cheers!
>>>> --Andy
>>>> 
>>>> 
>>>> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J.
>>>> <Ma...@mayo.edu>
>>>> wrote:
>>>> 
>>>>> 
>>>>> I would love to see the install be as simple as apt-get install to
>>>>> end up
>>>>> with some working dictionary that have more than a handful of entries
>>>>> to
>>>>> get them started.
>>>>> 
>>>>> Regards,
>>>>> James Masanz
>>>>> 
>>>>> -----Original Message-----
>>>>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>>>>> Sent: Tuesday, September 09, 2014 4:32 PM
>>>>> To: ctakes-dev@incubator.apache.org
>>>>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>>>>> 
>>>>> Greetings ctakes-dev:
>>>>> 
>>>>> *UMLS license restrictions have been getting more lax over the years
>>>>> --
>>>>> *much of the UMLS can be downloaded directly from the NCBI official
>>>>> FTP
>>>>> site.
>>>>> 
>>>>> In fact, the NIH (and implicitly the NLM) *have already made the
>>>>> standard
>>>>> terms public for some medical specialities*.
>>>>> 
>>>>> For example: Here is the UMLS subset specific to Medical Genetics
>>>>> (MedGen)
>>>>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and
>>>>> names,
>>>>> etc :
>>>>> 
>>>>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>>>>> 
>>>>> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>>>>> intend to open source soon (Clojure).  It includes REST support for
>>>>> invoking MetaMap with any or all of the command line arguments.
>>>>> We do not integrate with UIMA, we are basically a wrapper around the
>>>>> binary installation of MetaMap. The emphasis is on publication text
>>>>> not
>>>>> clinical text, still, some services are common (such as LVG).
>>>>> 
>>>>> Strangely, the NLM still requires UMLS licenses to download MetaMap
>>>>> execution binaries. The MetaMap binary install is better but
>>>>> customizing
>>>>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with
>>>>> YTEXT
>>>>> 
>>>>> [ 
>>>>> https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>>>>> 
>>>>> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>>>>> have a stronger default dictionary. ** *
>>>>> 
>>>>> Imagine if we could
>>>>> *$ apt-get install apache-ctakes *
>>>>> 
>>>>> and instantly have a working package for SOME problem domain.
>>>>> In my case (Medical Genetics) the UMLS definitions are already
>>>>> available
>>>>> and the UMLS license problem becomes a non issue, at least for many
>>>>> first
>>>>> time users
>>>>> 
>>>>> Your thoughts?
>>>>> AndyMC
>>>>> 
>>> 
>> 
> 


Re: building a *real sample dictionary* without UMLS login

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.
Hi,

I would be extremely interested in a sample dictionary that
doesn’t require a UMLS login.

How would I use this?

Thanks,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: "AndyMC@apache.org (forwarding)" <mc...@gmail.com>
Reply-To: "dev@ctakes.apache.org" <de...@ctakes.apache.org>
Date: Friday, October 2, 2015 at 12:43 AM
To: "dev@ctakes.apache.org" <de...@ctakes.apache.org>
Subject: building a *real sample dictionary* without UMLS login

>Greetings ctakes-dev!
>
>I have been polishing MedGen (UMLS) dictionaries for over a year now and
>I am confident in saying "this is solid".
>As a reminder, the medgen-mysql package contains a large subset of the
>UMLS that can be downloaded without UMLS login, greatly simplifying the
>creation of an example dictionary.
>
>QUESTION: 
>Would you like me to integrate this into ctakes to simplify installations
>for new-users, and if so, what would be your preferred method?
>
>Source Vocabularies (SAB)
>+-------------+--------+
>| SourceVocab | cnt    |
>+-------------+--------+
>| MSH         | 245435 | Medical Subject Headings
>| SNOMEDCT_US | 156105 | SNOMED Clinical Terms
>| NCI         | 136888 | NCI Cancer Terms
>| ...         |  ...   |
>+-------------+--------+
>
>Semantic Types (STY)
>+-------------------------------------------+--------+
>| SemanticType                              | cnt    |
>+-------------------------------------------+--------+
>| Pharmacologic Substance                   | 102511 |
>| Finding                                   |  90413 |
>| Organic Chemical                          |  81329 |
>| Disease or Syndrome                       |  47223 |
>| Neoplastic Process                        |  16151 |
>| Amino Acid, Peptide, or Protein           |   9383 |
>| Congenital Abnormality                    |   6536 |
>| Pathologic Function                       |   5655 |
>| Steroid                                   |   3919 |
>| Sign or Symptom                           |   2909 |
>| ...                                       |   ...  |
>
>
>What would you like to see?
>AndyMC@apache.org	
>
>
>On Nov 12, 2014, at 6:14 AM, "Dligach, Dmitriy"
><Dm...@childrens.harvard.edu> wrote:
>
>> Andy, thank you for this resource!
>> 
>> Do you have an estimate of what percentage of UMLS concepts were left
>>out?
>> 
>> Dima
>> 
>> 
>> 
>> 
>> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com> wrote:
>> 
>>> Hello!
>>> 
>>> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
>>> 
>>> We just released a new library containing a huge chunk of UMLS concepts
>>> which are available without registering accounts/username/passwords.
>>> LEGALLY. Yes, really!
>>> 
>>> The subset is from NCBI and it contains *thousands of concepts from
>>>SNOMED
>>> and other vocabularies*.
>>> 
>>> The code is essentially
>>> 1. a list of WGET targets to various NCBI FTP site mirrors
>>> 2. Makefile for building the databases of interest
>>> 
>>> Our legal team has approved distribution for Open Access work, ASL2
>>> LICENSE.
>>> 
>>> I recommend we use this opportunity to make this the default
>>>distribution
>>> for CTAKES UMLS connections, because it obviates the need for so much
>>> painful credentialing and back and forth agreements with the US
>>>National
>>> Library of Medicine.
>>> 
>>> Cheers!
>>> --Andy
>>> 
>>> 
>>> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J.
>>><Ma...@mayo.edu>
>>> wrote:
>>> 
>>>> 
>>>> I would love to see the install be as simple as apt-get install to
>>>>end up
>>>> with some working dictionary that have more than a handful of entries
>>>>to
>>>> get them started.
>>>> 
>>>> Regards,
>>>> James Masanz
>>>> 
>>>> -----Original Message-----
>>>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>>>> Sent: Tuesday, September 09, 2014 4:32 PM
>>>> To: ctakes-dev@incubator.apache.org
>>>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>>>> 
>>>> Greetings ctakes-dev:
>>>> 
>>>> *UMLS license restrictions have been getting more lax over the years
>>>>--
>>>> *much of the UMLS can be downloaded directly from the NCBI official
>>>>FTP
>>>> site.
>>>> 
>>>> In fact, the NIH (and implicitly the NLM) *have already made the
>>>>standard
>>>> terms public for some medical specialities*.
>>>> 
>>>> For example: Here is the UMLS subset specific to Medical Genetics
>>>>(MedGen)
>>>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and
>>>>names,
>>>> etc :
>>>> 
>>>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>>>> 
>>>> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>>>> intend to open source soon (Clojure).  It includes REST support for
>>>> invoking MetaMap with any or all of the command line arguments.
>>>> We do not integrate with UIMA, we are basically a wrapper around the
>>>> binary installation of MetaMap. The emphasis is on publication text
>>>>not
>>>> clinical text, still, some services are common (such as LVG).
>>>> 
>>>> Strangely, the NLM still requires UMLS licenses to download MetaMap
>>>> execution binaries. The MetaMap binary install is better but
>>>>customizing
>>>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with
>>>>YTEXT
>>>> 
>>>> [ 
>>>>https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>>>> 
>>>> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>>>> have a stronger default dictionary. ** *
>>>> 
>>>> Imagine if we could
>>>> *$ apt-get install apache-ctakes *
>>>> 
>>>> and instantly have a working package for SOME problem domain.
>>>> In my case (Medical Genetics) the UMLS definitions are already
>>>>available
>>>> and the UMLS license problem becomes a non issue, at least for many
>>>>first
>>>> time users
>>>> 
>>>> Your thoughts?
>>>> AndyMC
>>>> 
>> 
>


building a *real sample dictionary* without UMLS login

Posted by "AndyMC@apache.org (forwarding)" <mc...@gmail.com>.
Greetings ctakes-dev! 

I have been polishing MedGen (UMLS) dictionaries for over a year now and I am confident in saying "this is solid". 
As a reminder, the medgen-mysql package contains a large subset of the UMLS that can be downloaded without UMLS login, greatly simplifying the creation of an example dictionary. 

QUESTION: 
Would you like me to integrate this into ctakes to simplify installations for new-users, and if so, what would be your preferred method?

Source Vocabularies (SAB)
+-------------+--------+
| SourceVocab | cnt    | 
+-------------+--------+
| MSH         | 245435 | Medical Subject Headings
| SNOMEDCT_US | 156105 | SNOMED Clinical Terms
| NCI         | 136888 | NCI Cancer Terms
| ...         |  ...   | 
+-------------+--------+

Semantic Types (STY)
+-------------------------------------------+--------+
| SemanticType                              | cnt    |
+-------------------------------------------+--------+
| Pharmacologic Substance                   | 102511 |
| Finding                                   |  90413 |
| Organic Chemical                          |  81329 |
| Disease or Syndrome                       |  47223 |
| Neoplastic Process                        |  16151 |
| Amino Acid, Peptide, or Protein           |   9383 |
| Congenital Abnormality                    |   6536 |
| Pathologic Function                       |   5655 |
| Steroid                                   |   3919 |
| Sign or Symptom                           |   2909 |
| ...                                       |   ...  |


What would you like to see?
AndyMC@apache.org	


On Nov 12, 2014, at 6:14 AM, "Dligach, Dmitriy" <Dm...@childrens.harvard.edu> wrote:

> Andy, thank you for this resource!
> 
> Do you have an estimate of what percentage of UMLS concepts were left out?
> 
> Dima
> 
> 
> 
> 
> On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com> wrote:
> 
>> Hello!
>> 
>> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
>> 
>> We just released a new library containing a huge chunk of UMLS concepts
>> which are available without registering accounts/username/passwords.
>> LEGALLY. Yes, really!
>> 
>> The subset is from NCBI and it contains *thousands of concepts from SNOMED
>> and other vocabularies*.
>> 
>> The code is essentially
>> 1. a list of WGET targets to various NCBI FTP site mirrors
>> 2. Makefile for building the databases of interest
>> 
>> Our legal team has approved distribution for Open Access work, ASL2
>> LICENSE.
>> 
>> I recommend we use this opportunity to make this the default distribution
>> for CTAKES UMLS connections, because it obviates the need for so much
>> painful credentialing and back and forth agreements with the US National
>> Library of Medicine.
>> 
>> Cheers!
>> --Andy
>> 
>> 
>> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <Ma...@mayo.edu>
>> wrote:
>> 
>>> 
>>> I would love to see the install be as simple as apt-get install to end up
>>> with some working dictionary that have more than a handful of entries to
>>> get them started.
>>> 
>>> Regards,
>>> James Masanz
>>> 
>>> -----Original Message-----
>>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>>> Sent: Tuesday, September 09, 2014 4:32 PM
>>> To: ctakes-dev@incubator.apache.org
>>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>>> 
>>> Greetings ctakes-dev:
>>> 
>>> *UMLS license restrictions have been getting more lax over the years --
>>> *much of the UMLS can be downloaded directly from the NCBI official FTP
>>> site.
>>> 
>>> In fact, the NIH (and implicitly the NLM) *have already made the standard
>>> terms public for some medical specialities*.
>>> 
>>> For example: Here is the UMLS subset specific to Medical Genetics (MedGen)
>>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and names,
>>> etc :
>>> 
>>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>>> 
>>> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>>> intend to open source soon (Clojure).  It includes REST support for
>>> invoking MetaMap with any or all of the command line arguments.
>>> We do not integrate with UIMA, we are basically a wrapper around the
>>> binary installation of MetaMap. The emphasis is on publication text not
>>> clinical text, still, some services are common (such as LVG).
>>> 
>>> Strangely, the NLM still requires UMLS licenses to download MetaMap
>>> execution binaries. The MetaMap binary install is better but customizing
>>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with YTEXT
>>> 
>>> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>>> 
>>> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>>> have a stronger default dictionary. ** *
>>> 
>>> Imagine if we could
>>> *$ apt-get install apache-ctakes *
>>> 
>>> and instantly have a working package for SOME problem domain.
>>> In my case (Medical Genetics) the UMLS definitions are already available
>>> and the UMLS license problem becomes a non issue, at least for many first
>>> time users
>>> 
>>> Your thoughts?
>>> AndyMC
>>> 
> 


Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by "Dligach, Dmitriy" <Dm...@childrens.harvard.edu>.
Andy, thank you for this resource!

Do you have an estimate of what percentage of UMLS concepts were left out?

Dima




On Nov 11, 2014, at 16:02, andy mcmurry <mc...@gmail.com> wrote:

> Hello!
> 
> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> 
> We just released a new library containing a huge chunk of UMLS concepts
> which are available without registering accounts/username/passwords.
> LEGALLY. Yes, really!
> 
> The subset is from NCBI and it contains *thousands of concepts from SNOMED
> and other vocabularies*.
> 
> The code is essentially
> 1. a list of WGET targets to various NCBI FTP site mirrors
> 2. Makefile for building the databases of interest
> 
> Our legal team has approved distribution for Open Access work, ASL2
> LICENSE.
> 
> I recommend we use this opportunity to make this the default distribution
> for CTAKES UMLS connections, because it obviates the need for so much
> painful credentialing and back and forth agreements with the US National
> Library of Medicine.
> 
> Cheers!
> --Andy
> 
> 
> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <Ma...@mayo.edu>
> wrote:
> 
>> 
>> I would love to see the install be as simple as apt-get install to end up
>> with some working dictionary that have more than a handful of entries to
>> get them started.
>> 
>> Regards,
>> James Masanz
>> 
>> -----Original Message-----
>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>> Sent: Tuesday, September 09, 2014 4:32 PM
>> To: ctakes-dev@incubator.apache.org
>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>> 
>> Greetings ctakes-dev:
>> 
>> *UMLS license restrictions have been getting more lax over the years --
>> *much of the UMLS can be downloaded directly from the NCBI official FTP
>> site.
>> 
>> In fact, the NIH (and implicitly the NLM) *have already made the standard
>> terms public for some medical specialities*.
>> 
>> For example: Here is the UMLS subset specific to Medical Genetics (MedGen)
>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and names,
>> etc :
>> 
>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>> 
>> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>> intend to open source soon (Clojure).  It includes REST support for
>> invoking MetaMap with any or all of the command line arguments.
>> We do not integrate with UIMA, we are basically a wrapper around the
>> binary installation of MetaMap. The emphasis is on publication text not
>> clinical text, still, some services are common (such as LVG).
>> 
>> Strangely, the NLM still requires UMLS licenses to download MetaMap
>> execution binaries. The MetaMap binary install is better but customizing
>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with YTEXT
>> 
>> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>> 
>> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>> have a stronger default dictionary. ** *
>> 
>> Imagine if we could
>> *$ apt-get install apache-ctakes *
>> 
>> and instantly have a working package for SOME problem domain.
>> In my case (Medical Genetics) the UMLS definitions are already available
>> and the UMLS license problem becomes a non issue, at least for many first
>> time users
>> 
>> Your thoughts?
>> AndyMC
>> 


RE: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by "Savova, Guergana" <Gu...@childrens.harvard.edu>.
This is great!!!! Thank you so much, Andy!!!
I agree that it will make life for many users MUCH easier.
--guergana

-----Original Message-----
From: Jay Vyas [mailto:jayunit100.apache@gmail.com] 
Sent: Tuesday, November 11, 2014 5:31 PM
To: dev@ctakes.apache.org
Subject: Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

+1000 on this!  Great lets make a jira!!!

> On Nov 11, 2014, at 5:02 PM, andy mcmurry <mc...@gmail.com> wrote:
> 
> Hello!
> 
> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> 
> We just released a new library containing a huge chunk of UMLS 
> concepts which are available without registering accounts/username/passwords.
> LEGALLY. Yes, really!
> 
> The subset is from NCBI and it contains *thousands of concepts from 
> SNOMED and other vocabularies*.
> 
> The code is essentially
> 1. a list of WGET targets to various NCBI FTP site mirrors 2. Makefile 
> for building the databases of interest
> 
> Our legal team has approved distribution for Open Access work, ASL2 
> LICENSE.
> 
> I recommend we use this opportunity to make this the default 
> distribution for CTAKES UMLS connections, because it obviates the need 
> for so much painful credentialing and back and forth agreements with 
> the US National Library of Medicine.
> 
> Cheers!
> --Andy
> 
> 
> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. 
> <Ma...@mayo.edu>
> wrote:
> 
>> 
>> I would love to see the install be as simple as apt-get install to 
>> end up with some working dictionary that have more than a handful of 
>> entries to get them started.
>> 
>> Regards,
>> James Masanz
>> 
>> -----Original Message-----
>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>> Sent: Tuesday, September 09, 2014 4:32 PM
>> To: ctakes-dev@incubator.apache.org
>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>> 
>> Greetings ctakes-dev:
>> 
>> *UMLS license restrictions have been getting more lax over the years 
>> -- *much of the UMLS can be downloaded directly from the NCBI 
>> official FTP site.
>> 
>> In fact, the NIH (and implicitly the NLM) *have already made the 
>> standard terms public for some medical specialities*.
>> 
>> For example: Here is the UMLS subset specific to Medical Genetics 
>> (MedGen) and Genetic Testing (GTR) complete with SNOMED-CT concept 
>> CUI(s) and names, etc :
>> 
>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>> 
>> My team has developed a JVM based wrapper for MetaMap 2013AB which I 
>> intend to open source soon (Clojure).  It includes REST support for 
>> invoking MetaMap with any or all of the command line arguments.
>> We do not integrate with UIMA, we are basically a wrapper around the 
>> binary installation of MetaMap. The emphasis is on publication text 
>> not clinical text, still, some services are common (such as LVG).
>> 
>> Strangely, the NLM still requires UMLS licenses to download MetaMap 
>> execution binaries. The MetaMap binary install is better but 
>> customizing dictionaries (DataFileBuilder) is not as easy to use as 
>> CTAKES with YTEXT
>> 
>> [ 
>> https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation 
>> ]
>> 
>> *** Hence, there is a real opportunity here to enable Apache cTAKES 
>> to have a stronger default dictionary. ** *
>> 
>> Imagine if we could
>> *$ apt-get install apache-ctakes *
>> 
>> and instantly have a working package for SOME problem domain.
>> In my case (Medical Genetics) the UMLS definitions are already 
>> available and the UMLS license problem becomes a non issue, at least 
>> for many first time users
>> 
>> Your thoughts?
>> AndyMC
>> 

Re: Announcement: UMLS MedGen-MySQL dataset now available as open access download

Posted by Jay Vyas <ja...@gmail.com>.
+1000 on this!  Great lets make a jira!!!

> On Nov 11, 2014, at 5:02 PM, andy mcmurry <mc...@gmail.com> wrote:
> 
> Hello!
> 
> https://bitbucket.org/invitae/medgen-mysql (Apache Licensed ASL2)
> 
> We just released a new library containing a huge chunk of UMLS concepts
> which are available without registering accounts/username/passwords.
> LEGALLY. Yes, really!
> 
> The subset is from NCBI and it contains *thousands of concepts from SNOMED
> and other vocabularies*.
> 
> The code is essentially
> 1. a list of WGET targets to various NCBI FTP site mirrors
> 2. Makefile for building the databases of interest
> 
> Our legal team has approved distribution for Open Access work, ASL2
> LICENSE.
> 
> I recommend we use this opportunity to make this the default distribution
> for CTAKES UMLS connections, because it obviates the need for so much
> painful credentialing and back and forth agreements with the US National
> Library of Medicine.
> 
> Cheers!
> --Andy
> 
> 
> On Wed, Sep 10, 2014 at 12:13 PM, Masanz, James J. <Ma...@mayo.edu>
> wrote:
> 
>> 
>> I would love to see the install be as simple as apt-get install to end up
>> with some working dictionary that have more than a handful of entries to
>> get them started.
>> 
>> Regards,
>> James Masanz
>> 
>> -----Original Message-----
>> From: andy mcmurry [mailto:mcmurry.andy@gmail.com]
>> Sent: Tuesday, September 09, 2014 4:32 PM
>> To: ctakes-dev@incubator.apache.org
>> Subject: Recommendation for ctakes default (UMLS) dictionaries
>> 
>> Greetings ctakes-dev:
>> 
>> *UMLS license restrictions have been getting more lax over the years --
>> *much of the UMLS can be downloaded directly from the NCBI official FTP
>> site.
>> 
>> In fact, the NIH (and implicitly the NLM) *have already made the standard
>> terms public for some medical specialities*.
>> 
>> For example: Here is the UMLS subset specific to Medical Genetics (MedGen)
>> and Genetic Testing (GTR) complete with SNOMED-CT concept CUI(s) and names,
>> etc :
>> 
>> [  ftp://ftp.ncbi.nlm.nih.gov/pub/medgen/README.html  ]
>> 
>> My team has developed a JVM based wrapper for MetaMap 2013AB which I
>> intend to open source soon (Clojure).  It includes REST support for
>> invoking MetaMap with any or all of the command line arguments.
>> We do not integrate with UIMA, we are basically a wrapper around the
>> binary installation of MetaMap. The emphasis is on publication text not
>> clinical text, still, some services are common (such as LVG).
>> 
>> Strangely, the NLM still requires UMLS licenses to download MetaMap
>> execution binaries. The MetaMap binary install is better but customizing
>> dictionaries (DataFileBuilder) is not as easy to use as CTAKES with YTEXT
>> 
>> [ https://cwiki.apache.org/confluence/display/CTAKES/YTEX+Installation ]
>> 
>> *** Hence, there is a real opportunity here to enable Apache cTAKES to
>> have a stronger default dictionary. ** *
>> 
>> Imagine if we could
>> *$ apt-get install apache-ctakes *
>> 
>> and instantly have a working package for SOME problem domain.
>> In my case (Medical Genetics) the UMLS definitions are already available
>> and the UMLS license problem becomes a non issue, at least for many first
>> time users
>> 
>> Your thoughts?
>> AndyMC
>>