You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ctakes.apache.org by Niraj Shrestha <ns...@gmail.com> on 2016/08/02 12:13:32 UTC

ctakes concept and relation extraction

Dear Sir
I am trying to extract named entities and their relations from medical
document. If I understood correctly concepts are basically entities.
I have used two different analysis engines:
     AggregatePlaintextFastUMLSProcessor.xml for concept extraction and
     RelationExtractorAggregate for relation extraction.

My first question is how can I combined both engine to obtain concept and
relations in single file.

If I understood correctly, If I need to extract all the entities (concepts)
then I need to get all the nodes
"org.apache.ctakes.typesystem.type.refsem.UmlsConcept" from output xml
file. But how can I choose the single entities or concept from list of many
concepts?

and What is FSArray in which all concept ids are listed.

I found some concepts are not mentioned on input data but it appeared in
the output data for example, when I use following engine in "note.txt" file

<import
location="../analysis_engine/AggregatePlaintextFastUMLSProcessor.xml"/>
output file is "note.txt4.xml" (attached here)

One of the concept is following, where "kidney" is mentioned as
preferredText but the word "kidney" is not found in the input data.

<org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4503"
codingScheme="SNOMEDCT" code="64033007" oid="64033007#SNOMEDCT" score="0.0"
disambiguated="false" cui="C0022646" tui="T023" preferredText="Kidney"/>
    <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4493"
codingScheme="SNOMEDCT" code="17373004" oid="17373004#SNOMEDCT" score="0.0"
disambiguated="false" cui="C0227665" tui="T023" preferredText="Both
kidneys"/>
    <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4483"
codingScheme="SNOMEDCT" code="181414000" oid="181414000#SNOMEDCT"
score="0.0" disambiguated="false" cui="C1278978" tui="T023"
preferredText="Entire kidney"/>
    <uima.cas.FSArray _id="4513" size="3">
        <i>4483</i>
        <i>4493</i>
        <i>4503</i>
    </uima.cas.FSArray>


************************************
My next query concern with relation extraction for which I use following
engine.

<import
location="../../../ctakes-relation-extractor/desc/analysis_engine/RelationExtractorAggregate.xml"/>
output file is "note.txt_relation.xml" (attached here)

I am not able to interpret the output file (note.txt_relation.xml) in which
relation and their location is mentioned but could not figure out which
entities and what relation between those entities in terms of words.

For eg:

<org.apache.ctakes.typesystem.type.relation.RelationArgument _indexed="1"
_id="12422" id="0" _ref_argument="10680" role="Argument"/>
    <org.apache.ctakes.typesystem.type.relation.RelationArgument
_indexed="1" _id="12427" id="0" _ref_argument="10989" role="Related_to"/>
    <org.apache.ctakes.typesystem.type.relation.RelationArgument
_indexed="1" _id="12446" id="0" _ref_argument="10680" role="Argument"/>
.
.
.
.
<org.apache.ctakes.typesystem.type.relation.RelationArgument _indexed="1"
_id="12851" id="0" _ref_argument="12181" role="Related_to"/>
    <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
_indexed="1" _id="12432" id="0" category="location_of"
discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
conditional="false" _ref_arg1="12422" _ref_arg2="12427"/>
    <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
_indexed="1" _id="12456" id="0" category="location_of"
discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
conditional="false" _ref_arg1="12446" _ref_arg2="12451"/>
    <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
_indexed="1" _id="12480" id="0" category="location_of"
discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
conditional="false" _ref_arg1="12470" _ref_arg2="12475"/>
    <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
_indexed="1" _id="12508" id="0" category="location_of"
discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
conditional="false" _ref_arg1="12498" _ref_arg2="12503"/>


Sorry for long and many queries at once.

Thanks a lot in advance for your suggetions.

With regards,
Shrestha

Re: ctakes concept and relation extraction

Posted by Niraj Shrestha <ns...@gmail.com>.

Hi Timothy
Thanks for the prompt reply.
Is it possible to use IdentifiedAnnotation in CPE?
I saw IdentifiedAnnotation in CVD which select one concept among the
collections.
I would like to run CPE since I need to run for many documents. I believe
that I could not run CVD for many documents, am I right?

Regards,
Shrestha

On Tue, Aug 2, 2016 at 3:11 PM, Miller, Timothy <
Timothy.Miller@childrens.harvard.edu> wrote:

> I don't know if there is a single pipeline that does concepts and
> relations, if not you will have to use UIMAFit calls to add additional
> extractors to the fast pipeline descriptor you are currently getting.
>
> You may want "IdentifiedAnnotation" and its subclasses as your type
> because it has a definite span. Each IA may correspond to a number of
> different concepts in the UMLS dictionary, so we have a data structure
> that contains all the matches for a given span. That is the FSArray (It
> is a UIMA data type, stands for FeatureStructureArray). The UMLS
> dictionary annotators will create UmlsConcept instances in that array
> based on the results of the dictionary lookup.
> Finding the "best" one for any span is not something that cTAKES will do
> for you, it probably depends on your application. Sometimes we output
> them all, sometimes we output the first one, you may need to dig in to
> see how many of them are relevant and filter against a subset of things
> you are looking for.
>
>
> Looks like the word "kidney" is indeed in the input:
>
> > human embryo kidney 293T cells
>
> ctakes will find mentions even as modifiers inside larger phrases.
>
>
> Finally, I would not try to interpret Uima xml manually, I would use the
> UIMA CVD (visual debugger) to read the .xmi files that ctakes outputs.
> (I believe they should be xmi).
>
> Tim
>
>
> On Tue, 2016-08-02 at 14:13 +0200, Niraj Shrestha wrote:
> > Dear Sir
> > I am trying to extract named entities and their relations from medical
> > document. If I understood correctly concepts are basically entities.
> > I have used two different analysis engines:
> >      AggregatePlaintextFastUMLSProcessor.xml for concept extraction
> > and
> >
> >      RelationExtractorAggregate for relation extraction.
> >
> >
> > My first question is how can I combined both engine to obtain concept
> > and relations in single file.
> >
> >
> > If I understood correctly, If I need to extract all the entities
> > (concepts) then I need to get all the nodes
> > "org.apache.ctakes.typesystem.type.refsem.UmlsConcept" from output xml
> > file. But how can I choose the single entities or concept from list of
> > many concepts?
> >
> >
> > and What is FSArray in which all concept ids are listed.
> >
> >
> > I found some concepts are not mentioned on input data but it appeared
> > in the output data for example, when I use following engine in
> > "note.txt" file
> >
> >
> > <import
> > location="../analysis_engine/AggregatePlaintextFastUMLSProcessor.xml"/>
> >
> > output file is "note.txt4.xml" (attached here)
> >
> >
> > One of the concept is following, where "kidney" is mentioned as
> > preferredText but the word "kidney" is not found in the input data.
> >
> >
> > <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4503"
> > codingScheme="SNOMEDCT" code="64033007" oid="64033007#SNOMEDCT"
> > score="0.0" disambiguated="false" cui="C0022646" tui="T023"
> > preferredText="Kidney"/>
> >     <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4493"
> > codingScheme="SNOMEDCT" code="17373004" oid="17373004#SNOMEDCT"
> > score="0.0" disambiguated="false" cui="C0227665" tui="T023"
> > preferredText="Both kidneys"/>
> >     <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4483"
> > codingScheme="SNOMEDCT" code="181414000" oid="181414000#SNOMEDCT"
> > score="0.0" disambiguated="false" cui="C1278978" tui="T023"
> > preferredText="Entire kidney"/>
> >     <uima.cas.FSArray _id="4513" size="3">
> >         <i>4483</i>
> >         <i>4493</i>
> >         <i>4503</i>
> >     </uima.cas.FSArray>
> >
> >
> >
> >
> > ************************************
> > My next query concern with relation extraction for which I use
> > following engine.
> >
> >
> > <import
> >
> location="../../../ctakes-relation-extractor/desc/analysis_engine/RelationExtractorAggregate.xml"/>
> >
> > output file is "note.txt_relation.xml" (attached here)
> >
> >
> > I am not able to interpret the output file (note.txt_relation.xml) in
> > which relation and their location is mentioned but could not figure
> > out which entities and what relation between those entities in terms
> > of words.
> >
> >
> > For eg:
> >
> >
> > <org.apache.ctakes.typesystem.type.relation.RelationArgument
> > _indexed="1" _id="12422" id="0" _ref_argument="10680"
> > role="Argument"/>
> >     <org.apache.ctakes.typesystem.type.relation.RelationArgument
> > _indexed="1" _id="12427" id="0" _ref_argument="10989"
> > role="Related_to"/>
> >     <org.apache.ctakes.typesystem.type.relation.RelationArgument
> > _indexed="1" _id="12446" id="0" _ref_argument="10680"
> > role="Argument"/>
> > .
> > .
> > .
> > .
> > <org.apache.ctakes.typesystem.type.relation.RelationArgument
> > _indexed="1" _id="12851" id="0" _ref_argument="12181"
> > role="Related_to"/>
> >     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> > _indexed="1" _id="12432" id="0" category="location_of"
> > discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> > conditional="false" _ref_arg1="12422" _ref_arg2="12427"/>
> >     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> > _indexed="1" _id="12456" id="0" category="location_of"
> > discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> > conditional="false" _ref_arg1="12446" _ref_arg2="12451"/>
> >     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> > _indexed="1" _id="12480" id="0" category="location_of"
> > discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> > conditional="false" _ref_arg1="12470" _ref_arg2="12475"/>
> >     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> > _indexed="1" _id="12508" id="0" category="location_of"
> > discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> > conditional="false" _ref_arg1="12498" _ref_arg2="12503"/>
> >
> >
> >
> >
> > Sorry for long and many queries at once.
> >
> >
> > Thanks a lot in advance for your suggetions.
> >
> >
> > With regards,
> > Shrestha
> >
> >
> >
> >
>
>

Re: ctakes concept and relation extraction

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

I don't know if there is a single pipeline that does concepts and
relations, if not you will have to use UIMAFit calls to add additional
extractors to the fast pipeline descriptor you are currently getting.

You may want "IdentifiedAnnotation" and its subclasses as your type
because it has a definite span. Each IA may correspond to a number of
different concepts in the UMLS dictionary, so we have a data structure
that contains all the matches for a given span. That is the FSArray (It
is a UIMA data type, stands for FeatureStructureArray). The UMLS
dictionary annotators will create UmlsConcept instances in that array
based on the results of the dictionary lookup.
Finding the "best" one for any span is not something that cTAKES will do
for you, it probably depends on your application. Sometimes we output
them all, sometimes we output the first one, you may need to dig in to
see how many of them are relevant and filter against a subset of things
you are looking for.


Looks like the word "kidney" is indeed in the input:

> human embryo kidney 293T cells

ctakes will find mentions even as modifiers inside larger phrases.


Finally, I would not try to interpret Uima xml manually, I would use the
UIMA CVD (visual debugger) to read the .xmi files that ctakes outputs.
(I believe they should be xmi).

Tim


On Tue, 2016-08-02 at 14:13 +0200, Niraj Shrestha wrote:
> Dear Sir
> I am trying to extract named entities and their relations from medical
> document. If I understood correctly concepts are basically entities. 
> I have used two different analysis engines:
>      AggregatePlaintextFastUMLSProcessor.xml for concept extraction
> and 
> 
>      RelationExtractorAggregate for relation extraction. 
> 
> 
> My first question is how can I combined both engine to obtain concept
> and relations in single file. 
> 
> 
> If I understood correctly, If I need to extract all the entities
> (concepts) then I need to get all the nodes
> "org.apache.ctakes.typesystem.type.refsem.UmlsConcept" from output xml
> file. But how can I choose the single entities or concept from list of
> many concepts? 
> 
> 
> and What is FSArray in which all concept ids are listed. 
> 
> 
> I found some concepts are not mentioned on input data but it appeared
> in the output data for example, when I use following engine in
> "note.txt" file
> 
> 
> <import
> location="../analysis_engine/AggregatePlaintextFastUMLSProcessor.xml"/>
> 
> output file is "note.txt4.xml" (attached here)
> 
> 
> One of the concept is following, where "kidney" is mentioned as
> preferredText but the word "kidney" is not found in the input data. 
> 
> 
> <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4503"
> codingScheme="SNOMEDCT" code="64033007" oid="64033007#SNOMEDCT"
> score="0.0" disambiguated="false" cui="C0022646" tui="T023"
> preferredText="Kidney"/>
>     <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4493"
> codingScheme="SNOMEDCT" code="17373004" oid="17373004#SNOMEDCT"
> score="0.0" disambiguated="false" cui="C0227665" tui="T023"
> preferredText="Both kidneys"/>
>     <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4483"
> codingScheme="SNOMEDCT" code="181414000" oid="181414000#SNOMEDCT"
> score="0.0" disambiguated="false" cui="C1278978" tui="T023"
> preferredText="Entire kidney"/>
>     <uima.cas.FSArray _id="4513" size="3">
>         <i>4483</i>
>         <i>4493</i>
>         <i>4503</i>
>     </uima.cas.FSArray>
> 
> 
> 
> 
> ************************************
> My next query concern with relation extraction for which I use
> following engine. 
> 
> 
> <import
> location="../../../ctakes-relation-extractor/desc/analysis_engine/RelationExtractorAggregate.xml"/>
> 
> output file is "note.txt_relation.xml" (attached here)
> 
> 
> I am not able to interpret the output file (note.txt_relation.xml) in
> which relation and their location is mentioned but could not figure
> out which entities and what relation between those entities in terms
> of words. 
> 
> 
> For eg:
> 
> 
> <org.apache.ctakes.typesystem.type.relation.RelationArgument
> _indexed="1" _id="12422" id="0" _ref_argument="10680"
> role="Argument"/>
>     <org.apache.ctakes.typesystem.type.relation.RelationArgument
> _indexed="1" _id="12427" id="0" _ref_argument="10989"
> role="Related_to"/>
>     <org.apache.ctakes.typesystem.type.relation.RelationArgument
> _indexed="1" _id="12446" id="0" _ref_argument="10680"
> role="Argument"/>
> .
> .
> .
> .
> <org.apache.ctakes.typesystem.type.relation.RelationArgument
> _indexed="1" _id="12851" id="0" _ref_argument="12181"
> role="Related_to"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12432" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12422" _ref_arg2="12427"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12456" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12446" _ref_arg2="12451"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12480" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12470" _ref_arg2="12475"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12508" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12498" _ref_arg2="12503"/>
> 
> 
> 
> 
> Sorry for long and many queries at once. 
> 
> 
> Thanks a lot in advance for your suggetions.
> 
> 
> With regards,
> Shrestha
> 
> 
> 
>

Re: ctakes concept and relation extraction

Posted by Niraj Shrestha <ns...@gmail.com>.

Dear Sir
I updated the mail I sent before.

On Tue, Aug 2, 2016 at 2:13 PM, Niraj Shrestha <ns...@gmail.com> wrote:

> Dear Sir
> I am trying to extract named entities and their relations from medical
> document. If I understood correctly concepts are basically entities.
> I have used two different analysis engines:
>      AggregatePlaintextFastUMLSProcessor.xml for concept extraction and
>      RelationExtractorAggregate for relation extraction.
>
> My first question is how can I combined both engine to obtain concept and
> relations in single file.
>
> If I understood correctly, If I need to extract all the entities
> (concepts) then I need to get all the nodes
> "org.apache.ctakes.typesystem.type.refsem.UmlsConcept" from output xml
> file. But how can I choose the single entities or concept from list of many
> concepts?
>
> and What is FSArray in which all concept ids are listed.
>
> I found some concepts are not mentioned on input data but it appeared in
> the output data for example, when I use following engine in "note.txt" file
>
> <import
> location="../analysis_engine/AggregatePlaintextFastUMLSProcessor.xml"/>
> output file is "note.txt4.xml" (attached here)
>
> One of the concept is following, where "kidney" is mentioned as
> preferredText. (Text delete here)
>
> <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4503"
> codingScheme="SNOMEDCT" code="64033007" oid="64033007#SNOMEDCT" score="0.0"
> disambiguated="false" cui="C0022646" tui="T023" preferredText="Kidney"/>
>     <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4493"
> codingScheme="SNOMEDCT" code="17373004" oid="17373004#SNOMEDCT" score="0.0"
> disambiguated="false" cui="C0227665" tui="T023" preferredText="Both
> kidneys"/>
>     <org.apache.ctakes.typesystem.type.refsem.UmlsConcept _id="4483"
> codingScheme="SNOMEDCT" code="181414000" oid="181414000#SNOMEDCT"
> score="0.0" disambiguated="false" cui="C1278978" tui="T023"
> preferredText="Entire kidney"/>
>     <uima.cas.FSArray _id="4513" size="3">
>         <i>4483</i>
>         <i>4493</i>
>         <i>4503</i>
>     </uima.cas.FSArray>
>

If I need to choose one of the concept from above how can I choose one?

>
>
> ************************************
> My next query concern with relation extraction for which I use following
> engine.
>
> <import
> location="../../../ctakes-relation-extractor/desc/analysis_engine/RelationExtractorAggregate.xml"/>
> output file is "note.txt_relation.xml" (attached here)
>
> I am not able to interpret the output file (note.txt_relation.xml) in
> which relation and their location is mentioned but could not figure out
> which entities and what relation between those entities in terms of words.
>
> For eg:
>
> <org.apache.ctakes.typesystem.type.relation.RelationArgument _indexed="1"
> _id="12422" id="0" _ref_argument="10680" role="Argument"/>
>     <org.apache.ctakes.typesystem.type.relation.RelationArgument
> _indexed="1" _id="12427" id="0" _ref_argument="10989" role="Related_to"/>
>     <org.apache.ctakes.typesystem.type.relation.RelationArgument
> _indexed="1" _id="12446" id="0" _ref_argument="10680" role="Argument"/>
> .
> .
> .
> .
> <org.apache.ctakes.typesystem.type.relation.RelationArgument _indexed="1"
> _id="12851" id="0" _ref_argument="12181" role="Related_to"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12432" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12422" _ref_arg2="12427"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12456" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12446" _ref_arg2="12451"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12480" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12470" _ref_arg2="12475"/>
>     <org.apache.ctakes.typesystem.type.relation.LocationOfTextRelation
> _indexed="1" _id="12508" id="0" category="location_of"
> discoveryTechnique="0" confidence="0.0" polarity="0" uncertainty="0"
> conditional="false" _ref_arg1="12498" _ref_arg2="12503"/>
>
>
> Sorry for long and many queries at once.
>
> Thanks a lot in advance for your suggetions.
>
> With regards,
> Shrestha
>
>
>