You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ctakes.apache.org by "Yadav, Harish" <hy...@live.unc.edu> on 2018/03/28 01:33:17 UTC

Sentence extraction from cTAKES XML output.

Hi All,

I am trying to extract the sentence from cTAKES XML output by taking the "begin=5740" and "end=5749" tags (5740 and 5749 is just one example) in org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input text from 5740 to 5749 characters, but it turns out that the extracted section is not the complete sentence and misses the concept(CUIs preferred text) as well sometimes.

I am analyzing the sentences as well where the concept is tagged, so I need them to be complete. Any pointers will be of great help

Regards,
Harish.

RE: Sentence extraction from cTAKES XML output.

Posted by "Yadav, Harish" <hy...@live.unc.edu>.
Hi Reed,

Thanks for additional comment on Python.

I get the exact sentence extracted if I do not use the text within sofaString in XML output of cTAKES, instead take the text (not processed by cTAKES) directly opened in UTF-8 encoding and use the Begin and End tags to slice from this unprocessed text.

Apparently cTAKES adds extra characters in sofaString like &amp; in place of & and hence the length of the sofaString does not match with length of the text(not processed by cTAKES). It can be noted that all the tags Begin, End are with respect to this unprocessed text.

Regards,
Harish.

From: Reed Villanueva <vi...@gmail.com>
Sent: Monday, April 2, 2018 2:38 AM
To: user@ctakes.apache.org
Subject: Re: Sentence extraction from cTAKES XML output.


Using python to play a bit with some testing XMIs, I get results like:

        # getting length of the XMI (unicode) raw text
        print(len(sofa['sofaString']))
        ==> 1711

        # getting length of the raw text after converting to ASCII
        print(len( (unicodedata
                    .normalize('NFKD', sofa['sofaString'])
                    .encode('ascii','ignore')) ))
        ==> 1707

So if you wanted to remove all of the UTF characters that are being counted as having some size > 1, then you could try decoding the sofa unicode string to ascii in the way I am doing above (however I don't know if this will 'jive well' with the sentence tag "begin" and "end" index values (ie. if they will still match to the correct index locations in the decoded text)). For more info on the unicode-ascii differences, see here (https://stackoverflow.com/a/19212345/8236733).



On Sun, Apr 1, 2018 at 7:22 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi Reed,

Thanks for responding. Please find below the queries on the points you have highlighted:


  1.  Yes, I am extracting the sentence from “sofaString” using the begin and end indices of the sentence. I am using Python to do this.

One interesting thing I found is that there are a couple of symbols like :- [cid:image001.png@01D3CE08.CC7BEC10]   (dot before asprin) and [cid:image002.png@01D3CE08.CC7BEC10]  degree sign before f which are counted as 3 and 2 characters respectively by python and not by cTAKES, which hinted that some characters’ length which are in utf-8 encoding above the range of 128 are not counted as single characters by cTAKES. So I removed those characters with single space and this helped me to land up very near to “no significant jvd” i.e. at “es; no significant”.

Also in cTAKES when I look at the segment tag (as in below snapshot) which gives the total length of the raw_text in “sofaString” is 13135 and when I check the length of “sofaString” using python after removing the characters like – dot and degree to single characters I get the length of 13158.

This creates a lag of few characters, and I not sure which other characters’ length  cTAKES might be counting as “one” and python is counting as more than one. Any ideas on this?

[cid:image003.png@01D3CE08.CC7BEC10]


  1.  I have a dataset with a diverse and big raw_text with 50,000 other files so going through the ctakesCVD manually to check how ctakes is assigning the begin and end tags in different conditions for the sentences would not be feasible.

Regards,
Harish.

From: Reed Villanueva <vi...@gmail.com>>
Sent: Sunday, April 1, 2018 11:39 PM

To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Sentence extraction from cTAKES XML output.

The are two things I can think of to check:

1. The indices may be shifted in the representation of the raw text that ctakes is actually considering. For example, there should be a tag in the XMI called "Sofa" that has attributes "sofaNum" and "sofaString". The sofaString text is what I think is actually being referred to by the sentence begin-end indices (only started looking at ctakes a few weeks ago).
2. Using the ctakesCVD to manually go through the sentences (in XXX) and see how the sentences are segmented when you run whatever AE you're using here.
3. If anything, it may just be easier to code something that will expand the extracted substring to the nearest delimiters (ie. '.' or ';' characters) within the larger raw text (not really an answer so much as a workaround :P).

By the way, it's not just period characters that I have seen cause confusion when segmenting sentences. I have also seen weird sentence segmentation with close-parens ')' and semi-colons ';', the example I gave earlier was just very illustrative of this. Eg. the sentence

* "Example A: Buggsy R. presents as a 53 year old divorced Latina who has been working for the (name of employer) as a (job title) for the last 22 years."

gets segmented as

​* ​"
Example A :
​Buggsy​
R.  presents as a  53 year old divorced Latina who has been working for the (name of employer)
​"​

​* ​"
as a  (job title)
​"​

​* ​"
for the last 22 years.
​"​

On Wed, Mar 28, 2018 at 10:18 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi,
I can find that the UmlsConcept.preferredText provides the term “Orthopnea” which is nothing but medical concept for the CUI C0085619.

But I am not just looking for the corresponding concept from the begin and end tag of  textspan.sentence, I am trying to get the whole sentence that comprises the concept. In below example it happens to be same as the concept as well as the whole sentence i.e Orthopnea. I have given a different case which illustrates better about my problem of extracting the relevant sentence:

Step 1
[cid:image004.png@01D3CE08.CC7BEC10]

Step 2
[cid:image005.png@01D3CE08.CC7BEC10]

Step 3
[cid:image006.png@01D3CE08.CC7BEC10]

Textspan.sentence gives the begin = 5117 end = 5136 and the raw_text[5117:5136] gives : - es; no significant

Instead the output should have been :- no significant jvd (capturing whole phrase/sentence where jvd – jugular venous engorgement concept appears)

The snapshot for raw_text (marked in red box):
[cid:image007.png@01D3CE08.CC7BEC10]

Also if you suspect that the period characters in the raw text might create this issue, do you think that slicing through raw_text[5117:5136] after removing period characters would provide the output as – “no significant jvd” ?

I am viewing the XMI in notepad++ i.e. directly opening the output.xml file generated by cTAKES in the notepad++ and getting the tags in the snapshot.

Regards,
Harish.




From: Reed Villanueva [mailto:villanuevareed@gmail.com<ma...@gmail.com>]
Sent: Wednesday, March 28, 2018 2:42 AM

To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Sentence extraction from cTAKES XML output.

Just looking at what you wrote as the desired output it looks like you just want the associated ontology concept text (ie. in this case input=<the XMI document> output="Orthopnea"). Is this correct? Note that for the annotation mention that you showed (ie. the SignSymptomMention) the ref_ontologyConceptArr maps to the UmlsConcept _id value. The documentation seems spares on how all of the relationships in the ctakes XMI output exactly works, but this relation between annotation mention and UMLS concept tags seems to hold across all other XMIs that I have seen. You could use this relation to get the UmlsConcept.preferredText output (that I think) you are looking for by mapping in this way.

I don't know anything about how ctakes is parsing for the sentence segments, but I notice that the raw text you provide has a lot a period characters for abbreviations. Ctakes seems to have problems segmenting these kinds of sentences, eg. here are the sentence segment I get when inputting an abbreviation heavy string into the ctakesCVD and using the AggregatePlainTextFastUMLSPipeline.xmi:

"
​[​
pt.
​]​

​[​
desc.
​]​
​[​
not having any reason to con't.
​]​
​[​
living;
​]​
​[​
clinical depression.
​]​
"

​This could be the reason for some weirdness in trying to extract sentence information from the XMI fields.​

By the way what are you using to view the XMI? The tags in your images look different than what I see in when running ctakes, eg. mine look like

<textsem:SignSymptomMention xmi:id="353" sofa="1" begin="46" end="55" id="0" ontologyConceptArr="340" typeID="3" discoveryTechnique="1" confidence="0.0" polarity="1" uncertainty="0" conditional="false" generic="false" subject="patient" historyOf="0"/>

​Hope this helps.​

On Tue, Mar 27, 2018 at 8:06 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi Reed,

Thanks for responding. Below is the example and output which I am trying to get:

Once cTAKES gives the output after processing the raw_text (clinical document) in the form of XML. Below are the snapshots depicting what I am trying to extract from the XML:

Step 1
Finding the CUI and the id in the XML (In below snapshot cui is C0085619 and id is 39838 marked in red rectangle).
[cid:image008.png@01D3CE08.CC7BEC10]

Step 2
Finding the begin and the end tags for the corresponding CUI ( In below snapshot begin = 5740 and end = 5749 marked in red rectangle)
[cid:image009.png@01D3CE08.CC7BEC10]

Step 3
Finding the begin and end tags for the sentence of the corresponding CUI ( In below snapshot begin = 5740 and end = 5750 marked in red rectangle)
[cid:image010.png@01D3CE08.CC7BEC10]


Now when I am trying to Get the complete sentence from the raw_text (clinical document which was fed as an input to cTAKES) where the CUI was tagged, by using the begin and end tags of sentence extracted in the step 3 by simply performing raw_text[5740:5750] I am getting the output as:

OUTPUT :- o pnd ort

Instead of this I was expecting the complete sentence of the raw_text as:- Orthopnea (since the CUI correspond to Orthopnea hence the tagged sentence should comprise the tagged concept as well i.e Orthopnea)

Below is the snippet from the of the raw_text where I have marked the sentence in red rectangular box which yields “o pnd. ort” instead of “orthopnea” :-

[cid:image011.png@01D3CE08.CC7BEC10]

Please let me know if you have any queries regarding the example or the output I am trying to get.

Regards,
Harish.



From: Reed Villanueva [mailto:villanuevareed@gmail.com<ma...@gmail.com>]
Sent: Wednesday, March 28, 2018 12:35 AM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Sentence extraction from cTAKES XML output.

Could you provide an example of the problem your are seeing and a bit more about the kind of output you are trying to end up with?



On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi All,

I am trying to extract the sentence from cTAKES XML output by taking the “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input text from 5740 to 5749 characters, but it turns out that the extracted section is not the complete sentence and misses the concept(CUIs preferred text) as well sometimes.

I am analyzing the sentences as well where the concept is tagged, so I need them to be complete. Any pointers will be of great help

Regards,
Harish.





Re: Sentence extraction from cTAKES XML output.

Posted by Reed Villanueva <vi...@gmail.com>.
Using python to play a bit with some testing XMIs, I get results like:

        # getting length of the XMI (unicode) raw text
        print(len(sofa['sofaString']))
        ==> 1711

        # getting length of the raw text after converting to ASCII
        print(len( (unicodedata
                    .normalize('NFKD', sofa['sofaString'])
                    .encode('ascii','ignore')) ))
        ==> 1707

So if you wanted to remove all of the UTF characters that are being counted
as having some size > 1, then you could try decoding the sofa unicode
string to ascii in the way I am doing above (however I don't know if this
will 'jive well' with the sentence tag "begin" and "end" index values (ie.
if they will still match to the correct index locations in the decoded
text)). For more info on the unicode-ascii differences, see here (
https://stackoverflow.com/a/19212345/8236733).



On Sun, Apr 1, 2018 at 7:22 PM, Yadav, Harish <hy...@live.unc.edu> wrote:

> Hi Reed,
>
>
>
> Thanks for responding. Please find below the queries on the points you
> have highlighted:
>
>
>
>    1. Yes, I am extracting the sentence from “sofaString” using the begin
>    and end indices of the sentence. I am using Python to do this.
>
>
>
> One interesting thing I found is that there are a couple of symbols like
> :-   (dot before asprin) and  degree sign before f which are counted as 3
> and 2 characters respectively by python and not by cTAKES, which hinted
> that some characters’ length which are in utf-8 encoding above the range of
> 128 are not counted as single characters by cTAKES. So I removed those
> characters with single space and this helped me to land up very near to *“no
> significant jvd”* i.e. at “*es; no significant”*.
>
>
>
> Also in cTAKES when I look at the segment tag (as in below snapshot) which
> gives the total length of the raw_text in “sofaString” is *13135* and
> when I check the length of “sofaString” using python after removing the
> characters like – dot and degree to single characters I get the length of
> *13158*.
>
>
>
> This creates a lag of few characters, and I not sure which other
> characters’ length  cTAKES might be counting as “one” and python is
> counting as more than one. *Any ideas on this?*
>
>
>
>
>
>    1. I have a dataset with a diverse and big raw_text with 50,000 other
>    files so going through the ctakesCVD manually to check how ctakes is
>    assigning the begin and end tags in different conditions for the sentences
>    would not be feasible.
>
>
>
> Regards,
>
> Harish.
>
>
>
> *From:* Reed Villanueva <vi...@gmail.com>
> *Sent:* Sunday, April 1, 2018 11:39 PM
>
> *To:* user@ctakes.apache.org
> *Subject:* Re: Sentence extraction from cTAKES XML output.
>
>
>
> The are two things I can think of to check:
>
>
>
> 1. The indices may be shifted in the representation of the raw text that
> ctakes is actually considering. For example, there should be a tag in the
> XMI called "Sofa" that has attributes "sofaNum" and "sofaString". The
> sofaString text is what I think is actually being referred to by the
> sentence begin-end indices (only started looking at ctakes a few weeks
> ago).
>
> 2. Using the ctakesCVD to manually go through the sentences (in XXX) and
> see how the sentences are segmented when you run whatever AE you're using
> here.
>
> 3. If anything, it may just be easier to code something that will expand
> the extracted substring to the nearest delimiters (ie. '.' or ';'
> characters) within the larger raw text (not really an answer so much as a
> workaround :P).
>
>
>
> By the way, it's not just period characters that I have seen cause
> confusion when segmenting sentences. I have also seen weird sentence
> segmentation with close-parens ')' and semi-colons ';', the example I gave
> earlier was just very illustrative of this. Eg. the sentence
>
> * "Example A: Buggsy R. presents as a 53 year old divorced Latina who has
> been working for the (name of employer) as a (job title) for the last 22
> years."
>
> gets segmented as
>
>
>
> ​* ​"
>
> Example A :
>
> ​Buggsy​
>
> R.  presents as a  53 year old divorced Latina who has been working for
> the (name of employer)
>
> ​"​
>
>
>
> ​* ​"
>
> as a  (job title)
>
> ​"​
>
>
>
> ​* ​"
>
> for the last 22 years.
>
> ​"​
>
>
>
> On Wed, Mar 28, 2018 at 10:18 PM, Yadav, Harish <hy...@live.unc.edu>
> wrote:
>
> Hi,
>
> I can find that the UmlsConcept.preferredText provides the term
> “Orthopnea” which is nothing but medical concept for the CUI C0085619.
>
>
>
> But I am not just looking for the corresponding concept from the begin and
> end tag of  textspan.sentence, I am trying to get the whole sentence that
> comprises the concept. In below example it happens to be same as the
> concept as well as the whole sentence i.e Orthopnea. I have given a
> different case which illustrates better about my problem of extracting the
> relevant sentence:
>
>
>
> Step 1
>
>
>
> Step 2
>
>
>
> Step 3
>
>
>
> Textspan.sentence gives the begin = 5117 end = 5136 and the
> raw_text[5117:5136] gives : - *es; no significant*
>
>
>
> Instead the output should have been :- *no significant jvd (capturing
> whole phrase/sentence where jvd – jugular venous engorgement concept
> appears)*
>
>
>
> The snapshot for raw_text (marked in red box):
>
>
>
> Also if you suspect that the period characters in the raw text might
> create this issue, do you think that slicing through raw_text[5117:5136]
> after removing period characters would provide the output as – *“no
> significant jvd” ?*
>
>
>
> I am viewing the XMI in notepad++ i.e. directly opening the output.xml
> file generated by cTAKES in the notepad++ and getting the tags in the
> snapshot.
>
>
>
> Regards,
>
> Harish.
>
>
>
>
>
>
>
>
>
> *From:* Reed Villanueva [mailto:villanuevareed@gmail.com]
> *Sent:* Wednesday, March 28, 2018 2:42 AM
>
>
> *To:* user@ctakes.apache.org
> *Subject:* Re: Sentence extraction from cTAKES XML output.
>
>
>
> Just looking at what you wrote as the desired output it looks like you
> just want the associated ontology concept text (ie. in this case input=<the
> XMI document> output="Orthopnea"). Is this correct? Note that for the
> annotation mention that you showed (ie. the SignSymptomMention) the
> ref_ontologyConceptArr maps to the UmlsConcept _id value. The documentation
> seems spares on how all of the relationships in the ctakes XMI output
> exactly works, but this relation between annotation mention and UMLS
> concept tags seems to hold across all other XMIs that I have seen. You
> could use this relation to get the UmlsConcept.preferredText output (that I
> think) you are looking for by mapping in this way.
>
>
>
> I don't know anything about how ctakes is parsing for the sentence
> segments, but I notice that the raw text you provide has a lot a period
> characters for abbreviations. Ctakes seems to have problems segmenting
> these kinds of sentences, eg. here are the sentence segment I get when
> inputting an abbreviation heavy string into the ctakesCVD and using the
> AggregatePlainTextFastUMLSPipeline.xmi:
>
>
>
> "
>
> ​[​
>
> pt.
>
> ​]​
>
>
>
> ​[​
>
> desc.
>
> ​]​
>
> ​[​
>
> not having any reason to con't.
>
> ​]​
>
> ​[​
>
> living;
>
> ​]​
>
> ​[​
>
> clinical depression.
>
> ​]​
>
> "
>
>
>
> ​This could be the reason for some weirdness in trying to extract sentence
> information from the XMI fields.​
>
>
>
> By the way what are you using to view the XMI? The tags in your images
> look different than what I see in when running ctakes, eg. mine look like
>
>
>
> <textsem:SignSymptomMention xmi:id="353" sofa="1" begin="46" end="55"
> id="0" ontologyConceptArr="340" typeID="3" discoveryTechnique="1"
> confidence="0.0" polarity="1" uncertainty="0" conditional="false"
> generic="false" subject="patient" historyOf="0"/>
>
>
>
> ​Hope this helps.​
>
>
>
> On Tue, Mar 27, 2018 at 8:06 PM, Yadav, Harish <hy...@live.unc.edu>
> wrote:
>
> Hi Reed,
>
>
>
> Thanks for responding. Below is the example and output which I am trying
> to get:
>
>
>
> Once cTAKES gives the output after processing the raw_text (clinical
> document) in the form of XML. Below are the snapshots depicting what I am
> trying to extract from the XML:
>
>
>
> Step 1
>
> Finding the CUI and the id in the XML (In below snapshot cui is C0085619
> and id is 39838 marked in red rectangle).
>
>
>
>
>
> Step 2
>
> Finding the begin and the end tags for the corresponding CUI ( In below
> snapshot begin = 5740 and end = 5749 marked in red rectangle)
>
>
>
> Step 3
>
> Finding the begin and end tags for the sentence of the corresponding CUI (
> In below snapshot begin = 5740 and end = 5750 marked in red rectangle)
>
>
>
>
>
> Now when I am trying to Get the complete sentence from the raw_text
> (clinical document which was fed as an input to cTAKES) where the CUI was
> tagged, by using the begin and end tags of sentence extracted in the step 3
> by simply performing raw_text[5740:5750] I am getting the output as:
>
>
>
> *OUTPUT :- *o pnd ort
>
>
>
> *Instead of this I was expecting the complete sentence of the raw_text as*:-
> Orthopnea (since the CUI correspond to Orthopnea hence the tagged sentence
> should comprise the tagged concept as well i.e Orthopnea)
>
>
>
> Below is the snippet from the of the raw_text where I have marked the
> sentence in red rectangular box which yields “o pnd. ort” instead of
> “orthopnea” :-
>
>
>
>
>
> Please let me know if you have any queries regarding the example or the
> output I am trying to get.
>
>
>
> Regards,
>
> Harish.
>
>
>
>
>
>
>
> *From:* Reed Villanueva [mailto:villanuevareed@gmail.com]
> *Sent:* Wednesday, March 28, 2018 12:35 AM
> *To:* user@ctakes.apache.org
> *Subject:* Re: Sentence extraction from cTAKES XML output.
>
>
>
> Could you provide an example of the problem your are seeing and a bit more
> about the kind of output you are trying to end up with?
>
>
>
>
>
>
>
> On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu>
> wrote:
>
> Hi All,
>
>
>
> I am trying to extract the sentence from cTAKES XML output by taking the
> “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in
> org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input
> text from 5740 to 5749 characters, but it turns out that the extracted
> section is not the complete sentence and misses the concept(CUIs preferred
> text) as well sometimes.
>
>
>
> I am analyzing the sentences as well where the concept is tagged, so I
> need them to be complete. Any pointers will be of great help
>
>
>
> Regards,
>
> Harish.
>
>
>
>
>
>
>

RE: Sentence extraction from cTAKES XML output.

Posted by "Yadav, Harish" <hy...@live.unc.edu>.
Hi Reed,

Thanks for responding. Please find below the queries on the points you have highlighted:


  1.  Yes, I am extracting the sentence from “sofaString” using the begin and end indices of the sentence. I am using Python to do this.

One interesting thing I found is that there are a couple of symbols like :- [cid:image001.png@01D3CA1A.A9633D10]   (dot before asprin) and [cid:image002.png@01D3CA1B.A62D7100]  degree sign before f which are counted as 3 and 2 characters respectively by python and not by cTAKES, which hinted that some characters’ length which are in utf-8 encoding above the range of 128 are not counted as single characters by cTAKES. So I removed those characters with single space and this helped me to land up very near to “no significant jvd” i.e. at “es; no significant”.

Also in cTAKES when I look at the segment tag (as in below snapshot) which gives the total length of the raw_text in “sofaString” is 13135 and when I check the length of “sofaString” using python after removing the characters like – dot and degree to single characters I get the length of 13158.

This creates a lag of few characters, and I not sure which other characters’ length  cTAKES might be counting as “one” and python is counting as more than one. Any ideas on this?

[cid:image003.png@01D3CA1D.782D1880]


  1.  I have a dataset with a diverse and big raw_text with 50,000 other files so going through the ctakesCVD manually to check how ctakes is assigning the begin and end tags in different conditions for the sentences would not be feasible.

Regards,
Harish.

From: Reed Villanueva <vi...@gmail.com>
Sent: Sunday, April 1, 2018 11:39 PM
To: user@ctakes.apache.org
Subject: Re: Sentence extraction from cTAKES XML output.

The are two things I can think of to check:

1. The indices may be shifted in the representation of the raw text that ctakes is actually considering. For example, there should be a tag in the XMI called "Sofa" that has attributes "sofaNum" and "sofaString". The sofaString text is what I think is actually being referred to by the sentence begin-end indices (only started looking at ctakes a few weeks ago).
2. Using the ctakesCVD to manually go through the sentences (in XXX) and see how the sentences are segmented when you run whatever AE you're using here.
3. If anything, it may just be easier to code something that will expand the extracted substring to the nearest delimiters (ie. '.' or ';' characters) within the larger raw text (not really an answer so much as a workaround :P).

By the way, it's not just period characters that I have seen cause confusion when segmenting sentences. I have also seen weird sentence segmentation with close-parens ')' and semi-colons ';', the example I gave earlier was just very illustrative of this. Eg. the sentence

* "Example A: Buggsy R. presents as a 53 year old divorced Latina who has been working for the (name of employer) as a (job title) for the last 22 years."

gets segmented as

​* ​"
Example A :
​Buggsy​
R.  presents as a  53 year old divorced Latina who has been working for the (name of employer)
​"​

​* ​"
as a  (job title)
​"​

​* ​"
for the last 22 years.
​"​

On Wed, Mar 28, 2018 at 10:18 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi,
I can find that the UmlsConcept.preferredText provides the term “Orthopnea” which is nothing but medical concept for the CUI C0085619.

But I am not just looking for the corresponding concept from the begin and end tag of  textspan.sentence, I am trying to get the whole sentence that comprises the concept. In below example it happens to be same as the concept as well as the whole sentence i.e Orthopnea. I have given a different case which illustrates better about my problem of extracting the relevant sentence:

Step 1
[cid:image006.png@01D3CA19.754BC750]

Step 2
[cid:image007.png@01D3CA19.754BC750]

Step 3
[cid:image009.png@01D3CA19.754BC750]

Textspan.sentence gives the begin = 5117 end = 5136 and the raw_text[5117:5136] gives : - es; no significant

Instead the output should have been :- no significant jvd (capturing whole phrase/sentence where jvd – jugular venous engorgement concept appears)

The snapshot for raw_text (marked in red box):
[cid:image010.png@01D3CA19.754BC750]

Also if you suspect that the period characters in the raw text might create this issue, do you think that slicing through raw_text[5117:5136] after removing period characters would provide the output as – “no significant jvd” ?

I am viewing the XMI in notepad++ i.e. directly opening the output.xml file generated by cTAKES in the notepad++ and getting the tags in the snapshot.

Regards,
Harish.




From: Reed Villanueva [mailto:villanuevareed@gmail.com<ma...@gmail.com>]
Sent: Wednesday, March 28, 2018 2:42 AM

To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Sentence extraction from cTAKES XML output.

Just looking at what you wrote as the desired output it looks like you just want the associated ontology concept text (ie. in this case input=<the XMI document> output="Orthopnea"). Is this correct? Note that for the annotation mention that you showed (ie. the SignSymptomMention) the ref_ontologyConceptArr maps to the UmlsConcept _id value. The documentation seems spares on how all of the relationships in the ctakes XMI output exactly works, but this relation between annotation mention and UMLS concept tags seems to hold across all other XMIs that I have seen. You could use this relation to get the UmlsConcept.preferredText output (that I think) you are looking for by mapping in this way.

I don't know anything about how ctakes is parsing for the sentence segments, but I notice that the raw text you provide has a lot a period characters for abbreviations. Ctakes seems to have problems segmenting these kinds of sentences, eg. here are the sentence segment I get when inputting an abbreviation heavy string into the ctakesCVD and using the AggregatePlainTextFastUMLSPipeline.xmi:

"
​[​
pt.
​]​

​[​
desc.
​]​
​[​
not having any reason to con't.
​]​
​[​
living;
​]​
​[​
clinical depression.
​]​
"

​This could be the reason for some weirdness in trying to extract sentence information from the XMI fields.​

By the way what are you using to view the XMI? The tags in your images look different than what I see in when running ctakes, eg. mine look like

<textsem:SignSymptomMention xmi:id="353" sofa="1" begin="46" end="55" id="0" ontologyConceptArr="340" typeID="3" discoveryTechnique="1" confidence="0.0" polarity="1" uncertainty="0" conditional="false" generic="false" subject="patient" historyOf="0"/>

​Hope this helps.​

On Tue, Mar 27, 2018 at 8:06 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi Reed,

Thanks for responding. Below is the example and output which I am trying to get:

Once cTAKES gives the output after processing the raw_text (clinical document) in the form of XML. Below are the snapshots depicting what I am trying to extract from the XML:

Step 1
Finding the CUI and the id in the XML (In below snapshot cui is C0085619 and id is 39838 marked in red rectangle).
[cid:image011.png@01D3CA19.754BC750]

Step 2
Finding the begin and the end tags for the corresponding CUI ( In below snapshot begin = 5740 and end = 5749 marked in red rectangle)
[cid:image012.png@01D3CA19.754BC750]

Step 3
Finding the begin and end tags for the sentence of the corresponding CUI ( In below snapshot begin = 5740 and end = 5750 marked in red rectangle)
[cid:image013.png@01D3CA19.754BC750]


Now when I am trying to Get the complete sentence from the raw_text (clinical document which was fed as an input to cTAKES) where the CUI was tagged, by using the begin and end tags of sentence extracted in the step 3 by simply performing raw_text[5740:5750] I am getting the output as:

OUTPUT :- o pnd ort

Instead of this I was expecting the complete sentence of the raw_text as:- Orthopnea (since the CUI correspond to Orthopnea hence the tagged sentence should comprise the tagged concept as well i.e Orthopnea)

Below is the snippet from the of the raw_text where I have marked the sentence in red rectangular box which yields “o pnd. ort” instead of “orthopnea” :-

[cid:image014.png@01D3CA19.754BC750]

Please let me know if you have any queries regarding the example or the output I am trying to get.

Regards,
Harish.



From: Reed Villanueva [mailto:villanuevareed@gmail.com<ma...@gmail.com>]
Sent: Wednesday, March 28, 2018 12:35 AM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Sentence extraction from cTAKES XML output.

Could you provide an example of the problem your are seeing and a bit more about the kind of output you are trying to end up with?



On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi All,

I am trying to extract the sentence from cTAKES XML output by taking the “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input text from 5740 to 5749 characters, but it turns out that the extracted section is not the complete sentence and misses the concept(CUIs preferred text) as well sometimes.

I am analyzing the sentences as well where the concept is tagged, so I need them to be complete. Any pointers will be of great help

Regards,
Harish.




Re: Sentence extraction from cTAKES XML output.

Posted by Reed Villanueva <vi...@gmail.com>.
The are two things I can think of to check:

1. The indices may be shifted in the representation of the raw text that
ctakes is actually considering. For example, there should be a tag in the
XMI called "Sofa" that has attributes "sofaNum" and "sofaString". The
sofaString text is what I think is actually being referred to by the
sentence begin-end indices (only started looking at ctakes a few weeks
ago).
2. Using the ctakesCVD to manually go through the sentences (in XXX) and
see how the sentences are segmented when you run whatever AE you're using
here.
3. If anything, it may just be easier to code something that will expand
the extracted substring to the nearest delimiters (ie. '.' or ';'
characters) within the larger raw text (not really an answer so much as a
workaround :P).

By the way, it's not just period characters that I have seen cause
confusion when segmenting sentences. I have also seen weird sentence
segmentation with close-parens ')' and semi-colons ';', the example I gave
earlier was just very illustrative of this. Eg. the sentence

* "Example A: Buggsy R. presents as a 53 year old divorced Latina who has
been working for the (name of employer) as a (job title) for the last 22
years."

gets segmented as

​* ​"
Example A :
​Buggsy​
R.  presents as a  53 year old divorced Latina who has been working for the
(name of employer)
​"​

​* ​"
as a  (job title)
​"​

​* ​"
for the last 22 years.
​"​

On Wed, Mar 28, 2018 at 10:18 PM, Yadav, Harish <hy...@live.unc.edu> wrote:

> Hi,
>
> I can find that the UmlsConcept.preferredText provides the term
> “Orthopnea” which is nothing but medical concept for the CUI C0085619.
>
>
>
> But I am not just looking for the corresponding concept from the begin and
> end tag of  textspan.sentence, I am trying to get the whole sentence that
> comprises the concept. In below example it happens to be same as the
> concept as well as the whole sentence i.e Orthopnea. I have given a
> different case which illustrates better about my problem of extracting the
> relevant sentence:
>
>
>
> Step 1
>
>
>
> Step 2
>
>
>
> Step 3
>
>
>
> Textspan.sentence gives the begin = 5117 end = 5136 and the
> raw_text[5117:5136] gives : - *es; no significant*
>
>
>
> Instead the output should have been :- *no significant jvd (capturing
> whole phrase/sentence where jvd – jugular venous engorgement concept
> appears)*
>
>
>
> The snapshot for raw_text (marked in red box):
>
>
>
> Also if you suspect that the period characters in the raw text might
> create this issue, do you think that slicing through raw_text[5117:5136]
> after removing period characters would provide the output as – *“no
> significant jvd” ?*
>
>
>
> I am viewing the XMI in notepad++ i.e. directly opening the output.xml
> file generated by cTAKES in the notepad++ and getting the tags in the
> snapshot.
>
>
>
> Regards,
>
> Harish.
>
>
>
>
>
>
>
>
>
> *From:* Reed Villanueva [mailto:villanuevareed@gmail.com]
> *Sent:* Wednesday, March 28, 2018 2:42 AM
>
> *To:* user@ctakes.apache.org
> *Subject:* Re: Sentence extraction from cTAKES XML output.
>
>
>
> Just looking at what you wrote as the desired output it looks like you
> just want the associated ontology concept text (ie. in this case input=<the
> XMI document> output="Orthopnea"). Is this correct? Note that for the
> annotation mention that you showed (ie. the SignSymptomMention) the
> ref_ontologyConceptArr maps to the UmlsConcept _id value. The documentation
> seems spares on how all of the relationships in the ctakes XMI output
> exactly works, but this relation between annotation mention and UMLS
> concept tags seems to hold across all other XMIs that I have seen. You
> could use this relation to get the UmlsConcept.preferredText output (that I
> think) you are looking for by mapping in this way.
>
>
>
> I don't know anything about how ctakes is parsing for the sentence
> segments, but I notice that the raw text you provide has a lot a period
> characters for abbreviations. Ctakes seems to have problems segmenting
> these kinds of sentences, eg. here are the sentence segment I get when
> inputting an abbreviation heavy string into the ctakesCVD and using the
> AggregatePlainTextFastUMLSPipeline.xmi:
>
>
>
> "
>
> ​[​
>
> pt.
>
> ​]​
>
>
>
> ​[​
>
> desc.
>
> ​]​
>
> ​[​
>
> not having any reason to con't.
>
> ​]​
>
> ​[​
>
> living;
>
> ​]​
>
> ​[​
>
> clinical depression.
>
> ​]​
>
> "
>
>
>
> ​This could be the reason for some weirdness in trying to extract sentence
> information from the XMI fields.​
>
>
>
> By the way what are you using to view the XMI? The tags in your images
> look different than what I see in when running ctakes, eg. mine look like
>
>
>
> <textsem:SignSymptomMention xmi:id="353" sofa="1" begin="46" end="55"
> id="0" ontologyConceptArr="340" typeID="3" discoveryTechnique="1"
> confidence="0.0" polarity="1" uncertainty="0" conditional="false"
> generic="false" subject="patient" historyOf="0"/>
>
>
>
> ​Hope this helps.​
>
>
>
> On Tue, Mar 27, 2018 at 8:06 PM, Yadav, Harish <hy...@live.unc.edu>
> wrote:
>
> Hi Reed,
>
>
>
> Thanks for responding. Below is the example and output which I am trying
> to get:
>
>
>
> Once cTAKES gives the output after processing the raw_text (clinical
> document) in the form of XML. Below are the snapshots depicting what I am
> trying to extract from the XML:
>
>
>
> Step 1
>
> Finding the CUI and the id in the XML (In below snapshot cui is C0085619
> and id is 39838 marked in red rectangle).
>
>
>
>
>
> Step 2
>
> Finding the begin and the end tags for the corresponding CUI ( In below
> snapshot begin = 5740 and end = 5749 marked in red rectangle)
>
>
>
> Step 3
>
> Finding the begin and end tags for the sentence of the corresponding CUI (
> In below snapshot begin = 5740 and end = 5750 marked in red rectangle)
>
>
>
>
>
> Now when I am trying to Get the complete sentence from the raw_text
> (clinical document which was fed as an input to cTAKES) where the CUI was
> tagged, by using the begin and end tags of sentence extracted in the step 3
> by simply performing raw_text[5740:5750] I am getting the output as:
>
>
>
> *OUTPUT :- *o pnd ort
>
>
>
> *Instead of this I was expecting the complete sentence of the raw_text as*:-
> Orthopnea (since the CUI correspond to Orthopnea hence the tagged sentence
> should comprise the tagged concept as well i.e Orthopnea)
>
>
>
> Below is the snippet from the of the raw_text where I have marked the
> sentence in red rectangular box which yields “o pnd. ort” instead of
> “orthopnea” :-
>
>
>
>
>
> Please let me know if you have any queries regarding the example or the
> output I am trying to get.
>
>
>
> Regards,
>
> Harish.
>
>
>
>
>
>
>
> *From:* Reed Villanueva [mailto:villanuevareed@gmail.com]
> *Sent:* Wednesday, March 28, 2018 12:35 AM
> *To:* user@ctakes.apache.org
> *Subject:* Re: Sentence extraction from cTAKES XML output.
>
>
>
> Could you provide an example of the problem your are seeing and a bit more
> about the kind of output you are trying to end up with?
>
>
>
>
>
>
>
> On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu>
> wrote:
>
> Hi All,
>
>
>
> I am trying to extract the sentence from cTAKES XML output by taking the
> “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in
> org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input
> text from 5740 to 5749 characters, but it turns out that the extracted
> section is not the complete sentence and misses the concept(CUIs preferred
> text) as well sometimes.
>
>
>
> I am analyzing the sentences as well where the concept is tagged, so I
> need them to be complete. Any pointers will be of great help
>
>
>
> Regards,
>
> Harish.
>
>
>
>
>

RE: Sentence extraction from cTAKES XML output.

Posted by "Yadav, Harish" <hy...@live.unc.edu>.
Hi,
I can find that the UmlsConcept.preferredText provides the term “Orthopnea” which is nothing but medical concept for the CUI C0085619.

But I am not just looking for the corresponding concept from the begin and end tag of  textspan.sentence, I am trying to get the whole sentence that comprises the concept. In below example it happens to be same as the concept as well as the whole sentence i.e Orthopnea. I have given a different case which illustrates better about my problem of extracting the relevant sentence:

Step 1
[cid:image006.png@01D3C712.96C1FA30]

Step 2
[cid:image007.png@01D3C713.0A7E45F0]

Step 3
[cid:image005.png@01D3C712.0DF22C70]

Textspan.sentence gives the begin = 5117 end = 5136 and the raw_text[5117:5136] gives : - es; no significant

Instead the output should have been :- no significant jvd (capturing whole phrase/sentence where jvd – jugular venous engorgement concept appears)

The snapshot for raw_text (marked in red box):
[cid:image008.png@01D3C714.9A3C1AE0]

Also if you suspect that the period characters in the raw text might create this issue, do you think that slicing through raw_text[5117:5136] after removing period characters would provide the output as – “no significant jvd” ?

I am viewing the XMI in notepad++ i.e. directly opening the output.xml file generated by cTAKES in the notepad++ and getting the tags in the snapshot.

Regards,
Harish.




From: Reed Villanueva [mailto:villanuevareed@gmail.com]
Sent: Wednesday, March 28, 2018 2:42 AM
To: user@ctakes.apache.org
Subject: Re: Sentence extraction from cTAKES XML output.

Just looking at what you wrote as the desired output it looks like you just want the associated ontology concept text (ie. in this case input=<the XMI document> output="Orthopnea"). Is this correct? Note that for the annotation mention that you showed (ie. the SignSymptomMention) the ref_ontologyConceptArr maps to the UmlsConcept _id value. The documentation seems spares on how all of the relationships in the ctakes XMI output exactly works, but this relation between annotation mention and UMLS concept tags seems to hold across all other XMIs that I have seen. You could use this relation to get the UmlsConcept.preferredText output (that I think) you are looking for by mapping in this way.

I don't know anything about how ctakes is parsing for the sentence segments, but I notice that the raw text you provide has a lot a period characters for abbreviations. Ctakes seems to have problems segmenting these kinds of sentences, eg. here are the sentence segment I get when inputting an abbreviation heavy string into the ctakesCVD and using the AggregatePlainTextFastUMLSPipeline.xmi:

"
​[​
pt.
​]​

​[​
desc.
​]​
​[​
not having any reason to con't.
​]​
​[​
living;
​]​
​[​
clinical depression.
​]​
"

​This could be the reason for some weirdness in trying to extract sentence information from the XMI fields.​

By the way what are you using to view the XMI? The tags in your images look different than what I see in when running ctakes, eg. mine look like

<textsem:SignSymptomMention xmi:id="353" sofa="1" begin="46" end="55" id="0" ontologyConceptArr="340" typeID="3" discoveryTechnique="1" confidence="0.0" polarity="1" uncertainty="0" conditional="false" generic="false" subject="patient" historyOf="0"/>

​Hope this helps.​

On Tue, Mar 27, 2018 at 8:06 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi Reed,

Thanks for responding. Below is the example and output which I am trying to get:

Once cTAKES gives the output after processing the raw_text (clinical document) in the form of XML. Below are the snapshots depicting what I am trying to extract from the XML:

Step 1
Finding the CUI and the id in the XML (In below snapshot cui is C0085619 and id is 39838 marked in red rectangle).
[cid:image001.png@01D3C703.9E29B880]

Step 2
Finding the begin and the end tags for the corresponding CUI ( In below snapshot begin = 5740 and end = 5749 marked in red rectangle)
[cid:image002.png@01D3C703.9E29B880]

Step 3
Finding the begin and end tags for the sentence of the corresponding CUI ( In below snapshot begin = 5740 and end = 5750 marked in red rectangle)
[cid:image003.png@01D3C703.9E29B880]


Now when I am trying to Get the complete sentence from the raw_text (clinical document which was fed as an input to cTAKES) where the CUI was tagged, by using the begin and end tags of sentence extracted in the step 3 by simply performing raw_text[5740:5750] I am getting the output as:

OUTPUT :- o pnd ort

Instead of this I was expecting the complete sentence of the raw_text as:- Orthopnea (since the CUI correspond to Orthopnea hence the tagged sentence should comprise the tagged concept as well i.e Orthopnea)

Below is the snippet from the of the raw_text where I have marked the sentence in red rectangular box which yields “o pnd. ort” instead of “orthopnea” :-

[cid:image004.png@01D3C703.9E29B880]

Please let me know if you have any queries regarding the example or the output I am trying to get.

Regards,
Harish.



From: Reed Villanueva [mailto:villanuevareed@gmail.com<ma...@gmail.com>]
Sent: Wednesday, March 28, 2018 12:35 AM
To: user@ctakes.apache.org<ma...@ctakes.apache.org>
Subject: Re: Sentence extraction from cTAKES XML output.

Could you provide an example of the problem your are seeing and a bit more about the kind of output you are trying to end up with?



On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi All,

I am trying to extract the sentence from cTAKES XML output by taking the “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input text from 5740 to 5749 characters, but it turns out that the extracted section is not the complete sentence and misses the concept(CUIs preferred text) as well sometimes.

I am analyzing the sentences as well where the concept is tagged, so I need them to be complete. Any pointers will be of great help

Regards,
Harish.



Re: Sentence extraction from cTAKES XML output.

Posted by Reed Villanueva <vi...@gmail.com>.
Just looking at what you wrote as the desired output it looks like you just
want the associated ontology concept text (ie. in this case input=<the XMI
document> output="Orthopnea"). Is this correct? Note that for the
annotation mention that you showed (ie. the SignSymptomMention) the
ref_ontologyConceptArr maps to the UmlsConcept _id value. The documentation
seems spares on how all of the relationships in the ctakes XMI output
exactly works, but this relation between annotation mention and UMLS
concept tags seems to hold across all other XMIs that I have seen. You
could use this relation to get the UmlsConcept.preferredText output (that I
think) you are looking for by mapping in this way.

I don't know anything about how ctakes is parsing for the sentence
segments, but I notice that the raw text you provide has a lot a period
characters for abbreviations. Ctakes seems to have problems segmenting
these kinds of sentences, eg. here are the sentence segment I get when
inputting an abbreviation heavy string into the ctakesCVD and using the
AggregatePlainTextFastUMLSPipeline.xmi:

"
> ​[​
> pt.
> ​]​
>
> ​[​
> desc.
> ​]​
> ​[​
> not having any reason to con't.
> ​]​
> ​[​
> living;
> ​]​
> ​[​
> clinical depression.
> ​]​
> "


​This could be the reason for some weirdness in trying to extract sentence
information from the XMI fields.​

By the way what are you using to view the XMI? The tags in your images look
different than what I see in when running ctakes, eg. mine look like

<textsem:SignSymptomMention xmi:id="353" sofa="1" begin="46" end="55"
> id="0" ontologyConceptArr="340" typeID="3" discoveryTechnique="1"
> confidence="0.0" polarity="1" uncertainty="0" conditional="false"
> generic="false" subject="patient" historyOf="0"/>


​Hope this helps.​

On Tue, Mar 27, 2018 at 8:06 PM, Yadav, Harish <hy...@live.unc.edu> wrote:

> Hi Reed,
>
>
>
> Thanks for responding. Below is the example and output which I am trying
> to get:
>
>
>
> Once cTAKES gives the output after processing the raw_text (clinical
> document) in the form of XML. Below are the snapshots depicting what I am
> trying to extract from the XML:
>
>
>
> Step 1
>
> Finding the CUI and the id in the XML (In below snapshot cui is C0085619
> and id is 39838 marked in red rectangle).
>
>
>
>
>
> Step 2
>
> Finding the begin and the end tags for the corresponding CUI ( In below
> snapshot begin = 5740 and end = 5749 marked in red rectangle)
>
>
>
> Step 3
>
> Finding the begin and end tags for the sentence of the corresponding CUI (
> In below snapshot begin = 5740 and end = 5750 marked in red rectangle)
>
>
>
>
>
> Now when I am trying to Get the complete sentence from the raw_text
> (clinical document which was fed as an input to cTAKES) where the CUI was
> tagged, by using the begin and end tags of sentence extracted in the step 3
> by simply performing raw_text[5740:5750] I am getting the output as:
>
>
>
> *OUTPUT** :- *o pnd ort
>
>
>
> *Instead of this I was expecting the complete sentence of the raw_text as*:-
> Orthopnea (since the CUI correspond to Orthopnea hence the tagged sentence
> should comprise the tagged concept as well i.e Orthopnea)
>
>
>
> Below is the snippet from the of the raw_text where I have marked the
> sentence in red rectangular box which yields “o pnd. ort” instead of
> “orthopnea” :-
>
>
>
>
>
> Please let me know if you have any queries regarding the example or the
> output I am trying to get.
>
>
>
> Regards,
>
> Harish.
>
>
>
>
>
>
>
> *From:* Reed Villanueva [mailto:villanuevareed@gmail.com]
> *Sent:* Wednesday, March 28, 2018 12:35 AM
> *To:* user@ctakes.apache.org
> *Subject:* Re: Sentence extraction from cTAKES XML output.
>
>
>
> Could you provide an example of the problem your are seeing and a bit more
> about the kind of output you are trying to end up with?
>
>
>
>
>
>
>
> On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu>
> wrote:
>
> Hi All,
>
>
>
> I am trying to extract the sentence from cTAKES XML output by taking the
> “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in
> org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input
> text from 5740 to 5749 characters, but it turns out that the extracted
> section is not the complete sentence and misses the concept(CUIs preferred
> text) as well sometimes.
>
>
>
> I am analyzing the sentences as well where the concept is tagged, so I
> need them to be complete. Any pointers will be of great help
>
>
>
> Regards,
>
> Harish.
>
>
>

RE: Sentence extraction from cTAKES XML output.

Posted by "Yadav, Harish" <hy...@live.unc.edu>.
Hi Reed,

Thanks for responding. Below is the example and output which I am trying to get:

Once cTAKES gives the output after processing the raw_text (clinical document) in the form of XML. Below are the snapshots depicting what I am trying to extract from the XML:

Step 1
Finding the CUI and the id in the XML (In below snapshot cui is C0085619 and id is 39838 marked in red rectangle).
[cid:image001.png@01D3C634.B0363D20]

Step 2
Finding the begin and the end tags for the corresponding CUI ( In below snapshot begin = 5740 and end = 5749 marked in red rectangle)
[cid:image002.png@01D3C635.30FFC070]

Step 3
Finding the begin and end tags for the sentence of the corresponding CUI ( In below snapshot begin = 5740 and end = 5750 marked in red rectangle)
[cid:image003.png@01D3C636.EC021750]


Now when I am trying to Get the complete sentence from the raw_text (clinical document which was fed as an input to cTAKES) where the CUI was tagged, by using the begin and end tags of sentence extracted in the step 3 by simply performing raw_text[5740:5750] I am getting the output as:

OUTPUT :- o pnd ort

Instead of this I was expecting the complete sentence of the raw_text as:- Orthopnea (since the CUI correspond to Orthopnea hence the tagged sentence should comprise the tagged concept as well i.e Orthopnea)

Below is the snippet from the of the raw_text where I have marked the sentence in red rectangular box which yields “o pnd. ort” instead of “orthopnea” :-

[cid:image004.png@01D3C638.5DA307B0]

Please let me know if you have any queries regarding the example or the output I am trying to get.

Regards,
Harish.



From: Reed Villanueva [mailto:villanuevareed@gmail.com]
Sent: Wednesday, March 28, 2018 12:35 AM
To: user@ctakes.apache.org
Subject: Re: Sentence extraction from cTAKES XML output.

Could you provide an example of the problem your are seeing and a bit more about the kind of output you are trying to end up with?



On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu>> wrote:
Hi All,

I am trying to extract the sentence from cTAKES XML output by taking the “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input text from 5740 to 5749 characters, but it turns out that the extracted section is not the complete sentence and misses the concept(CUIs preferred text) as well sometimes.

I am analyzing the sentences as well where the concept is tagged, so I need them to be complete. Any pointers will be of great help

Regards,
Harish.


Re: Sentence extraction from cTAKES XML output.

Posted by Reed Villanueva <vi...@gmail.com>.
Could you provide an example of the problem your are seeing and a bit more
about the kind of output you are trying to end up with?



On Tue, Mar 27, 2018 at 3:33 PM, Yadav, Harish <hy...@live.unc.edu> wrote:

> Hi All,
>
>
>
> I am trying to extract the sentence from cTAKES XML output by taking the
> “begin=5740” and “end=5749” tags (5740 and 5749 is just one example) in
> org.apache.ctakes.typesystem.type.textspan.Sentence and slicing the input
> text from 5740 to 5749 characters, but it turns out that the extracted
> section is not the complete sentence and misses the concept(CUIs preferred
> text) as well sometimes.
>
>
>
> I am analyzing the sentences as well where the concept is tagged, so I
> need them to be complete. Any pointers will be of great help
>
>
>
> Regards,
>
> Harish.
>