You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Ab...@cognizant.com on 2020/06/11 14:17:23 UTC

Sentence detector changes

Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]

We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: "Patient was taking Paracetamol (650 mg) thrice daily" , was splitted to two different sentences(because a ')' encountered)

1. Patient was taking Paracetamol (650 mg)

2. thrice daily

So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just '.' As delimiter and found sentences are splitted whenever a '.' Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]
Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Abad,

Check the README file in the ctakes-core-res module.  It is in resources/org/apache/ctakes/core/sentdetect/

The README has some basic information about how the sentence detector works, setting up training data and training a new model.

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Wednesday, June 17, 2020 8:34 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *


Hi Team,

A gentle reminder on the below request. Any advise would be of great help

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Ayyub, Abad (Cognizant)
Sent: Tuesday, June 16, 2020 9:27 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Hi Team,

Thank you for that valuable input. We also saw a model that can be trained for sentence identification while going through SentenceDetector AE. So we wanted to check that aspect also where, are there chances of improving the result by training the .model file used in sentence detector. We also saw a main method available in that AE which generates a model from the training data(where training data is passed as a file). Pls. advise us on the below points

 1.Could you pls. advise us on how a training data can be created . Is there any sample or documentation we can refer to create the training data.
2.Once a model is created do we need to replace or update the existing model with the new one?
3. Shall we use the same TrainingParameters for generating the model where Algorithm is MAXENT and Number of iterations and cutoff  are 100,5 respectively. Pls. enlighten us on what each signify when creating model.

Kindly excuse for  some of the basic questions regarding data training since our knowledge in NLP is  very minimal .

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 11:03 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

[External]


Hi Abad,
I've been following the thread but don't have much to add on top of what Sean's saying. The BIO version has one major benefit, in that it allows sentences to wrap newlines. But it does seem to break on Mr. and Dr. unfortunately. The solution is to create more training data but it's hard to get people excited about that. The next best solution is along the lines of what Sean suggested, to use post-processing to fix mistakes.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 1:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *


Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Posted by Ab...@cognizant.com.

Hi Team,

A gentle reminder on the below request. Any advise would be of great help

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Ayyub, Abad (Cognizant)
Sent: Tuesday, June 16, 2020 9:27 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Hi Team,

Thank you for that valuable input. We also saw a model that can be trained for sentence identification while going through SentenceDetector AE. So we wanted to check that aspect also where, are there chances of improving the result by training the .model file used in sentence detector. We also saw a main method available in that AE which generates a model from the training data(where training data is passed as a file). Pls. advise us on the below points

 1.Could you pls. advise us on how a training data can be created . Is there any sample or documentation we can refer to create the training data.
2.Once a model is created do we need to replace or update the existing model with the new one?
3. Shall we use the same TrainingParameters for generating the model where Algorithm is MAXENT and Number of iterations and cutoff  are 100,5 respectively. Pls. enlighten us on what each signify when creating model.

Kindly excuse for  some of the basic questions regarding data training since our knowledge in NLP is  very minimal .

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 11:03 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

[External]


Hi Abad,
I've been following the thread but don't have much to add on top of what Sean's saying. The BIO version has one major benefit, in that it allows sentences to wrap newlines. But it does seem to break on Mr. and Dr. unfortunately. The solution is to create more training data but it's hard to get people excited about that. The next best solution is along the lines of what Sean suggested, to use post-processing to fix mistakes.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 1:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *


Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Posted by Ab...@cognizant.com.

Hi Team,

A gentle reminder on the below request. Any advise would be of great help

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Ayyub, Abad (Cognizant)
Sent: Tuesday, June 16, 2020 9:27 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Hi Team,

Thank you for that valuable input. We also saw a model that can be trained for sentence identification while going through SentenceDetector AE. So we wanted to check that aspect also where, are there chances of improving the result by training the .model file used in sentence detector. We also saw a main method available in that AE which generates a model from the training data(where training data is passed as a file). Pls. advise us on the below points

 1.Could you pls. advise us on how a training data can be created . Is there any sample or documentation we can refer to create the training data.
2.Once a model is created do we need to replace or update the existing model with the new one?
3. Shall we use the same TrainingParameters for generating the model where Algorithm is MAXENT and Number of iterations and cutoff  are 100,5 respectively. Pls. enlighten us on what each signify when creating model.

Kindly excuse for  some of the basic questions regarding data training since our knowledge in NLP is  very minimal .

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 11:03 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

[External]


Hi Abad,
I've been following the thread but don't have much to add on top of what Sean's saying. The BIO version has one major benefit, in that it allows sentences to wrap newlines. But it does seem to break on Mr. and Dr. unfortunately. The solution is to create more training data but it's hard to get people excited about that. The next best solution is along the lines of what Sean suggested, to use post-processing to fix mistakes.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 1:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *


Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Posted by Ab...@cognizant.com.

Hi Team,

Thank you for that valuable input. We also saw a model that can be trained for sentence identification while going through SentenceDetector AE. So we wanted to check that aspect also where, are there chances of improving the result by training the .model file used in sentence detector. We also saw a main method available in that AE which generates a model from the training data(where training data is passed as a file). Pls. advise us on the below points

 1.Could you pls. advise us on how a training data can be created . Is there any sample or documentation we can refer to create the training data.
2.Once a model is created do we need to replace or update the existing model with the new one?
3. Shall we use the same TrainingParameters for generating the model where Algorithm is MAXENT and Number of iterations and cutoff  are 100,5 respectively. Pls. enlighten us on what each signify when creating model.

Kindly excuse for  some of the basic questions regarding data training since our knowledge in NLP is  very minimal .

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 11:03 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

[External]


Hi Abad,
I've been following the thread but don't have much to add on top of what Sean's saying. The BIO version has one major benefit, in that it allows sentences to wrap newlines. But it does seem to break on Mr. and Dr. unfortunately. The solution is to create more training data but it's hard to get people excited about that. The next best solution is along the lines of what Sean suggested, to use post-processing to fix mistakes.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 1:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *


Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Posted by Ab...@cognizant.com.

Hi Team,

Thank you for that valuable input. We also saw a model that can be trained for sentence identification while going through SentenceDetector AE. So we wanted to check that aspect also where, are there chances of improving the result by training the .model file used in sentence detector. We also saw a main method available in that AE which generates a model from the training data(where training data is passed as a file). Pls. advise us on the below points

 1.Could you pls. advise us on how a training data can be created . Is there any sample or documentation we can refer to create the training data.
2.Once a model is created do we need to replace or update the existing model with the new one?
3. Shall we use the same TrainingParameters for generating the model where Algorithm is MAXENT and Number of iterations and cutoff  are 100,5 respectively. Pls. enlighten us on what each signify when creating model.

Kindly excuse for  some of the basic questions regarding data training since our knowledge in NLP is  very minimal .

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 11:03 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

[External]


Hi Abad,
I've been following the thread but don't have much to add on top of what Sean's saying. The BIO version has one major benefit, in that it allows sentences to wrap newlines. But it does seem to break on Mr. and Dr. unfortunately. The solution is to create more training data but it's hard to get people excited about that. The next best solution is along the lines of what Sean suggested, to use post-processing to fix mistakes.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 1:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *


Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

Hi Abad,
I've been following the thread but don't have much to add on top of what Sean's saying. The BIO version has one major benefit, in that it allows sentences to wrap newlines. But it does seem to break on Mr. and Dr. unfortunately. The solution is to create more training data but it's hard to get people excited about that. The next best solution is along the lines of what Sean suggested, to use post-processing to fix mistakes.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 1:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *

Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]

Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]

Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};

So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]

Hi Abad,

None of your embedded images are visible to me, so I don't have whatever information is contained within those images.

It sounds like you are using the SentenceDetectorBIO.  Very cool.

It does have a few idiosyncrasies, one of which you have identified.

There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.

EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.

MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.

You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.

Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.

Sean

________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]

We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)

1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily

So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.

Hi Abad,
I've been following the thread but don't have much to add on top of what Sean's saying. The BIO version has one major benefit, in that it allows sentences to wrap newlines. But it does seem to break on Mr. and Dr. unfortunately. The solution is to create more training data but it's hard to get people excited about that. The next best solution is along the lines of what Sean suggested, to use post-processing to fix mistakes.
Tim

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 1:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL] [SUSPICIOUS]

* External Email - Caution *

Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]

Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]

Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};

So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]

Hi Abad,

None of your embedded images are visible to me, so I don't have whatever information is contained within those images.

It sounds like you are using the SentenceDetectorBIO.  Very cool.

It does have a few idiosyncrasies, one of which you have identified.

There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.

EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.

MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.

You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.

Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.

Sean

________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *

Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]

We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)

1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily

So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Sentence detector changes [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Abad,

I can't say anything about Timothy Miller's availability.  He is on the ctakes dev mailing list so he may respond if he feels it is necessary.  He is quite busy with a lot of groundbreaking work, but I wanted to make sure that he got credit for the ..BIO annotator.

The piper file would be just as it was before for the Sentence..BIO with the classifier specified.
That would be followed by the lines

add EolSentenceFixer
add MrsDrSentenceJoiner
add AbadsNewDigitJoiner

where AbadsNewDigitJoiner is a custom AE using the logic of MrsDr.. that checks for digits before and after the dot (eg "5.5") instead of checking for a person title before the dot (eg "Mrs.")

Sean
________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:50 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL]

Posted by Ab...@cognizant.com.

Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL]

Posted by Ab...@cognizant.com.

Thank you for that quick response Sean :). So you mean to say we can add a new custom AE using the similar logic in MrsDr... and refer it in the piper file, in that case do we need to again mention the classifier jar path as   "classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar".

Also is Timothy Miller available to help us on the issues with ' SentenceDetectorAnnotatorBIO ' where sentences are splitted on decimals or dates separated with '.'. I hope you guys are safe and doing well during this lock down. Stay safe :)

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Friday, June 12, 2020 9:06 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) )
              && !sentences.get( i+1 ).getCoveredText().isEmpty()
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Sentence detector changes [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Abad,

The expert on SentenceDetectorAnnotatorBIO is Timothy Miller, so he might be able to weigh in on some of this.

I haven't noticed Sentence..BIO splitting sentences on decimals, but as an AI trained model you never quite know what might happen.

You could easily make something like the MrsDr.. that handles decimal problems.

Basically, a copy of MrsDr.. with lines ~62
         if ( (text.endsWith( " Mr." ) || text.endsWith( " Mrs." ) || text.endsWith( " Dr." )
               || text.endsWith( " a.m." ) || text.endsWith( " p.m." )
               || text.equals( "Mr." ) || text.equals( "Mrs." ) || text.equals( "Dr." ))
              && i < sentenceCount - 1
              && !newlines.contains( sentence.getEnd() ) ) {

to something like

         if ( text.length() > 1
              && text.charAt( text.length()-1 ) == '.'
              && Character.isDigit( text.charAt( text.length()-2 ) ) 
              && !sentences.get( i+1 ).getCoveredText().isEmpty() 
              && Character.isDigit( sentences.get( i+1 ).getCoveredText().charAt( 0 ) ) ) {

That if (..) could be cleaned up a little, but that should do it.

Sean




________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Friday, June 12, 2020 11:21 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL]

Posted by Ab...@cognizant.com.

Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL]

Posted by Ab...@cognizant.com.

Hi Sean,

Thank you for your advise and we tried using the 'SentenceDetectorAnnotatorBIO' along with the changes required in piper files as you mentioned and we could find that its splitting the sentences based on '.'  only ,  Actually we were able to get similar o/p by using the  'SentenceDetectorAnnotator' itself by just using '.' as the only eosCandidate in the EOSScannerImpl class.

So will 'SentenceDetectorAnnotatorBIO'  be able to extract sentences using some other way. Like some problems we face are the ''SentenceDetectorAnnotatorBIO' ' is splitting the sentence whenever it sees a decimal point like 5.5 or a date where separated using '.' like 01.01.2020.

Can the AE's EolSentenceFixer & MrsDrSentenceJoiner  be able to resolve our above issues where sentences are splitted on encountering decimals or '.' separated dates. If it can what are the changes that we need to do in the piper file to incorporate the same.

Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 9:14 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Sentence detector changes [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Thank you for clarifying the contents of the second image.  That changes everything.

You are using the original SentenceDetector.  So, somewhere in your piper file you've got:
add SentenceDetector

I was under the impression that you are using the newer alternative, SentenceDetectorAnnotatorBIO.  While the original SentenceDetector is more of a "splitter", the BIO version is more of a "lumper".

I would switch the detector and see how the results change using:
add  SentenceDetectorAnnotatorBIO classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar

Don't forget to comment out the "add SentenceDetector".

After you have looked at results from the BIO version, then you can consider which better fits your data and any needs for further adjustment of Sentences.

Sean

________________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 11:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: RE: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer
Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL]

Posted by Ab...@cognizant.com.

Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer
Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

RE: Sentence detector changes [EXTERNAL]

Posted by Ab...@cognizant.com.

Thank you Sean for the response. Sorry that the image are not visible for you and forgot to mention the version we are using which is version 4.0. Reiterating it as below

First image was how the Sentence Object looks like using CAS viewer
Second Image was the list of EndOfSentence Candidate like in the class ‘EOSScannerImpl’as below
     private static final char [] eosCandidates={ ‘.’, ‘!’,’)’,’]’, ‘>’, ‘/’’’,’:’, ‘;’};



So any modification to SentenceExtractor have impacts on every other downstream modules right? We will definitely have a look into the AE's you mentioned and  you mean to say , that to try adding the AE's  EolSentenceFixer, MrsDrSentenceJoiner which would refine the sentence extraction right?.
Thanks & Regards

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028


-----Original Message-----
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Thursday, June 11, 2020 8:20 PM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Re: Sentence detector changes [EXTERNAL]

[External]


Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.
This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.

Re: Sentence detector changes [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Abad,


None of your embedded images are visible to me, so I don't have whatever information is contained within those images.


It sounds like you are using the SentenceDetectorBIO.  Very cool.


It does have a few idiosyncrasies, one of which you have identified.


There are two helper AEs in ctakes-core that might be useful for you.  They are not in the released (4.0) version of ctakes, only in ctakes trunk.


EolSentenceFixer

Re-annotates Sentences based upon short lines, preventing a Sentence from spanning over an intentional line break.

The BIO will often lump short (intentionally separated) lines into a single sentence.  This attempts to detect such intentionally short lines and split them.


MrsDrSentenceJoiner

Joins Sentences with person titles Mr. Mrs. Dr. that have been split by SentenceDetectorBIO.


You can peek at the code in MrsDrSentenceJoiner and do something similar to repair cases in which other texts like ')' have causes improper splits.


Because Sentence boundaries are often used in downstream processing (Mentions, Relations), it is very important that they be properly assigned.


Sean



________________________________
From: Abad.Ayyub@cognizant.com <Ab...@cognizant.com>
Sent: Thursday, June 11, 2020 10:17 AM
To: dev@ctakes.apache.org; user@ctakes.apache.org
Subject: Sentence detector changes [EXTERNAL]

* External Email - Caution *


Hi Team,

We are trying to utilize the maximum potential of cTAKES to meet the requirements for our profile, where we have a requirement to extract the sentences from the medical document. We have seen cTAKES already providing the list of sentences in the clinical text within the object as below

[cid:image002.png@01D64027.944FD390]


We also notice that sentences are delimited based on the below predefined delimiters, which was actually a problem in our requirement where sentences were seggregated whenever one of the below tokens are encountered.

[cid:image005.jpg@01D64029.1E6AC980]

For eg: “Patient was taking Paracetamol (650 mg) thrice daily” , was splitted to two different sentences(because a ‘)’ encountered)


1.     Patient was taking Paracetamol (650 mg)

2.     thrice daily


So we tried to customize it by removing some of the defined delimiters to meet our requirement. Actually we tried with just ‘.’ As delimiter and found sentences are splitted whenever a ‘.’ Is encountered Since this is a change done at the core module , we would like to know whether this is going to impact the clinical token identification process or going to have impact on the already provided informations like tlink,timex or any other critical attribute. Kindly advice.

Thanks & Regards
[cid:D3145E69-CD94-48C1-877F-5134EEAFB598]

Abad Ayyub
Vnet: 406170 | Cell : +91-9447379028



This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored.