You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by "Finan, Sean" <Se...@childrens.harvard.edu> on 2018/03/09 22:34:26 UTC

Re: Sentence splitter [EXTERNAL]

Hi Masoud,

There is a very nice SentenceDetectorBIO in ctakes-core.  It will split sentences based upon features other than just a newline character, which appears to be what you want.

Sean


________________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Friday, March 9, 2018 4:41 PM
To: dev@ctakes.apache.org
Subject: Sentence splitter [EXTERNAL]

Hello cTAKES team!



I was wondering what types of sentence splitters are available in cTAKES? The default sentence splitter does not appear to be the best one. See output for the demo example from the example in cTAKES installation guide:



Dr. Nutritious Medical Nutrition Therapy for Hyperlipidemia Referral from:

Julie Tester, RD, LD, CNSD Phone contact:

(555)

555-1212 Height:

144 cm Current Weight:

45 kg Date of current weight: 02-29-2001 Admit Weight:

[...]



Thanks so much,

Masoud





----

Masoud Rouhizadeh, PhD

NLP Specialist / Software Engineer

Institute for Clinical and Translational Research

Johns Hopkins University

https://urldefense.proofpoint.com/v2/url?u=http-3A__pages.jh.edu_-7Emrouhiz1&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aZ4yDE4zQbRJuUQ8p-T5nPrjhYvXF28sFoJWEtP3sGU&s=ob0U2sSfS7UijTI8PqCh_MwMucxPc14ovmcC2vq7rDA&e=






Re: Sentence splitter [EXTERNAL]

Posted by Tomasz Oliwa <ol...@uchicago.edu>.
Thanks for the info Sean, this is helpful.

Tomasz

________________________________________
From: Finan, Sean <Se...@childrens.harvard.edu>
Sent: Tuesday, March 13, 2018 6:21:51 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Just fyi, a lot of things start to get IN pos with the different breaks.  For that reason we removed the exclusion of IN from the dictionary lookup in another project using the bio detector:

// This is the same as the default list except that "IN" is not excluded
set exclusionTags="VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"

If things still go missing you can just not exclude any pos from lookup - which is what I do in yet another project.

Sean


________________________________________
From: Tomasz Oliwa <ol...@uchicago.edu>
Sent: Tuesday, March 13, 2018 6:14 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Interesting, with the SentenceDetectorAnnotatorBIO the WordToken "aspirin" gets partOfSpeech = "IN", with the regular SentenceDetectorAnnotator it is "NN".

Looks like you were right Tim, since IN stands for preposition or subordinating conjunction as defined at https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ling.upenn.edu_courses_Fall-5F2003_ling001_penn-5Ftreebank-5Fpos.html&d=DwIF-g&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=Q5JhdPhBsKD7UM5afTxmQ6lmFQzj0gmPCyFcefaEoRQ&s=_HkxQUxlBtVxn79KEjc8GFOT4w6qba_BBJXlkMjmLpI&e=

Tomasz

________________________________________
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Tuesday, March 13, 2018 4:57:36 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

That sounds bizarre! I can think of two possibilities: a sentence break in the middle of the word (unlikely), or the different sentence splits caused the POS tagger some confusion, and tagged the word aspirin as a forbidden part of speech, like a preposition or something. If you check the token annotation on the word aspirin you should be able to see the part of speech tag for that word.
Tim

________________________________________
From: Tomasz Oliwa <ol...@uchicago.edu>
Sent: Tuesday, March 13, 2018 5:34 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi,

I tested SentenceDetectorAnnotatorBIO in cTAKES 4.0.0, simply by replacing SentenceDetectorAnnotator.xml with SentenceDetectorAnnotatorBIO.xml in AggregatePlaintextFastUMLSProcessor.xml.

While it seemed to work, I noticed that in one example, an IdentifiedAnnotation was not found, that was found for the same input with just SentenceDetectorAnnotator.xml.

Could somebody check this please? Run the cTAKES CVD with the following input (without the "):

"
aspirin

his leg
"

On the machine I tested this, the MedicationMention does not show up with SentenceDetectorAnnotatorBIO, but it does with SentenceDetectorAnnotator.

________________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Tuesday, March 13, 2018 3:02:35 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi Sean,

Thank you for the pointer. I was able to run the SentenceDetectorAnnotatorBIO from ctakes-core. The results are way better than the SentenceDetectorAnnotator but I still see some issues such as splitting “Dr.” as a separate sentence (most likely due to the period after the abbreviation). Do you think there is a way to define an abbreviation list for SentenceDetectorAnnotatorBIO so that it knows that this is a word-final (i.e. abbreviation-final) and not a sentence-final period?

Thanks again,
Masoud





On 3/9/18, 5:35 PM, "Finan, Sean" <Se...@childrens.harvard.edu> wrote:


    Hi Masoud,

    There is a very nice SentenceDetectorBIO in ctakes-core.  It will split sentences based upon features other than just a newline character, which appears to be what you want.

    Sean


    ________________________________________
    From: Masoud Rouhizadeh <mr...@jhu.edu>
    Sent: Friday, March 9, 2018 4:41 PM
    To: dev@ctakes.apache.org
    Subject: Sentence splitter [EXTERNAL]

    Hello cTAKES team!



    I was wondering what types of sentence splitters are available in cTAKES? The default sentence splitter does not appear to be the best one. See output for the demo example from the example in cTAKES installation guide:



    Dr. Nutritious Medical Nutrition Therapy for Hyperlipidemia Referral from:

    Julie Tester, RD, LD, CNSD Phone contact:

    (555)

    555-1212 Height:

    144 cm Current Weight:

    45 kg Date of current weight: 02-29-2001 Admit Weight:

    [...]



    Thanks so much,

    Masoud





    ----

    Masoud Rouhizadeh, PhD

    NLP Specialist / Software Engineer

    Institute for Clinical and Translational Research

    Johns Hopkins University

    https://urldefense.proofpoint.com/v2/url?u=http-3A__pages.jh.edu_-7Emrouhiz1&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aZ4yDE4zQbRJuUQ8p-T5nPrjhYvXF28sFoJWEtP3sGU&s=ob0U2sSfS7UijTI8PqCh_MwMucxPc14ovmcC2vq7rDA&e=









Re: Sentence splitter [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.
Just fyi, a lot of things start to get IN pos with the different breaks.  For that reason we removed the exclusion of IN from the dictionary lookup in another project using the bio detector:

// This is the same as the default list except that "IN" is not excluded
set exclusionTags="VB,VBD,VBG,VBN,VBP,VBZ,CC,CD,DT,EX,LS,MD,PDT,POS,PP,PP$,PRP,PRP$,RP,TO,WDT,WP,WPS,WRB"

If things still go missing you can just not exclude any pos from lookup - which is what I do in yet another project.

Sean


________________________________________
From: Tomasz Oliwa <ol...@uchicago.edu>
Sent: Tuesday, March 13, 2018 6:14 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Interesting, with the SentenceDetectorAnnotatorBIO the WordToken "aspirin" gets partOfSpeech = "IN", with the regular SentenceDetectorAnnotator it is "NN".

Looks like you were right Tim, since IN stands for preposition or subordinating conjunction as defined at https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ling.upenn.edu_courses_Fall-5F2003_ling001_penn-5Ftreebank-5Fpos.html&d=DwIF-g&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=Q5JhdPhBsKD7UM5afTxmQ6lmFQzj0gmPCyFcefaEoRQ&s=_HkxQUxlBtVxn79KEjc8GFOT4w6qba_BBJXlkMjmLpI&e=

Tomasz

________________________________________
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Tuesday, March 13, 2018 4:57:36 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

That sounds bizarre! I can think of two possibilities: a sentence break in the middle of the word (unlikely), or the different sentence splits caused the POS tagger some confusion, and tagged the word aspirin as a forbidden part of speech, like a preposition or something. If you check the token annotation on the word aspirin you should be able to see the part of speech tag for that word.
Tim

________________________________________
From: Tomasz Oliwa <ol...@uchicago.edu>
Sent: Tuesday, March 13, 2018 5:34 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi,

I tested SentenceDetectorAnnotatorBIO in cTAKES 4.0.0, simply by replacing SentenceDetectorAnnotator.xml with SentenceDetectorAnnotatorBIO.xml in AggregatePlaintextFastUMLSProcessor.xml.

While it seemed to work, I noticed that in one example, an IdentifiedAnnotation was not found, that was found for the same input with just SentenceDetectorAnnotator.xml.

Could somebody check this please? Run the cTAKES CVD with the following input (without the "):

"
aspirin

his leg
"

On the machine I tested this, the MedicationMention does not show up with SentenceDetectorAnnotatorBIO, but it does with SentenceDetectorAnnotator.

________________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Tuesday, March 13, 2018 3:02:35 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi Sean,

Thank you for the pointer. I was able to run the SentenceDetectorAnnotatorBIO from ctakes-core. The results are way better than the SentenceDetectorAnnotator but I still see some issues such as splitting “Dr.” as a separate sentence (most likely due to the period after the abbreviation). Do you think there is a way to define an abbreviation list for SentenceDetectorAnnotatorBIO so that it knows that this is a word-final (i.e. abbreviation-final) and not a sentence-final period?

Thanks again,
Masoud





On 3/9/18, 5:35 PM, "Finan, Sean" <Se...@childrens.harvard.edu> wrote:


    Hi Masoud,

    There is a very nice SentenceDetectorBIO in ctakes-core.  It will split sentences based upon features other than just a newline character, which appears to be what you want.

    Sean


    ________________________________________
    From: Masoud Rouhizadeh <mr...@jhu.edu>
    Sent: Friday, March 9, 2018 4:41 PM
    To: dev@ctakes.apache.org
    Subject: Sentence splitter [EXTERNAL]

    Hello cTAKES team!



    I was wondering what types of sentence splitters are available in cTAKES? The default sentence splitter does not appear to be the best one. See output for the demo example from the example in cTAKES installation guide:



    Dr. Nutritious Medical Nutrition Therapy for Hyperlipidemia Referral from:

    Julie Tester, RD, LD, CNSD Phone contact:

    (555)

    555-1212 Height:

    144 cm Current Weight:

    45 kg Date of current weight: 02-29-2001 Admit Weight:

    [...]



    Thanks so much,

    Masoud





    ----

    Masoud Rouhizadeh, PhD

    NLP Specialist / Software Engineer

    Institute for Clinical and Translational Research

    Johns Hopkins University

    https://urldefense.proofpoint.com/v2/url?u=http-3A__pages.jh.edu_-7Emrouhiz1&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aZ4yDE4zQbRJuUQ8p-T5nPrjhYvXF28sFoJWEtP3sGU&s=ob0U2sSfS7UijTI8PqCh_MwMucxPc14ovmcC2vq7rDA&e=









Re: Sentence splitter [EXTERNAL]

Posted by Tomasz Oliwa <ol...@uchicago.edu>.
Interesting, with the SentenceDetectorAnnotatorBIO the WordToken "aspirin" gets partOfSpeech = "IN", with the regular SentenceDetectorAnnotator it is "NN".

Looks like you were right Tim, since IN stands for preposition or subordinating conjunction as defined at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Tomasz

________________________________________
From: Miller, Timothy <Ti...@childrens.harvard.edu>
Sent: Tuesday, March 13, 2018 4:57:36 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

That sounds bizarre! I can think of two possibilities: a sentence break in the middle of the word (unlikely), or the different sentence splits caused the POS tagger some confusion, and tagged the word aspirin as a forbidden part of speech, like a preposition or something. If you check the token annotation on the word aspirin you should be able to see the part of speech tag for that word.
Tim

________________________________________
From: Tomasz Oliwa <ol...@uchicago.edu>
Sent: Tuesday, March 13, 2018 5:34 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi,

I tested SentenceDetectorAnnotatorBIO in cTAKES 4.0.0, simply by replacing SentenceDetectorAnnotator.xml with SentenceDetectorAnnotatorBIO.xml in AggregatePlaintextFastUMLSProcessor.xml.

While it seemed to work, I noticed that in one example, an IdentifiedAnnotation was not found, that was found for the same input with just SentenceDetectorAnnotator.xml.

Could somebody check this please? Run the cTAKES CVD with the following input (without the "):

"
aspirin

his leg
"

On the machine I tested this, the MedicationMention does not show up with SentenceDetectorAnnotatorBIO, but it does with SentenceDetectorAnnotator.

________________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Tuesday, March 13, 2018 3:02:35 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi Sean,

Thank you for the pointer. I was able to run the SentenceDetectorAnnotatorBIO from ctakes-core. The results are way better than the SentenceDetectorAnnotator but I still see some issues such as splitting “Dr.” as a separate sentence (most likely due to the period after the abbreviation). Do you think there is a way to define an abbreviation list for SentenceDetectorAnnotatorBIO so that it knows that this is a word-final (i.e. abbreviation-final) and not a sentence-final period?

Thanks again,
Masoud





On 3/9/18, 5:35 PM, "Finan, Sean" <Se...@childrens.harvard.edu> wrote:


    Hi Masoud,

    There is a very nice SentenceDetectorBIO in ctakes-core.  It will split sentences based upon features other than just a newline character, which appears to be what you want.

    Sean


    ________________________________________
    From: Masoud Rouhizadeh <mr...@jhu.edu>
    Sent: Friday, March 9, 2018 4:41 PM
    To: dev@ctakes.apache.org
    Subject: Sentence splitter [EXTERNAL]

    Hello cTAKES team!



    I was wondering what types of sentence splitters are available in cTAKES? The default sentence splitter does not appear to be the best one. See output for the demo example from the example in cTAKES installation guide:



    Dr. Nutritious Medical Nutrition Therapy for Hyperlipidemia Referral from:

    Julie Tester, RD, LD, CNSD Phone contact:

    (555)

    555-1212 Height:

    144 cm Current Weight:

    45 kg Date of current weight: 02-29-2001 Admit Weight:

    [...]



    Thanks so much,

    Masoud





    ----

    Masoud Rouhizadeh, PhD

    NLP Specialist / Software Engineer

    Institute for Clinical and Translational Research

    Johns Hopkins University

    https://urldefense.proofpoint.com/v2/url?u=http-3A__pages.jh.edu_-7Emrouhiz1&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aZ4yDE4zQbRJuUQ8p-T5nPrjhYvXF28sFoJWEtP3sGU&s=ob0U2sSfS7UijTI8PqCh_MwMucxPc14ovmcC2vq7rDA&e=









Re: Sentence splitter [EXTERNAL]

Posted by "Miller, Timothy" <Ti...@childrens.harvard.edu>.
That sounds bizarre! I can think of two possibilities: a sentence break in the middle of the word (unlikely), or the different sentence splits caused the POS tagger some confusion, and tagged the word aspirin as a forbidden part of speech, like a preposition or something. If you check the token annotation on the word aspirin you should be able to see the part of speech tag for that word.
Tim

________________________________________
From: Tomasz Oliwa <ol...@uchicago.edu>
Sent: Tuesday, March 13, 2018 5:34 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi,

I tested SentenceDetectorAnnotatorBIO in cTAKES 4.0.0, simply by replacing SentenceDetectorAnnotator.xml with SentenceDetectorAnnotatorBIO.xml in AggregatePlaintextFastUMLSProcessor.xml.

While it seemed to work, I noticed that in one example, an IdentifiedAnnotation was not found, that was found for the same input with just SentenceDetectorAnnotator.xml.

Could somebody check this please? Run the cTAKES CVD with the following input (without the "):

"
aspirin

his leg
"

On the machine I tested this, the MedicationMention does not show up with SentenceDetectorAnnotatorBIO, but it does with SentenceDetectorAnnotator.

________________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Tuesday, March 13, 2018 3:02:35 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi Sean,

Thank you for the pointer. I was able to run the SentenceDetectorAnnotatorBIO from ctakes-core. The results are way better than the SentenceDetectorAnnotator but I still see some issues such as splitting “Dr.” as a separate sentence (most likely due to the period after the abbreviation). Do you think there is a way to define an abbreviation list for SentenceDetectorAnnotatorBIO so that it knows that this is a word-final (i.e. abbreviation-final) and not a sentence-final period?

Thanks again,
Masoud





On 3/9/18, 5:35 PM, "Finan, Sean" <Se...@childrens.harvard.edu> wrote:


    Hi Masoud,

    There is a very nice SentenceDetectorBIO in ctakes-core.  It will split sentences based upon features other than just a newline character, which appears to be what you want.

    Sean


    ________________________________________
    From: Masoud Rouhizadeh <mr...@jhu.edu>
    Sent: Friday, March 9, 2018 4:41 PM
    To: dev@ctakes.apache.org
    Subject: Sentence splitter [EXTERNAL]

    Hello cTAKES team!



    I was wondering what types of sentence splitters are available in cTAKES? The default sentence splitter does not appear to be the best one. See output for the demo example from the example in cTAKES installation guide:



    Dr. Nutritious Medical Nutrition Therapy for Hyperlipidemia Referral from:

    Julie Tester, RD, LD, CNSD Phone contact:

    (555)

    555-1212 Height:

    144 cm Current Weight:

    45 kg Date of current weight: 02-29-2001 Admit Weight:

    [...]



    Thanks so much,

    Masoud





    ----

    Masoud Rouhizadeh, PhD

    NLP Specialist / Software Engineer

    Institute for Clinical and Translational Research

    Johns Hopkins University

    https://urldefense.proofpoint.com/v2/url?u=http-3A__pages.jh.edu_-7Emrouhiz1&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aZ4yDE4zQbRJuUQ8p-T5nPrjhYvXF28sFoJWEtP3sGU&s=ob0U2sSfS7UijTI8PqCh_MwMucxPc14ovmcC2vq7rDA&e=









Re: Sentence splitter [EXTERNAL]

Posted by Tomasz Oliwa <ol...@uchicago.edu>.
Hi,

I tested SentenceDetectorAnnotatorBIO in cTAKES 4.0.0, simply by replacing SentenceDetectorAnnotator.xml with SentenceDetectorAnnotatorBIO.xml in AggregatePlaintextFastUMLSProcessor.xml.

While it seemed to work, I noticed that in one example, an IdentifiedAnnotation was not found, that was found for the same input with just SentenceDetectorAnnotator.xml.

Could somebody check this please? Run the cTAKES CVD with the following input (without the "):

"
aspirin

his leg
"

On the machine I tested this, the MedicationMention does not show up with SentenceDetectorAnnotatorBIO, but it does with SentenceDetectorAnnotator.

________________________________________
From: Masoud Rouhizadeh <mr...@jhu.edu>
Sent: Tuesday, March 13, 2018 3:02:35 PM
To: dev@ctakes.apache.org
Subject: Re: Sentence splitter [EXTERNAL]

Hi Sean,

Thank you for the pointer. I was able to run the SentenceDetectorAnnotatorBIO from ctakes-core. The results are way better than the SentenceDetectorAnnotator but I still see some issues such as splitting “Dr.” as a separate sentence (most likely due to the period after the abbreviation). Do you think there is a way to define an abbreviation list for SentenceDetectorAnnotatorBIO so that it knows that this is a word-final (i.e. abbreviation-final) and not a sentence-final period?

Thanks again,
Masoud





On 3/9/18, 5:35 PM, "Finan, Sean" <Se...@childrens.harvard.edu> wrote:


    Hi Masoud,

    There is a very nice SentenceDetectorBIO in ctakes-core.  It will split sentences based upon features other than just a newline character, which appears to be what you want.

    Sean


    ________________________________________
    From: Masoud Rouhizadeh <mr...@jhu.edu>
    Sent: Friday, March 9, 2018 4:41 PM
    To: dev@ctakes.apache.org
    Subject: Sentence splitter [EXTERNAL]

    Hello cTAKES team!



    I was wondering what types of sentence splitters are available in cTAKES? The default sentence splitter does not appear to be the best one. See output for the demo example from the example in cTAKES installation guide:



    Dr. Nutritious Medical Nutrition Therapy for Hyperlipidemia Referral from:

    Julie Tester, RD, LD, CNSD Phone contact:

    (555)

    555-1212 Height:

    144 cm Current Weight:

    45 kg Date of current weight: 02-29-2001 Admit Weight:

    [...]



    Thanks so much,

    Masoud





    ----

    Masoud Rouhizadeh, PhD

    NLP Specialist / Software Engineer

    Institute for Clinical and Translational Research

    Johns Hopkins University

    https://urldefense.proofpoint.com/v2/url?u=http-3A__pages.jh.edu_-7Emrouhiz1&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aZ4yDE4zQbRJuUQ8p-T5nPrjhYvXF28sFoJWEtP3sGU&s=ob0U2sSfS7UijTI8PqCh_MwMucxPc14ovmcC2vq7rDA&e=









Re: Sentence splitter [EXTERNAL]

Posted by Masoud Rouhizadeh <mr...@jhu.edu>.
Hi Sean,

Thank you for the pointer. I was able to run the SentenceDetectorAnnotatorBIO from ctakes-core. The results are way better than the SentenceDetectorAnnotator but I still see some issues such as splitting “Dr.” as a separate sentence (most likely due to the period after the abbreviation). Do you think there is a way to define an abbreviation list for SentenceDetectorAnnotatorBIO so that it knows that this is a word-final (i.e. abbreviation-final) and not a sentence-final period?

Thanks again,
Masoud





On 3/9/18, 5:35 PM, "Finan, Sean" <Se...@childrens.harvard.edu> wrote:

    
    Hi Masoud,
    
    There is a very nice SentenceDetectorBIO in ctakes-core.  It will split sentences based upon features other than just a newline character, which appears to be what you want.
    
    Sean
    
    
    ________________________________________
    From: Masoud Rouhizadeh <mr...@jhu.edu>
    Sent: Friday, March 9, 2018 4:41 PM
    To: dev@ctakes.apache.org
    Subject: Sentence splitter [EXTERNAL]
    
    Hello cTAKES team!
    
    
    
    I was wondering what types of sentence splitters are available in cTAKES? The default sentence splitter does not appear to be the best one. See output for the demo example from the example in cTAKES installation guide:
    
    
    
    Dr. Nutritious Medical Nutrition Therapy for Hyperlipidemia Referral from:
    
    Julie Tester, RD, LD, CNSD Phone contact:
    
    (555)
    
    555-1212 Height:
    
    144 cm Current Weight:
    
    45 kg Date of current weight: 02-29-2001 Admit Weight:
    
    [...]
    
    
    
    Thanks so much,
    
    Masoud
    
    
    
    
    
    ----
    
    Masoud Rouhizadeh, PhD
    
    NLP Specialist / Software Engineer
    
    Institute for Clinical and Translational Research
    
    Johns Hopkins University
    
    https://urldefense.proofpoint.com/v2/url?u=http-3A__pages.jh.edu_-7Emrouhiz1&d=DwIGaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=fs67GvlGZstTpyIisCYNYmQCP6r0bcpKGd4f7d4gTao&m=aZ4yDE4zQbRJuUQ8p-T5nPrjhYvXF28sFoJWEtP3sGU&s=ob0U2sSfS7UijTI8PqCh_MwMucxPc14ovmcC2vq7rDA&e=