You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Masanz, James J." <Ma...@mayo.edu> on 2013/08/26 18:05:21 UTC

apostrophe and sentence detector

The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.

The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)

It has >100K occurrences of 's

It has >175K occurrences of the ' character in all.

The place I noticed this is in testfakenote.txt.xml in ctakes-regression-test.

The word "Broca's" used to have a ContractionToken but since a sentence is now ending on the apostrophe, the apostrophe is getting annotated as a PunctuationToken.

Since I don't see anything obviously wrong with the training data, I'm pondering the idea of having a rule that would run after the sentence detector model is used which would rejoin any sentence split that occurs at an ' when it is immediately followed by any letter (not just an s) and preceded by any non white space.

Some examples that currently split wrong, using vertical bar to show where the sentence detector splits them
The patient also was concerned about a small lesion in his Broca'|s area|
Broca'|s|
Isn'|t|
The pain isn'|t preventing Don'|s daily walks.|

Some examples that currently split correctly
The aspirin isn't stopping Don's pain.|

Anyone have any other suggestions?

-- James

Re: apostrophe and sentence detector

Posted by Karthik Sarma <ks...@ksarma.com>.

Hah, indeed





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksarma@ksarma.com
gchat: ksarma@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Tue, Aug 27, 2013 at 6:57 AM, Wu, Stephen T., Ph.D.
<Wu...@mayo.edu>wrote:

> On 8/26/13 9:00 PM, "Karthik Sarma" <ks...@ksarma.com> wrote:
>
> >The structure of
> >clinical records vary dramatically between institutions (and, of course,
> >even between departments at a single institution). I've found that I have
> >to remain vigilant about the quality of sentence detection in just about
> >everything I run.
>
> | sed 's/sentence detection/each component/g'
>
>

Re: apostrophe and sentence detector

Posted by "Wu, Stephen T., Ph.D." <Wu...@mayo.edu>.

On 8/26/13 9:00 PM, "Karthik Sarma" <ks...@ksarma.com> wrote:

>The structure of
>clinical records vary dramatically between institutions (and, of course,
>even between departments at a single institution). I've found that I have
>to remain vigilant about the quality of sentence detection in just about
>everything I run.

| sed 's/sentence detection/each component/g'

Re: apostrophe and sentence detector

Posted by Karthik Sarma <ks...@ksarma.com>.

I'd have to disagree that it is a subset of the "english language" found in
books -- for one thing, one finds a great many more sentence fragments and
lists in clinical records. I have no doubt that training on gutenberg would
yield a reliable sentence detector, but I fear that sentence detector would
be unlikely to perform much better than the existing one.

To be honest, I've started to develop more and more concern about some of
the models used and the training data that was used. The structure of
clinical records vary dramatically between institutions (and, of course,
even between departments at a single institution). I've found that I have
to remain vigilant about the quality of sentence detection in just about
everything I run. This might be unavoidable, but perhaps what we need is an
annotated set of clinical documents culled from a variety of institutions.
Probably a pie in the sky, though ;)

Karthik





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksarma@ksarma.com
gchat: ksarma@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Mon, Aug 26, 2013 at 12:22 PM, John Green <jo...@gmail.com>wrote:

> Karthik, well said. There are many differences. I wonder, what do you
> think about the logical division of the two sets? Do they share domain? Is
> one a subset of the other? I would propose that it wouldnt be unreasonable
> to think of clinical notes as being a subset of the english language. It
> seems to me that gutenberg is fairly good average of that english language
> so the superset could contribute to the recognition of the subset.
>
>
>
>
>
>     JG
>
>
>
>
>
>     —
> Sent from Mailbox for iPhone
>
> On Mon, Aug 26, 2013 at 2:07 PM, Masanz, James J. <Ma...@mayo.edu>
> wrote:
>
> > The corpus used for cTAKES sentence detection is a combination of some
> Mayo Clinic clinical notes that were manually separated into sentences,
> combined with the Penn Treebank (wall street journal)
> > -- James
> > -----Original Message-----
> > From: dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
> dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of
> John Green
> > Sent: Monday, August 26, 2013 11:46 AM
> > To: dev@ctakes.apache.org
> > Subject: Re: apostrophe and sentence detector
> > Just out of curiosity, how was the training data originally built? I
> mean, who separated the lines? By hand? Regex?
> >
> >
> >     Question two: has anyone made attempts at adding project gutenberg
> to the training data for things like sentence detection? Wide variety of
> punctuation in the years a lot of those books were written.
> >
> >
> >     Trying to piece together how it all works,
> >     JG
> >
> >
> >     —
> > Sent from Mailbox for iPhone
> > On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
> > <ti...@childrens.harvard.edu> wrote:
> >> Ah, so we might suspect that some of those 7 lines in the file were
> >> indeed followed by newlines in the original training data. In the
> >> absence of more/better training data which would help us learn this I
> >> think it would be reasonable to restore the list of sentence-breaking
> >> characters to not include apostrophe. Seems like it is rare for a
> >> sentence to end on it, and my preference is to accidentally call 2
> >> sentences one sentence, rather than splitting one sentence in the
> >> middle. I think it's probably better for downstream processing.
> >> Just my .02,
> >> Tim
> >> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> >>> The training data is one sentence per line.
> >>> That's how you feed data to the sentence detector.
> >>>
> >>> -----Original Message-----
> >>> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
> dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim
> Miller
> >>> Sent: Monday, August 26, 2013 11:12 AM
> >>> To: dev@ctakes.apache.org
> >>> Subject: Re: apostrophe and sentence detector
> >>>
> >>>
> >>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
> >>>> The recently rebuilt sentence detector (currently in trunk and the
> 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where
> the ctakes-3.0.0-incubating model didn't.
> >>>>
> >>>> The training data used for the recently rebuilt model only contains
> only 7 lines that end with an apostrophe (single quote)
> >>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> >>> sentence detector will currently break on newlines no matter what, so
> >>> the important number is how many sentences end mid-line with an
> >>> apostrophe, right?
> >>> Tim
>

RE: apostrophe and sentence detector

Posted by John Green <jo...@gmail.com>.

Karthik, well said. There are many differences. I wonder, what do you think about the logical division of the two sets? Do they share domain? Is one a subset of the other? I would propose that it wouldnt be unreasonable to think of clinical notes as being a subset of the english language. It seems to me that gutenberg is fairly good average of that english language so the superset could contribute to the recognition of the subset.

    
      


    JG

    
      


    —
Sent from Mailbox for iPhone

On Mon, Aug 26, 2013 at 2:07 PM, Masanz, James J. <Ma...@mayo.edu>
wrote:

> The corpus used for cTAKES sentence detection is a combination of some Mayo Clinic clinical notes that were manually separated into sentences, combined with the Penn Treebank (wall street journal)
> -- James
> -----Original Message-----
> From: dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of John Green
> Sent: Monday, August 26, 2013 11:46 AM
> To: dev@ctakes.apache.org
> Subject: Re: apostrophe and sentence detector
> Just out of curiosity, how was the training data originally built? I mean, who separated the lines? By hand? Regex? 
>     
>       
>     Question two: has anyone made attempts at adding project gutenberg to the training data for things like sentence detection? Wide variety of punctuation in the years a lot of those books were written. 
>     
>       
>     Trying to piece together how it all works,
>     JG
>     
>       
>     —
> Sent from Mailbox for iPhone
> On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
> <ti...@childrens.harvard.edu> wrote:
>> Ah, so we might suspect that some of those 7 lines in the file were 
>> indeed followed by newlines in the original training data. In the 
>> absence of more/better training data which would help us learn this I 
>> think it would be reasonable to restore the list of sentence-breaking 
>> characters to not include apostrophe. Seems like it is rare for a 
>> sentence to end on it, and my preference is to accidentally call 2 
>> sentences one sentence, rather than splitting one sentence in the 
>> middle. I think it's probably better for downstream processing.
>> Just my .02,
>> Tim
>> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
>>> The training data is one sentence per line.
>>> That's how you feed data to the sentence detector.
>>>
>>> -----Original Message-----
>>> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
>>> Sent: Monday, August 26, 2013 11:12 AM
>>> To: dev@ctakes.apache.org
>>> Subject: Re: apostrophe and sentence detector
>>>
>>>
>>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>>>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
>>>>
>>>> The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)
>>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
>>> sentence detector will currently break on newlines no matter what, so
>>> the important number is how many sentences end mid-line with an
>>> apostrophe, right?
>>> Tim

RE: apostrophe and sentence detector

Posted by "Masanz, James J." <Ma...@mayo.edu>.

The corpus used for cTAKES sentence detection is a combination of some Mayo Clinic clinical notes that were manually separated into sentences, combined with the Penn Treebank (wall street journal)

-- James

-----Original Message-----
From: dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1889-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of John Green
Sent: Monday, August 26, 2013 11:46 AM
To: dev@ctakes.apache.org
Subject: Re: apostrophe and sentence detector

Just out of curiosity, how was the training data originally built? I mean, who separated the lines? By hand? Regex? 

    
      


    Question two: has anyone made attempts at adding project gutenberg to the training data for things like sentence detection? Wide variety of punctuation in the years a lot of those books were written. 

    
      


    Trying to piece together how it all works,

    JG

    
      


    —
Sent from Mailbox for iPhone

On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
<ti...@childrens.harvard.edu> wrote:

> Ah, so we might suspect that some of those 7 lines in the file were 
> indeed followed by newlines in the original training data. In the 
> absence of more/better training data which would help us learn this I 
> think it would be reasonable to restore the list of sentence-breaking 
> characters to not include apostrophe. Seems like it is rare for a 
> sentence to end on it, and my preference is to accidentally call 2 
> sentences one sentence, rather than splitting one sentence in the 
> middle. I think it's probably better for downstream processing.
> Just my .02,
> Tim
> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
>> The training data is one sentence per line.
>> That's how you feed data to the sentence detector.
>>
>> -----Original Message-----
>> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
>> Sent: Monday, August 26, 2013 11:12 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: apostrophe and sentence detector
>>
>>
>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
>>>
>>> The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)
>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
>> sentence detector will currently break on newlines no matter what, so
>> the important number is how many sentences end mid-line with an
>> apostrophe, right?
>> Tim

Re: apostrophe and sentence detector

Posted by Karthik Sarma <ks...@ksarma.com>.

Hmm, one problem there is that medical records tend to be punctuated
completely differently from normal text in my experience.





--
Karthik Sarma
UCLA Medical Scientist Training Program Class of 20??
Member, UCLA Medical Imaging & Informatics Lab
Member, CA Delegation to the House of Delegates of the American Medical
Association
ksarma@ksarma.com
gchat: ksarma@gmail.com
linkedin: www.linkedin.com/in/ksarma


On Mon, Aug 26, 2013 at 9:46 AM, John Green <jo...@gmail.com>wrote:

> Just out of curiosity, how was the training data originally built? I mean,
> who separated the lines? By hand? Regex?
>
>
>
>
>
>     Question two: has anyone made attempts at adding project gutenberg to
> the training data for things like sentence detection? Wide variety of
> punctuation in the years a lot of those books were written.
>
>
>
>
>
>     Trying to piece together how it all works,
>
>     JG
>
>
>
>
>
>     —
> Sent from Mailbox for iPhone
>
> On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
> <ti...@childrens.harvard.edu> wrote:
>
> > Ah, so we might suspect that some of those 7 lines in the file were
> > indeed followed by newlines in the original training data. In the
> > absence of more/better training data which would help us learn this I
> > think it would be reasonable to restore the list of sentence-breaking
> > characters to not include apostrophe. Seems like it is rare for a
> > sentence to end on it, and my preference is to accidentally call 2
> > sentences one sentence, rather than splitting one sentence in the
> > middle. I think it's probably better for downstream processing.
> > Just my .02,
> > Tim
> > On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> >> The training data is one sentence per line.
> >> That's how you feed data to the sentence detector.
> >>
> >> -----Original Message-----
> >> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:
> dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim
> Miller
> >> Sent: Monday, August 26, 2013 11:12 AM
> >> To: dev@ctakes.apache.org
> >> Subject: Re: apostrophe and sentence detector
> >>
> >>
> >> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
> >>> The recently rebuilt sentence detector (currently in trunk and the
> 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where
> the ctakes-3.0.0-incubating model didn't.
> >>>
> >>> The training data used for the recently rebuilt model only contains
> only 7 lines that end with an apostrophe (single quote)
> >> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> >> sentence detector will currently break on newlines no matter what, so
> >> the important number is how many sentences end mid-line with an
> >> apostrophe, right?
> >> Tim
>

Re: apostrophe and sentence detector

Posted by John Green <jo...@gmail.com>.

Just out of curiosity, how was the training data originally built? I mean, who separated the lines? By hand? Regex? 

    
      


    Question two: has anyone made attempts at adding project gutenberg to the training data for things like sentence detection? Wide variety of punctuation in the years a lot of those books were written. 

    
      


    Trying to piece together how it all works,

    JG

    
      


    —
Sent from Mailbox for iPhone

On Mon, Aug 26, 2013 at 12:35 PM, Tim Miller
<ti...@childrens.harvard.edu> wrote:

> Ah, so we might suspect that some of those 7 lines in the file were 
> indeed followed by newlines in the original training data. In the 
> absence of more/better training data which would help us learn this I 
> think it would be reasonable to restore the list of sentence-breaking 
> characters to not include apostrophe. Seems like it is rare for a 
> sentence to end on it, and my preference is to accidentally call 2 
> sentences one sentence, rather than splitting one sentence in the 
> middle. I think it's probably better for downstream processing.
> Just my .02,
> Tim
> On 08/26/2013 12:29 PM, Masanz, James J. wrote:
>> The training data is one sentence per line.
>> That's how you feed data to the sentence detector.
>>
>> -----Original Message-----
>> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
>> Sent: Monday, August 26, 2013 11:12 AM
>> To: dev@ctakes.apache.org
>> Subject: Re: apostrophe and sentence detector
>>
>>
>> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
>>>
>>> The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)
>> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
>> sentence detector will currently break on newlines no matter what, so
>> the important number is how many sentences end mid-line with an
>> apostrophe, right?
>> Tim

RE: apostrophe and sentence detector

Posted by "Masanz, James J." <Ma...@mayo.edu>.

The  7 lines I referred to as "ending with apostrophe" indeed have apostrophe followed immediately by newline.

In the training data it is indeed very rare to end on apostrophe. 7 out of >400K sentences. 
I second your suggestion of removing the apostrophe from the list of sentence-breaking characters.  It is straight-forward and cleaner. Thanks

-----Original Message-----
From: dev-return-1887-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1887-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
Sent: Monday, August 26, 2013 11:35 AM
To: dev@ctakes.apache.org
Subject: Re: apostrophe and sentence detector

Ah, so we might suspect that some of those 7 lines in the file were 
indeed followed by newlines in the original training data. In the 
absence of more/better training data which would help us learn this I 
think it would be reasonable to restore the list of sentence-breaking 
characters to not include apostrophe. Seems like it is rare for a 
sentence to end on it, and my preference is to accidentally call 2 
sentences one sentence, rather than splitting one sentence in the 
middle. I think it's probably better for downstream processing.
Just my .02,
Tim

On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> The training data is one sentence per line.
> That's how you feed data to the sentence detector.
>
> -----Original Message-----
> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
> Sent: Monday, August 26, 2013 11:12 AM
> To: dev@ctakes.apache.org
> Subject: Re: apostrophe and sentence detector
>
>
> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
>>
>> The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)
> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> sentence detector will currently break on newlines no matter what, so
> the important number is how many sentences end mid-line with an
> apostrophe, right?
> Tim

Re: apostrophe and sentence detector

Posted by Tim Miller <ti...@childrens.harvard.edu>.

Ah, so we might suspect that some of those 7 lines in the file were 
indeed followed by newlines in the original training data. In the 
absence of more/better training data which would help us learn this I 
think it would be reasonable to restore the list of sentence-breaking 
characters to not include apostrophe. Seems like it is rare for a 
sentence to end on it, and my preference is to accidentally call 2 
sentences one sentence, rather than splitting one sentence in the 
middle. I think it's probably better for downstream processing.
Just my .02,
Tim

On 08/26/2013 12:29 PM, Masanz, James J. wrote:
> The training data is one sentence per line.
> That's how you feed data to the sentence detector.
>
> -----Original Message-----
> From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
> Sent: Monday, August 26, 2013 11:12 AM
> To: dev@ctakes.apache.org
> Subject: Re: apostrophe and sentence detector
>
>
> On 08/26/2013 12:05 PM, Masanz, James J. wrote:
>> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
>>
>> The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)
> Do you mean 7 sentences that end in a single apostrophe or 7 lines? The
> sentence detector will currently break on newlines no matter what, so
> the important number is how many sentences end mid-line with an
> apostrophe, right?
> Tim

RE: apostrophe and sentence detector

Posted by "Masanz, James J." <Ma...@mayo.edu>.

The training data is one sentence per line.
That's how you feed data to the sentence detector.

-----Original Message-----
From: dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org [mailto:dev-return-1884-Masanz.James=mayo.edu@ctakes.apache.org] On Behalf Of Tim Miller
Sent: Monday, August 26, 2013 11:12 AM
To: dev@ctakes.apache.org
Subject: Re: apostrophe and sentence detector


On 08/26/2013 12:05 PM, Masanz, James J. wrote:
> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
>
> The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)
Do you mean 7 sentences that end in a single apostrophe or 7 lines? The 
sentence detector will currently break on newlines no matter what, so 
the important number is how many sentences end mid-line with an 
apostrophe, right?
Tim

Re: apostrophe and sentence detector

Posted by Tim Miller <ti...@childrens.harvard.edu>.

On 08/26/2013 12:05 PM, Masanz, James J. wrote:
> The recently rebuilt sentence detector (currently in trunk and the 3.1.0 branch) is sometimes taking the apostrophe as a sentence break where the ctakes-3.0.0-incubating model didn't.
>
> The training data used for the recently rebuilt model only contains only 7 lines that end with an apostrophe (single quote)
Do you mean 7 sentences that end in a single apostrophe or 7 lines? The 
sentence detector will currently break on newlines no matter what, so 
the important number is how many sentences end mid-line with an 
apostrophe, right?
Tim