You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by "Masanz, James J." <Ma...@mayo.edu> on 2013/02/05 21:03:39 UTC

[DISCUSS] FW: [jira] [Created] (CTAKES-145) inconsistent handling of upper ascii

cTAKES pipelines written to accept CDA (which is a specific XML) input create a plaintext view, and replace any non (basic) ASCII character with blank. All the main processing is then done on that plaintext view.

cTAKES pipelines written to accept plaintext, do not replace upper ASCII characters (like the degree symbol used here: °C).

I created the JIRA issue this morning to track this. 
I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII - even when input is CDA.  Single byte character set should not affect any of the offset-processing cTAKES does.

One consideration is that none of the training data used for the sentence detector, part of speech tagger or chunker included such characters.

What other considerations can people think of?

Any objections?

-- James

> -----Original Message-----
> From: ctakes-notifications-return-287-
> Masanz.James=mayo.edu@incubator.apache.org [mailto:ctakes-notifications-
> return-287-Masanz.James=mayo.edu@incubator.apache.org] On Behalf Of James
> Joseph Masanz (JIRA)
> Sent: Tuesday, February 05, 2013 10:36 AM
> To: ctakes-notifications@incubator.apache.org
> Subject: [jira] [Created] (CTAKES-145) inconsistent handling of upper
> ascii
> 
> James Joseph Masanz created CTAKES-145:
> ------------------------------------------
> 
>              Summary: inconsistent handling of upper ascii
>                  Key: CTAKES-145
>                  URL: https://issues.apache.org/jira/browse/CTAKES-145
>              Project: cTAKES
>           Issue Type: Task
>           Components: ctakes-preprocessor
>     Affects Versions: future enhancement
>             Reporter: James Joseph Masanz
>             Priority: Minor
> 
> 
> Currently cTAKES handles character above ascii 127 different depending on
> if using a pipeline that processes CDA (Clinical document architecture
> XML) or a pipeline that expects plain text.
> 
> The CDA pipelines, as an early step, create a plaintext view that has each
> upper ascii characters replaced by a blank.
> 
> The plaintext pipelines do not do anything special for upper ascii
> characters.
> 
> Example input text for plaintext, to show this behavior:
> His name is Gërman. Temp is 98 °C taken on the forehead
> 
> Need to decide if it is OK for this inconsistent behavior or if we should
> change one or the other to make them consistent.
> 
> See ClinicalNotePreProcessor.java
> 
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators For more information on JIRA, see:
> http://www.atlassian.com/software/jira

Re: [DISCUSS] FW: [jira] [Created] (CTAKES-145) inconsistent handling of upper ascii

Posted by Steven Bethard <st...@Colorado.EDU>.

On Feb 5, 2013, at 1:03 PM, "Masanz, James J." <Ma...@mayo.edu> wrote:
> I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII

Yes please. Anything that is replacing character instead of using the correct encoding is just a bug waiting to happen later.

> One consideration is that none of the training data used for the sentence detector, part of speech tagger or chunker included such characters.

Might be worth running the current models over such text just to make sure things don't break horribly. I wouldn't expect them to, but you never know…

Steve

Re: [DISCUSS] FW: [jira] [Created] (CTAKES-145) inconsistent handling of upper ascii

Posted by Andy McMurry <mc...@gmail.com>.

> One consideration is that none of the training data used for the sentence detector, part of speech tagger or chunker included such characters.

In a sense, cTAKES supports only the formats that it was trained with. 
I suggest alerting the user if a non-supported format is detected. 

Its a hard problem to wrap your head around as an end user. 
All the more difficult when the user is not even aware of the non-standard character translation issues. 

Ideally, cTAKES would work as normal when the same format is in use as the training set. 
If an unsupported format is detected, log warnings (once per document) with a  pointer to how to "correct' the issue. 

--Andy 


On Feb 5, 2013, at 12:03 PM, "Masanz, James J." <Ma...@mayo.edu> wrote:

> 
> cTAKES pipelines written to accept CDA (which is a specific XML) input create a plaintext view, and replace any non (basic) ASCII character with blank. All the main processing is then done on that plaintext view.
> 
> cTAKES pipelines written to accept plaintext, do not replace upper ASCII characters (like the degree symbol used here: °C).
> 
> I created the JIRA issue this morning to track this. 
> I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII - even when input is CDA.  Single byte character set should not affect any of the offset-processing cTAKES does.
> 
> One consideration is that none of the training data used for the sentence detector, part of speech tagger or chunker included such characters.
> 
> What other considerations can people think of?
> 
> Any objections?
> 
> -- James
> 
>> -----Original Message-----
>> From: ctakes-notifications-return-287-
>> Masanz.James=mayo.edu@incubator.apache.org [mailto:ctakes-notifications-
>> return-287-Masanz.James=mayo.edu@incubator.apache.org] On Behalf Of James
>> Joseph Masanz (JIRA)
>> Sent: Tuesday, February 05, 2013 10:36 AM
>> To: ctakes-notifications@incubator.apache.org
>> Subject: [jira] [Created] (CTAKES-145) inconsistent handling of upper
>> ascii
>> 
>> James Joseph Masanz created CTAKES-145:
>> ------------------------------------------
>> 
>>             Summary: inconsistent handling of upper ascii
>>                 Key: CTAKES-145
>>                 URL: https://issues.apache.org/jira/browse/CTAKES-145
>>             Project: cTAKES
>>          Issue Type: Task
>>          Components: ctakes-preprocessor
>>    Affects Versions: future enhancement
>>            Reporter: James Joseph Masanz
>>            Priority: Minor
>> 
>> 
>> Currently cTAKES handles character above ascii 127 different depending on
>> if using a pipeline that processes CDA (Clinical document architecture
>> XML) or a pipeline that expects plain text.
>> 
>> The CDA pipelines, as an early step, create a plaintext view that has each
>> upper ascii characters replaced by a blank.
>> 
>> The plaintext pipelines do not do anything special for upper ascii
>> characters.
>> 
>> Example input text for plaintext, to show this behavior:
>> His name is Gërman. Temp is 98 °C taken on the forehead
>> 
>> Need to decide if it is OK for this inconsistent behavior or if we should
>> change one or the other to make them consistent.
>> 
>> See ClinicalNotePreProcessor.java
>> 
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators For more information on JIRA, see:
>> http://www.atlassian.com/software/jira