You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@ctakes.apache.org by "James Joseph Masanz (JIRA)" <ji...@apache.org> on 2013/02/05 21:05:13 UTC

[jira] [Commented] (CTAKES-145) inconsistent handling of upper ascii

    [ https://issues.apache.org/jira/browse/CTAKES-145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571677#comment-13571677 ] 

James Joseph Masanz commented on CTAKES-145:
--------------------------------------------


cTAKES pipelines written to accept CDA (which is a specific XML) input create a plaintext view, and replace any non (basic) ASCII character with blank. All the main processing is then done on that plaintext view.

cTAKES pipelines written to accept plaintext, do not replace upper ASCII characters (like the degree symbol used here: °C).

I created the JIRA issue this morning to track this. 
I propose having cTAKES, by default, accept UTF8 - not just (basic) ASCII - even when input is CDA.  Single byte character set should not affect any of the offset-processing cTAKES does.

One consideration is that none of the training data used for the sentence detector, part of speech tagger or chunker included such characters.

What other considerations can people think of?

Any objections?

-- James


                
> inconsistent handling of upper ascii 
> -------------------------------------
>
>                 Key: CTAKES-145
>                 URL: https://issues.apache.org/jira/browse/CTAKES-145
>             Project: cTAKES
>          Issue Type: Task
>          Components: ctakes-preprocessor
>    Affects Versions: future enhancement
>            Reporter: James Joseph Masanz
>            Priority: Minor
>
> Currently cTAKES handles character above ascii 127 different depending on if using a pipeline that processes CDA (Clinical document architecture XML) or a pipeline that expects plain text.
> The CDA pipelines, as an early step, create a plaintext view that has each upper ascii characters replaced by a blank.
> The plaintext pipelines do not do anything special for upper ascii characters.
> Example input text for plaintext, to show this behavior: 
> His name is Gërman. Temp is 98 °C taken on the forehead
> Need to decide if it is OK for this inconsistent behavior or if we should change one or the other to make them consistent.
> See ClinicalNotePreProcessor.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira