You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@ctakes.apache.org by "Sean Finan (Jira)" <ji...@apache.org> on 2022/12/30 23:30:00 UTC

[jira] [Closed] (CTAKES-155) SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters

     [ https://issues.apache.org/jira/browse/CTAKES-155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Finan closed CTAKES-155.
-----------------------------
      Assignee: Sean Finan
    Resolution: Workaround

There are newer sectionizers that can be used instead of that old engine.

> SimpleSegmentWithTagsAnnotator assumes all section names are 5 characters
> -------------------------------------------------------------------------
>
>                 Key: CTAKES-155
>                 URL: https://issues.apache.org/jira/browse/CTAKES-155
>             Project: cTAKES
>          Issue Type: Bug
>          Components: ctakes-core
>    Affects Versions: 3.0-incubating
>            Reporter: Steven Bethard
>            Assignee: Sean Finan
>            Priority: Major
>             Fix For: future enhancement
>
>
> The code in SimpleSegmentWithTagsAnnotator is a bit hard to follow, but I believe it assumes all sections are 5 characters long here:
> {code:java}
> 	fileReader.read(sectIdArr, 0, 5);
> {code}
> As a result, when the section name is longer than that, some part of the section heading (e.g. for a 6 letter section name, the final "]") is left in the text of the next section. This results, for example, in the dependency parser choking:
> {code:java}
> Caused by: java.lang.NullPointerException
> 	at clear.pos.PosEnLib.isNoun(PosEnLib.java:56)
> 	at clear.morph.MorphEnAnalyzer.getException(MorphEnAnalyzer.java:273)
> 	at clear.morph.MorphEnAnalyzer.getLemma(MorphEnAnalyzer.java:247)
> {code}
> I would fix this but:
> (1) There are no tests for SimpleSegmentWithTagsAnnotator and it's documentation actually says "Creates a single segment annotation that spans the entire document" which is just untrue, so I'm not really sure what this annotator is intended to do.
> (2) Even if I make some assumptions about what it's intended to do, the code is written in an extremely brittle fashion, and I'm afraid to make changes to that. For what it's worth, here's what I think the annotator should really look like:
> {code:java}
>   public static class SegmentsFromBracketedSectionTagsAnnotator extends JCasAnnotator_ImplBase {
>     private static Pattern SECTION_PATTERN =
>         Pattern.compile("(\\[start section id=\"?(.*?)\"?\\]).*?(\\[end section id=\"?(.*?)\"?\\])", Pattern.DOTALL);
>     @Override
>     public void process(JCas jCas) throws AnalysisEngineProcessException {
>       Matcher matcher = SECTION_PATTERN.matcher(jCas.getDocumentText());
>       while (matcher.find()) {
>         Segment segment = new Segment(jCas);
>         segment.setBegin(matcher.start() + matcher.group(1).length());
>         segment.setEnd(matcher.end() - matcher.group(3).length());
>         segment.setId(matcher.group(2));
>         segment.addToIndexes();
>       }
>     }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)