You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@ctakes.apache.org by "Ewan Mellor (JIRA)" <ji...@apache.org> on 2018/08/16 21:19:00 UTC

[jira] [Created] (CTAKES-520) SentenceDetectorAnnotatorBIO token scanning performance issues

Ewan Mellor created CTAKES-520:
----------------------------------

             Summary: SentenceDetectorAnnotatorBIO token scanning performance issues
                 Key: CTAKES-520
                 URL: https://issues.apache.org/jira/browse/CTAKES-520
             Project: cTAKES
          Issue Type: Improvement
          Components: ctakes-core
    Affects Versions: 4.0.0
            Reporter: Ewan Mellor


SentenceDetectorAnnotatorBIO iterates over every character in the Segment and classifies it as Begin, Inside, or Outside a Sentence.  When doing this, it needs to know the next and previous token from the current character.

It currently finds these tokens afresh for each character.  That means that it starts from the current character, and scans forward and backwards looking for whitespace until it finds the boundaries of the tokens either side of the current position.  This is very wasteful; when the current index steps within a word, the tokens do not change since we're still within the same word.  Also, since we're scanning in one direction, we never need to scan for the previous token, because we already know it.

(I found this bug with a pathological case where I had a "document" with a single word that was a megabyte long.  In a case where the word length is not bounded, the current algorithm is quadratic instead of linear, because it scans the length of the word for each character.)

Patch attached.  This fixes the problem by keeping track of the word boundary, and only scanning for the next token when we have reached the boundary of the current one.  Also, the previous token is simply taken as the token from the previous iteration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)