You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Vadym Oliinyk (JIRA)" <de...@uima.apache.org> on 2014/11/20 21:09:34 UTC

[jira] [Created] (UIMA-4115) TikaAnnotator: incorrect order of tags processing

Vadym Oliinyk created UIMA-4115:
-----------------------------------

             Summary: TikaAnnotator: incorrect order of tags processing
                 Key: UIMA-4115
                 URL: https://issues.apache.org/jira/browse/UIMA-4115
             Project: UIMA
          Issue Type: Bug
          Components: addons
    Affects Versions: 2.3.1Addons
            Reporter: Vadym Oliinyk


org.apache.uima.tika.MarkupAnnotator outputs incorrect content due to bug in org.apache.uima.tika.MarkupHandler. The problem located in the end element event handler: MarkupHandler#endElement method should close opened tags by removing them from the stack (last added tag should be removed first if corresponding end tag found). But in current implementation it removes start elements beginning from the first open element which results in incorrect text spans annotated by the processor.

The fix is trivial:
in MarkupHandler#endElement replace startedAnnotations.iterator() with 
startedAnnotations.descendingIterator().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (UIMA-4115) TikaAnnotator: incorrect order of tags processing

Posted by Marshall Schor <ms...@schor.com>.

Tommaso - could you take a look?

-Marshall

On 11/20/2014 3:09 PM, Vadym Oliinyk (JIRA) wrote:
> Vadym Oliinyk created UIMA-4115:
> -----------------------------------
>
>              Summary: TikaAnnotator: incorrect order of tags processing
>                  Key: UIMA-4115
>                  URL: https://issues.apache.org/jira/browse/UIMA-4115
>              Project: UIMA
>           Issue Type: Bug
>           Components: addons
>     Affects Versions: 2.3.1Addons
>             Reporter: Vadym Oliinyk
>
>
> org.apache.uima.tika.MarkupAnnotator outputs incorrect content due to bug in org.apache.uima.tika.MarkupHandler. The problem located in the end element event handler: MarkupHandler#endElement method should close opened tags by removing them from the stack (last added tag should be removed first if corresponding end tag found). But in current implementation it removes start elements beginning from the first open element which results in incorrect text spans annotated by the processor.
>
> The fix is trivial:
> in MarkupHandler#endElement replace startedAnnotations.iterator() with 
> startedAnnotations.descendingIterator().
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>
>