You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeremy Anderson (Commented) (JIRA)" <ji...@apache.org> on 2011/10/03 20:01:37 UTC
[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] 

Jeremy Anderson commented on TIKA-733:
--------------------------------------

The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.

My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?

I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.  

To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.


Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:
* detect if they have more ending blocks than starting, and when it does
   * check to see if the final one is a partial replication of the prior one
   * and if so, just ignore the final one.


Last lines of the corrupted file:
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira