You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jeremy Anderson (Created) (JIRA)" <ji...@apache.org> on 2011/09/28 04:12:45 UTC

[jira] [Created] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

[PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
------------------------------------------------------------------

                 Key: TIKA-733
                 URL: https://issues.apache.org/jira/browse/TIKA-733
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
            Reporter: Jeremy Anderson
             Fix For: 1.0
         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch

Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.

Unable to include sample file due to sensitive nature of file contents.

StackTrace (TIKA-0.9)

Caused by: java.util.NoSuchElementException
	at java.util.LinkedList.remove(LinkedList.java:788)
	at java.util.LinkedList.removeLast(LinkedList.java:144)
	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] 

Jeremy Anderson commented on TIKA-733:
--------------------------------------

The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.

My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?

I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.  

To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.


Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:
* detect if they have more ending blocks than starting, and when it does
   * check to see if the final one is a partial replication of the prior one
   * and if so, just ignore the final one.


Last lines of the corrupted file:
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved TIKA-733.
-------------------------------------

    Resolution: Fixed

Thanks Jeremy!
                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119999#comment-13119999 ] 

Michael McCandless commented on TIKA-733:
-----------------------------------------

Thank you Jeremy!  Keep the patches coming ;)
                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Anderson updated TIKA-733:
---------------------------------

    Attachment: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch

Patch file
                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] 

Jeremy Anderson edited comment on TIKA-733 at 10/3/11 6:02 PM:
---------------------------------------------------------------

The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.

My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?

I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.

To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.

Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:
* detect if they have more ending blocks than starting, and when it does
* check to see if the final one is a partial replication of the prior one
* and if so, just ignore the final one.


Last lines of the corrupted file:
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
      was (Author: rpialum):
    The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.


My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?


I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.  


To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.



Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:

* detect if they have more ending blocks than starting, and when it does

   * check to see if the final one is a partial replication of the prior one

   * and if so, just ignore the final one.


Last lines of the corrupted file:
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                  
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] 

Jeremy Anderson edited comment on TIKA-733 at 10/3/11 6:03 PM:
---------------------------------------------------------------

The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.<br>

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.<br>

My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?<br>

I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.<br>

To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.<br>

Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:<br>
* detect if they have more ending blocks than starting, and when it does<br>
* check to see if the final one is a partial replication of the prior one<br>
* and if so, just ignore the final one.<br>
<br>
<br>
Last lines of the corrupted file:<br>
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
      was (Author: rpialum):
    The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.

My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?

I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.

To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.

Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:
* detect if they have more ending blocks than starting, and when it does
* check to see if the final one is a partial replication of the prior one
* and if so, just ignore the final one.


Last lines of the corrupted file:
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                  
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119815#comment-13119815 ] 

Jeremy Anderson commented on TIKA-733:
--------------------------------------

Cool beans!! 

Thanks for your attention to it.  Yeah, I confirmed with 18 of the other files experiencing this error, all corruption issues similar to the first one.  Although the amount of info contained in the final block varies widely from a few chars to none.  


But using the patch I already submitted does appear to actually work with getting the text out for each these corrupted documents.

Thanks again for adding it to the trunk.

                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Michael McCandless (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned TIKA-733:
---------------------------------------

    Assignee: Michael McCandless
    
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] 

Jeremy Anderson edited comment on TIKA-733 at 10/3/11 6:04 PM:
---------------------------------------------------------------

The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.


However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty


My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?


I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.


To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.


Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:

* detect if they have more ending blocks than starting, and when it does\n
* check to see if the final one is a partial replication of the prior one

* and if so, just ignore the final one.


Last lines of the corrupted file:
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
      was (Author: rpialum):
    The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.<br>

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.<br>

My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?<br>

I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.<br>

To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.<br>

Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:<br>
* detect if they have more ending blocks than starting, and when it does<br>
* check to see if the final one is a partial replication of the prior one<br>
* and if so, just ignore the final one.<br>
<br>
<br>
Last lines of the corrupted file:<br>
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                  
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116331#comment-13116331 ] 

Michael McCandless commented on TIKA-733:
-----------------------------------------

Hmm, it makes me a little nervous just blindly not popping the group
state once it's empty since this could be masking a more serious bug.

Ie, it's possible we are not correctly tokenizing the open / close
group tokens.

The other explanation is that the RTF doc is corrupt (has too many
closing } vs open {).

Can you look at the doc and figure out if its corrupt?

Does this RTF document work with older versions of Tika (before
TIKA-683 was committed)?
                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119462#comment-13119462 ] 

Michael McCandless commented on TIKA-733:
-----------------------------------------

Actually, I think we should just commit your patch: it's harmless for non-corrupt RTF docs, and for corrupt ones (with this particular corruption) it will make a best effort to extract what text it can.

I only wanted to confirm that you were hitting this because of document corruption and not a bug in how the new RTF parser tokenizes open/close groups.  Thanks!
                
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] 

Jeremy Anderson edited comment on TIKA-733 at 10/3/11 6:01 PM:
---------------------------------------------------------------

The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.


My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?


I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.  


To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.



Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:

* detect if they have more ending blocks than starting, and when it does

   * check to see if the final one is a partial replication of the prior one

   * and if so, just ignore the final one.


Last lines of the corrupted file:
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
      was (Author: rpialum):
    The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.

However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.

My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?

I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.  

To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.


Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:
* detect if they have more ending blocks than starting, and when it does
   * check to see if the final one is a partial replication of the prior one
   * and if so, just ignore the final one.


Last lines of the corrupted file:
...
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                  
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-733) [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException

Posted by "Jeremy Anderson (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119441#comment-13119441 ] 

Jeremy Anderson edited comment on TIKA-733 at 10/3/11 6:06 PM:
---------------------------------------------------------------

(Sorry, I can't seem to get the post to maintain my newline characters :( )

The problem is also present in the older 0.9 release.


Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.


However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty.


My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?



I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.



To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.



Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:


* detect if they have more ending blocks than starting, and when it does


* check to see if the final one is a partial replication of the prior one


* and if so, just ignore the final one.



Last lines of the corrupted file:
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                
      was (Author: rpialum):
    The problem is also present in the older 0.9 release.

Looking at the document as you suggested, the document is corrupt/malformed in the sense that it contains more closing brackets '}' than opening brackets '{'.


However with that said, the text contained with in the document appears to still be extractable for this document using the patch I submitted that ignores the group state once empty


My knowledge on RTF formats is rather limited, but is there perhaps a better compromise that will allow the parser to return the text it is able to get and maybe log a warning condition when a malformed RTF is encountered?


I have about 20 or so files that have encountered this failure in my load set.  I haven't had the time to investigate all of them yet to see if they all fail for the same mis-matched problem, and when corrupt, determine how much of the extractable text is impacted by the fix I submitted.


To be noted, both Word pad and MS word are able to open these files without issue... though thats to be expected.  I expect that they may also just ignore the final block in these cases.  Actually after opening the failed document and resaving it in WordPad, the final partial block does indeed just get truncated from the file.


Looking closer at the file in a text editor, the culprit final extra block appears to be a partial replication of the final valid ending block in the file.  Perhaps an appropriate fix for being able to auto-handle these partial corrupted RTF's is to:

* detect if they have more ending blocks than starting, and when it does\n
* check to see if the final one is a partial replication of the prior one

* and if so, just ignore the final one.


Last lines of the corrupted file:
\pard\li360____VALID RTF FILE TEXT _____\line\par
\pard\par
\pard\fi-1800\li1800\tx1800\cf1\f0\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf1\b0\f0\par
}
 0\li1800\tx1800\cf2\f2\fs20\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\pard\fi-1800\li1800\tx1800\cf2\b0\f2\par
\pard\cf0\b\f3\par
\par
}
 
                  
> [PATCH] RTF TextExtractor processGroupEnd() NoSuchElementException
> ------------------------------------------------------------------
>
>                 Key: TIKA-733
>                 URL: https://issues.apache.org/jira/browse/TIKA-733
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jeremy Anderson
>            Assignee: Michael McCandless
>              Labels: patch
>             Fix For: 1.0
>
>         Attachments: TIKA-733-rtf_TextExtractor_processGroupEnd-NoSuchElementException.patch
>
>
> Parsing some RTF documents attempt to perform a removeLast() on the groupStates() list when the list is empty.  Added a check to not perform the logic when the list is empty, thus causing the restore group state to not be performed. Text extraction now completes without further down-stream errors.
> Unable to include sample file due to sensitive nature of file contents.
> StackTrace (TIKA-0.9)
> Caused by: java.util.NoSuchElementException
> 	at java.util.LinkedList.remove(LinkedList.java:788)
> 	at java.util.LinkedList.removeLast(LinkedList.java:144)
> 	at org.apache.tika.parser.rtf.TextExtractor.processGroupEnd(TextExtractor.java:1010)
> 	at org.apache.tika.parser.rtf.TextExtractor.extract(TextExtractor.java:352)
> 	at org.apache.tika.parser.rtf.RTFParser.parse(RTFParser.java:53)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 45 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira