You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jerome Lacoste (Created) (JIRA)" <ji...@apache.org> on 2011/12/17 11:36:30 UTC

[jira] [Created] (TIKA-815) Tika parsers should handle failures more gracefully

Tika parsers should handle failures more gracefully
---------------------------------------------------

                 Key: TIKA-815
                 URL: https://issues.apache.org/jira/browse/TIKA-815
             Project: Tika
          Issue Type: Test
          Components: parser
    Affects Versions: 1.0
            Reporter: Jerome Lacoste


We encountered an OOM while parsing a Word document. We will report the failure to POI.

This raises the question about the general robustness of the parsers.

We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.

Yet it's easy to write generic tools at the Tika level to make these kind of tests.

So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.

Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-815) Tika parsers should handle failures more gracefully

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171516#comment-13171516 ] 

Nick Burch commented on TIKA-815:
---------------------------------

FYI Tika does provide the Fork Parser for cases when you want to ensure the parsing can't affect the parent application
                
> Tika parsers should handle failures more gracefully
> ---------------------------------------------------
>
>                 Key: TIKA-815
>                 URL: https://issues.apache.org/jira/browse/TIKA-815
>             Project: Tika
>          Issue Type: Test
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of tests.
> So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-815) Tika parsers should handle failures more gracefully

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175243#comment-13175243 ] 

Nick Burch commented on TIKA-815:
---------------------------------

For people with strong stability requirements, we provide the ForkParser

For everyone else, we suggest they report bugs when they hit issues, and ideally help work with us + the upstream libraries to fix things :)
                
> Tika parsers should handle failures more gracefully
> ---------------------------------------------------
>
>                 Key: TIKA-815
>                 URL: https://issues.apache.org/jira/browse/TIKA-815
>             Project: Tika
>          Issue Type: Test
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of tests.
> So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-815) Tika parsers should handle failures more gracefully

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-815.
--------------------------------

    Resolution: Duplicate

Resolving this as a duplicate of all the followup issues mentioned above.
                
> Tika parsers should handle failures more gracefully
> ---------------------------------------------------
>
>                 Key: TIKA-815
>                 URL: https://issues.apache.org/jira/browse/TIKA-815
>             Project: Tika
>          Issue Type: Test
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of tests.
> So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-815) Tika parsers should handle failures more gracefully

Posted by "Jerome Lacoste (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174827#comment-13174827 ] 

Jerome Lacoste commented on TIKA-815:
-------------------------------------

Agreed. Yet improving the default parsers might still be a good idea.

Note: I didn't yet manage to use the forked parser, so I will mail the user list.
                
> Tika parsers should handle failures more gracefully
> ---------------------------------------------------
>
>                 Key: TIKA-815
>                 URL: https://issues.apache.org/jira/browse/TIKA-815
>             Project: Tika
>          Issue Type: Test
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of tests.
> So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-815) Tika parsers should handle failures more gracefully

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404513#comment-13404513 ] 

Jukka Zitting commented on TIKA-815:
------------------------------------

Would you be interested in contributing the tika-hardener codebase to Tika itself? It would make a great addition to our existing test suite.
                
> Tika parsers should handle failures more gracefully
> ---------------------------------------------------
>
>                 Key: TIKA-815
>                 URL: https://issues.apache.org/jira/browse/TIKA-815
>             Project: Tika
>          Issue Type: Test
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of tests.
> So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-815) Tika parsers should handle failures more gracefully

Posted by "Jerome Lacoste (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175403#comment-13175403 ] 

Jerome Lacoste commented on TIKA-815:
-------------------------------------

> For people with strong stability requirements, we provide the ForkParser

To get the ForkParser to work, I've had to make 6 patches... And I haven't yet stress tested it. That makes me wary of using it in production!

Please fix TIKA-808, TIKA-827 (optional), TIKA-828, TIKA-829, TIKA-830, TIKA-831 in that order.

0001-TIKA-808-tika-doesn-t-parse-PDF-file.-The-issue-is-c.patch
0002-TIKA-827-try-to-report-something-if-the-exception-is.patch  (optional)
0003-TIKA-828-make-sure-the-exceptions-thrown-by-TaggedIn.patch
0004-TIKA-829-make-sure-tika-identifies-invalid-arguments.patch
0005-TIKA-830-Tike.parseToString-caused-ForkParser-to-try.patch
0006-TIKA-830-refactor-tests-for-clarity.patch
0007-TIKA-831-fix-for-errors-not-being-reported-properly-.patch

Thanks

                
> Tika parsers should handle failures more gracefully
> ---------------------------------------------------
>
>                 Key: TIKA-815
>                 URL: https://issues.apache.org/jira/browse/TIKA-815
>             Project: Tika
>          Issue Type: Test
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of tests.
> So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-815) Tika parsers should handle failures more gracefully

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418425#comment-13418425 ] 

Chris A. Mattmann commented on TIKA-815:
----------------------------------------

Hi Jerome: what do you think about contributing the Tika hardener? I'm +1 to Jukka's suggestion on that and we'd love to have you helping out and appreciate your contributions so far!
                
> Tika parsers should handle failures more gracefully
> ---------------------------------------------------
>
>                 Key: TIKA-815
>                 URL: https://issues.apache.org/jira/browse/TIKA-815
>             Project: Tika
>          Issue Type: Test
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Jerome Lacoste
>
> We encountered an OOM while parsing a Word document. We will report the failure to POI.
> This raises the question about the general robustness of the parsers.
> We've written a little test tool that reproduces the aforementionned OOM and other potential issues that will be reported to the individual parsers. It's the responsibility of the parsers to handle those failures gracefully.
> Yet it's easy to write generic tools at the Tika level to make these kind of tests.
> So we also submit this issue here to start a discussion on what role should Tika have when it comes to validate its parsers.
> Code here: https://github.com/lacostej/tika-hardener

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira