You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2019/01/03 17:45:00 UTC

[jira] [Comment Edited] (TIKA-2787) Make WriteLimitReachedException public and not subclass of SAXException

    [ https://issues.apache.org/jira/browse/TIKA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733276#comment-16733276 ] 

Tim Allison edited comment on TIKA-2787 at 1/3/19 5:44 PM:
-----------------------------------------------------------

bq. That will probably work better than having the exception public.

I agree generally, and I'm grateful for your example above.  Within the RecursiveParserWrapper (and somewhere else?), we have special handling for WriteLimitReached even though we don't necessarily have access to the WriteoutContentHandler (perhaps it is wrapped in another handler).  So, for this use case, it would be useful to have a public exception. 

I've been wondering if we might consider having a TikaSAXException (extends SAXException, or maybe RuntimeException?) that we can use for what are actually parse exceptions but are thrown in code that override methods that only throw SAXExceptions.  If we went this route, we could have WriteLimitReached extend TikaSAXException.


was (Author: tallison@mitre.org):
bq. That will probably work better than having the exception public.

I agree generally, and I'm grateful for your example above.  Within the RecursiveParserWrapper (and somewhere else?), we have special handling for WriteLimitReached even though we don't necessarily have access to the WriteoutContentHandler (perhaps it is wrapped in another handler).  So, for this use case, it would be useful to have a public exception. 

I've been wondering if we might consider having a TikaSAXException (extends SAXException) that we can use for what are actually parse exceptions but are thrown in code that override methods that only throw SAXExceptions.  If we went this route, we could have WriteLimitReached extend TikaSAXException.

> Make WriteLimitReachedException public and not subclass of SAXException
> -----------------------------------------------------------------------
>
>                 Key: TIKA-2787
>                 URL: https://issues.apache.org/jira/browse/TIKA-2787
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.19.1
>            Reporter: Dmitry Goldenberg
>            Priority: Major
>
> The idea behind being able to set a limit on text extraction is to be able to get up to N characters extracted back. We just got tripped up by the fact that Tika throws an exception once the limit has been reached.
> This, in and of itself, is not a major hindrance especially since the error message itself clearly states that the extracted text is, "however, available".
> OK, but why is WriteLimitReachedException private? why not public so it can be explicitly caught when the parse() method is called? and why not add it to the signature of the parse method? I don't think it should extend SAXException, either; just cleanly throw it as is.
> Right now, our code makes this cumbersome adjustment around the condition:
> {code:java}
> ContentHandler handler = new BodyContentHandler(limit); // <-- e.g. set to 1000000
> try {
>     parser.parse(dataStream, handler, metadata, parseCtx);
> } catch (IOException | TikaException ex) {
>     throw ex;
> } catch (SAXException ex) {
>     String message = (ex.getMessage() == null) ? "" : ex.getMessage();
>     if (!message.contains("Your document contained more than")) {
>         throw new TikaException("Tika error has occurred.", ex);
>     } else {
>         log.warn("TE limit reached on file {}.", filePath);
>     }
> }
> // Keep the extracted text regardless of WriteLimitReachedException
> String text = handler.toString();
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)