You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2018/09/13 17:16:00 UTC

[jira] [Commented] (TIKA-2627) Exception thrown when max string length is reached

    [ https://issues.apache.org/jira/browse/TIKA-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613814#comment-16613814 ] 

Dmitry Goldenberg commented on TIKA-2627:
-----------------------------------------

I agree, there is something wrong here for sure. The whole point is to just drop any excess text.

 
{code:java}
// In org/apache/tika/sax/WriteOutContentHandler 
  @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (writeLimit == -1 || writeCount + length <= writeLimit) {
            super.characters(ch, start, length);
            writeCount += length;
        } else {
            super.characters(ch, start, writeLimit - writeCount);
            writeCount = writeLimit;
            throw new WriteLimitReachedException(
                    "Your document contained more than " + writeLimit
                    + " characters, and so your requested limit has been"
                    + " reached. To receive the full text of the document,"
                    + " increase your limit. (Text up to the limit is"
                    + " however available).", tag);
        }
    }
{code}
This should not throw; at the maximum, this should just log a warning and keep going.

> Exception thrown when max string length is reached
> --------------------------------------------------
>
>                 Key: TIKA-2627
>                 URL: https://issues.apache.org/jira/browse/TIKA-2627
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Windows 2012 R2
> Java 1.8.0_151
>            Reporter: Caleb Ott
>            Priority: Major
>         Attachments: ExceptionStacktrace.txt
>
>
> I have set the max string length and expected tika to parse up to that limit then return me the text. However, for certain files it appears that once that limit is reached, instead of returning the text parsed so far, it is throwing an exception.
> It looks like the WriteLimitReachedException is being wrapped in another exception which is why it is not being caught.
> Attached is the stack trace I am getting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)