You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2010/03/18 10:30:27 UTC

[jira] Created: (TIKA-388) Don't trust streams that claim mark support

Don't trust streams that claim mark support
-------------------------------------------

                 Key: TIKA-388
                 URL: https://issues.apache.org/jira/browse/TIKA-388
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Jukka Zitting
            Priority: Minor


As seen on tika-dev@ and in JCR-2576, there are some InputStream implementations that claim to support the mark feature, but lose the mark as soon as the end of stream has been reached. There's no way for a client to detect such behaviour, so it's probably best for Tika to always use BufferedInputStream to wrap incoming streams when mark support is needed. This may cause one layer of extra buffering, but avoids problems with such broken streams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-388) Don't trust streams that claim mark support

Posted by "Daan de Wit (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847509#action_12847509 ] 

Daan de Wit commented on TIKA-388:
----------------------------------

I did not test it, and it might be a premature optimization, but wouldn't it be better to check if the stream is already a BufferedInputStream?

> Don't trust streams that claim mark support
> -------------------------------------------
>
>                 Key: TIKA-388
>                 URL: https://issues.apache.org/jira/browse/TIKA-388
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.7
>
>
> As seen on tika-dev@ and in JCR-2576, there are some InputStream implementations that claim to support the mark feature, but lose the mark as soon as the end of stream has been reached. There's no way for a client to detect such behaviour, so it's probably best for Tika to always use BufferedInputStream to wrap incoming streams when mark support is needed. This may cause one layer of extra buffering, but avoids problems with such broken streams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-388) Don't trust streams that claim mark support

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846937#action_12846937 ] 

Chris A. Mattmann commented on TIKA-388:
----------------------------------------

+1! I've ran into this issue myself, and the overhead IMHO is worth is for the ease of use...

> Don't trust streams that claim mark support
> -------------------------------------------
>
>                 Key: TIKA-388
>                 URL: https://issues.apache.org/jira/browse/TIKA-388
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> As seen on tika-dev@ and in JCR-2576, there are some InputStream implementations that claim to support the mark feature, but lose the mark as soon as the end of stream has been reached. There's no way for a client to detect such behaviour, so it's probably best for Tika to always use BufferedInputStream to wrap incoming streams when mark support is needed. This may cause one layer of extra buffering, but avoids problems with such broken streams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-388) Don't trust streams that claim mark support

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-388.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.7
         Assignee: Jukka Zitting

As of revision 925217 the AutoDetectParser wraps all incoming streams to BufferedInputStream regardless of whether they claim mark support or not. Resolving as fixed.

> Don't trust streams that claim mark support
> -------------------------------------------
>
>                 Key: TIKA-388
>                 URL: https://issues.apache.org/jira/browse/TIKA-388
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.7
>
>
> As seen on tika-dev@ and in JCR-2576, there are some InputStream implementations that claim to support the mark feature, but lose the mark as soon as the end of stream has been reached. There's no way for a client to detect such behaviour, so it's probably best for Tika to always use BufferedInputStream to wrap incoming streams when mark support is needed. This may cause one layer of extra buffering, but avoids problems with such broken streams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.