You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Klaus v. Einem (Created) (JIRA)" <ji...@apache.org> on 2012/03/22 13:04:22 UTC

[jira] [Created] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

HtmlParser sometimes(!) throws IOException while determining Html-Encoding
--------------------------------------------------------------------------

                 Key: TIKA-881
                 URL: https://issues.apache.org/jira/browse/TIKA-881
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.0
         Environment: Windows7, JDK1.5, JDK1.6
            Reporter: Klaus v. Einem


Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 

java.io.IOException: Resetting to invalid mark
	at java.io.BufferedInputStream.reset(Unknown Source)
	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)

In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 

So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
 * ...
 * To enable the efficient conversion of bytes to characters, more bytes may
 * be read ahead from the underlying stream than are necessary to satisfy the
 * current read operation.
 * ...

Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Ken Krugler (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-881:
--------------------------------

    Assignee: Ken Krugler
    
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>            Assignee: Ken Krugler
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Klaus v. Einem (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235527#comment-13235527 ] 

Klaus v. Einem edited comment on TIKA-881 at 3/22/12 1:17 PM:
--------------------------------------------------------------

BugfixHtmlParser.java: This is my Workaround... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) with the String constructor.
                
      was (Author: v.einem):
    This is my Solution... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) with the String constructor.
                  
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Klaus v. Einem (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235527#comment-13235527 ] 

Klaus v. Einem edited comment on TIKA-881 at 3/22/12 1:25 PM:
--------------------------------------------------------------

BugfixHtmlParser.java: This is my Workaround... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) with the String constructor.

To get this up an running you have to copy the 2 sourcfiles HtmlHandler.java and XHTMLDowngradeHandler from the tika-sources (package: org.apache.tika.parser.html) to the package, where BugfixHtmlParser.java lives. Why? Because of their package private nature.
                
      was (Author: v.einem):
    BugfixHtmlParser.java: This is my Workaround... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) with the String constructor.
                  
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-881:
--------------------------------

    Assignee:     (was: Ken Krugler)

Looks like an InputStream issue, not something with HtmlParser. Inputstreams should get "wrapped" by Tika such that a reset() will always work.
                
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Klaus v. Einem (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaus v. Einem updated TIKA-881:
--------------------------------

    Attachment: HtmlParser.java

OK, this is 100% original sourcecode with Bugfix included.
                
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Ken Krugler (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235701#comment-13235701 ] 

Ken Krugler commented on TIKA-881:
----------------------------------

Hi Klaus - thanks for debugging this. I'll take a look at your patch over the next few days.
                
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>            Assignee: Ken Krugler
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Klaus v. Einem (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Klaus v. Einem updated TIKA-881:
--------------------------------

    Attachment: BugfixHtmlParser.java

This is my Solution... Sorry, Comments are in German. The Key is: No InputStreamReader, no Cry! Reading a *bytes* array and decoding (afterwards) with the String constructor.
                
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>              Labels: stability
>         Attachments: BugfixHtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432218#comment-13432218 ] 

Ken Krugler commented on TIKA-881:
----------------------------------

I've asked Jukka to look into this. From my email to tika-dev:

{quote}
The fix that Klaus provided avoids using reset() on the input stream.

But I thought that Tika tries to wrap streams such that a reset() will work properly, as otherwise auto detection of content can fail.

I haven't had to dig into all of the tricky issues around stream management, so I'm hoping you can take a look at Klaus's report and provide commentary.
{quote}
                
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>            Assignee: Ken Krugler
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (TIKA-881) HtmlParser sometimes(!) throws IOException while determining Html-Encoding

Posted by "Klaus v. Einem (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235562#comment-13235562 ] 

Klaus v. Einem edited comment on TIKA-881 at 3/22/12 1:15 PM:
--------------------------------------------------------------

HtmlParser.java: This is 100% original sourcecode with Bugfix included.
                
      was (Author: v.einem):
    OK, this is 100% original sourcecode with Bugfix included.
                  
> HtmlParser sometimes(!) throws IOException while determining Html-Encoding
> --------------------------------------------------------------------------
>
>                 Key: TIKA-881
>                 URL: https://issues.apache.org/jira/browse/TIKA-881
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.0
>         Environment: Windows7, JDK1.5, JDK1.6
>            Reporter: Klaus v. Einem
>              Labels: stability
>         Attachments: BugfixHtmlParser.java, HtmlParser.java
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Attention, this is an ugly one: A non deterministic Bug. It fails only 1 out of 10 (approximately). 
> java.io.IOException: Resetting to invalid mark
> 	at java.io.BufferedInputStream.reset(Unknown Source)
> 	at org.apache.tika.io.ProxyInputStream.reset(ProxyInputStream.java:168)
> 	at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:92)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:188)
> In the getEncoding()-Method: To re-read() the input stream, the current read position is marked and the readlimit (maximum number of bytes to be read before the mark position gets invalidated) is given. 
> So far so good, but then an InputStreamReader comes into play. When you check the API-Doc you see this: 
>  * ...
>  * To enable the efficient conversion of bytes to characters, more bytes may
>  * be read ahead from the underlying stream than are necessary to satisfy the
>  * current read operation.
>  * ...
> Please notice the term "may"... So, when this happens the following reset() on the stream will throw the Exception because the mark position gets invalidated (the number of read bytes exceeds the readlimit).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira