You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrew Khoury (JIRA)" <ji...@apache.org> on 2010/05/27 23:09:41 UTC

[jira] Created: (TIKA-434) Bug in TagSoup causes IOException

Bug in TagSoup causes IOException
---------------------------------

                 Key: TIKA-434
                 URL: https://issues.apache.org/jira/browse/TIKA-434
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.6
            Reporter: Andrew Khoury


When uploading documents to a jackrabbit 2.1 repository the following exception was received.  It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that it may be caused by '&' characters in the html being parsed):
27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
       at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
       at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
       at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
       at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
       at java.util.concurrent.FutureTask.run(Unknown Source)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
       at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: Pushback buffer overflow
       at java.io.PushbackReader.unread(Unknown Source)
       at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
       at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
       at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
       at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
       at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
       ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-434) Bug in TagSoup causes IOException

Posted by "Andrew Khoury (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Khoury updated TIKA-434:
-------------------------------

    Comment: was deleted

(was: Sample html file that causes the error.)

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>
> When uploading documents to a jackrabbit 2.1 repository the following exception was received.  It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that it may be caused by '&' characters in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-434) Bug in TagSoup causes IOException

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876454#action_12876454 ] 

Jukka Zitting commented on TIKA-434:
------------------------------------

I came up with a fairly simple patch [1] that seems to solve this. I'll see what we can do to push out an official release with this fix.

[1] http://github.com/jukka/tagsoup/commit/9cfe7b48745173faafa419f540538a0b6309b699

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>         Attachments: html_to_reproduce_issue.html
>
>
> When uploading documents to a jackrabbit 2.1 repository the following exception was received.  It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that it may be caused by '&' characters in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-434) Bug in TagSoup causes IOException

Posted by "Andrew Khoury (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Khoury updated TIKA-434:
-------------------------------

    Attachment: breezycove.html

Sample html file that causes the error.

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>         Attachments: breezycove.html
>
>
> When uploading documents to a jackrabbit 2.1 repository the following exception was received.  It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that it may be caused by '&' characters in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-434) Bug in TagSoup causes IOException

Posted by "Andrew Khoury (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Khoury updated TIKA-434:
-------------------------------

    Attachment:     (was: breezycove.html)

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>
> When uploading documents to a jackrabbit 2.1 repository the following exception was received.  It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that it may be caused by '&' characters in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-434) Bug in TagSoup causes IOException

Posted by "Andrew Khoury (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Khoury updated TIKA-434:
-------------------------------

    Attachment: html_to_reproduce_issue.html

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>         Attachments: html_to_reproduce_issue.html
>
>
> When uploading documents to a jackrabbit 2.1 repository the following exception was received.  It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that it may be caused by '&' characters in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-434) Bug in TagSoup causes IOException

Posted by "Andrew Khoury (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872388#action_12872388 ] 

Andrew Khoury commented on TIKA-434:
------------------------------------

Here are some related posts to the tagsoup user groups:
http://groups.google.com/group/tagsoup-friends/browse_thread/thread/751d271c107a24a9#

http://webcache.googleusercontent.com/search?q=cache:M2F_jS2hLVwJ:tech.groups.yahoo.com/group/tagsoup-friends/message/1250+%22Yes,+it+should+be+handled+%28and+returned+as+a+raw+%26,+to+be+escaped%22+on+output+as+%26amp%3B&cd=1&hl=en&ct=clnk&gl=us

Evidently the bug occurs when the document contains a sequence of '&' followed by [CR].  When all CRs are transliterated to LFs then TagSoup runs properly.

As tagsoup has no official bug tracking or release tracking system there is no way to know when this bug would be fixed.  That is why I'm submitting it here as it is causing a bug in apache tika. 

> Bug in TagSoup causes IOException
> ---------------------------------
>
>                 Key: TIKA-434
>                 URL: https://issues.apache.org/jira/browse/TIKA-434
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.6
>            Reporter: Andrew Khoury
>         Attachments: breezycove.html
>
>
> When uploading documents to a jackrabbit 2.1 repository the following exception was received.  It looks like a bug in tagsoup 1.2 (if you search the tagsoup yahoo group you can see that it may be caused by '&' characters in the html being parsed):
> 27.05.2010 14:57:18 *WARN * LazyTextExtractorField: Failed to extract text from a binary property (LazyTextExtractorField.java, line 180)
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.html.HtmlParser@eba477
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
>        at org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174)
>        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
>        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Pushback buffer overflow
>        at java.io.PushbackReader.unread(Unknown Source)
>        at org.ccil.cowan.tagsoup.HTMLScanner.unread(HTMLScanner.java:274)
>        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:487)
>        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
>        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:177)
>        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 10 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.