You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/14 15:12:59 UTC

[jira] [Updated] (TIKA-676) Boilerpipe fails

     [ https://issues.apache.org/jira/browse/TIKA-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated TIKA-676:
-------------------------------

    Comment: was deleted

(was: BTW, i also confirmed that BoilerPipe 1.2.0 fixes an EmptyStackException issue for other pages:

{code}
2011-07-14 14:18:39,635 ERROR tika.TikaParser - Error parsing http://www.botje.nl
java.util.EmptyStackException
        at java.util.Stack.peek(Stack.java:85)
        at java.util.Stack.pop(Stack.java:67)
        at org.apache.nutch.parse.tika.DOMBuilder.endElement(DOMBuilder.java:349)
        at org.apache.tika.parser.html.BoilerpipeContentHandler.endDocument(BoilerpipeContentHandler.java:315)
        at org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
        at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:212)
        at org.apache.tika.sax.TextContentHandler.endDocument(TextContentHandler.java:57)
        at org.apache.tika.sax.ContentHandlerDecorator.endDocument(ContentHandlerDecorator.java:115)
        at org.ccil.cowan.tagsoup.Parser.eof(Parser.java:639)
        at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:589)
        at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
        at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:115)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.lang.Thread.run(Thread.java:662)
{code})

> Boilerpipe fails
> ----------------
>
>                 Key: TIKA-676
>                 URL: https://issues.apache.org/jira/browse/TIKA-676
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>             Fix For: 1.0
>
>
> This is apparently a [boilerpipe issue |http://code.google.com/p/boilerpipe/issues/detail?id=24 ], they fixed in the [Web API edition | http://boilerpipe-web.appspot.com/]. 
> {code}
> $ curl --fail -L http://thisrecording.com/the-past | java -jar tika-app-0.9.jar -T
>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>                                  Dload  Upload   Total   Spent    Left  Speed
> 100 65688    0 65688    0     0  17650      0 --:--:--  0:00:03 --:--:-- 18698Exception in thread "main" org.xml.sax.SAXException: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to boilerpipe again
> 100  128k    0  128k    0     0  32019      0 --:--:--  0:00:04 --:--:-- 33735
> 	at de.l3s.boilerpipe.sax.CommonTagActions$2.start(CommonTagActions.java:108)
> 	at de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.startElement(BoilerpipeHTMLContentHandler.java:169)
> 	at org.apache.tika.parser.html.BoilerpipeContentHandler.startElement(BoilerpipeContentHandler.java:195)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:237)
> 	at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:279)
> 	at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:197)
> 	at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:135)
> 	at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
> 	at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:61)
> 	at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
> 	at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
> 	at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
> 	at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:565)
> 	at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
> 	at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:198)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> 	at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
> 	at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:288)
> 	at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:94)
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira