You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by cybercouf <cy...@free.fr> on 2007/03/06 17:21:23 UTC
Re: [SOLVED] Nutch 0.8.1 not parsing XHTML using XML (even
mime.type.magic off)
ok I found finally that
- even if content-type was "text/html", nutch suggest "text/xml" because of
".xml" file extention
- and parse-plugin.xml was calling parse-text for mimeType "text/xml" (now
parse-html, as in patch NUTCH-418)
so I solved my problem, is there no danger to use parse-html to parse XHTML
content (since i didn't see specific xhtml parser) ?
cybercouf wrote:
>
> I saw the jira report about this problem (bug NUTCH-275), and applied the
> same configuration, but it's still not working.
>
> mime-types.xml
> ---------------------
> <mime-type name="text/xml"
> description="Extensible Markup Language File">
> <ext>xml</ext><ext>xsl</ext>
> <!--magic offset="0" value="<?xml"/-->
> </mime-type>
>
> nutch-default.xml
> ------------------------
> <name>mime.type.magic</name>
> <value>false</value>
>
> nutch-site.xml
> --------------------
> <name>mime.type.magic</name>
> <value>false</value>
> [...]
> <name>plugin.includes</name>
> <value>parse-(text|html|rss [...]
>
>
> the target webpage is like:
> <?xml version="1.0"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html>
> <head>
>
> so nutch parse it using parse-text plugin, so no outlinks...
>
> hadoop.log
> ----------------
> 2007-03-05 18:10:41,671 WARN parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,671 WARN parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,671 WARN parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,702 DEBUG parse.ParseUtil - Parsing
> [http://bmw.mobi/bmw/mobi/handler/0/nn/idx.xml] with
> [org.apache.nutch.parse.text.TextParser@1649b44]
> 2007-03-05 18:10:41,734 ERROR parse.OutlinkExtractor - getOutlinks
> java.net.MalformedURLException: unknown protocol: font-family
>
> and in the segment dump i can see:
> Outlinks: 1
> outlink: toUrl: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
> anchor:
>
>
> reading the jira report the bug should be fixed, so what's wrong with me?
>
--
View this message in context: http://www.nabble.com/Nutch-0.8.1-not-parsing-XHTML-using-XML-%28even-mime.type.magic-off%29-tf3350710.html#a9335478
Sent from the Nutch - User mailing list archive at Nabble.com.