You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by cybercouf <cy...@free.fr> on 2007/03/06 17:21:23 UTC

Re: [SOLVED] Nutch 0.8.1 not parsing XHTML using XML (even mime.type.magic off)

ok I found finally that 

- even if content-type was "text/html", nutch suggest "text/xml" because of
".xml" file extention
- and parse-plugin.xml was calling parse-text for mimeType "text/xml" (now
parse-html, as in patch NUTCH-418)

so I solved my problem, is there no danger to use parse-html to parse XHTML
content (since i didn't see specific xhtml parser) ?



cybercouf wrote:
> 
> I saw the jira report about this problem (bug NUTCH-275), and applied the
> same configuration, but it's still not working.
> 
> mime-types.xml
> ---------------------
>     <mime-type name="text/xml"
>                description="Extensible Markup Language File">
>         <ext>xml</ext><ext>xsl</ext>
>         <!--magic offset="0" value="&lt;?xml"/-->
>     </mime-type>
> 
> nutch-default.xml
> ------------------------
> <name>mime.type.magic</name>
>   <value>false</value>
> 
> nutch-site.xml
> --------------------
> <name>mime.type.magic</name>
>   <value>false</value>
>  [...]
> <name>plugin.includes</name>
>     <value>parse-(text|html|rss [...]
> 
> 
> the target webpage is like:
> <?xml version="1.0"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html>
>   <head>
> 
> so nutch parse it using parse-text plugin, so no outlinks...
> 
> hadoop.log
> ----------------
> 2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.html.HtmlParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,671 WARN  parse.ParserFactory - ParserFactory:Plugin:
> org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via
> parse-plugins.xml, but its plugin.xml file does not claim to support
> contentType: text/xml
> 2007-03-05 18:10:41,702 DEBUG parse.ParseUtil - Parsing
> [http://bmw.mobi/bmw/mobi/handler/0/nn/idx.xml] with
> [org.apache.nutch.parse.text.TextParser@1649b44]
> 2007-03-05 18:10:41,734 ERROR parse.OutlinkExtractor - getOutlinks
> java.net.MalformedURLException: unknown protocol: font-family
> 
> and in the segment dump i can see:
> Outlinks: 1
>   outlink: toUrl: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
> anchor: 
> 
> 
> reading the jira report the bug should be fixed, so what's wrong with me?
> 

-- 
View this message in context: http://www.nabble.com/Nutch-0.8.1-not-parsing-XHTML-using-XML-%28even-mime.type.magic-off%29-tf3350710.html#a9335478
Sent from the Nutch - User mailing list archive at Nabble.com.