You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/21 03:14:30 UTC

[jira] Updated: (NUTCH-275) Fetcher not parsing XHTML-pages at all

     [ http://issues.apache.org/jira/browse/NUTCH-275?page=all ]

Stefan Neufeind updated NUTCH-275:
----------------------------------

    Description: 
Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).

For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).

Links inside this document are NOT indexed at all - no digging this website actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).


060521 025018 fetching http://www.secreturl.something/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 10000
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but 
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019  map 0%  reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

  was:
Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).

For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).

Links inside this document are NOT indexed at all - no digging this website actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).


060521 025018 fetching http://www.speedpartner.de/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 10000
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
 its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but 
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019  map 0%  reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 


> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
>          Key: NUTCH-275
>          URL: http://issues.apache.org/jira/browse/NUTCH-275
>      Project: Nutch
>         Type: Bug

>     Versions: 0.8-dev
>  Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
>     Reporter: Stefan Neufeind

>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but 
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019  map 0%  reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira