You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Stefan Neufeind (JIRA)" <ji...@apache.org> on 2006/05/21 03:01:31 UTC
[jira] Created: (NUTCH-275) Fetcher not parsing XHTML-pages at all
Fetcher not parsing XHTML-pages at all
--------------------------------------
Key: NUTCH-275
URL: http://issues.apache.org/jira/browse/NUTCH-275
Project: Nutch
Type: Bug
Versions: 0.8-dev
Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
Reporter: Stefan Neufeind
Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
Links inside this document are NOT indexed at all - no digging this website actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
060521 025018 fetching http://www.speedpartner.de/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 10000
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019 map 0% reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at
all
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12414476 ]
Stefan Neufeind commented on NUTCH-275:
---------------------------------------
Maybe just XHTML is something special in this casee? In general I guess mime-magic is a good idea. But could it be extended to differentiate xml and xhtml?
> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
> Key: NUTCH-275
> URL: http://issues.apache.org/jira/browse/NUTCH-275
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
> Reporter: Stefan Neufeind
>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
> its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019 map 0% reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at
all
Posted by "Stefan Groschupf (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12414456 ]
Stefan Groschupf commented on NUTCH-275:
----------------------------------------
Should we switch off mime.type.magic by default?
Some people was reporting the same problems.
> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
> Key: NUTCH-275
> URL: http://issues.apache.org/jira/browse/NUTCH-275
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
> Reporter: Stefan Neufeind
>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
> its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019 map 0% reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-275) Fetcher not parsing XHTML-pages at all
Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-275?page=all ]
Jerome Charron resolved NUTCH-275:
----------------------------------
Fix Version: 0.8-dev
Resolution: Fixed
Magic guessing removed for xml content-type.
> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
> Key: NUTCH-275
> URL: http://issues.apache.org/jira/browse/NUTCH-275
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
> Reporter: Stefan Neufeind
> Fix For: 0.8-dev
>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
> its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019 map 0% reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at
all
Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12412835 ]
Jerome Charron commented on NUTCH-275:
--------------------------------------
This problem as already been reported by Doug : http://mail-archive.com/nutch-dev%40lucene.apache.org/msg03474.html
It is related to magic based content-type guessing.
Nothing as been stated about this for now, but I should work on it.
Workaround :
* inactivate the mime-type magic resolution (mime.type.magic = false)
* or remove the <magic offset="0" ... > line in mime-types.xml
Thanks for opening a jira issue about this.
> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
> Key: NUTCH-275
> URL: http://issues.apache.org/jira/browse/NUTCH-275
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
> Reporter: Stefan Neufeind
>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
> its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019 map 0% reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-275) Fetcher not parsing XHTML-pages at all
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-275?page=all ]
Stefan Neufeind updated NUTCH-275:
----------------------------------
Description:
Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
Links inside this document are NOT indexed at all - no digging this website actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
060521 025018 fetching http://www.secreturl.something/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 10000
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019 map 0% reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
was:
Server reports page as "text/html" - so I thought it would be processed as html.
But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
Links inside this document are NOT indexed at all - no digging this website actually stops here.
Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
060521 025018 fetching http://www.speedpartner.de/
060521 025018 http.proxy.host = null
060521 025018 http.proxy.port = 8080
060521 025018 http.timeout = 10000
060521 025018 http.content.limit = 65536
060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060521 025018 fetcher.server.delay = 1000
060521 025018 http.max.delays = 1000
060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
its plugin.xml file does not claim to support contentType: text/xml
060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
not enabled via plugin.includes in nutch-default.xml
060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
060521 025019 map 0% reduce 0%
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
> Key: NUTCH-275
> URL: http://issues.apache.org/jira/browse/NUTCH-275
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
> Reporter: Stefan Neufeind
>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
> its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019 map 0% reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at
all
Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12415116 ]
Jerome Charron commented on NUTCH-275:
--------------------------------------
> could it be extended to differentiate xml and xhtml
Yes, I have a new version based on freedesktop specification that is sleeping for a while on my disk.
I don't want to commit it before the 0.8-release... probably for the 0.9
This version has a better handling for xml / xhtml/ html related documents.
For now, I think the best solution is to remove the magic detection for xml
, simply by removing the <magic offset="0" ... > line for xml content type in mime-types.xml
> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
> Key: NUTCH-275
> URL: http://issues.apache.org/jira/browse/NUTCH-275
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
> Reporter: Stefan Neufeind
>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
> its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019 map 0% reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at
all
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12412659 ]
Stefan Neufeind commented on NUTCH-275:
---------------------------------------
I've found out that the first line actually leads to the problems. Without it, the file is parsed as html.
- But why can't XML be parsed at all (not even by TextParser)?
- And afaik that header is valid as is - been told so - and validator from w3 does not complain as well.
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" lang="de">
<head>
> Fetcher not parsing XHTML-pages at all
> --------------------------------------
>
> Key: NUTCH-275
> URL: http://issues.apache.org/jira/browse/NUTCH-275
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Environment: problem with nightly-2006-05-20; worked fine with same website on 0.7.2
> Reporter: Stefan Neufeind
>
> Server reports page as "text/html" - so I thought it would be processed as html.
> But something I guess evaluated the headers of the document and re-labeled it as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header seem to be valid links for the fetcher and as such are indexed in the next round (if urlfilter allows).
> 060521 025018 fetching http://www.speedpartner.de/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 10000
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to contentType text/xml via parse-plugins.xml, but
> its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser mapped to contentType text/xml via parse-plugins.xml, but
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019 map 0% reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s,
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira