You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/10/13 18:39:13 UTC

Re: injector in nutch-1.4

Hi,

This is most likely an URL filter issue. Check all URL filters. There's also a 
test program for URL filtering. Try it out.

http://wiki.apache.org/nutch/CommandLineOptions

Cheers,

ps. Moved to user@nutch as it's more appropriate there.

> I have problems with running injector in nutch-1.4 on hadoop, same
> command with nutch-1.3 works fine. As you can see, list of URLs is
> loaded from hdfs correctly Map input records=66906 but no records are on
> map ouput. Could it be some problems with broken filtering?
> 
> ponto:(crawler)runtime/deploy>bin/nutch inject /czcrawl/db /czcrawl/seeds
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: starting at 2011-10-13
> 17:56:25
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: crawlDb: /czcrawl/db
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: urlDir: /czcrawl/seeds
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: Converting injected
> urls to crawl db entries.
> 11/10/13 17:56:28 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 11/10/13 17:56:29 INFO mapred.JobClient: Running job: job_201110091645_0032
> 11/10/13 17:56:30 INFO mapred.JobClient:  map 0% reduce 0%
> 11/10/13 17:56:52 INFO mapred.JobClient:  map 50% reduce 0%
> 11/10/13 17:56:53 INFO mapred.JobClient:  map 100% reduce 0%
> 11/10/13 17:57:05 INFO mapred.JobClient:  map 100% reduce 100%
> 11/10/13 17:57:10 INFO mapred.JobClient: Job complete:
> job_201110091645_0032 11/10/13 17:57:10 INFO mapred.JobClient: Counters:
> 27
> 11/10/13 17:57:10 INFO mapred.JobClient:   Job Counters
> 11/10/13 17:57:10 INFO mapred.JobClient:     Launched reduce tasks=1
> 11/10/13 17:57:10 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=20455
> 11/10/13 17:57:10 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Total time spent by all
> maps waiting after reserving slots (ms)=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Rack-local map tasks=1
> 11/10/13 17:57:10 INFO mapred.JobClient:     Launched map tasks=2
> 11/10/13 17:57:10 INFO mapred.JobClient:     Data-local map tasks=1
> 11/10/13 17:57:10 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10356
> 11/10/13 17:57:10 INFO mapred.JobClient:   File Input Format Counters
> 11/10/13 17:57:10 INFO mapred.JobClient:     Bytes Read=1283144
> 11/10/13 17:57:10 INFO mapred.JobClient:   File Output Format Counters
> 11/10/13 17:57:10 INFO mapred.JobClient:     Bytes Written=86
> 11/10/13 17:57:10 INFO mapred.JobClient:   FileSystemCounters
> 11/10/13 17:57:10 INFO mapred.JobClient:     FILE_BYTES_READ=6
> 11/10/13 17:57:10 INFO mapred.JobClient:     HDFS_BYTES_READ=1283358
> 11/10/13 17:57:10 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=89486
> 11/10/13 17:57:10 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=86
> 11/10/13 17:57:10 INFO mapred.JobClient:   Map-Reduce Framework
> 11/10/13 17:57:10 INFO mapred.JobClient:     Map output materialized
> bytes=12
> 11/10/13 17:57:10 INFO mapred.JobClient:     Map input records=66906
> 11/10/13 17:57:10 INFO mapred.JobClient:     Reduce shuffle bytes=6
> 11/10/13 17:57:10 INFO mapred.JobClient:     Spilled Records=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Map output bytes=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Map input bytes=1280141
> 11/10/13 17:57:10 INFO mapred.JobClient:     Combine input records=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     SPLIT_RAW_BYTES=214
> 11/10/13 17:57:10 INFO mapred.JobClient:     Reduce input records=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Reduce input groups=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Combine output records=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Reduce output records=0
> 11/10/13 17:57:10 INFO mapred.JobClient:     Map output records=0

Re: SOLVED: injector in nutch-1.4

Posted by Radim Kolar <hs...@sendmail.cz>.
error was caused by incorrect entry in domain-urlfilter i had there 
".cz" and it should be only "cz"

Re: SOLVED: injector in nutch-1.4

Posted by Markus Jelsma <ma...@openindex.io>.
I just did and confirmed index-basic has no relevance to the crawl db. Here's 
a piece of log output for injector and crawl db reader. There are only two 
registered plugins, protocol-http and lib-http. After injection the crawldb 
has 1 entry which is the same URL as in my seed list.


2011-10-14 15:30:03,683 INFO  crawl.Injector - Injector: starting at 
2011-10-14 15:30:03
2011-10-14 15:30:03,684 INFO  crawl.Injector - Injector: crawlDb: 
crawl/crawldb
2011-10-14 15:30:03,684 INFO  crawl.Injector - Injector: urlDir: urls
2011-10-14 15:30:03,684 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2011-10-14 15:30:04,041 INFO  plugin.PluginRepository - Plugins: looking in: 
/home/markus/projects/apache/nutch/trunk/runtime/local/plugins
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository - Registered Plugins:
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         the nutch core 
extension points (nutch-extensionpoints)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         HTTP Framework 
(lib-http)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Http Protocol 
Plug-in (protocol-http)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository - Registered Extension-
Points:
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch URL 
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch Segment 
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch URL 
Filter (org.apache.nutch.net.URLFilter)
2011-10-14 15:30:04,131 INFO  plugin.PluginRepository -         Nutch Indexing 
Filter (org.apache.nutch.indexer.IndexingFilter)
2011-10-14 15:30:04,132 INFO  plugin.PluginRepository -         HTML Parse 
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-10-14 15:30:04,132 INFO  plugin.PluginRepository -         Nutch Content 
Parser (org.apache.nutch.parse.Parser)
2011-10-14 15:30:04,132 INFO  plugin.PluginRepository -         Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
2011-10-14 15:30:04,946 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2011-10-14 15:30:05,160 WARN  util.NativeCodeLoader - Unable to load native-
hadoop library for your platform... using builtin-java classes where 
applicable
2011-10-14 15:30:06,104 INFO  crawl.Injector - Injector: finished at 
2011-10-14 15:30:06, elapsed: 00:00:02
2011-10-14 15:30:08,727 INFO  crawl.CrawlDbReader - CrawlDb statistics start: 
crawl/crawldb/
2011-10-14 15:30:08,836 WARN  mapred.JobClient - Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - Statistics for CrawlDb: 
crawl/crawldb/
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - TOTAL urls: 1
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - retry 0:    1
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - min score:  1.0
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - avg score:  1.0
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - max score:  1.0
2011-10-14 15:30:10,052 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):    
1
2011-10-14 15:30:10,053 INFO  crawl.CrawlDbReader - CrawlDb statistics: done



On Friday 14 October 2011 15:23:00 Radim Kolar wrote:
> try it yourself. in 1.4 remove index-basic from list of included
> plugins, then run nutch inject in hadoop mode and you will get 0 rows on
> first map output.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: SOLVED: injector in nutch-1.4

Posted by Radim Kolar <hs...@sendmail.cz>.
try it yourself. in 1.4 remove index-basic from list of included 
plugins, then run nutch inject in hadoop mode and you will get 0 rows on 
first map output.

Re: SOLVED: injector in nutch-1.4

Posted by Markus Jelsma <ma...@openindex.io>.
That makes no sense. The injector and indexer code is completely separated. 
Did you dump the crawl db after injection? 


On Friday 14 October 2011 13:36:24 Radim Kolar wrote:
> Dne 14.10.2011 13:16, Radim Kolar napsal(a):
> > If you dont have "index-basic" plugin included, nothing gets injected
> 
> it works in nutch 1.3 without index-basic but not in 1.4

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: SOLVED: injector in nutch-1.4

Posted by Radim Kolar <hs...@sendmail.cz>.
Dne 14.10.2011 13:16, Radim Kolar napsal(a):
> If you dont have "index-basic" plugin included, nothing gets injected
it works in nutch 1.3 without index-basic but not in 1.4

SOLVED: injector in nutch-1.4

Posted by Radim Kolar <hs...@sendmail.cz>.
If you dont have "index-basic" plugin included, nothing gets injected

Re: injector in nutch-1.4

Posted by Radim Kolar <hs...@sendmail.cz>.
Dne 14.10.2011 10:31, Markus Jelsma napsal(a):
> Index and parse filter checkers do not use URL filtering.
how to check why URLs are not injected into db then?

Re: injector in nutch-1.4

Posted by Markus Jelsma <ma...@openindex.io>.
Index and parse filter checkers do not use URL filtering.

> Hi Radim,
> 
> Please see the final log output
> 
> 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> org.apache.nutch.parse.html.Ht
> mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml,
> but not enabled via plugin.includes in nutch-default.xml
> 
> Please try adding parse-html and re-running the indexerchecker
> 
> On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <hs...@sendmail.cz> wrote:
> > Hi,
> > 
> > This is most likely an URL filter issue. Check all URL filters. There's
> > also a
> > test program for URL filtering. Try it out.
> > 
> > This is indexchecker output for one URL. Is this URL filtered or not? I
> > don't know how to interpret output
> > 
> > ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz
> > 
> > 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching:
> > http://www.root.cz
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
> > /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
> > mode: [true]
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         the nutch core
> > extension points (nutch-extensionpoints)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL
> > Normalizer (urlnormalizer-regex)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Basic URL
> > Normalizer (urlnormalizer-basic)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Tika Parser
> > Plug-in (parse-tika)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Domain URL Filter
> > (urlfilter-domain)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTTP Framework
> > (lib-http)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> > (urlfilter-regex)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> > Framework (lib-regex-filter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Http Protocol
> > Plug-in (protocol-http)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered
> > Extension-Points:
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL
> > Normalizer (org.apache.nutch.net.**URLNormalizer)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Protocol
> > (org.apache.nutch.protocol.**Protocol)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Segment
> > Merge Filter (org.apache.nutch.segment.**SegmentMergeFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL Filter
> > ( org.apache.nutch.net.**URLFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Indexing
> > Filter (org.apache.nutch.indexer.**IndexingFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTML Parse Filter
> > (org.apache.nutch.parse.**HtmlParseFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Content
> > Parser (org.apache.nutch.parse.**Parser)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Scoring
> > (org.apache.nutch.scoring.**ScoringFilter)
> > 11/10/14 06:01:00 INFO http.Http: http.accept.language =
> > en-us,en-gb,en;q=0.7,*;q=0.3
> > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing:
> > http://www.root.cz
> > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType:
> > application/xhtml+xml
> > 11/10/14 06:01:02 INFO conf.Configuration: found resource
> > parse-plugins.xml at
> > file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/**
> > parse-plugins.xml
> > 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> > org.apache.nutch.parse.html.**HtmlParser mapped to contentType
> > application/xhtml+xml via parse-plugins.xml, but not enabled via
> > plugin.includes in nutch-default.xml

Re: injector in nutch-1.4

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Radim,

Please see the final log output

11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
org.apache.nutch.parse.html.Ht
mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml,
but not enabled via plugin.includes in nutch-default.xml

Please try adding parse-html and re-running the indexerchecker


On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <hs...@sendmail.cz> wrote:

>
> Hi,
>
> This is most likely an URL filter issue. Check all URL filters. There's
> also a
> test program for URL filtering. Try it out.
>
> This is indexchecker output for one URL. Is this URL filtered or not? I
> don't know how to interpret output
>
> ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz
>
> 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching:
> http://www.root.cz
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
> /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         the nutch core
> extension points (nutch-extensionpoints)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL
> Normalizer (urlnormalizer-regex)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Basic URL
> Normalizer (urlnormalizer-basic)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Tika Parser Plug-in
> (parse-tika)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Domain URL Filter
> (urlfilter-domain)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTTP Framework
> (lib-http)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> (urlfilter-regex)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter
> Framework (lib-regex-filter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Http Protocol
> Plug-in (protocol-http)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL
> Normalizer (org.apache.nutch.net.**URLNormalizer)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Protocol
> (org.apache.nutch.protocol.**Protocol)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Segment Merge
> Filter (org.apache.nutch.segment.**SegmentMergeFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL Filter (
> org.apache.nutch.net.**URLFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Indexing
> Filter (org.apache.nutch.indexer.**IndexingFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         HTML Parse Filter
> (org.apache.nutch.parse.**HtmlParseFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Content
> Parser (org.apache.nutch.parse.**Parser)
> 11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Scoring
> (org.apache.nutch.scoring.**ScoringFilter)
> 11/10/14 06:01:00 INFO http.Http: http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing:
> http://www.root.cz
> 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType:
> application/xhtml+xml
> 11/10/14 06:01:02 INFO conf.Configuration: found resource parse-plugins.xml
> at file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/**
> parse-plugins.xml
> 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> org.apache.nutch.parse.html.**HtmlParser mapped to contentType
> application/xhtml+xml via parse-plugins.xml, but not enabled via
> plugin.includes in nutch-default.xml
>
>


-- 
*Lewis*

Re: injector in nutch-1.4

Posted by Radim Kolar <hs...@sendmail.cz>.
Hi,

This is most likely an URL filter issue. Check all URL filters. There's also a
test program for URL filtering. Try it out.

This is indexchecker output for one URL. Is this URL filtered or not? I don't know how to interpret output

ponto:(crawler)runtime/deploy>bin/nutch indexchecker http://www.root.cz

11/10/14 06:01:00 INFO indexer.IndexingFiltersChecker: fetching: 
http://www.root.cz
11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in: 
/tmp/hadoop-crawler/hadoop-unjar3406850446948112163/plugins
11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation 
mode: [true]
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
11/10/14 06:01:00 INFO plugin.PluginRepository:         the nutch core 
extension points (nutch-extensionpoints)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL 
Normalizer (urlnormalizer-regex)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Basic URL 
Normalizer (urlnormalizer-basic)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Tika Parser 
Plug-in (parse-tika)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Domain URL 
Filter (urlfilter-domain)
11/10/14 06:01:00 INFO plugin.PluginRepository:         HTTP Framework 
(lib-http)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter 
(urlfilter-regex)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Regex URL Filter 
Framework (lib-regex-filter)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Http Protocol 
Plug-in (protocol-http)
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Extension-Points:
11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL 
Normalizer (org.apache.nutch.net.URLNormalizer)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Segment 
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch URL Filter 
(org.apache.nutch.net.URLFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Indexing 
Filter (org.apache.nutch.indexer.IndexingFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository:         HTML Parse 
Filter (org.apache.nutch.parse.HtmlParseFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Content 
Parser (org.apache.nutch.parse.Parser)
11/10/14 06:01:00 INFO plugin.PluginRepository:         Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
11/10/14 06:01:00 INFO http.Http: http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: parsing: 
http://www.root.cz
11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: contentType: 
application/xhtml+xml
11/10/14 06:01:02 INFO conf.Configuration: found resource 
parse-plugins.xml at 
file:/tmp/hadoop-crawler/hadoop-unjar3406850446948112163/parse-plugins.xml
11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but not enabled via 
plugin.includes in nutch-default.xml