You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/10/13 18:39:13 UTC
Re: injector in nutch-1.4
Hi,
This is most likely an URL filter issue. Check all URL filters. There's also a
test program for URL filtering. Try it out.
http://wiki.apache.org/nutch/CommandLineOptions
Cheers,
ps. Moved to user@nutch as it's more appropriate there.
> I have problems with running injector in nutch-1.4 on hadoop, same
> command with nutch-1.3 works fine. As you can see, list of URLs is
> loaded from hdfs correctly Map input records=66906 but no records are on
> map ouput. Could it be some problems with broken filtering?
>
> ponto:(crawler)runtime/deploy>bin/nutch inject /czcrawl/db /czcrawl/seeds
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: starting at 2011-10-13
> 17:56:25
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: crawlDb: /czcrawl/db
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: urlDir: /czcrawl/seeds
> 11/10/13 17:56:25 INFO crawl.Injector: Injector: Converting injected
> urls to crawl db entries.
> 11/10/13 17:56:28 INFO mapred.FileInputFormat: Total input paths to
> process : 1
> 11/10/13 17:56:29 INFO mapred.JobClient: Running job: job_201110091645_0032
> 11/10/13 17:56:30 INFO mapred.JobClient: map 0% reduce 0%
> 11/10/13 17:56:52 INFO mapred.JobClient: map 50% reduce 0%
> 11/10/13 17:56:53 INFO mapred.JobClient: map 100% reduce 0%
> 11/10/13 17:57:05 INFO mapred.JobClient: map 100% reduce 100%
> 11/10/13 17:57:10 INFO mapred.JobClient: Job complete:
> job_201110091645_0032 11/10/13 17:57:10 INFO mapred.JobClient: Counters:
> 27
> 11/10/13 17:57:10 INFO mapred.JobClient: Job Counters
> 11/10/13 17:57:10 INFO mapred.JobClient: Launched reduce tasks=1
> 11/10/13 17:57:10 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=20455
> 11/10/13 17:57:10 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Total time spent by all
> maps waiting after reserving slots (ms)=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Rack-local map tasks=1
> 11/10/13 17:57:10 INFO mapred.JobClient: Launched map tasks=2
> 11/10/13 17:57:10 INFO mapred.JobClient: Data-local map tasks=1
> 11/10/13 17:57:10 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10356
> 11/10/13 17:57:10 INFO mapred.JobClient: File Input Format Counters
> 11/10/13 17:57:10 INFO mapred.JobClient: Bytes Read=1283144
> 11/10/13 17:57:10 INFO mapred.JobClient: File Output Format Counters
> 11/10/13 17:57:10 INFO mapred.JobClient: Bytes Written=86
> 11/10/13 17:57:10 INFO mapred.JobClient: FileSystemCounters
> 11/10/13 17:57:10 INFO mapred.JobClient: FILE_BYTES_READ=6
> 11/10/13 17:57:10 INFO mapred.JobClient: HDFS_BYTES_READ=1283358
> 11/10/13 17:57:10 INFO mapred.JobClient: FILE_BYTES_WRITTEN=89486
> 11/10/13 17:57:10 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=86
> 11/10/13 17:57:10 INFO mapred.JobClient: Map-Reduce Framework
> 11/10/13 17:57:10 INFO mapred.JobClient: Map output materialized
> bytes=12
> 11/10/13 17:57:10 INFO mapred.JobClient: Map input records=66906
> 11/10/13 17:57:10 INFO mapred.JobClient: Reduce shuffle bytes=6
> 11/10/13 17:57:10 INFO mapred.JobClient: Spilled Records=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Map output bytes=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Map input bytes=1280141
> 11/10/13 17:57:10 INFO mapred.JobClient: Combine input records=0
> 11/10/13 17:57:10 INFO mapred.JobClient: SPLIT_RAW_BYTES=214
> 11/10/13 17:57:10 INFO mapred.JobClient: Reduce input records=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Reduce input groups=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Combine output records=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Reduce output records=0
> 11/10/13 17:57:10 INFO mapred.JobClient: Map output records=0
Re: SOLVED: injector in nutch-1.4
Posted by Radim Kolar <hs...@sendmail.cz>.
error was caused by incorrect entry in domain-urlfilter i had there
".cz" and it should be only "cz"
Re: SOLVED: injector in nutch-1.4
Posted by Markus Jelsma <ma...@openindex.io>.
I just did and confirmed index-basic has no relevance to the crawl db. Here's
a piece of log output for injector and crawl db reader. There are only two
registered plugins, protocol-http and lib-http. After injection the crawldb
has 1 entry which is the same URL as in my seed list.
2011-10-14 15:30:03,683 INFO crawl.Injector - Injector: starting at
2011-10-14 15:30:03
2011-10-14 15:30:03,684 INFO crawl.Injector - Injector: crawlDb:
crawl/crawldb
2011-10-14 15:30:03,684 INFO crawl.Injector - Injector: urlDir: urls
2011-10-14 15:30:03,684 INFO crawl.Injector - Injector: Converting injected
urls to crawl db entries.
2011-10-14 15:30:04,041 INFO plugin.PluginRepository - Plugins: looking in:
/home/markus/projects/apache/nutch/trunk/runtime/local/plugins
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Plugin Auto-activation
mode: [true]
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Registered Plugins:
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - HTTP Framework
(lib-http)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Http Protocol
Plug-in (protocol-http)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Registered Extension-
Points:
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2011-10-14 15:30:04,131 INFO plugin.PluginRepository - Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2011-10-14 15:30:04,132 INFO plugin.PluginRepository - HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
2011-10-14 15:30:04,132 INFO plugin.PluginRepository - Nutch Content
Parser (org.apache.nutch.parse.Parser)
2011-10-14 15:30:04,132 INFO plugin.PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2011-10-14 15:30:04,946 INFO crawl.Injector - Injector: Merging injected urls
into crawl db.
2011-10-14 15:30:05,160 WARN util.NativeCodeLoader - Unable to load native-
hadoop library for your platform... using builtin-java classes where
applicable
2011-10-14 15:30:06,104 INFO crawl.Injector - Injector: finished at
2011-10-14 15:30:06, elapsed: 00:00:02
2011-10-14 15:30:08,727 INFO crawl.CrawlDbReader - CrawlDb statistics start:
crawl/crawldb/
2011-10-14 15:30:08,836 WARN mapred.JobClient - Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - Statistics for CrawlDb:
crawl/crawldb/
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - TOTAL urls: 1
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - retry 0: 1
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - min score: 1.0
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - avg score: 1.0
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - max score: 1.0
2011-10-14 15:30:10,052 INFO crawl.CrawlDbReader - status 1 (db_unfetched):
1
2011-10-14 15:30:10,053 INFO crawl.CrawlDbReader - CrawlDb statistics: done
On Friday 14 October 2011 15:23:00 Radim Kolar wrote:
> try it yourself. in 1.4 remove index-basic from list of included
> plugins, then run nutch inject in hadoop mode and you will get 0 rows on
> first map output.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: SOLVED: injector in nutch-1.4
Posted by Radim Kolar <hs...@sendmail.cz>.
try it yourself. in 1.4 remove index-basic from list of included
plugins, then run nutch inject in hadoop mode and you will get 0 rows on
first map output.
Re: SOLVED: injector in nutch-1.4
Posted by Markus Jelsma <ma...@openindex.io>.
That makes no sense. The injector and indexer code is completely separated.
Did you dump the crawl db after injection?
On Friday 14 October 2011 13:36:24 Radim Kolar wrote:
> Dne 14.10.2011 13:16, Radim Kolar napsal(a):
> > If you dont have "index-basic" plugin included, nothing gets injected
>
> it works in nutch 1.3 without index-basic but not in 1.4
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: SOLVED: injector in nutch-1.4
Posted by Radim Kolar <hs...@sendmail.cz>.
Dne 14.10.2011 13:16, Radim Kolar napsal(a):
> If you dont have "index-basic" plugin included, nothing gets injected
it works in nutch 1.3 without index-basic but not in 1.4
SOLVED: injector in nutch-1.4
Posted by Radim Kolar <hs...@sendmail.cz>.
If you dont have "index-basic" plugin included, nothing gets injected
Re: injector in nutch-1.4
Posted by Radim Kolar <hs...@sendmail.cz>.
Dne 14.10.2011 10:31, Markus Jelsma napsal(a):
> Index and parse filter checkers do not use URL filtering.
how to check why URLs are not injected into db then?
Re: injector in nutch-1.4
Posted by Markus Jelsma <ma...@openindex.io>.
Index and parse filter checkers do not use URL filtering.
> Hi Radim,
>
> Please see the final log output
>
> 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> org.apache.nutch.parse.html.Ht
> mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml,
> but not enabled via plugin.includes in nutch-default.xml
>
> Please try adding parse-html and re-running the indexerchecker
>
> On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <hs...@sendmail.cz> wrote:
> > Hi,
> >
> > This is most likely an URL filter issue. Check all URL filters. There's
> > also a
> > test program for URL filtering. Try it out.
> >
> > This is indexchecker output for one URL. Is this URL filtered or not? I
> > don't know how to interpret output
> >
> > ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz
> >
> > 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching:
> > http://www.root.cz
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
> > /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
> > mode: [true]
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: the nutch core
> > extension points (nutch-extensionpoints)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL
> > Normalizer (urlnormalizer-regex)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Basic URL
> > Normalizer (urlnormalizer-basic)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Tika Parser
> > Plug-in (parse-tika)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Domain URL Filter
> > (urlfilter-domain)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: HTTP Framework
> > (lib-http)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
> > (urlfilter-regex)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
> > Framework (lib-regex-filter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Http Protocol
> > Plug-in (protocol-http)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered
> > Extension-Points:
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL
> > Normalizer (org.apache.nutch.net.**URLNormalizer)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Protocol
> > (org.apache.nutch.protocol.**Protocol)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Segment
> > Merge Filter (org.apache.nutch.segment.**SegmentMergeFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Filter
> > ( org.apache.nutch.net.**URLFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Indexing
> > Filter (org.apache.nutch.indexer.**IndexingFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: HTML Parse Filter
> > (org.apache.nutch.parse.**HtmlParseFilter)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Content
> > Parser (org.apache.nutch.parse.**Parser)
> > 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Scoring
> > (org.apache.nutch.scoring.**ScoringFilter)
> > 11/10/14 06:01:00 INFO http.Http: http.accept.language =
> > en-us,en-gb,en;q=0.7,*;q=0.3
> > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing:
> > http://www.root.cz
> > 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType:
> > application/xhtml+xml
> > 11/10/14 06:01:02 INFO conf.Configuration: found resource
> > parse-plugins.xml at
> > file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/**
> > parse-plugins.xml
> > 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> > org.apache.nutch.parse.html.**HtmlParser mapped to contentType
> > application/xhtml+xml via parse-plugins.xml, but not enabled via
> > plugin.includes in nutch-default.xml
Re: injector in nutch-1.4
Posted by lewis john mcgibbney <le...@gmail.com>.
Hi Radim,
Please see the final log output
11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
org.apache.nutch.parse.html.Ht
mlParser mapped to contentType application/xhtml+xml via parse-plugins.xml,
but not enabled via plugin.includes in nutch-default.xml
Please try adding parse-html and re-running the indexerchecker
On Fri, Oct 14, 2011 at 5:18 AM, Radim Kolar <hs...@sendmail.cz> wrote:
>
> Hi,
>
> This is most likely an URL filter issue. Check all URL filters. There's
> also a
> test program for URL filtering. Try it out.
>
> This is indexchecker output for one URL. Is this URL filtered or not? I
> don't know how to interpret output
>
> ponto:(crawler)runtime/deploy>**bin/nutch indexchecker http://www.root.cz
>
> 11/10/14 06:01:00 INFO indexer.**IndexingFiltersChecker: fetching:
> http://www.root.cz
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
> /tmp/hadoop-crawler/hadoop-**unjar3406850446948112163/**plugins
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
> 11/10/14 06:01:00 INFO plugin.PluginRepository: the nutch core
> extension points (nutch-extensionpoints)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL
> Normalizer (urlnormalizer-regex)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Basic URL
> Normalizer (urlnormalizer-basic)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Tika Parser Plug-in
> (parse-tika)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Domain URL Filter
> (urlfilter-domain)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: HTTP Framework
> (lib-http)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
> (urlfilter-regex)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
> Framework (lib-regex-filter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Http Protocol
> Plug-in (protocol-http)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL
> Normalizer (org.apache.nutch.net.**URLNormalizer)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Protocol
> (org.apache.nutch.protocol.**Protocol)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Segment Merge
> Filter (org.apache.nutch.segment.**SegmentMergeFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Filter (
> org.apache.nutch.net.**URLFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Indexing
> Filter (org.apache.nutch.indexer.**IndexingFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: HTML Parse Filter
> (org.apache.nutch.parse.**HtmlParseFilter)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Content
> Parser (org.apache.nutch.parse.**Parser)
> 11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Scoring
> (org.apache.nutch.scoring.**ScoringFilter)
> 11/10/14 06:01:00 INFO http.Http: http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: parsing:
> http://www.root.cz
> 11/10/14 06:01:02 INFO indexer.**IndexingFiltersChecker: contentType:
> application/xhtml+xml
> 11/10/14 06:01:02 INFO conf.Configuration: found resource parse-plugins.xml
> at file:/tmp/hadoop-crawler/**hadoop-**unjar3406850446948112163/**
> parse-plugins.xml
> 11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
> org.apache.nutch.parse.html.**HtmlParser mapped to contentType
> application/xhtml+xml via parse-plugins.xml, but not enabled via
> plugin.includes in nutch-default.xml
>
>
--
*Lewis*
Re: injector in nutch-1.4
Posted by Radim Kolar <hs...@sendmail.cz>.
Hi,
This is most likely an URL filter issue. Check all URL filters. There's also a
test program for URL filtering. Try it out.
This is indexchecker output for one URL. Is this URL filtered or not? I don't know how to interpret output
ponto:(crawler)runtime/deploy>bin/nutch indexchecker http://www.root.cz
11/10/14 06:01:00 INFO indexer.IndexingFiltersChecker: fetching:
http://www.root.cz
11/10/14 06:01:00 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-crawler/hadoop-unjar3406850446948112163/plugins
11/10/14 06:01:00 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Plugins:
11/10/14 06:01:00 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL
Normalizer (urlnormalizer-regex)
11/10/14 06:01:00 INFO plugin.PluginRepository: Basic URL
Normalizer (urlnormalizer-basic)
11/10/14 06:01:00 INFO plugin.PluginRepository: Tika Parser
Plug-in (parse-tika)
11/10/14 06:01:00 INFO plugin.PluginRepository: Domain URL
Filter (urlfilter-domain)
11/10/14 06:01:00 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
(urlfilter-regex)
11/10/14 06:01:00 INFO plugin.PluginRepository: Regex URL Filter
Framework (lib-regex-filter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
11/10/14 06:01:00 INFO plugin.PluginRepository: Registered Extension-Points:
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Segment
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch URL Filter
(org.apache.nutch.net.URLFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: HTML Parse
Filter (org.apache.nutch.parse.HtmlParseFilter)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/10/14 06:01:00 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/10/14 06:01:00 INFO http.Http: http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: parsing:
http://www.root.cz
11/10/14 06:01:02 INFO indexer.IndexingFiltersChecker: contentType:
application/xhtml+xml
11/10/14 06:01:02 INFO conf.Configuration: found resource
parse-plugins.xml at
file:/tmp/hadoop-crawler/hadoop-unjar3406850446948112163/parse-plugins.xml
11/10/14 06:01:02 WARN parse.ParserFactory: ParserFactory: Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType
application/xhtml+xml via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml