You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Haggai R <ha...@gmail.com> on 2012/02/09 09:16:59 UTC

WARN regex.RegexURLNormalizer: Can't load the default rules! during Nutch Crawl

Hi,

I'm new to Nutch and tried to run it and got the following exception and
will appreciate any direction how to solve it:

12/02/09 10:00:25 WARN crawl.Crawl: solrUrl is not set, indexing will be
skipped...
12/02/09 10:00:25 INFO crawl.Crawl: crawl started in: crawl-20120209100025
12/02/09 10:00:25 INFO crawl.Crawl: rootUrlDir = urls
12/02/09 10:00:25 INFO crawl.Crawl: threads = 10
12/02/09 10:00:25 INFO crawl.Crawl: depth = 5
12/02/09 10:00:25 INFO crawl.Crawl: solrUrl=null
12/02/09 10:00:25 INFO crawl.Injector: Injector: starting at 2012-02-09
10:00:25
12/02/09 10:00:25 INFO crawl.Injector: Injector: crawlDb:
crawl-20120209100025/crawldb
12/02/09 10:00:25 INFO crawl.Injector: Injector: urlDir: urls
12/02/09 10:00:25 INFO crawl.Injector: Injector: Converting injected urls
to crawl db entries.
12/02/09 10:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
12/02/09 10:00:26 INFO mapred.FileInputFormat: Total input paths to process
: 1
12/02/09 10:00:26 INFO mapred.JobClient: Running job: job_local_0001
12/02/09 10:00:26 INFO mapred.FileInputFormat: Total input paths to process
: 1
12/02/09 10:00:26 INFO mapred.MapTask: numReduceTasks: 1
12/02/09 10:00:26 INFO mapred.MapTask: io.sort.mb = 100
12/02/09 10:00:26 INFO mapred.MapTask: data buffer = 79691776/99614720
12/02/09 10:00:26 INFO mapred.MapTask: record buffer = 262144/327680
12/02/09 10:00:26 INFO plugin.PluginRepository: Plugins: looking in:
D:\workspace\Gilad\plugins
12/02/09 10:00:26 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
12/02/09 10:00:26 INFO plugin.PluginRepository: Registered Plugins:
12/02/09 10:00:26 INFO plugin.PluginRepository:     Regex URL Filter
(urlfilter-regex)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Tika Parser Plug-in
(parse-tika)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Html Parse Plug-in
(parse-html)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Regex URL Filter
Framework (lib-regex-filter)
12/02/09 10:00:26 INFO plugin.PluginRepository:     the nutch core
extension points (nutch-extensionpoints)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Pass-through URL
Normalizer (urlnormalizer-pass)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Regex URL Normalizer
(urlnormalizer-regex)
12/02/09 10:00:26 INFO plugin.PluginRepository:     CyberNeko HTML Parser
(lib-nekohtml)
12/02/09 10:00:26 INFO plugin.PluginRepository:     HTTP Framework
(lib-http)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Anchor Indexing Filter
(index-anchor)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Basic Indexing Filter
(index-basic)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Basic URL Normalizer
(urlnormalizer-basic)
12/02/09 10:00:26 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
(scoring-opic)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Http Protocol Plug-in
(protocol-http)
12/02/09 10:00:26 INFO plugin.PluginRepository: Registered Extension-Points:
12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
12/02/09 10:00:26 INFO plugin.PluginRepository:     HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch URL Filter
(org.apache.nutch.net.URLFilter)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Segment Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Protocol
(org.apache.nutch.protocol.Protocol)
12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Content Parser
(org.apache.nutch.parse.Parser)
12/02/09 10:00:26 INFO conf.Configuration: regex-normalize.xml not found
12/02/09 10:00:26 WARN regex.RegexURLNormalizer: Can't load the default
rules!
12/02/09 10:00:26 INFO conf.Configuration: regex-urlfilter.txt not found
12/02/09 10:00:26 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: Error in configuring object
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:48)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
    at java.lang.reflect.Method.invoke(Method.java:600)
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
    ... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    ... 10 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:48)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
    at java.lang.reflect.Method.invoke(Method.java:600)
    at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
    ... 13 more
Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:73)
    at java.io.BufferedReader.<init>(BufferedReader.java:88)
    at java.io.BufferedReader.<init>(BufferedReader.java:103)
    at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
    at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
    at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
    at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57)
    at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:72)
    ... 18 more
12/02/09 10:00:27 INFO mapred.JobClient:  map 0% reduce 0%
12/02/09 10:00:27 INFO mapred.JobClient: Job complete: job_local_0001
12/02/09 10:00:27 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at Try.main(Try.java:13)


Thank you,

Haggai

Re: WARN regex.RegexURLNormalizer: Can't load the default rules! during Nutch Crawl

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Haggai,

Which version of Nutch are you using?
It appears that you don't have a regex-urlfilter.txt in your configuration
settings!!!

http://svn.apache.org/viewvc/nutch/trunk/conf/regex-urlfilter.txt.template?view=markup

On Thu, Feb 9, 2012 at 8:16 AM, Haggai R <ha...@gmail.com> wrote:

> Hi,
>
> I'm new to Nutch and tried to run it and got the following exception and
> will appreciate any direction how to solve it:
>
> 12/02/09 10:00:25 WARN crawl.Crawl: solrUrl is not set, indexing will be
> skipped...
> 12/02/09 10:00:25 INFO crawl.Crawl: crawl started in: crawl-20120209100025
> 12/02/09 10:00:25 INFO crawl.Crawl: rootUrlDir = urls
> 12/02/09 10:00:25 INFO crawl.Crawl: threads = 10
> 12/02/09 10:00:25 INFO crawl.Crawl: depth = 5
> 12/02/09 10:00:25 INFO crawl.Crawl: solrUrl=null
> 12/02/09 10:00:25 INFO crawl.Injector: Injector: starting at 2012-02-09
> 10:00:25
> 12/02/09 10:00:25 INFO crawl.Injector: Injector: crawlDb:
> crawl-20120209100025/crawldb
> 12/02/09 10:00:25 INFO crawl.Injector: Injector: urlDir: urls
> 12/02/09 10:00:25 INFO crawl.Injector: Injector: Converting injected urls
> to crawl db entries.
> 12/02/09 10:00:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 12/02/09 10:00:26 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 12/02/09 10:00:26 INFO mapred.JobClient: Running job: job_local_0001
> 12/02/09 10:00:26 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 12/02/09 10:00:26 INFO mapred.MapTask: numReduceTasks: 1
> 12/02/09 10:00:26 INFO mapred.MapTask: io.sort.mb = 100
> 12/02/09 10:00:26 INFO mapred.MapTask: data buffer = 79691776/99614720
> 12/02/09 10:00:26 INFO mapred.MapTask: record buffer = 262144/327680
> 12/02/09 10:00:26 INFO plugin.PluginRepository: Plugins: looking in:
> D:\workspace\Gilad\plugins
> 12/02/09 10:00:26 INFO plugin.PluginRepository: Plugin Auto-activation
> mode: [true]
> 12/02/09 10:00:26 INFO plugin.PluginRepository: Registered Plugins:
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Regex URL Filter
> (urlfilter-regex)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Tika Parser Plug-in
> (parse-tika)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Html Parse Plug-in
> (parse-html)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Regex URL Filter
> Framework (lib-regex-filter)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     the nutch core
> extension points (nutch-extensionpoints)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Pass-through URL
> Normalizer (urlnormalizer-pass)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Regex URL Normalizer
> (urlnormalizer-regex)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     CyberNeko HTML Parser
> (lib-nekohtml)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     HTTP Framework
> (lib-http)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Anchor Indexing Filter
> (index-anchor)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Basic Indexing Filter
> (index-basic)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Basic URL Normalizer
> (urlnormalizer-basic)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
> (scoring-opic)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Http Protocol Plug-in
> (protocol-http)
> 12/02/09 10:00:26 INFO plugin.PluginRepository: Registered
> Extension-Points:
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch URL Normalizer
> (org.apache.nutch.net.URLNormalizer)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     HTML Parse Filter
> (org.apache.nutch.parse.HtmlParseFilter)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch URL Filter
> (org.apache.nutch.net.URLFilter)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Scoring
> (org.apache.nutch.scoring.ScoringFilter)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Indexing Filter
> (org.apache.nutch.indexer.IndexingFilter)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Segment Merge
> Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Protocol
> (org.apache.nutch.protocol.Protocol)
> 12/02/09 10:00:26 INFO plugin.PluginRepository:     Nutch Content Parser
> (org.apache.nutch.parse.Parser)
> 12/02/09 10:00:26 INFO conf.Configuration: regex-normalize.xml not found
> 12/02/09 10:00:26 WARN regex.RegexURLNormalizer: Can't load the default
> rules!
> 12/02/09 10:00:26 INFO conf.Configuration: regex-urlfilter.txt not found
> 12/02/09 10:00:26 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.RuntimeException: Error in configuring object
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>    at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>    at
>
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: java.lang.reflect.InvocationTargetException
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:48)
>    at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>    at java.lang.reflect.Method.invoke(Method.java:600)
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>    ... 5 more
> Caused by: java.lang.RuntimeException: Error in configuring object
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>    at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>    at
>
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>    ... 10 more
> Caused by: java.lang.reflect.InvocationTargetException
>    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:48)
>    at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
>    at java.lang.reflect.Method.invoke(Method.java:600)
>    at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>    ... 13 more
> Caused by: java.lang.NullPointerException
>    at java.io.Reader.<init>(Reader.java:73)
>    at java.io.BufferedReader.<init>(BufferedReader.java:88)
>    at java.io.BufferedReader.<init>(BufferedReader.java:103)
>    at
>
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:180)
>    at
>
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:156)
>    at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>    at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:57)
>    at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:72)
>    ... 18 more
> 12/02/09 10:00:27 INFO mapred.JobClient:  map 0% reduce 0%
> 12/02/09 10:00:27 INFO mapred.JobClient: Job complete: job_local_0001
> 12/02/09 10:00:27 INFO mapred.JobClient: Counters: 0
> Exception in thread "main" java.io.IOException: Job failed!
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>    at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
>    at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at Try.main(Try.java:13)
>
>
> Thank you,
>
> Haggai
>



-- 
*Lewis*