You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arthur Yarwood <ar...@fubaby.com> on 2015/07/10 00:15:50 UTC
Filtering at index time (with a different regex-urlfilter.txt from
crawl)
I'm attempting with Nutch 1.10 to try and filter the urls I crawl
differently to the urls I index (into Solr). Usual scenario or seeds
sites tend to be of the format home page -> list page -> product page. I
have the home pages as my seeds, but only want to index and search the
deep leaf product pages.
Going for the approach of having a fair open regex-urlfilter.txt for
crawl time to ensure all pages are crawled, but I'm trying to use a more
selective regex-urlfilter.txt file to apply at index time, by specifying
-Durlfilter.regex.file=myfile.txt on the command line.
First problem:-
bin/crawl needed some tweaking to only add this -Durlfilter.regex.file
java property to the index call, rather than all calls to nutch.
Second problem:-
bin/nutch index ... -filter -normalize
doesn't appear to a valid command line:-
/opt/nutch/bin/nutch index
-Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
-Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
/opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter
-normalize
Indexer: starting at 2015-07-09 22:28:06
Indexer: deleting gone documents: false
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
Error running:
/opt/nutch/bin/nutch index
-Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
-Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
/opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter
-normalize
Failed with exit value 255.
Command worked if I removed the '-filter -normalize' arguments. But of
course nowt was filtered. Were these args really the issue, or were they
indeed used, but I then hit some subsequent blocker?
Then noticed that Patch NUTCH-1300 appears to address this by adding
config options:-
public static final String INDEXER_DELETE = "indexer.delete";
+ public static final String URL_FILTERING = "indexer.url.filters";
+ public static final String URL_NORMALIZING = "indexer.url.normalizers";
I've stuck these in my nutch-site.xml, but the command line call to
bin/nutch index all informs me their false.
Indexer: URL filtering: false
Indexer: URL normalizing: false
Bit odd, as specifying them on command line above seems to set them to
true...
So, how does one enable url filtering/normalizing within the nutch index
command?
Third problem:
Gave up on the index pluggable interface, and attempted to swap the
index command in bin/crawl to:-
bin/nutch solrindex ...
However no matter where I add the argument Durlfilter.regex.file=myfile
it falls over after misinterpreting the argument order.
Do you have a concrete example of how to call binnutch solrindex with
Durlfilter.regex.file=myfile ?
Thanks!
--
Arthur Yarwood
http://www.fubaby.com
Re: Filtering at index time (with a different regex-urlfilter.txt
from crawl)
Posted by Arthur Yarwood <ar...@fubaby.com>.
On 10/07/2015 12:42, Sebastian Nagel wrote:
> principally your approach should work. But as all config files the indexing
> url filter file
> is loaded from class path. An absolute path does not work:
> ...
> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
> If the file is properly deployed to $NUTCH_HOME/conf/ in local mode are
> contained
> in the job file for distributed mode it should be enough to specifiy the
> pure filename:
> ... -Durlfilter.regex.file=regex-urlfilter-index.txt
Good spot! Yep, removing the absolute path and just using the filename
seems to have fixed it and indexing into Solr worked without error.
Many thanks!
--
Arthur Yarwood
http://www.fubaby.com
Re: Filtering at index time (with a different regex-urlfilter.txt
from crawl)
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Arthur,
principally your approach should work. But as all config files the indexing
url filter file
is loaded from class path. An absolute path does not work:
...
-Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
If the file is properly deployed to $NUTCH_HOME/conf/ in local mode are
contained
in the job file for distributed mode it should be enough to specifiy the
pure filename:
... -Durlfilter.regex.file=regex-urlfilter-index.txt
> Then noticed that Patch NUTCH-1300 appears to address this by adding
config options:-
> public static final String INDEXER_DELETE = "indexer.delete";
>+ public static final String URL_FILTERING = "indexer.url.filters";
>+ public static final String URL_NORMALIZING = "indexer.url.normalizers";
> I've stuck these in my nutch-site.xml, but the command line call to
bin/nutch index all informs me their false.
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
These properties are only used to pass command-line options to the
MapReduce jobs.
They are always overwritten by the command-line options (true if -filter,
false otherwise).
Cheers,
Sebastian
2015-07-10 0:24 GMT+02:00 Arthur Yarwood <ar...@fubaby.com>:
> If it helps, here's the full stack trace from trying to call:
> bin/nutch index .... -filter -normalize
>
>
> 2015-07-09 23:19:02,727 INFO indexer.IndexingJob - Indexer: starting at
> 2015-07-09 23:19:02
> 2015-07-09 23:19:02,802 INFO indexer.IndexingJob - Indexer: deleting gone
> documents: false
> 2015-07-09 23:19:02,802 INFO indexer.IndexingJob - Indexer: URL
> filtering: true
> 2015-07-09 23:19:02,802 INFO indexer.IndexingJob - Indexer: URL
> normalizing: true
> 2015-07-09 23:19:03,152 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2015-07-09 23:19:03,152 INFO indexer.IndexingJob - Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : username for authentication
> solr.auth.password : password for authentication
>
>
> 2015-07-09 23:19:03,154 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: /opt/nutch/crawl/linkdb
> 2015-07-09 23:19:03,154 INFO indexer.IndexerMapReduce -
> IndexerMapReduces: adding segment: /opt/nutch/crawl/segments/20150709231809
> 2015-07-09 23:19:03,257 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2015-07-09 23:19:03,678 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2015-07-09 23:19:04,064 WARN mapred.LocalJobRunner -
> job_local1744190808_0001
> java.lang.Exception: java.lang.RuntimeException: Error in configuring
> object
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 10 more
> Caused by: java.lang.RuntimeException: Error in configuring object
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> ... 15 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 18 more
> Caused by: java.lang.NullPointerException
> at java.io.Reader.<init>(Reader.java:78)
> at java.io.BufferedReader.<init>(BufferedReader.java:94)
> at java.io.BufferedReader.<init>(BufferedReader.java:109)
> at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:204)
> at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:176)
> at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
> at
> org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
> at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:30)
> at
> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:99)
> ... 23 more
> 2015-07-09 23:19:04,535 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
>
>
>
>
> On 09/07/2015 23:15, Arthur Yarwood wrote:
>
>> I'm attempting with Nutch 1.10 to try and filter the urls I crawl
>> differently to the urls I index (into Solr). Usual scenario or seeds sites
>> tend to be of the format home page -> list page -> product page. I have the
>> home pages as my seeds, but only want to index and search the deep leaf
>> product pages.
>>
>> Going for the approach of having a fair open regex-urlfilter.txt for
>> crawl time to ensure all pages are crawled, but I'm trying to use a more
>> selective regex-urlfilter.txt file to apply at index time, by specifying
>> -Durlfilter.regex.file=myfile.txt on the command line.
>>
>> First problem:-
>> bin/crawl needed some tweaking to only add this -Durlfilter.regex.file
>> java property to the index call, rather than all calls to nutch.
>>
>> Second problem:-
>> bin/nutch index ... -filter -normalize
>> doesn't appear to a valid command line:-
>>
>> /opt/nutch/bin/nutch index
>> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
>> -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
>> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter
>> -normalize
>> Indexer: starting at 2015-07-09 22:28:06
>> Indexer: deleting gone documents: false
>> Indexer: URL filtering: true
>> Indexer: URL normalizing: true
>> Active IndexWriters :
>> SOLRIndexWriter
>> solr.server.url : URL of the SOLR instance (mandatory)
>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>> solr.mapping.file : name of the mapping file for fields (default
>> solrindex-mapping.xml)
>> solr.auth : use authentication (default false)
>> solr.auth.username : username for authentication
>> solr.auth.password : password for authentication
>>
>>
>> Indexer: java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
>>
>> Error running:
>> /opt/nutch/bin/nutch index
>> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
>> -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
>> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter
>> -normalize
>> Failed with exit value 255.
>>
>> Command worked if I removed the '-filter -normalize' arguments. But of
>> course nowt was filtered. Were these args really the issue, or were they
>> indeed used, but I then hit some subsequent blocker?
>>
>> Then noticed that Patch NUTCH-1300 appears to address this by adding
>> config options:-
>> public static final String INDEXER_DELETE = "indexer.delete";
>> + public static final String URL_FILTERING = "indexer.url.filters";
>> + public static final String URL_NORMALIZING = "indexer.url.normalizers";
>>
>> I've stuck these in my nutch-site.xml, but the command line call to
>> bin/nutch index all informs me their false.
>> Indexer: URL filtering: false
>> Indexer: URL normalizing: false
>>
>> Bit odd, as specifying them on command line above seems to set them to
>> true...
>>
>> So, how does one enable url filtering/normalizing within the nutch index
>> command?
>>
>> Third problem:
>> Gave up on the index pluggable interface, and attempted to swap the index
>> command in bin/crawl to:-
>> bin/nutch solrindex ...
>> However no matter where I add the argument Durlfilter.regex.file=myfile
>> it falls over after misinterpreting the argument order.
>> Do you have a concrete example of how to call binnutch solrindex with
>> Durlfilter.regex.file=myfile ?
>>
>> Thanks!
>>
>>
>
> --
> Arthur Yarwood
> http://www.fubaby.com
>
>
Re: Filtering at index time (with a different regex-urlfilter.txt
from crawl)
Posted by Arthur Yarwood <ar...@fubaby.com>.
If it helps, here's the full stack trace from trying to call:
bin/nutch index .... -filter -normalize
2015-07-09 23:19:02,727 INFO indexer.IndexingJob - Indexer: starting at
2015-07-09 23:19:02
2015-07-09 23:19:02,802 INFO indexer.IndexingJob - Indexer: deleting
gone documents: false
2015-07-09 23:19:02,802 INFO indexer.IndexingJob - Indexer: URL
filtering: true
2015-07-09 23:19:02,802 INFO indexer.IndexingJob - Indexer: URL
normalizing: true
2015-07-09 23:19:03,152 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-07-09 23:19:03,152 INFO indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
2015-07-09 23:19:03,154 INFO indexer.IndexerMapReduce -
IndexerMapReduce: crawldb: /opt/nutch/crawl/linkdb
2015-07-09 23:19:03,154 INFO indexer.IndexerMapReduce -
IndexerMapReduces: adding segment: /opt/nutch/crawl/segments/20150709231809
2015-07-09 23:19:03,257 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-07-09 23:19:03,678 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2015-07-09 23:19:04,064 WARN mapred.LocalJobRunner -
job_local1744190808_0001
java.lang.Exception: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 10 more
Caused by: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 15 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 18 more
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.BufferedReader.<init>(BufferedReader.java:94)
at java.io.BufferedReader.<init>(BufferedReader.java:109)
at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:204)
at
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:176)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:30)
at
org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:99)
... 23 more
2015-07-09 23:19:04,535 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
On 09/07/2015 23:15, Arthur Yarwood wrote:
> I'm attempting with Nutch 1.10 to try and filter the urls I crawl
> differently to the urls I index (into Solr). Usual scenario or seeds
> sites tend to be of the format home page -> list page -> product page.
> I have the home pages as my seeds, but only want to index and search
> the deep leaf product pages.
>
> Going for the approach of having a fair open regex-urlfilter.txt for
> crawl time to ensure all pages are crawled, but I'm trying to use a
> more selective regex-urlfilter.txt file to apply at index time, by
> specifying -Durlfilter.regex.file=myfile.txt on the command line.
>
> First problem:-
> bin/crawl needed some tweaking to only add this -Durlfilter.regex.file
> java property to the index call, rather than all calls to nutch.
>
> Second problem:-
> bin/nutch index ... -filter -normalize
> doesn't appear to a valid command line:-
>
> /opt/nutch/bin/nutch index
> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749
> -filter -normalize
> Indexer: starting at 2015-07-09 22:28:06
> Indexer: deleting gone documents: false
> Indexer: URL filtering: true
> Indexer: URL normalizing: true
> Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : username for authentication
> solr.auth.password : password for authentication
>
>
> Indexer: java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
>
> Error running:
> /opt/nutch/bin/nutch index
> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749
> -filter -normalize
> Failed with exit value 255.
>
> Command worked if I removed the '-filter -normalize' arguments. But of
> course nowt was filtered. Were these args really the issue, or were
> they indeed used, but I then hit some subsequent blocker?
>
> Then noticed that Patch NUTCH-1300 appears to address this by adding
> config options:-
> public static final String INDEXER_DELETE = "indexer.delete";
> + public static final String URL_FILTERING = "indexer.url.filters";
> + public static final String URL_NORMALIZING =
> "indexer.url.normalizers";
>
> I've stuck these in my nutch-site.xml, but the command line call to
> bin/nutch index all informs me their false.
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
>
> Bit odd, as specifying them on command line above seems to set them to
> true...
>
> So, how does one enable url filtering/normalizing within the nutch
> index command?
>
> Third problem:
> Gave up on the index pluggable interface, and attempted to swap the
> index command in bin/crawl to:-
> bin/nutch solrindex ...
> However no matter where I add the argument
> Durlfilter.regex.file=myfile it falls over after misinterpreting the
> argument order.
> Do you have a concrete example of how to call binnutch solrindex with
> Durlfilter.regex.file=myfile ?
>
> Thanks!
>
--
Arthur Yarwood
http://www.fubaby.com