You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Arthur Yarwood <ar...@fubaby.com> on 2015/07/10 00:15:50 UTC

Filtering at index time (with a different regex-urlfilter.txt from crawl)

I'm attempting with Nutch 1.10 to try and filter the urls I crawl 
differently to the urls I index (into Solr). Usual scenario or seeds 
sites tend to be of the format home page -> list page -> product page. I 
have the home pages as my seeds, but only want to index and search the 
deep leaf product pages.

Going for the approach of having a fair open regex-urlfilter.txt for 
crawl time to ensure all pages are crawled, but I'm trying to use a more 
selective regex-urlfilter.txt file to apply at index time, by specifying 
-Durlfilter.regex.file=myfile.txt on the command line.

First problem:-
bin/crawl needed some tweaking to only add this -Durlfilter.regex.file 
java property to the index call, rather than all calls to nutch.

Second problem:-
bin/nutch index ... -filter -normalize
doesn't appear to a valid command line:-

/opt/nutch/bin/nutch index 
-Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt 
-Dsolr.server.url=http://localhost:8983/solr/HFH_Test1 
/opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter 
-normalize
Indexer: starting at 2015-07-09 22:28:06
Indexer: deleting gone documents: false
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
     solr.server.url : URL of the SOLR instance (mandatory)
     solr.commit.size : buffer size when sending to SOLR (default 1000)
     solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)
     solr.auth : use authentication (default false)
     solr.auth.username : username for authentication
     solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)

Error running:
   /opt/nutch/bin/nutch index 
-Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt 
-Dsolr.server.url=http://localhost:8983/solr/HFH_Test1 
/opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter 
-normalize
Failed with exit value 255.

Command worked if I removed the '-filter -normalize' arguments. But of 
course nowt was filtered. Were these args really the issue, or were they 
indeed used, but I then hit some subsequent blocker?

Then noticed that Patch NUTCH-1300 appears to address this by adding 
config options:-
    public static final String INDEXER_DELETE = "indexer.delete";
+  public static final String URL_FILTERING = "indexer.url.filters";
+  public static final String URL_NORMALIZING = "indexer.url.normalizers";

I've stuck these in my nutch-site.xml, but the command line call to 
bin/nutch index all informs me their false.
Indexer: URL filtering: false
Indexer: URL normalizing: false

Bit odd, as specifying them on command line above seems to set them to 
true...

So, how does one enable url filtering/normalizing within the nutch index 
command?

Third problem:
Gave up on the index pluggable interface, and attempted to swap the 
index command in bin/crawl to:-
bin/nutch solrindex ...
However no matter where I add the argument Durlfilter.regex.file=myfile 
it falls over after misinterpreting the argument order.
Do you have a concrete example of how to call binnutch solrindex with 
Durlfilter.regex.file=myfile ?

Thanks!

-- 
Arthur Yarwood
http://www.fubaby.com

Re: Filtering at index time (with a different regex-urlfilter.txt from crawl)

Posted by Arthur Yarwood <ar...@fubaby.com>.

On 10/07/2015 12:42, Sebastian Nagel wrote:
> principally your approach should work. But as all config files the indexing
> url filter file
> is loaded from class path. An absolute path does not work:
>    ...
> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
> If the file is properly deployed to $NUTCH_HOME/conf/ in local mode are
> contained
> in the job file for distributed mode it should be enough to specifiy the
> pure filename:
>    ... -Durlfilter.regex.file=regex-urlfilter-index.txt
Good spot! Yep, removing the absolute path and just using the filename 
seems to have fixed it and indexing into Solr worked without error.
Many thanks!

-- 
Arthur Yarwood
http://www.fubaby.com

Re: Filtering at index time (with a different regex-urlfilter.txt from crawl)

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Arthur,

principally your approach should work. But as all config files the indexing
url filter file
is loaded from class path. An absolute path does not work:
  ...
-Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
If the file is properly deployed to $NUTCH_HOME/conf/ in local mode are
contained
in the job file for distributed mode it should be enough to specifiy the
pure filename:
  ... -Durlfilter.regex.file=regex-urlfilter-index.txt


> Then noticed that Patch NUTCH-1300 appears to address this by adding
config options:-
>   public static final String INDEXER_DELETE = "indexer.delete";
>+  public static final String URL_FILTERING = "indexer.url.filters";
>+  public static final String URL_NORMALIZING = "indexer.url.normalizers";
> I've stuck these in my nutch-site.xml, but the command line call to
bin/nutch index all informs me their false.
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
These properties are only used to pass command-line options to the
MapReduce jobs.
They are always overwritten by the command-line options (true if -filter,
false otherwise).

Cheers,
Sebastian


2015-07-10 0:24 GMT+02:00 Arthur Yarwood <ar...@fubaby.com>:

> If it helps, here's the full stack trace from trying to call:
> bin/nutch index .... -filter -normalize
>
>
> 2015-07-09 23:19:02,727 INFO  indexer.IndexingJob - Indexer: starting at
> 2015-07-09 23:19:02
> 2015-07-09 23:19:02,802 INFO  indexer.IndexingJob - Indexer: deleting gone
> documents: false
> 2015-07-09 23:19:02,802 INFO  indexer.IndexingJob - Indexer: URL
> filtering: true
> 2015-07-09 23:19:02,802 INFO  indexer.IndexingJob - Indexer: URL
> normalizing: true
> 2015-07-09 23:19:03,152 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2015-07-09 23:19:03,152 INFO  indexer.IndexingJob - Active IndexWriters :
> SOLRIndexWriter
>     solr.server.url : URL of the SOLR instance (mandatory)
>     solr.commit.size : buffer size when sending to SOLR (default 1000)
>     solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>     solr.auth : use authentication (default false)
>     solr.auth.username : username for authentication
>     solr.auth.password : password for authentication
>
>
> 2015-07-09 23:19:03,154 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: /opt/nutch/crawl/linkdb
> 2015-07-09 23:19:03,154 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces: adding segment: /opt/nutch/crawl/segments/20150709231809
> 2015-07-09 23:19:03,257 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2015-07-09 23:19:03,678 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2015-07-09 23:19:04,064 WARN  mapred.LocalJobRunner -
> job_local1744190808_0001
> java.lang.Exception: java.lang.RuntimeException: Error in configuring
> object
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
> Caused by: java.lang.RuntimeException: Error in configuring object
>     at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>     at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>     at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>     ... 10 more
> Caused by: java.lang.RuntimeException: Error in configuring object
>     at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>     at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>     at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>     at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>     ... 15 more
> Caused by: java.lang.reflect.InvocationTargetException
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>     ... 18 more
> Caused by: java.lang.NullPointerException
>     at java.io.Reader.<init>(Reader.java:78)
>     at java.io.BufferedReader.<init>(BufferedReader.java:94)
>     at java.io.BufferedReader.<init>(BufferedReader.java:109)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:204)
>     at
> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:176)
>     at
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>     at
> org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:30)
>     at
> org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:99)
>     ... 23 more
> 2015-07-09 23:19:04,535 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
>     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
>
>
>
>
> On 09/07/2015 23:15, Arthur Yarwood wrote:
>
>> I'm attempting with Nutch 1.10 to try and filter the urls I crawl
>> differently to the urls I index (into Solr). Usual scenario or seeds sites
>> tend to be of the format home page -> list page -> product page. I have the
>> home pages as my seeds, but only want to index and search the deep leaf
>> product pages.
>>
>> Going for the approach of having a fair open regex-urlfilter.txt for
>> crawl time to ensure all pages are crawled, but I'm trying to use a more
>> selective regex-urlfilter.txt file to apply at index time, by specifying
>> -Durlfilter.regex.file=myfile.txt on the command line.
>>
>> First problem:-
>> bin/crawl needed some tweaking to only add this -Durlfilter.regex.file
>> java property to the index call, rather than all calls to nutch.
>>
>> Second problem:-
>> bin/nutch index ... -filter -normalize
>> doesn't appear to a valid command line:-
>>
>> /opt/nutch/bin/nutch index
>> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
>> -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
>> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter
>> -normalize
>> Indexer: starting at 2015-07-09 22:28:06
>> Indexer: deleting gone documents: false
>> Indexer: URL filtering: true
>> Indexer: URL normalizing: true
>> Active IndexWriters :
>> SOLRIndexWriter
>>     solr.server.url : URL of the SOLR instance (mandatory)
>>     solr.commit.size : buffer size when sending to SOLR (default 1000)
>>     solr.mapping.file : name of the mapping file for fields (default
>> solrindex-mapping.xml)
>>     solr.auth : use authentication (default false)
>>     solr.auth.username : username for authentication
>>     solr.auth.password : password for authentication
>>
>>
>> Indexer: java.io.IOException: Job failed!
>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
>>     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
>>
>> Error running:
>>   /opt/nutch/bin/nutch index
>> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt
>> -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1
>> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 -filter
>> -normalize
>> Failed with exit value 255.
>>
>> Command worked if I removed the '-filter -normalize' arguments. But of
>> course nowt was filtered. Were these args really the issue, or were they
>> indeed used, but I then hit some subsequent blocker?
>>
>> Then noticed that Patch NUTCH-1300 appears to address this by adding
>> config options:-
>>    public static final String INDEXER_DELETE = "indexer.delete";
>> +  public static final String URL_FILTERING = "indexer.url.filters";
>> +  public static final String URL_NORMALIZING = "indexer.url.normalizers";
>>
>> I've stuck these in my nutch-site.xml, but the command line call to
>> bin/nutch index all informs me their false.
>> Indexer: URL filtering: false
>> Indexer: URL normalizing: false
>>
>> Bit odd, as specifying them on command line above seems to set them to
>> true...
>>
>> So, how does one enable url filtering/normalizing within the nutch index
>> command?
>>
>> Third problem:
>> Gave up on the index pluggable interface, and attempted to swap the index
>> command in bin/crawl to:-
>> bin/nutch solrindex ...
>> However no matter where I add the argument Durlfilter.regex.file=myfile
>> it falls over after misinterpreting the argument order.
>> Do you have a concrete example of how to call binnutch solrindex with
>> Durlfilter.regex.file=myfile ?
>>
>> Thanks!
>>
>>
>
> --
> Arthur Yarwood
> http://www.fubaby.com
>
>

Re: Filtering at index time (with a different regex-urlfilter.txt from crawl)

Posted by Arthur Yarwood <ar...@fubaby.com>.

If it helps, here's the full stack trace from trying to call:
bin/nutch index .... -filter -normalize


2015-07-09 23:19:02,727 INFO  indexer.IndexingJob - Indexer: starting at 
2015-07-09 23:19:02
2015-07-09 23:19:02,802 INFO  indexer.IndexingJob - Indexer: deleting 
gone documents: false
2015-07-09 23:19:02,802 INFO  indexer.IndexingJob - Indexer: URL 
filtering: true
2015-07-09 23:19:02,802 INFO  indexer.IndexingJob - Indexer: URL 
normalizing: true
2015-07-09 23:19:03,152 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-07-09 23:19:03,152 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
     solr.server.url : URL of the SOLR instance (mandatory)
     solr.commit.size : buffer size when sending to SOLR (default 1000)
     solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)
     solr.auth : use authentication (default false)
     solr.auth.username : username for authentication
     solr.auth.password : password for authentication


2015-07-09 23:19:03,154 INFO  indexer.IndexerMapReduce - 
IndexerMapReduce: crawldb: /opt/nutch/crawl/linkdb
2015-07-09 23:19:03,154 INFO  indexer.IndexerMapReduce - 
IndexerMapReduces: adding segment: /opt/nutch/crawl/segments/20150709231809
2015-07-09 23:19:03,257 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes 
where applicable
2015-07-09 23:19:03,678 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2015-07-09 23:19:04,064 WARN  mapred.LocalJobRunner - 
job_local1744190808_0001
java.lang.Exception: java.lang.RuntimeException: Error in configuring object
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: Error in configuring object
     at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
     at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
     at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
     at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
     at java.util.concurrent.FutureTask.run(FutureTask.java:262)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
     at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)
     at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
     ... 10 more
Caused by: java.lang.RuntimeException: Error in configuring object
     at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
     at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
     at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
     at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
     ... 15 more
Caused by: java.lang.reflect.InvocationTargetException
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
     at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
     at java.lang.reflect.Method.invoke(Method.java:606)
     at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
     ... 18 more
Caused by: java.lang.NullPointerException
     at java.io.Reader.<init>(Reader.java:78)
     at java.io.BufferedReader.<init>(BufferedReader.java:94)
     at java.io.BufferedReader.<init>(BufferedReader.java:109)
     at 
org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:204)
     at 
org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:176)
     at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
     at 
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:30)
     at 
org.apache.nutch.indexer.IndexerMapReduce.configure(IndexerMapReduce.java:99)
     ... 23 more
2015-07-09 23:19:04,535 ERROR indexer.IndexingJob - Indexer: 
java.io.IOException: Job failed!
     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)




On 09/07/2015 23:15, Arthur Yarwood wrote:
> I'm attempting with Nutch 1.10 to try and filter the urls I crawl 
> differently to the urls I index (into Solr). Usual scenario or seeds 
> sites tend to be of the format home page -> list page -> product page. 
> I have the home pages as my seeds, but only want to index and search 
> the deep leaf product pages.
>
> Going for the approach of having a fair open regex-urlfilter.txt for 
> crawl time to ensure all pages are crawled, but I'm trying to use a 
> more selective regex-urlfilter.txt file to apply at index time, by 
> specifying -Durlfilter.regex.file=myfile.txt on the command line.
>
> First problem:-
> bin/crawl needed some tweaking to only add this -Durlfilter.regex.file 
> java property to the index call, rather than all calls to nutch.
>
> Second problem:-
> bin/nutch index ... -filter -normalize
> doesn't appear to a valid command line:-
>
> /opt/nutch/bin/nutch index 
> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1 
> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 
> -filter -normalize
> Indexer: starting at 2015-07-09 22:28:06
> Indexer: deleting gone documents: false
> Indexer: URL filtering: true
> Indexer: URL normalizing: true
> Active IndexWriters :
> SOLRIndexWriter
>     solr.server.url : URL of the SOLR instance (mandatory)
>     solr.commit.size : buffer size when sending to SOLR (default 1000)
>     solr.mapping.file : name of the mapping file for fields (default 
> solrindex-mapping.xml)
>     solr.auth : use authentication (default false)
>     solr.auth.username : username for authentication
>     solr.auth.password : password for authentication
>
>
> Indexer: java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>     at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
>     at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
>
> Error running:
>   /opt/nutch/bin/nutch index 
> -Durlfilter.regex.file=/opt/nutch/bin/../conf/regex-urlfilter-index.txt -Dsolr.server.url=http://localhost:8983/solr/HFH_Test1 
> /opt/nutch/crawl/linkdb /opt/nutch/crawl/segments/20150709222749 
> -filter -normalize
> Failed with exit value 255.
>
> Command worked if I removed the '-filter -normalize' arguments. But of 
> course nowt was filtered. Were these args really the issue, or were 
> they indeed used, but I then hit some subsequent blocker?
>
> Then noticed that Patch NUTCH-1300 appears to address this by adding 
> config options:-
>    public static final String INDEXER_DELETE = "indexer.delete";
> +  public static final String URL_FILTERING = "indexer.url.filters";
> +  public static final String URL_NORMALIZING = 
> "indexer.url.normalizers";
>
> I've stuck these in my nutch-site.xml, but the command line call to 
> bin/nutch index all informs me their false.
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
>
> Bit odd, as specifying them on command line above seems to set them to 
> true...
>
> So, how does one enable url filtering/normalizing within the nutch 
> index command?
>
> Third problem:
> Gave up on the index pluggable interface, and attempted to swap the 
> index command in bin/crawl to:-
> bin/nutch solrindex ...
> However no matter where I add the argument 
> Durlfilter.regex.file=myfile it falls over after misinterpreting the 
> argument order.
> Do you have a concrete example of how to call binnutch solrindex with 
> Durlfilter.regex.file=myfile ?
>
> Thanks!
>


-- 
Arthur Yarwood
http://www.fubaby.com