You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Murat Ali Bayir <mu...@agmlab.com> on 2006/05/22 13:50:56 UTC

WhiteListBlackList

Hi, I have problem when I am using black-white list url filtering. I have two directiory for filtering
called NegativeURLS and PositiveURLS

*****************************************************************************************
in NegativeURLS, I have
www.hurriyet.com.tr

in PostiveURLS, I have 
www.milliyet.com.tr

*****************************************************************************************
In the input directory for Crawl operation, I have
www.hurriyet.com.tr
www.milliyet.com.tr

I run the following commands from shell.

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/PositiveURLS/ -white

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/NegativeURLS/ -black

Then I run inject,generate and Fetch, After that I run following 

$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/trace/output/segments/20060522115951/

Finally I run GenericReader and I print the output, it contains the URLs that are in the blacklist,
what can be the problem? 






Re: Run-Time Error

Posted by Dennis Kubes <nu...@dragonflymc.com>.
On the launcher under classpath you will need to add the directory above 
plugins.  Make sure this is on the eclipse laucher though.  Setting it 
on the project won't help

TDLN wrote:
> Did you add the plugins directory to your classpath and does it
> contain all of your plugins?
>
> Rgrds, Thomas
>
> On 5/23/06, Murat Ali Bayir <mu...@agmlab.com> wrote:
>> Hi everbody, I am running Nuth 0.8 under windows by using Eclipse
>> I got the following error.  I added conf directory to my classpath. I
>> changed
>> nuth-site.xml added regex-url filter there. What can be reason for the
>> following mistake?
>>
>> java.lang.RuntimeException:
>> org.apache.nutch.net.URLFilter not found.
>>         at
>> org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
>>         at
>> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:55)
>>         at
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>>         at
>> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>>         at
>> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>>         at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>>         at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
>> Exception in thread "main" java.io.IOException: Job
>> failed!
>>         at
>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
>>         at
>> org.apache.nutch.crawl.Injector.inject(Injector.java:130)
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
>>
>>
>>
>>
>>

Re: Run-Time Error

Posted by TDLN <di...@gmail.com>.
Did you add the plugins directory to your classpath and does it
contain all of your plugins?

Rgrds, Thomas

On 5/23/06, Murat Ali Bayir <mu...@agmlab.com> wrote:
> Hi everbody, I am running Nuth 0.8 under windows by using Eclipse
> I got the following error.  I added conf directory to my classpath. I
> changed
> nuth-site.xml added regex-url filter there. What can be reason for the
> following mistake?
>
> java.lang.RuntimeException:
> org.apache.nutch.net.URLFilter not found.
>         at
> org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
>         at
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:55)
>         at
> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>         at
> org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
>         at
> org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
>         at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
> Exception in thread "main" java.io.IOException: Job
> failed!
>         at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
>         at
> org.apache.nutch.crawl.Injector.inject(Injector.java:130)
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
>
>
>
>
>

Changing db data

Posted by Bogdan Kecman <bo...@alteray.com>.
Hi,
I'm writing a small utility to ammend the data in nutch database. I managed
to read the nutch database, also I can delete document from the database but
is there a way to change a value of the field in nutch db?

If you can just point me in right direction, spent lot of time reading
lucene and nutch api, I can create db from scratch and add data but cannot
change anything... Any ideas ?

10x in advance
Bogdan


Run-Time Error

Posted by Murat Ali Bayir <mu...@agmlab.com>.
Hi everbody, I am running Nuth 0.8 under windows by using Eclipse
I got the following error.  I added conf directory to my classpath. I 
changed
nuth-site.xml added regex-url filter there. What can be reason for the
following mistake?

java.lang.RuntimeException:
org.apache.nutch.net.URLFilter not found.
	at
org.apache.nutch.net.URLFilters.<init>(URLFilters.java:47)
	at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:55)
	at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
	at
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)
	at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:389)
	at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
	at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)
Exception in thread "main" java.io.IOException: Job
failed!
	at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
	at
org.apache.nutch.crawl.Injector.inject(Injector.java:130)
	at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)





Re: WhiteListBlackList

Posted by Murat Ali Bayir <mu...@agmlab.com>.
Marko Bauhardt wrote:

>
> Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:
>
>> Hi, I have problem when I am using black-white list url filtering.  I 
>> have two directiory for filtering
>> called NegativeURLS and PositiveURLS
>>
>> ********************************************************************** 
>> *******************
>> in NegativeURLS, I have
>> www.hurriyet.com.tr
>>
>> in PostiveURLS, I have www.milliyet.com.tr
>>
>> ********************************************************************** 
>> *******************
>> In the input directory for Crawl operation, I have
>> www.hurriyet.com.tr
>> www.milliyet.com.tr
>>
>> I run the following commands from shell.
>>
>> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ 
>> PositiveURLS/ -white
>>
>> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ 
>> NegativeURLS/ -black
>>
>> Then I run inject,generate and Fetch, After that I run following
>> $ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/ 
>> trace/output/segments/20060522115951/
>>
>> Finally I run GenericReader and I print the output, it contains the  
>> URLs that are in the blacklist,
>> what can be the problem?
>
>
> The Black/White List works only in the update process (BWUpdateDb),  
> not by fetching or generating. Only the white Urls will be updated to  
> the crawldb.
>
> Are only www.hurriyet.com.tr in your crawldb or other html sites from  
> this host? And what is the status of this urls (STATUS_DB_FETCHED or  
> STATUS_DB_UNFETCHED  )?
>
> Marko
>
>
>
> The crawldb contains the following

http://hurriyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null

http://milliyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null


both of them is DB_unfetched.

PostiveURL is http://milliyet.com.tr
it is in ~/URL/PositiveURLS/Positive.txt

NegativeURL is http://hurriyet.com.tr
it is in ~/URL/NegativeURLS/Negative.txt

I run the following inject command

 ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/PositiveURLS/ 
-white 
 ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/NegativeURLS/ 
-black

After fetch command with parsing option

I run the following

$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/ 
trace/output/segments/20060522115951/


Any suggestion for two DB_unfetched entry? I expect one them is fetched.


Re: WhiteListBlackList

Posted by Marko Bauhardt <mb...@media-style.com>.
Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:

> Hi, I have problem when I am using black-white list url filtering.  
> I have two directiory for filtering
> called NegativeURLS and PositiveURLS
>
> ********************************************************************** 
> *******************
> in NegativeURLS, I have
> www.hurriyet.com.tr
>
> in PostiveURLS, I have www.milliyet.com.tr
>
> ********************************************************************** 
> *******************
> In the input directory for Crawl operation, I have
> www.hurriyet.com.tr
> www.milliyet.com.tr
>
> I run the following commands from shell.
>
> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ 
> PositiveURLS/ -white
>
> $ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ 
> NegativeURLS/ -black
>
> Then I run inject,generate and Fetch, After that I run following
> $ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/ 
> trace/output/segments/20060522115951/
>
> Finally I run GenericReader and I print the output, it contains the  
> URLs that are in the blacklist,
> what can be the problem?

The Black/White List works only in the update process (BWUpdateDb),  
not by fetching or generating. Only the white Urls will be updated to  
the crawldb.

Are only www.hurriyet.com.tr in your crawldb or other html sites from  
this host? And what is the status of this urls (STATUS_DB_FETCHED or  
STATUS_DB_UNFETCHED  )?

Marko