You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/11/24 23:13:15 UTC

need to override refetch intervals

In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.


Here is the injection command I use:
$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt

After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.

I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?



For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.

http://mobile.reuters.com/
http://mobile.reuters.com/business
http://mobile.reuters.com/finance
http://mobile.reuters.com/news/entertainment
http://mobile.reuters.com/news/entertainment/arts
http://mobile.reuters.com/news/environment
http://mobile.reuters.com/news/health
http://mobile.reuters.com/news/lifestyle
http://mobile.reuters.com/news/oddlyEnough
http://mobile.reuters.com/news/science
http://mobile.reuters.com/news/sports
http://mobile.reuters.com/news/technology
http://mobile.reuters.com/news/us
http://mobile.reuters.com/news/world
http://mobile.reuters.com/politics
http://www.reuters.com/subjects/healthcare
https://www.reuters.com/
https://www.reuters.com/energy-environment
https://www.reuters.com/finance
https://www.reuters.com/money
https://www.reuters.com/news/entertainment
https://www.reuters.com/news/health
https://www.reuters.com/news/technology
https://www.reuters.com/news/world
https://www.reuters.com/politics

Re: need to override refetch intervals

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,

> http://mobile.reuters.com/	nutch.score=100	nutch.fetchInterval=1800

works (make sure you have tabs as separators).

Of course, if the URLs are already in CrawlDb you need to "overwrite" them.

   nutch inject  ...   -overwrite
      -D db.injector.overwrite=true does not work because it's overwritten by
      -overwrite or is set to false if -overwrite is absent ;(

or "update"

   nutch inject  ...   -update
     (-update will only overwrite the fetch interval if it's not the default,
      otherwise it preserves the fetch interval which might have been changed adaptively)

Best,
Sebastian

On 11/27/2017 09:23 PM, Michael Coffey wrote:
> I also tried including metadata in the seeds file (TAB-delimited) as follows.
> 
> 
> http://mobile.reuters.com/      nutch.score=100 nutch.fetchInterval=1800
> http://mobile.reuters.com/business      nutch.score=100 nutch.fetchInterval=1800
> 
> 
> So, I am still looking for a way to manipulate the refetch intervals and scores in the crawl db.
> 
> 
> ________________________________
> From: Michael Coffey <mc...@yahoo.com.INVALID>
> To: User <us...@nutch.apache.org> 
> Sent: Friday, November 24, 2017 3:13 PM
> Subject: need to override refetch intervals
> 
> 
> 
> In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.
> 
> 
> 
> Here is the injection command I use:
> 
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt
> 
> 
> After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.
> 
> 
> I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?
> 
> 
> 
> 
> For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.
> 
> 
> http://mobile.reuters.com/
> 
> http://mobile.reuters.com/business
> 
> http://mobile.reuters.com/finance
> 
> http://mobile.reuters.com/news/entertainment
> 
> http://mobile.reuters.com/news/entertainment/arts
> 
> http://mobile.reuters.com/news/environment
> 
> http://mobile.reuters.com/news/health
> 
> http://mobile.reuters.com/news/lifestyle
> 
> http://mobile.reuters.com/news/oddlyEnough
> 
> http://mobile.reuters.com/news/science
> 
> http://mobile.reuters.com/news/sports
> 
> http://mobile.reuters.com/news/technology
> 
> http://mobile.reuters.com/news/us
> 
> http://mobile.reuters.com/news/world
> 
> http://mobile.reuters.com/politics
> 
> http://www.reuters.com/subjects/healthcare
> 
> https://www.reuters.com/
> 
> https://www.reuters.com/energy-environment
> 
> https://www.reuters.com/finance
> 
> https://www.reuters.com/money
> 
> https://www.reuters.com/news/entertainment
> 
> https://www.reuters.com/news/health
> 
> https://www.reuters.com/news/technology
> 
> https://www.reuters.com/news/world
> 
> https://www.reuters.com/politics
> 


Re: need to override refetch intervals

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
I also tried including metadata in the seeds file (TAB-delimited) as follows.


http://mobile.reuters.com/      nutch.score=100 nutch.fetchInterval=1800
http://mobile.reuters.com/business      nutch.score=100 nutch.fetchInterval=1800


So, I am still looking for a way to manipulate the refetch intervals and scores in the crawl db.


________________________________
From: Michael Coffey <mc...@yahoo.com.INVALID>
To: User <us...@nutch.apache.org> 
Sent: Friday, November 24, 2017 3:13 PM
Subject: need to override refetch intervals



In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.



Here is the injection command I use:

$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt


After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.


I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?




For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.


http://mobile.reuters.com/

http://mobile.reuters.com/business

http://mobile.reuters.com/finance

http://mobile.reuters.com/news/entertainment

http://mobile.reuters.com/news/entertainment/arts

http://mobile.reuters.com/news/environment

http://mobile.reuters.com/news/health

http://mobile.reuters.com/news/lifestyle

http://mobile.reuters.com/news/oddlyEnough

http://mobile.reuters.com/news/science

http://mobile.reuters.com/news/sports

http://mobile.reuters.com/news/technology

http://mobile.reuters.com/news/us

http://mobile.reuters.com/news/world

http://mobile.reuters.com/politics

http://www.reuters.com/subjects/healthcare

https://www.reuters.com/

https://www.reuters.com/energy-environment

https://www.reuters.com/finance

https://www.reuters.com/money

https://www.reuters.com/news/entertainment

https://www.reuters.com/news/health

https://www.reuters.com/news/technology

https://www.reuters.com/news/world

https://www.reuters.com/politics