You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/11/24 23:13:15 UTC
need to override refetch intervals
In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.
Here is the injection command I use:
$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt
After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.
I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?
For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.
http://mobile.reuters.com/
http://mobile.reuters.com/business
http://mobile.reuters.com/finance
http://mobile.reuters.com/news/entertainment
http://mobile.reuters.com/news/entertainment/arts
http://mobile.reuters.com/news/environment
http://mobile.reuters.com/news/health
http://mobile.reuters.com/news/lifestyle
http://mobile.reuters.com/news/oddlyEnough
http://mobile.reuters.com/news/science
http://mobile.reuters.com/news/sports
http://mobile.reuters.com/news/technology
http://mobile.reuters.com/news/us
http://mobile.reuters.com/news/world
http://mobile.reuters.com/politics
http://www.reuters.com/subjects/healthcare
https://www.reuters.com/
https://www.reuters.com/energy-environment
https://www.reuters.com/finance
https://www.reuters.com/money
https://www.reuters.com/news/entertainment
https://www.reuters.com/news/health
https://www.reuters.com/news/technology
https://www.reuters.com/news/world
https://www.reuters.com/politics
Re: need to override refetch intervals
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Michael,
> http://mobile.reuters.com/ nutch.score=100 nutch.fetchInterval=1800
works (make sure you have tabs as separators).
Of course, if the URLs are already in CrawlDb you need to "overwrite" them.
nutch inject ... -overwrite
-D db.injector.overwrite=true does not work because it's overwritten by
-overwrite or is set to false if -overwrite is absent ;(
or "update"
nutch inject ... -update
(-update will only overwrite the fetch interval if it's not the default,
otherwise it preserves the fetch interval which might have been changed adaptively)
Best,
Sebastian
On 11/27/2017 09:23 PM, Michael Coffey wrote:
> I also tried including metadata in the seeds file (TAB-delimited) as follows.
>
>
> http://mobile.reuters.com/ nutch.score=100 nutch.fetchInterval=1800
> http://mobile.reuters.com/business nutch.score=100 nutch.fetchInterval=1800
>
>
> So, I am still looking for a way to manipulate the refetch intervals and scores in the crawl db.
>
>
> ________________________________
> From: Michael Coffey <mc...@yahoo.com.INVALID>
> To: User <us...@nutch.apache.org>
> Sent: Friday, November 24, 2017 3:13 PM
> Subject: need to override refetch intervals
>
>
>
> In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.
>
>
>
> Here is the injection command I use:
>
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt
>
>
> After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.
>
>
> I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?
>
>
>
>
> For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.
>
>
> http://mobile.reuters.com/
>
> http://mobile.reuters.com/business
>
> http://mobile.reuters.com/finance
>
> http://mobile.reuters.com/news/entertainment
>
> http://mobile.reuters.com/news/entertainment/arts
>
> http://mobile.reuters.com/news/environment
>
> http://mobile.reuters.com/news/health
>
> http://mobile.reuters.com/news/lifestyle
>
> http://mobile.reuters.com/news/oddlyEnough
>
> http://mobile.reuters.com/news/science
>
> http://mobile.reuters.com/news/sports
>
> http://mobile.reuters.com/news/technology
>
> http://mobile.reuters.com/news/us
>
> http://mobile.reuters.com/news/world
>
> http://mobile.reuters.com/politics
>
> http://www.reuters.com/subjects/healthcare
>
> https://www.reuters.com/
>
> https://www.reuters.com/energy-environment
>
> https://www.reuters.com/finance
>
> https://www.reuters.com/money
>
> https://www.reuters.com/news/entertainment
>
> https://www.reuters.com/news/health
>
> https://www.reuters.com/news/technology
>
> https://www.reuters.com/news/world
>
> https://www.reuters.com/politics
>
Re: need to override refetch intervals
Posted by Michael Coffey <mc...@yahoo.com.INVALID>.
I also tried including metadata in the seeds file (TAB-delimited) as follows.
http://mobile.reuters.com/ nutch.score=100 nutch.fetchInterval=1800
http://mobile.reuters.com/business nutch.score=100 nutch.fetchInterval=1800
So, I am still looking for a way to manipulate the refetch intervals and scores in the crawl db.
________________________________
From: Michael Coffey <mc...@yahoo.com.INVALID>
To: User <us...@nutch.apache.org>
Sent: Friday, November 24, 2017 3:13 PM
Subject: need to override refetch intervals
In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.
Here is the injection command I use:
$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt
After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.
I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?
For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.
http://mobile.reuters.com/
http://mobile.reuters.com/business
http://mobile.reuters.com/finance
http://mobile.reuters.com/news/entertainment
http://mobile.reuters.com/news/entertainment/arts
http://mobile.reuters.com/news/environment
http://mobile.reuters.com/news/health
http://mobile.reuters.com/news/lifestyle
http://mobile.reuters.com/news/oddlyEnough
http://mobile.reuters.com/news/science
http://mobile.reuters.com/news/sports
http://mobile.reuters.com/news/technology
http://mobile.reuters.com/news/us
http://mobile.reuters.com/news/world
http://mobile.reuters.com/politics
http://www.reuters.com/subjects/healthcare
https://www.reuters.com/
https://www.reuters.com/energy-environment
https://www.reuters.com/finance
https://www.reuters.com/money
https://www.reuters.com/news/entertainment
https://www.reuters.com/news/health
https://www.reuters.com/news/technology
https://www.reuters.com/news/world
https://www.reuters.com/politics