You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sergey A Volkov <se...@gmail.com> on 2011/10/14 14:48:47 UTC

crawldb modifications.

Hi!

Is there any good way to modify all crawldb records? (e.g. drop score or 
force refetch).

I'm using now nutch 1.2 and as I see the only way to do this is writing 
own MapReduce task for every modification or changing CrawlDb updater 
and writing own extension point.

Sergey Volkov.

Re: SOLVED: crawldb modifications.

Posted by Sergey A Volkov <se...@gmail.com>.
Thanks.

I will try to patch CrawlDBFilter for both problems.

On 10/14/2011 05:35 PM, Markus Jelsma wrote:
>
> On Friday 14 October 2011 15:30:25 Sergey A Volkov wrote:
>> Thanks for your quick reply.
>>
>> I will try to use scoreupdater next time=)
> Keep in mind that it relies on the WebGraph program. Another quick fix would
> be to patch CrawlDBFilter to reset score based on the presence of some
> configuration setting.
>
>> Unfortunately -addDays would not work for me because I want to refetch
>> only specified domains, not all db (my first question was not correct).
>> Another problem with -addDays  and FetchSchedule is that I have to use
>> generate.topN lower than size of part for refetch (there are some time
>> restrictions for index update)
>> , so i can't determine when to stop using addDays
> If you only want to generate fetch lists for specific domains you can use a
> custom domain URL filter with the generator.
>
> Take care of using a filter for a generator with DB updating as you'll loose
> all filtered URL's then.
>
>> On Fri 14 Oct 2011 04:52:33 PM MSK, Markus Jelsma wrote:
>>> There are no tools for resetting the score but it would not be hard to
>>> modify an existing tool for that e.g. WebGraph's scoreupdater tool. You
>>> can force refetch by using the -addDays switch with the generator tool.
>>> It'll add numDays to the current time to generate records that are not
>>> yet due for fetch.
>>>
>>> On Friday 14 October 2011 14:48:47 Sergey A Volkov wrote:
>>>> Hi!
>>>>
>>>> Is there any good way to modify all crawldb records? (e.g. drop score or
>>>> force refetch).
>>>>
>>>> I'm using now nutch 1.2 and as I see the only way to do this is writing
>>>> own MapReduce task for every modification or changing CrawlDb updater
>>>> and writing own extension point.
>>>>
>>>> Sergey Volkov.


Re: crawldb modifications.

Posted by Markus Jelsma <ma...@openindex.io>.

On Friday 14 October 2011 15:30:25 Sergey A Volkov wrote:
> Thanks for your quick reply.
> 
> I will try to use scoreupdater next time=)

Keep in mind that it relies on the WebGraph program. Another quick fix would 
be to patch CrawlDBFilter to reset score based on the presence of some 
configuration setting.

> 
> Unfortunately -addDays would not work for me because I want to refetch
> only specified domains, not all db (my first question was not correct).
> Another problem with -addDays  and FetchSchedule is that I have to use
> generate.topN lower than size of part for refetch (there are some time
> restrictions for index update)
> , so i can't determine when to stop using addDays

If you only want to generate fetch lists for specific domains you can use a 
custom domain URL filter with the generator. 

Take care of using a filter for a generator with DB updating as you'll loose 
all filtered URL's then.

> 
> On Fri 14 Oct 2011 04:52:33 PM MSK, Markus Jelsma wrote:
> > There are no tools for resetting the score but it would not be hard to
> > modify an existing tool for that e.g. WebGraph's scoreupdater tool. You
> > can force refetch by using the -addDays switch with the generator tool.
> > It'll add numDays to the current time to generate records that are not
> > yet due for fetch.
> > 
> > On Friday 14 October 2011 14:48:47 Sergey A Volkov wrote:
> >> Hi!
> >> 
> >> Is there any good way to modify all crawldb records? (e.g. drop score or
> >> force refetch).
> >> 
> >> I'm using now nutch 1.2 and as I see the only way to do this is writing
> >> own MapReduce task for every modification or changing CrawlDb updater
> >> and writing own extension point.
> >> 
> >> Sergey Volkov.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: crawldb modifications.

Posted by Sergey A Volkov <se...@gmail.com>.
Thanks for your quick reply.

I will try to use scoreupdater next time=)

Unfortunately -addDays would not work for me because I want to refetch 
only specified domains, not all db (my first question was not correct).
Another problem with -addDays  and FetchSchedule is that I have to use 
generate.topN lower than size of part for refetch (there are some time 
restrictions for index update)
, so i can't determine when to stop using addDays
On Fri 14 Oct 2011 04:52:33 PM MSK, Markus Jelsma wrote:
> There are no tools for resetting the score but it would not be hard to modify
> an existing tool for that e.g. WebGraph's scoreupdater tool. You can force
> refetch by using the -addDays switch with the generator tool. It'll add
> numDays to the current time to generate records that are not yet due for
> fetch.
>
> On Friday 14 October 2011 14:48:47 Sergey A Volkov wrote:
>> Hi!
>>
>> Is there any good way to modify all crawldb records? (e.g. drop score or
>> force refetch).
>>
>> I'm using now nutch 1.2 and as I see the only way to do this is writing
>> own MapReduce task for every modification or changing CrawlDb updater
>> and writing own extension point.
>>
>> Sergey Volkov.
>



Re: crawldb modifications.

Posted by Markus Jelsma <ma...@openindex.io>.
There are no tools for resetting the score but it would not be hard to modify 
an existing tool for that e.g. WebGraph's scoreupdater tool. You can force 
refetch by using the -addDays switch with the generator tool. It'll add 
numDays to the current time to generate records that are not yet due for 
fetch.

On Friday 14 October 2011 14:48:47 Sergey A Volkov wrote:
> Hi!
> 
> Is there any good way to modify all crawldb records? (e.g. drop score or
> force refetch).
> 
> I'm using now nutch 1.2 and as I see the only way to do this is writing
> own MapReduce task for every modification or changing CrawlDb updater
> and writing own extension point.
> 
> Sergey Volkov.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350