You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "christoph-maximilian.pfluegler@stud.uni-bamberg.de" <ch...@stud.uni-bamberg.de> on 2008/01/10 14:04:33 UTC

Problem with recrawl

Hi there,

I'm actually having weird problems with my recrawl procedure (nutch0.9).

The situation is the following:

First, I crawl a couple of domains. Then, I start a seperate crawl with a pages resulting from the first crawl and finally merge these two crawls.

What I basically want to achieve now is to frequently update (refetch!!) the crawl resulting from the merge-procedure without adding new urls to it. The problem now is that while executing the recrawl-procedure, nutch is fetching/indexing new urls I don't want to have in my crawl. The -noaddionts parameter doesn't help me, because the crawldb seems to already contain more urls than actually indexed (and these urls are going to be injected initially). So my approach was to use the regex-urlfilter.txt, but somehow the recrawl procedure doesn't consider this file (all new urls are fetched/indexed anyway). The parameters were depth 1 and adddays 1. If somebody knows how to limit the nutch-recrawl-procedure, please let me know.

The second problem I faced was with the adddays parameter. A recrawl with depth 0 and adddays 31 doesn't make nutch to refetch the urls. If I change the depth to 1, I face the problems described above, but nutch doesn't refetch the 'orginial' pages either.

So, does anybody know how to solve these problems??

Thanks for your help!

Regards,
Chris 


 
                   

Re: Problem with recrawl

Posted by Susam Pal <su...@gmail.com>.
Hi,

It seems you don't want to use the URLs generated from the crawl db
itself. In that case "bin/nutch freegen" command would be helpful to
you.

If you are using the "bin/nutch crawl" command, the
'conf/regex-urlfilter.txt' file wouldn't be used. The
'conf/crawl-urlfilter.txt' file would be used instead.

Regards,
Susam Pal

On Jan 10, 2008 6:34 PM,
christoph-maximilian.pfluegler@stud.uni-bamberg.de
<ch...@stud.uni-bamberg.de> wrote:
> Hi there,
>
> I'm actually having weird problems with my recrawl procedure (nutch0.9).
>
> The situation is the following:
>
> First, I crawl a couple of domains. Then, I start a seperate crawl with a pages resulting from the first crawl and finally merge these two crawls.
>
> What I basically want to achieve now is to frequently update (refetch!!) the crawl resulting from the merge-procedure without adding new urls to it. The problem now is that while executing the recrawl-procedure, nutch is fetching/indexing new urls I don't want to have in my crawl. The -noaddionts parameter doesn't help me, because the crawldb seems to already contain more urls than actually indexed (and these urls are going to be injected initially). So my approach was to use the regex-urlfilter.txt, but somehow the recrawl procedure doesn't consider this file (all new urls are fetched/indexed anyway). The parameters were depth 1 and adddays 1. If somebody knows how to limit the nutch-recrawl-procedure, please let me know.
>
> The second problem I faced was with the adddays parameter. A recrawl with depth 0 and adddays 31 doesn't make nutch to refetch the urls. If I change the depth to 1, I face the problems described above, but nutch doesn't refetch the 'orginial' pages either.
>
> So, does anybody know how to solve these problems??
>
> Thanks for your help!
>
> Regards,
> Chris
>
>
>
>
>