You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "christoph-maximilian.pfluegler@stud.uni-bamberg.de" <ch...@stud.uni-bamberg.de> on 2007/12/18 12:35:29 UTC

adding domain to recrawl

Hi there,


I have the following problem to solve:

I already crawled a couple of domains and can also recrawl them frequently. But what if I want to add additional domains to my crawl lateron?

I could imagine to solutions:

1. Add the new domain somehow to the ?crawldb? so it is considered somehow during the recrawl process. The doubt I have concerning this approach is that I am probably not able to specify the crawl-depth and a crawl-filter.

2. (which I would prefer): crawl the new domain as usual and merge this crawl into the existing crawl. The problem I have with this solution is that the merge crawl script provided by the nutch homepage merges two crawls into a NEW one. This is a problem because the "injection" of the new domain would happen during runtime of the system, therefore changing the corresponding property-file is not possible (usually a Tomcat restart is required to take the changes into effect??). So the question here is if there is a way to merge a new crawl into an EXISTING one.

I appreciate a lot for your help!

Regards,
Chris

Re: adding domain to recrawl

Posted by Susam Pal <su...@gmail.com>.

For point (1), isn't "bin/nutch freegen" command enough for what you want?

Regards,
Susam Pal

On Dec 18, 2007 5:05 PM,
christoph-maximilian.pfluegler@stud.uni-bamberg.de
<ch...@stud.uni-bamberg.de> wrote:
> Hi there,
>
>
> I have the following problem to solve:
>
> I already crawled a couple of domains and can also recrawl them frequently. But what if I want to add additional domains to my crawl lateron?
>
> I could imagine to solutions:
>
> 1. Add the new domain somehow to the ?crawldb? so it is considered somehow during the recrawl process. The doubt I have concerning this approach is that I am probably not able to specify the crawl-depth and a crawl-filter.
>
> 2. (which I would prefer): crawl the new domain as usual and merge this crawl into the existing crawl. The problem I have with this solution is that the merge crawl script provided by the nutch homepage merges two crawls into a NEW one. This is a problem because the "injection" of the new domain would happen during runtime of the system, therefore changing the corresponding property-file is not possible (usually a Tomcat restart is required to take the changes into effect??). So the question here is if there is a way to merge a new crawl into an EXISTING one.
>
> I appreciate a lot for your help!
>
> Regards,
> Chris
>
>
>
>
>