You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by nu...@joergsandl.com on 2009/06/09 17:20:47 UTC

After test -> how to crawl WWW continously?

Hi

nutch installed and tested successfully.
How do I manage that my webserver crawls continously the WWW?
I guess I will have to start a script with the crontab each minute?
How does this script work and what needs to be write in?
I tried but however it doesn't work because of the pathes (I just  
wrote the test routine in a file)

Many thx
Jo

Re: After test -> how to crawl WWW continously?

Posted by nu...@joergsandl.com.

I run it with bash


Zitat von Susam Pal <su...@gmail.com>:

> On Wed, Jun 10, 2009 at 12:29 AM, <nu...@joergsandl.com> wrote:
>> I used the recrawl script but caused errors:
>>
>> runbot: ./nutchcrawler.sh could not find environment variable NUTCH_HOME
>> runbot: NUTCH_HOME=/nutch has been set by the script
>> runbot: ./nutchcrawler.sh could not find environment variable NUTCH_HOME
>> runbot: CATALINA_HOME=/usr/local/tomcat has been set by the script
>> ----- Inject (Step 1 of 8) -----
>> Injector: starting
>> Injector: crawlDb: crawl/crawldb
>> Injector: urlDir: urls
>> Injector: Converting injected urls to crawl db entries.
>> Injector: Merging injected urls into crawl db.
>> Injector: done
>> ----- Generate, Fetch, Parse, Update (Step 2 of 8) -----
>> ./nutchcrawler.sh: 56: Syntax error: Bad for loop variable
>>
>> --------------------LINE 56 is-----------------
>> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>> for((i=0; i < $depth; i++))
>> do
>> ------------------------------------------------
>
> This statement should run fine in bash. What shell are you using to
> run this? I never tested this script for anything other than bash. If
> you are using some other shell, you might want to rewrite the loop
> using a while or something else that works for your shell.
>
> Regards,
> Susam Pal
>

Re: After test -> how to crawl WWW continously?

Posted by Susam Pal <su...@gmail.com>.

On Wed, Jun 10, 2009 at 12:29 AM, <nu...@joergsandl.com> wrote:
> I used the recrawl script but caused errors:
>
> runbot: ./nutchcrawler.sh could not find environment variable NUTCH_HOME
> runbot: NUTCH_HOME=/nutch has been set by the script
> runbot: ./nutchcrawler.sh could not find environment variable NUTCH_HOME
> runbot: CATALINA_HOME=/usr/local/tomcat has been set by the script
> ----- Inject (Step 1 of 8) -----
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> ----- Generate, Fetch, Parse, Update (Step 2 of 8) -----
> ./nutchcrawler.sh: 56: Syntax error: Bad for loop variable
>
> --------------------LINE 56 is-----------------
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
> do
> ------------------------------------------------

This statement should run fine in bash. What shell are you using to
run this? I never tested this script for anything other than bash. If
you are using some other shell, you might want to rewrite the loop
using a while or something else that works for your shell.

Regards,
Susam Pal

Re: After test -> how to crawl WWW continously?

Posted by nu...@joergsandl.com.

I used the recrawl script but caused errors:

runbot: ./nutchcrawler.sh could not find environment variable NUTCH_HOME
runbot: NUTCH_HOME=/nutch has been set by the script
runbot: ./nutchcrawler.sh could not find environment variable NUTCH_HOME
runbot: CATALINA_HOME=/usr/local/tomcat has been set by the script
----- Inject (Step 1 of 8) -----
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
----- Generate, Fetch, Parse, Update (Step 2 of 8) -----
./nutchcrawler.sh: 56: Syntax error: Bad for loop variable

--------------------LINE 56 is-----------------
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
------------------------------------------------

What went wrong?


Zitat von Raymond Balmès <ra...@gmail.com>:

> There is a "recrawl" script on the Wiki, but a crawl usually takes hours or
> even days so why would you want to start one every minute.
> If you are looking for crawling the entire web... you probably need a couple
> of servers, disk space, bandwidth
>
> I think you need to describe somewhat more what kind of crawl you want to
> do.
>
> -Ray-
>
> 2009/6/9 <nu...@joergsandl.com>
>
>> Hi
>>
>> nutch installed and tested successfully.
>> How do I manage that my webserver crawls continously the WWW?
>> I guess I will have to start a script with the crontab each minute?
>> How does this script work and what needs to be write in?
>> I tried but however it doesn't work because of the pathes (I just wrote the
>> test routine in a file)
>>
>> Many thx
>> Jo
>>
>

Re: After test -> how to crawl WWW continously?

Posted by Raymond Balmès <ra...@gmail.com>.

There is a "recrawl" script on the Wiki, but a crawl usually takes hours or
even days so why would you want to start one every minute.
If you are looking for crawling the entire web... you probably need a couple
of servers, disk space, bandwidth

I think you need to describe somewhat more what kind of crawl you want to
do.

-Ray-

2009/6/9 <nu...@joergsandl.com>

> Hi
>
> nutch installed and tested successfully.
> How do I manage that my webserver crawls continously the WWW?
> I guess I will have to start a script with the crontab each minute?
> How does this script work and what needs to be write in?
> I tried but however it doesn't work because of the pathes (I just wrote the
> test routine in a file)
>
> Many thx
> Jo
>