You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by BELLINI ADAM <mb...@msn.com> on 2009/12/01 17:05:39 UTC

RE: recrawl.sh stopped at depth 7/10 without error

hi,

anay idea guys ??



thanx

> From: mbellil@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: recrawl.sh stopped at depth 7/10 without error
> Date: Fri, 27 Nov 2009 20:11:12 +0000
> 
> 
> 
> hi,
> 
> this is the main loop of my recrawl.sh
> 
> 
> do
> 
>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
>       -adddays $adddays
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>     break
>   fi
>   segment=`ls -d $crawl/segments/* | tail -1`
> 
>   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>     echo "runbot: Deleting segment $segment."
>     rm $RMARGS $segment
>     continue
>   fi
> 
>   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> 
> done
> 
> echo "----- Merge Segments (Step 3 of $steps) -----"
> 
> 
> 
> in my log file i never find the message "----- Merge Segments (Step 3 of $steps) -----" ! so it breaks the loop and stops the process. 
> 
> i dont understand why it stops at depth 7 without any errors !
> 
> 
> > From: mbellil@msn.com
> > To: nutch-user@lucene.apache.org
> > Subject: recrawl.sh stopped at depth 7/10 without error
> > Date: Wed, 25 Nov 2009 15:43:33 +0000
> > 
> > 
> > 
> > hi,
> > 
> > i'm running recrawl.sh and it stops every time at depth 7/10 without any error ! but when run the bin/crawl with the same crawl-urlfilter and the same seeds file it finishs softly in 1h50
> > 
> > i checked the hadoop.log, and dont find any error there...i just find the last url it was parsing
> > do fetching or crawling has a timeout ?
> > my recrawl takes 2 hours before it stops. i set the time fetch interval 24 hours and i'm running the generate with adddays = 1
> > 
> > best regards
> >  		 	   		  
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
>  		 	   		  
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

RE: recrawl.sh stopped at depth 7/10 without error

Posted by BELLINI ADAM <mb...@msn.com>.


i fixed it by putting  it in crontab and now i can sleep without thinking of it :)
thx u very much



> Date: Mon, 7 Dec 2009 12:03:25 -0500
> From: ptomblin@gmail.com
> To: nutch-user@lucene.apache.org
> Subject: RE: recrawl.sh stopped at depth 7/10 without error
> 
> Try starting it with nohup.  'man nohup' for details.
> 
> -- Sent from my Palm Prē
> BELLINI ADAM wrote:
> 
> 
> 
> 
> 
> 
> 
> hi,
> 
> 
> 
> mabe i found my probleme, it's not nutch mistake, i beleived when running the crawl command as background process when closing my console it will not stop the process, but it seems that it realy kill the process  
> 
> 
> 
> 
> 
> i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 > crawl.log &amp;
> 
> 
> 
> but even with the '&amp;' caractere when closing my console it kills the process.
> 
> 
> 
> thx
> 
> 
> 
> > Date: Mon, 7 Dec 2009 19:00:37 +0800
> 
> > Subject: Re: recrawl.sh stopped at depth 7/10 without error
> 
> > From: yeahyf@gmail.com
> 
> > To: nutch-user@lucene.apache.org
> 
> > 
> 
> > I sill want to  know the reason.
> 
> > 
> 
> > 2009/12/2 BELLINI ADAM &lt;mbellil@msn.com>
> 
> > 
> 
> > >
> 
> > > hi,
> 
> > >
> 
> > > anay idea guys ??
> 
> > >
> 
> > >
> 
> > >
> 
> > > thanx
> 
> > >
> 
> > > > From: mbellil@msn.com
> 
> > > > To: nutch-user@lucene.apache.org
> 
> > > > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> 
> > > > Date: Fri, 27 Nov 2009 20:11:12 +0000
> 
> > > >
> 
> > > >
> 
> > > >
> 
> > > > hi,
> 
> > > >
> 
> > > > this is the main loop of my recrawl.sh
> 
> > > >
> 
> > > >
> 
> > > > do
> 
> > > >
> 
> > > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> 
> > > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
> 
> > > >       -adddays $adddays
> 
> > > >   if [ $? -ne 0 ]
> 
> > > >   then
> 
> > > >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> 
> > > >     break
> 
> > > >   fi
> 
> > > >   segment=`ls -d $crawl/segments/* | tail -1`
> 
> > > >
> 
> > > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> 
> > > >   if [ $? -ne 0 ]
> 
> > > >   then
> 
> > > >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> 
> > > >     echo "runbot: Deleting segment $segment."
> 
> > > >     rm $RMARGS $segment
> 
> > > >     continue
> 
> > > >   fi
> 
> > > >
> 
> > > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> 
> > > >
> 
> > > > done
> 
> > > >
> 
> > > > echo "----- Merge Segments (Step 3 of $steps) -----"
> 
> > > >
> 
> > > >
> 
> > > >
> 
> > > > in my log file i never find the message "----- Merge Segments (Step 3 of
> 
> > > $steps) -----" ! so it breaks the loop and stops the process.
> 
> > > >
> 
> > > > i dont understand why it stops at depth 7 without any errors !
> 
> > > >
> 
> > > >
> 
> > > > > From: mbellil@msn.com
> 
> > > > > To: nutch-user@lucene.apache.org
> 
> > > > > Subject: recrawl.sh stopped at depth 7/10 without error
> 
> > > > > Date: Wed, 25 Nov 2009 15:43:33 +0000
> 
> > > > >
> 
> > > > >
> 
> > > > >
> 
> > > > > hi,
> 
> > > > >
> 
> > > > > i'm running recrawl.sh and it stops every time at depth 7/10 without
> 
> > > any error ! but when run the bin/crawl with the same crawl-urlfilter and the
> 
> > > same seeds file it finishs softly in 1h50
> 
> > > > >
> 
> > > > > i checked the hadoop.log, and dont find any error there...i just find
> 
> > > the last url it was parsing
> 
> > > > > do fetching or crawling has a timeout ?
> 
> > > > > my recrawl takes 2 hours before it stops. i set the time fetch interval
> 
> > > 24 hours and i'm running the generate with adddays = 1
> 
> > > > >
> 
> > > > > best regards
> 
> > > > >
> 
> > > > > _________________________________________________________________
> 
> > > > > Eligible CDN College &amp; University students can upgrade to Windows 7
> 
> > > before Jan 3 for only $39.99. Upgrade now!
> 
> > > > > http://go.microsoft.com/?linkid=9691819
> 
> > > >
> 
> > > > _________________________________________________________________
> 
> > > > Eligible CDN College &amp; University students can upgrade to Windows 7
> 
> > > before Jan 3 for only $39.99. Upgrade now!
> 
> > > > http://go.microsoft.com/?linkid=9691819
> 
> > >
> 
> > > _________________________________________________________________
> 
> > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
> 
> > > now
> 
> > > http://go.microsoft.com/?linkid=9691818
> 
>  		 	   		  
> 
> _________________________________________________________________
> 
> Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
> 
> http://go.microsoft.com/?linkid=9691818
 		 	   		  
_________________________________________________________________
Windows Live: Friends get your Flickr, Yelp, and Digg updates when they e-mail you.
http://go.microsoft.com/?linkid=9691817

RE: recrawl.sh stopped at depth 7/10 without error

Posted by BELLINI ADAM <mb...@msn.com>.

yes i'v just tested nohup and it works :)
thx to all






> Date: Mon, 7 Dec 2009 19:26:42 +0100
> Subject: Re: recrawl.sh stopped at depth 7/10 without error
> From: millebii@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Another an alternative to crontab, I use nohup command to get my jobs running.
> 
> 2009/12/7, BELLINI ADAM <mb...@msn.com>:
> >
> > thx fuad for the info...yes i was just closing my laptop without exting the
> > ssh session.
> > but now i hv it running form my cron and it didnt stop :)
> > thx again
> >
> >
> >
> >
> >> From: fuad@efendi.ca
> >> To: nutch-user@lucene.apache.org
> >> Subject: RE: recrawl.sh stopped at depth 7/10 without error
> >> Date: Mon, 7 Dec 2009 12:58:48 -0500
> >>
> >> >crawl.log 2>&1 &
> >>
> >> You forgot 2>&1... output for errors...
> >>
> >> Also, you need to close _politely_ the SSH session by executing "exit".
> >> Without it, it pipe is broken, OS will kill the process.
> >>
> >>
> >> Fuad Efendi
> >> +1 416-993-2060
> >> http://www.tokenizer.ca
> >> Data Mining, Vertical Search
> >>
> >>
> >> > -----Original Message-----
> >> > From: BELLINI ADAM [mailto:mbellil@msn.com]
> >> > Sent: December-07-09 12:01 PM
> >> > To: nutch-user@lucene.apache.org
> >> > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> >> >
> >> >
> >> >
> >> >
> >> > hi,
> >> >
> >> > mabe i found my probleme, it's not nutch mistake, i beleived when
> >> > running
> >> > the crawl command as background process when closing my console it will
> >> > not stop the process, but it seems that it realy kill the process
> >> >
> >> >
> >> > i launched the porcess like this : ./bin/nutch crawl urls -dir crawl
> >> > depth
> >> > -10 > crawl.log &
> >> >
> >> > but even with the '&' caractere when closing my console it kills the
> >> > process.
> >> >
> >> > thx
> >> >
> >> > > Date: Mon, 7 Dec 2009 19:00:37 +0800
> >> > > Subject: Re: recrawl.sh stopped at depth 7/10 without error
> >> > > From: yeahyf@gmail.com
> >> > > To: nutch-user@lucene.apache.org
> >> > >
> >> > > I sill want to  know the reason.
> >> > >
> >> > > 2009/12/2 BELLINI ADAM <mb...@msn.com>
> >> > >
> >> > > >
> >> > > > hi,
> >> > > >
> >> > > > anay idea guys ??
> >> > > >
> >> > > >
> >> > > >
> >> > > > thanx
> >> > > >
> >> > > > > From: mbellil@msn.com
> >> > > > > To: nutch-user@lucene.apache.org
> >> > > > > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> >> > > > > Date: Fri, 27 Nov 2009 20:11:12 +0000
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > hi,
> >> > > > >
> >> > > > > this is the main loop of my recrawl.sh
> >> > > > >
> >> > > > >
> >> > > > > do
> >> > > > >
> >> > > > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >> > > > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> >> > $topN \
> >> > > > >       -adddays $adddays
> >> > > > >   if [ $? -ne 0 ]
> >> > > > >   then
> >> > > > >     echo "runbot: Stopping at depth $depth. No more URLs to
> >> > > > > fetch."
> >> > > > >     break
> >> > > > >   fi
> >> > > > >   segment=`ls -d $crawl/segments/* | tail -1`
> >> > > > >
> >> > > > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >> > > > >   if [ $? -ne 0 ]
> >> > > > >   then
> >> > > > >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> >> > > > >     echo "runbot: Deleting segment $segment."
> >> > > > >     rm $RMARGS $segment
> >> > > > >     continue
> >> > > > >   fi
> >> > > > >
> >> > > > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> >> > > > >
> >> > > > > done
> >> > > > >
> >> > > > > echo "----- Merge Segments (Step 3 of $steps) -----"
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > in my log file i never find the message "----- Merge Segments
> >> > > > > (Step
> >> > 3 of
> >> > > > $steps) -----" ! so it breaks the loop and stops the process.
> >> > > > >
> >> > > > > i dont understand why it stops at depth 7 without any errors !
> >> > > > >
> >> > > > >
> >> > > > > > From: mbellil@msn.com
> >> > > > > > To: nutch-user@lucene.apache.org
> >> > > > > > Subject: recrawl.sh stopped at depth 7/10 without error
> >> > > > > > Date: Wed, 25 Nov 2009 15:43:33 +0000
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > hi,
> >> > > > > >
> >> > > > > > i'm running recrawl.sh and it stops every time at depth 7/10
> >> > without
> >> > > > any error ! but when run the bin/crawl with the same crawl-urlfilter
> >> > and the
> >> > > > same seeds file it finishs softly in 1h50
> >> > > > > >
> >> > > > > > i checked the hadoop.log, and dont find any error there...i just
> >> > find
> >> > > > the last url it was parsing
> >> > > > > > do fetching or crawling has a timeout ?
> >> > > > > > my recrawl takes 2 hours before it stops. i set the time fetch
> >> > interval
> >> > > > 24 hours and i'm running the generate with adddays = 1
> >> > > > > >
> >> > > > > > best regards
> >> > > > > >
> >> > > > > > _________________________________________________________________
> >> > > > > > Eligible CDN College & University students can upgrade to
> >> > > > > > Windows
> >> > 7
> >> > > > before Jan 3 for only $39.99. Upgrade now!
> >> > > > > > http://go.microsoft.com/?linkid=9691819
> >> > > > >
> >> > > > > _________________________________________________________________
> >> > > > > Eligible CDN College & University students can upgrade to Windows
> >> > > > > 7
> >> > > > before Jan 3 for only $39.99. Upgrade now!
> >> > > > > http://go.microsoft.com/?linkid=9691819
> >> > > >
> >> > > > _________________________________________________________________
> >> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> >> > Windows 7
> >> > > > now
> >> > > > http://go.microsoft.com/?linkid=9691818
> >> >
> >> > _________________________________________________________________
> >> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> >> > Windows
> >> > 7 now
> >> > http://go.microsoft.com/?linkid=9691818
> >>
> >>
> >  		 	   		
> > _________________________________________________________________
> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
> > now
> > http://go.microsoft.com/?linkid=9691818
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

Re: recrawl.sh stopped at depth 7/10 without error

Posted by MilleBii <mi...@gmail.com>.

Another an alternative to crontab, I use nohup command to get my jobs running.

2009/12/7, BELLINI ADAM <mb...@msn.com>:
>
> thx fuad for the info...yes i was just closing my laptop without exting the
> ssh session.
> but now i hv it running form my cron and it didnt stop :)
> thx again
>
>
>
>
>> From: fuad@efendi.ca
>> To: nutch-user@lucene.apache.org
>> Subject: RE: recrawl.sh stopped at depth 7/10 without error
>> Date: Mon, 7 Dec 2009 12:58:48 -0500
>>
>> >crawl.log 2>&1 &
>>
>> You forgot 2>&1... output for errors...
>>
>> Also, you need to close _politely_ the SSH session by executing "exit".
>> Without it, it pipe is broken, OS will kill the process.
>>
>>
>> Fuad Efendi
>> +1 416-993-2060
>> http://www.tokenizer.ca
>> Data Mining, Vertical Search
>>
>>
>> > -----Original Message-----
>> > From: BELLINI ADAM [mailto:mbellil@msn.com]
>> > Sent: December-07-09 12:01 PM
>> > To: nutch-user@lucene.apache.org
>> > Subject: RE: recrawl.sh stopped at depth 7/10 without error
>> >
>> >
>> >
>> >
>> > hi,
>> >
>> > mabe i found my probleme, it's not nutch mistake, i beleived when
>> > running
>> > the crawl command as background process when closing my console it will
>> > not stop the process, but it seems that it realy kill the process
>> >
>> >
>> > i launched the porcess like this : ./bin/nutch crawl urls -dir crawl
>> > depth
>> > -10 > crawl.log &
>> >
>> > but even with the '&' caractere when closing my console it kills the
>> > process.
>> >
>> > thx
>> >
>> > > Date: Mon, 7 Dec 2009 19:00:37 +0800
>> > > Subject: Re: recrawl.sh stopped at depth 7/10 without error
>> > > From: yeahyf@gmail.com
>> > > To: nutch-user@lucene.apache.org
>> > >
>> > > I sill want to  know the reason.
>> > >
>> > > 2009/12/2 BELLINI ADAM <mb...@msn.com>
>> > >
>> > > >
>> > > > hi,
>> > > >
>> > > > anay idea guys ??
>> > > >
>> > > >
>> > > >
>> > > > thanx
>> > > >
>> > > > > From: mbellil@msn.com
>> > > > > To: nutch-user@lucene.apache.org
>> > > > > Subject: RE: recrawl.sh stopped at depth 7/10 without error
>> > > > > Date: Fri, 27 Nov 2009 20:11:12 +0000
>> > > > >
>> > > > >
>> > > > >
>> > > > > hi,
>> > > > >
>> > > > > this is the main loop of my recrawl.sh
>> > > > >
>> > > > >
>> > > > > do
>> > > > >
>> > > > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>> > > > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
>> > $topN \
>> > > > >       -adddays $adddays
>> > > > >   if [ $? -ne 0 ]
>> > > > >   then
>> > > > >     echo "runbot: Stopping at depth $depth. No more URLs to
>> > > > > fetch."
>> > > > >     break
>> > > > >   fi
>> > > > >   segment=`ls -d $crawl/segments/* | tail -1`
>> > > > >
>> > > > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>> > > > >   if [ $? -ne 0 ]
>> > > > >   then
>> > > > >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>> > > > >     echo "runbot: Deleting segment $segment."
>> > > > >     rm $RMARGS $segment
>> > > > >     continue
>> > > > >   fi
>> > > > >
>> > > > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>> > > > >
>> > > > > done
>> > > > >
>> > > > > echo "----- Merge Segments (Step 3 of $steps) -----"
>> > > > >
>> > > > >
>> > > > >
>> > > > > in my log file i never find the message "----- Merge Segments
>> > > > > (Step
>> > 3 of
>> > > > $steps) -----" ! so it breaks the loop and stops the process.
>> > > > >
>> > > > > i dont understand why it stops at depth 7 without any errors !
>> > > > >
>> > > > >
>> > > > > > From: mbellil@msn.com
>> > > > > > To: nutch-user@lucene.apache.org
>> > > > > > Subject: recrawl.sh stopped at depth 7/10 without error
>> > > > > > Date: Wed, 25 Nov 2009 15:43:33 +0000
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > hi,
>> > > > > >
>> > > > > > i'm running recrawl.sh and it stops every time at depth 7/10
>> > without
>> > > > any error ! but when run the bin/crawl with the same crawl-urlfilter
>> > and the
>> > > > same seeds file it finishs softly in 1h50
>> > > > > >
>> > > > > > i checked the hadoop.log, and dont find any error there...i just
>> > find
>> > > > the last url it was parsing
>> > > > > > do fetching or crawling has a timeout ?
>> > > > > > my recrawl takes 2 hours before it stops. i set the time fetch
>> > interval
>> > > > 24 hours and i'm running the generate with adddays = 1
>> > > > > >
>> > > > > > best regards
>> > > > > >
>> > > > > > _________________________________________________________________
>> > > > > > Eligible CDN College & University students can upgrade to
>> > > > > > Windows
>> > 7
>> > > > before Jan 3 for only $39.99. Upgrade now!
>> > > > > > http://go.microsoft.com/?linkid=9691819
>> > > > >
>> > > > > _________________________________________________________________
>> > > > > Eligible CDN College & University students can upgrade to Windows
>> > > > > 7
>> > > > before Jan 3 for only $39.99. Upgrade now!
>> > > > > http://go.microsoft.com/?linkid=9691819
>> > > >
>> > > > _________________________________________________________________
>> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
>> > Windows 7
>> > > > now
>> > > > http://go.microsoft.com/?linkid=9691818
>> >
>> > _________________________________________________________________
>> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
>> > Windows
>> > 7 now
>> > http://go.microsoft.com/?linkid=9691818
>>
>>
>  		 	   		
> _________________________________________________________________
> Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
> now
> http://go.microsoft.com/?linkid=9691818


-- 
-MilleBii-

RE: recrawl.sh stopped at depth 7/10 without error

Posted by BELLINI ADAM <mb...@msn.com>.

thx fuad for the info...yes i was just closing my laptop without exting the ssh session.
but now i hv it running form my cron and it didnt stop :)
thx again




> From: fuad@efendi.ca
> To: nutch-user@lucene.apache.org
> Subject: RE: recrawl.sh stopped at depth 7/10 without error
> Date: Mon, 7 Dec 2009 12:58:48 -0500
> 
> >crawl.log 2>&1 &
> 
> You forgot 2>&1... output for errors...
> 
> Also, you need to close _politely_ the SSH session by executing "exit".
> Without it, it pipe is broken, OS will kill the process.
> 
> 
> Fuad Efendi
> +1 416-993-2060
> http://www.tokenizer.ca
> Data Mining, Vertical Search
> 
> 
> > -----Original Message-----
> > From: BELLINI ADAM [mailto:mbellil@msn.com]
> > Sent: December-07-09 12:01 PM
> > To: nutch-user@lucene.apache.org
> > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> > 
> > 
> > 
> > 
> > hi,
> > 
> > mabe i found my probleme, it's not nutch mistake, i beleived when running
> > the crawl command as background process when closing my console it will
> > not stop the process, but it seems that it realy kill the process
> > 
> > 
> > i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth
> > -10 > crawl.log &
> > 
> > but even with the '&' caractere when closing my console it kills the
> > process.
> > 
> > thx
> > 
> > > Date: Mon, 7 Dec 2009 19:00:37 +0800
> > > Subject: Re: recrawl.sh stopped at depth 7/10 without error
> > > From: yeahyf@gmail.com
> > > To: nutch-user@lucene.apache.org
> > >
> > > I sill want to  know the reason.
> > >
> > > 2009/12/2 BELLINI ADAM <mb...@msn.com>
> > >
> > > >
> > > > hi,
> > > >
> > > > anay idea guys ??
> > > >
> > > >
> > > >
> > > > thanx
> > > >
> > > > > From: mbellil@msn.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> > > > > Date: Fri, 27 Nov 2009 20:11:12 +0000
> > > > >
> > > > >
> > > > >
> > > > > hi,
> > > > >
> > > > > this is the main loop of my recrawl.sh
> > > > >
> > > > >
> > > > > do
> > > > >
> > > > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> > > > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> > $topN \
> > > > >       -adddays $adddays
> > > > >   if [ $? -ne 0 ]
> > > > >   then
> > > > >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> > > > >     break
> > > > >   fi
> > > > >   segment=`ls -d $crawl/segments/* | tail -1`
> > > > >
> > > > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> > > > >   if [ $? -ne 0 ]
> > > > >   then
> > > > >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> > > > >     echo "runbot: Deleting segment $segment."
> > > > >     rm $RMARGS $segment
> > > > >     continue
> > > > >   fi
> > > > >
> > > > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> > > > >
> > > > > done
> > > > >
> > > > > echo "----- Merge Segments (Step 3 of $steps) -----"
> > > > >
> > > > >
> > > > >
> > > > > in my log file i never find the message "----- Merge Segments (Step
> > 3 of
> > > > $steps) -----" ! so it breaks the loop and stops the process.
> > > > >
> > > > > i dont understand why it stops at depth 7 without any errors !
> > > > >
> > > > >
> > > > > > From: mbellil@msn.com
> > > > > > To: nutch-user@lucene.apache.org
> > > > > > Subject: recrawl.sh stopped at depth 7/10 without error
> > > > > > Date: Wed, 25 Nov 2009 15:43:33 +0000
> > > > > >
> > > > > >
> > > > > >
> > > > > > hi,
> > > > > >
> > > > > > i'm running recrawl.sh and it stops every time at depth 7/10
> > without
> > > > any error ! but when run the bin/crawl with the same crawl-urlfilter
> > and the
> > > > same seeds file it finishs softly in 1h50
> > > > > >
> > > > > > i checked the hadoop.log, and dont find any error there...i just
> > find
> > > > the last url it was parsing
> > > > > > do fetching or crawling has a timeout ?
> > > > > > my recrawl takes 2 hours before it stops. i set the time fetch
> > interval
> > > > 24 hours and i'm running the generate with adddays = 1
> > > > > >
> > > > > > best regards
> > > > > >
> > > > > > _________________________________________________________________
> > > > > > Eligible CDN College & University students can upgrade to Windows
> > 7
> > > > before Jan 3 for only $39.99. Upgrade now!
> > > > > > http://go.microsoft.com/?linkid=9691819
> > > > >
> > > > > _________________________________________________________________
> > > > > Eligible CDN College & University students can upgrade to Windows 7
> > > > before Jan 3 for only $39.99. Upgrade now!
> > > > > http://go.microsoft.com/?linkid=9691819
> > > >
> > > > _________________________________________________________________
> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> > Windows 7
> > > > now
> > > > http://go.microsoft.com/?linkid=9691818
> > 
> > _________________________________________________________________
> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows
> > 7 now
> > http://go.microsoft.com/?linkid=9691818
> 
> 
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

RE: recrawl.sh stopped at depth 7/10 without error

Posted by Fuad Efendi <fu...@efendi.ca>.

>crawl.log 2>&1 &

You forgot 2>&1... output for errors...

Also, you need to close _politely_ the SSH session by executing "exit".
Without it, it pipe is broken, OS will kill the process.


Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca
Data Mining, Vertical Search


> -----Original Message-----
> From: BELLINI ADAM [mailto:mbellil@msn.com]
> Sent: December-07-09 12:01 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: recrawl.sh stopped at depth 7/10 without error
> 
> 
> 
> 
> hi,
> 
> mabe i found my probleme, it's not nutch mistake, i beleived when running
> the crawl command as background process when closing my console it will
> not stop the process, but it seems that it realy kill the process
> 
> 
> i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth
> -10 > crawl.log &
> 
> but even with the '&' caractere when closing my console it kills the
> process.
> 
> thx
> 
> > Date: Mon, 7 Dec 2009 19:00:37 +0800
> > Subject: Re: recrawl.sh stopped at depth 7/10 without error
> > From: yeahyf@gmail.com
> > To: nutch-user@lucene.apache.org
> >
> > I sill want to  know the reason.
> >
> > 2009/12/2 BELLINI ADAM <mb...@msn.com>
> >
> > >
> > > hi,
> > >
> > > anay idea guys ??
> > >
> > >
> > >
> > > thanx
> > >
> > > > From: mbellil@msn.com
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> > > > Date: Fri, 27 Nov 2009 20:11:12 +0000
> > > >
> > > >
> > > >
> > > > hi,
> > > >
> > > > this is the main loop of my recrawl.sh
> > > >
> > > >
> > > > do
> > > >
> > > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> > > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> $topN \
> > > >       -adddays $adddays
> > > >   if [ $? -ne 0 ]
> > > >   then
> > > >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> > > >     break
> > > >   fi
> > > >   segment=`ls -d $crawl/segments/* | tail -1`
> > > >
> > > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> > > >   if [ $? -ne 0 ]
> > > >   then
> > > >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> > > >     echo "runbot: Deleting segment $segment."
> > > >     rm $RMARGS $segment
> > > >     continue
> > > >   fi
> > > >
> > > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> > > >
> > > > done
> > > >
> > > > echo "----- Merge Segments (Step 3 of $steps) -----"
> > > >
> > > >
> > > >
> > > > in my log file i never find the message "----- Merge Segments (Step
> 3 of
> > > $steps) -----" ! so it breaks the loop and stops the process.
> > > >
> > > > i dont understand why it stops at depth 7 without any errors !
> > > >
> > > >
> > > > > From: mbellil@msn.com
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: recrawl.sh stopped at depth 7/10 without error
> > > > > Date: Wed, 25 Nov 2009 15:43:33 +0000
> > > > >
> > > > >
> > > > >
> > > > > hi,
> > > > >
> > > > > i'm running recrawl.sh and it stops every time at depth 7/10
> without
> > > any error ! but when run the bin/crawl with the same crawl-urlfilter
> and the
> > > same seeds file it finishs softly in 1h50
> > > > >
> > > > > i checked the hadoop.log, and dont find any error there...i just
> find
> > > the last url it was parsing
> > > > > do fetching or crawling has a timeout ?
> > > > > my recrawl takes 2 hours before it stops. i set the time fetch
> interval
> > > 24 hours and i'm running the generate with adddays = 1
> > > > >
> > > > > best regards
> > > > >
> > > > > _________________________________________________________________
> > > > > Eligible CDN College & University students can upgrade to Windows
> 7
> > > before Jan 3 for only $39.99. Upgrade now!
> > > > > http://go.microsoft.com/?linkid=9691819
> > > >
> > > > _________________________________________________________________
> > > > Eligible CDN College & University students can upgrade to Windows 7
> > > before Jan 3 for only $39.99. Upgrade now!
> > > > http://go.microsoft.com/?linkid=9691819
> > >
> > > _________________________________________________________________
> > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> Windows 7
> > > now
> > > http://go.microsoft.com/?linkid=9691818
> 
> _________________________________________________________________
> Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows
> 7 now
> http://go.microsoft.com/?linkid=9691818

RE: recrawl.sh stopped at depth 7/10 without error

Posted by Paul Tomblin <pt...@gmail.com>.

Try starting it with nohup.  'man nohup' for details.

-- Sent from my Palm Prē
BELLINI ADAM wrote:







hi,



mabe i found my probleme, it's not nutch mistake, i beleived when running the crawl command as background process when closing my console it will not stop the process, but it seems that it realy kill the process  





i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 > crawl.log &amp;



but even with the '&amp;' caractere when closing my console it kills the process.



thx



> Date: Mon, 7 Dec 2009 19:00:37 +0800

> Subject: Re: recrawl.sh stopped at depth 7/10 without error

> From: yeahyf@gmail.com

> To: nutch-user@lucene.apache.org

> 

> I sill want to  know the reason.

> 

> 2009/12/2 BELLINI ADAM &lt;mbellil@msn.com>

> 

> >

> > hi,

> >

> > anay idea guys ??

> >

> >

> >

> > thanx

> >

> > > From: mbellil@msn.com

> > > To: nutch-user@lucene.apache.org

> > > Subject: RE: recrawl.sh stopped at depth 7/10 without error

> > > Date: Fri, 27 Nov 2009 20:11:12 +0000

> > >

> > >

> > >

> > > hi,

> > >

> > > this is the main loop of my recrawl.sh

> > >

> > >

> > > do

> > >

> > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"

> > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \

> > >       -adddays $adddays

> > >   if [ $? -ne 0 ]

> > >   then

> > >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."

> > >     break

> > >   fi

> > >   segment=`ls -d $crawl/segments/* | tail -1`

> > >

> > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads

> > >   if [ $? -ne 0 ]

> > >   then

> > >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."

> > >     echo "runbot: Deleting segment $segment."

> > >     rm $RMARGS $segment

> > >     continue

> > >   fi

> > >

> > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment

> > >

> > > done

> > >

> > > echo "----- Merge Segments (Step 3 of $steps) -----"

> > >

> > >

> > >

> > > in my log file i never find the message "----- Merge Segments (Step 3 of

> > $steps) -----" ! so it breaks the loop and stops the process.

> > >

> > > i dont understand why it stops at depth 7 without any errors !

> > >

> > >

> > > > From: mbellil@msn.com

> > > > To: nutch-user@lucene.apache.org

> > > > Subject: recrawl.sh stopped at depth 7/10 without error

> > > > Date: Wed, 25 Nov 2009 15:43:33 +0000

> > > >

> > > >

> > > >

> > > > hi,

> > > >

> > > > i'm running recrawl.sh and it stops every time at depth 7/10 without

> > any error ! but when run the bin/crawl with the same crawl-urlfilter and the

> > same seeds file it finishs softly in 1h50

> > > >

> > > > i checked the hadoop.log, and dont find any error there...i just find

> > the last url it was parsing

> > > > do fetching or crawling has a timeout ?

> > > > my recrawl takes 2 hours before it stops. i set the time fetch interval

> > 24 hours and i'm running the generate with adddays = 1

> > > >

> > > > best regards

> > > >

> > > > _________________________________________________________________

> > > > Eligible CDN College &amp; University students can upgrade to Windows 7

> > before Jan 3 for only $39.99. Upgrade now!

> > > > http://go.microsoft.com/?linkid=9691819

> > >

> > > _________________________________________________________________

> > > Eligible CDN College &amp; University students can upgrade to Windows 7

> > before Jan 3 for only $39.99. Upgrade now!

> > > http://go.microsoft.com/?linkid=9691819

> >

> > _________________________________________________________________

> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7

> > now

> > http://go.microsoft.com/?linkid=9691818

 		 	   		  

_________________________________________________________________

Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now

http://go.microsoft.com/?linkid=9691818

RE: recrawl.sh stopped at depth 7/10 without error

Posted by BELLINI ADAM <mb...@msn.com>.



hi,

mabe i found my probleme, it's not nutch mistake, i beleived when running the crawl command as background process when closing my console it will not stop the process, but it seems that it realy kill the process  


i launched the porcess like this : ./bin/nutch crawl urls -dir crawl depth -10 > crawl.log &

but even with the '&' caractere when closing my console it kills the process.

thx

> Date: Mon, 7 Dec 2009 19:00:37 +0800
> Subject: Re: recrawl.sh stopped at depth 7/10 without error
> From: yeahyf@gmail.com
> To: nutch-user@lucene.apache.org
> 
> I sill want to  know the reason.
> 
> 2009/12/2 BELLINI ADAM <mb...@msn.com>
> 
> >
> > hi,
> >
> > anay idea guys ??
> >
> >
> >
> > thanx
> >
> > > From: mbellil@msn.com
> > > To: nutch-user@lucene.apache.org
> > > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> > > Date: Fri, 27 Nov 2009 20:11:12 +0000
> > >
> > >
> > >
> > > hi,
> > >
> > > this is the main loop of my recrawl.sh
> > >
> > >
> > > do
> > >
> > >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> > >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
> > >       -adddays $adddays
> > >   if [ $? -ne 0 ]
> > >   then
> > >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> > >     break
> > >   fi
> > >   segment=`ls -d $crawl/segments/* | tail -1`
> > >
> > >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> > >   if [ $? -ne 0 ]
> > >   then
> > >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> > >     echo "runbot: Deleting segment $segment."
> > >     rm $RMARGS $segment
> > >     continue
> > >   fi
> > >
> > >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> > >
> > > done
> > >
> > > echo "----- Merge Segments (Step 3 of $steps) -----"
> > >
> > >
> > >
> > > in my log file i never find the message "----- Merge Segments (Step 3 of
> > $steps) -----" ! so it breaks the loop and stops the process.
> > >
> > > i dont understand why it stops at depth 7 without any errors !
> > >
> > >
> > > > From: mbellil@msn.com
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: recrawl.sh stopped at depth 7/10 without error
> > > > Date: Wed, 25 Nov 2009 15:43:33 +0000
> > > >
> > > >
> > > >
> > > > hi,
> > > >
> > > > i'm running recrawl.sh and it stops every time at depth 7/10 without
> > any error ! but when run the bin/crawl with the same crawl-urlfilter and the
> > same seeds file it finishs softly in 1h50
> > > >
> > > > i checked the hadoop.log, and dont find any error there...i just find
> > the last url it was parsing
> > > > do fetching or crawling has a timeout ?
> > > > my recrawl takes 2 hours before it stops. i set the time fetch interval
> > 24 hours and i'm running the generate with adddays = 1
> > > >
> > > > best regards
> > > >
> > > > _________________________________________________________________
> > > > Eligible CDN College & University students can upgrade to Windows 7
> > before Jan 3 for only $39.99. Upgrade now!
> > > > http://go.microsoft.com/?linkid=9691819
> > >
> > > _________________________________________________________________
> > > Eligible CDN College & University students can upgrade to Windows 7
> > before Jan 3 for only $39.99. Upgrade now!
> > > http://go.microsoft.com/?linkid=9691819
> >
> > _________________________________________________________________
> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
> > now
> > http://go.microsoft.com/?linkid=9691818
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

Re: recrawl.sh stopped at depth 7/10 without error

Posted by yangfeng <ye...@gmail.com>.

I sill want to  know the reason.

2009/12/2 BELLINI ADAM <mb...@msn.com>

>
> hi,
>
> anay idea guys ??
>
>
>
> thanx
>
> > From: mbellil@msn.com
> > To: nutch-user@lucene.apache.org
> > Subject: RE: recrawl.sh stopped at depth 7/10 without error
> > Date: Fri, 27 Nov 2009 20:11:12 +0000
> >
> >
> >
> > hi,
> >
> > this is the main loop of my recrawl.sh
> >
> >
> > do
> >
> >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >   $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN \
> >       -adddays $adddays
> >   if [ $? -ne 0 ]
> >   then
> >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> >     break
> >   fi
> >   segment=`ls -d $crawl/segments/* | tail -1`
> >
> >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >   if [ $? -ne 0 ]
> >   then
> >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> >     echo "runbot: Deleting segment $segment."
> >     rm $RMARGS $segment
> >     continue
> >   fi
> >
> >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> >
> > done
> >
> > echo "----- Merge Segments (Step 3 of $steps) -----"
> >
> >
> >
> > in my log file i never find the message "----- Merge Segments (Step 3 of
> $steps) -----" ! so it breaks the loop and stops the process.
> >
> > i dont understand why it stops at depth 7 without any errors !
> >
> >
> > > From: mbellil@msn.com
> > > To: nutch-user@lucene.apache.org
> > > Subject: recrawl.sh stopped at depth 7/10 without error
> > > Date: Wed, 25 Nov 2009 15:43:33 +0000
> > >
> > >
> > >
> > > hi,
> > >
> > > i'm running recrawl.sh and it stops every time at depth 7/10 without
> any error ! but when run the bin/crawl with the same crawl-urlfilter and the
> same seeds file it finishs softly in 1h50
> > >
> > > i checked the hadoop.log, and dont find any error there...i just find
> the last url it was parsing
> > > do fetching or crawling has a timeout ?
> > > my recrawl takes 2 hours before it stops. i set the time fetch interval
> 24 hours and i'm running the generate with adddays = 1
> > >
> > > best regards
> > >
> > > _________________________________________________________________
> > > Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> > > http://go.microsoft.com/?linkid=9691819
> >
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
>
> _________________________________________________________________
> Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
> now
> http://go.microsoft.com/?linkid=9691818