You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by BELLINI ADAM <mb...@msn.com> on 2009/12/03 22:27:15 UTC

db.fetch.interval.default

hi,

i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly

<property>

  <name>db.fetch.interval.default</name>

  <value>18000</value>

  <description>The number of seconds between re-fetches of a page (5 hours ).

  </description>

</property>


the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).


but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!

and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?

why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !

plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth  since i have all the urls in the crawldb ?




i'm using this recrawl script :

depth=10

echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))

do

  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"

$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN

  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d $crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads

  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
    echo "runbot: Deleting segment $segment."
    rm $RMARGS $segment
    continue
  fi

echo " ----- Updating Dadatabase ( $steps) -----"


  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment

done

echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*


rm   $crawl/segments
mv  $crawl/MERGEDsegments $crawl/segments

echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*



 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Re: db.fetch.interval.default

Posted by reinhard schwab <re...@aon.at>.

the crawl date here has state db_unfetched.
it has not been fetched.
are you sure that you dont have crawl dates with retry interval 0 seconds?

grep Retry crawldump | grep -v "18000"

BELLINI ADAM schrieb:
> hi
> i dumped the database, and this is what i found:
>
>
> Status: 1 (db_unfetched)
> Fetch time: Thu Dec 03 15:53:24 EST 2009
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 18000 seconds (0 days)
> Score: 2.0549393
> Signature: null
> Metadata:
>
>
>
>
> so if meeting this url several time in 2 hours, that means becoz of the 0 days it gonna be fetched several times ??
> it will not look at the 18000 secondes ???
>
>
> thx
>
>
>
>
>   
>> Date: Thu, 3 Dec 2009 22:39:29 +0100
>> From: reinhard.schwab@aon.at
>> To: nutch-user@lucene.apache.org
>> Subject: Re: db.fetch.interval.default
>>
>> hi,
>>
>> i have identified one source of such a problem and opened an issue at jira.
>> you can apply this patch and check whether is solves your problem.
>>
>> https://issues.apache.org/jira/browse/NUTCH-774
>>
>> btw you can check also your crawldb for such items - the retry interval
>> is set to 0.
>> just dump the crawldb and search for it.
>>
>> regards
>> reinhard
>>
>> BELLINI ADAM schrieb:
>>     
>>> hi,
>>>
>>> i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly
>>>
>>> <property>
>>>
>>>   <name>db.fetch.interval.default</name>
>>>
>>>   <value>18000</value>
>>>
>>>   <description>The number of seconds between re-fetches of a page (5 hours ).
>>>
>>>   </description>
>>>
>>> </property>
>>>
>>>
>>> the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).
>>>
>>>
>>> but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!
>>>
>>> and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?
>>>
>>> why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !
>>>
>>> plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth  since i have all the urls in the crawldb ?
>>>
>>>
>>>
>>>
>>> i'm using this recrawl script :
>>>
>>> depth=10
>>>
>>> echo "----- Inject (Step 1 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>>>
>>> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>>> for((i=0; i < $depth; i++))
>>>
>>> do
>>>
>>>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>>
>>> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
>>>
>>>   if [ $? -ne 0 ]
>>>   then
>>>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>>>     break
>>>   fi
>>>   segment=`ls -d $crawl/segments/* | tail -1`
>>>
>>>   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>>>
>>>   if [ $? -ne 0 ]
>>>   then
>>>     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>>>     echo "runbot: Deleting segment $segment."
>>>     rm $RMARGS $segment
>>>     continue
>>>   fi
>>>
>>> echo " ----- Updating Dadatabase ( $steps) -----"
>>>
>>>
>>>   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>>>
>>> done
>>>
>>> echo "----- Merge Segments (Step 3 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
>>>
>>>
>>> rm   $crawl/segments
>>> mv  $crawl/MERGEDsegments $crawl/segments
>>>
>>> echo "----- Invert Links (Step 4 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
>>>
>>>
>>>
>>>  		 	   		  
>>> _________________________________________________________________
>>> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
>>> http://go.microsoft.com/?linkid=9691819
>>>   
>>>       
>  		 	   		  
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
>

RE: db.fetch.interval.default

Posted by BELLINI ADAM <mb...@msn.com>.

hi
i dumped the database, and this is what i found:


Status: 1 (db_unfetched)
Fetch time: Thu Dec 03 15:53:24 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 2.0549393
Signature: null
Metadata:




so if meeting this url several time in 2 hours, that means becoz of the 0 days it gonna be fetched several times ??
it will not look at the 18000 secondes ???


thx




> Date: Thu, 3 Dec 2009 22:39:29 +0100
> From: reinhard.schwab@aon.at
> To: nutch-user@lucene.apache.org
> Subject: Re: db.fetch.interval.default
> 
> hi,
> 
> i have identified one source of such a problem and opened an issue at jira.
> you can apply this patch and check whether is solves your problem.
> 
> https://issues.apache.org/jira/browse/NUTCH-774
> 
> btw you can check also your crawldb for such items - the retry interval
> is set to 0.
> just dump the crawldb and search for it.
> 
> regards
> reinhard
> 
> BELLINI ADAM schrieb:
> > hi,
> >
> > i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly
> >
> > <property>
> >
> >   <name>db.fetch.interval.default</name>
> >
> >   <value>18000</value>
> >
> >   <description>The number of seconds between re-fetches of a page (5 hours ).
> >
> >   </description>
> >
> > </property>
> >
> >
> > the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).
> >
> >
> > but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!
> >
> > and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?
> >
> > why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !
> >
> > plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth  since i have all the urls in the crawldb ?
> >
> >
> >
> >
> > i'm using this recrawl script :
> >
> > depth=10
> >
> > echo "----- Inject (Step 1 of $steps) -----"
> > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> >
> > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> > for((i=0; i < $depth; i++))
> >
> > do
> >
> >   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >
> > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
> >
> >   if [ $? -ne 0 ]
> >   then
> >     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> >     break
> >   fi
> >   segment=`ls -d $crawl/segments/* | tail -1`
> >
> >   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >
> >   if [ $? -ne 0 ]
> >   then
> >     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> >     echo "runbot: Deleting segment $segment."
> >     rm $RMARGS $segment
> >     continue
> >   fi
> >
> > echo " ----- Updating Dadatabase ( $steps) -----"
> >
> >
> >   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> >
> > done
> >
> > echo "----- Merge Segments (Step 3 of $steps) -----"
> > $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
> >
> >
> > rm   $crawl/segments
> > mv  $crawl/MERGEDsegments $crawl/segments
> >
> > echo "----- Invert Links (Step 4 of $steps) -----"
> > $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
> >
> >
> >
> >  		 	   		  
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> >   
> 
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Re: db.fetch.interval.default

Posted by reinhard schwab <re...@aon.at>.

hi,

i have identified one source of such a problem and opened an issue at jira.
you can apply this patch and check whether is solves your problem.

https://issues.apache.org/jira/browse/NUTCH-774

btw you can check also your crawldb for such items - the retry interval
is set to 0.
just dump the crawldb and search for it.

regards
reinhard

BELLINI ADAM schrieb:
> hi,
>
> i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly
>
> <property>
>
>   <name>db.fetch.interval.default</name>
>
>   <value>18000</value>
>
>   <description>The number of seconds between re-fetches of a page (5 hours ).
>
>   </description>
>
> </property>
>
>
> the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).
>
>
> but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!
>
> and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?
>
> why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !
>
> plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth  since i have all the urls in the crawldb ?
>
>
>
>
> i'm using this recrawl script :
>
> depth=10
>
> echo "----- Inject (Step 1 of $steps) -----"
> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
>
> do
>
>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>
> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
>
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>     break
>   fi
>   segment=`ls -d $crawl/segments/* | tail -1`
>
>   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>     echo "runbot: Deleting segment $segment."
>     rm $RMARGS $segment
>     continue
>   fi
>
> echo " ----- Updating Dadatabase ( $steps) -----"
>
>
>   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>
> done
>
> echo "----- Merge Segments (Step 3 of $steps) -----"
> $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
>
>
> rm   $crawl/segments
> mv  $crawl/MERGEDsegments $crawl/segments
>
> echo "----- Invert Links (Step 4 of $steps) -----"
> $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
>
>
>
>  		 	   		  
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
>