You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by BELLINI ADAM <mb...@msn.com> on 2009/12/03 22:27:15 UTC
db.fetch.interval.default
hi,
i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly
<property>
<name>db.fetch.interval.default</name>
<value>18000</value>
<description>The number of seconds between re-fetches of a page (5 hours ).
</description>
</property>
the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).
but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!
and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?
why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !
plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth since i have all the urls in the crawldb ?
i'm using this recrawl script :
depth=10
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d $crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth `expr $i + 1` failed."
echo "runbot: Deleting segment $segment."
rm $RMARGS $segment
continue
fi
echo " ----- Updating Dadatabase ( $steps) -----"
$NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
done
echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
rm $crawl/segments
mv $crawl/MERGEDsegments $crawl/segments
echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819
Re: db.fetch.interval.default
Posted by reinhard schwab <re...@aon.at>.
the crawl date here has state db_unfetched.
it has not been fetched.
are you sure that you dont have crawl dates with retry interval 0 seconds?
grep Retry crawldump | grep -v "18000"
BELLINI ADAM schrieb:
> hi
> i dumped the database, and this is what i found:
>
>
> Status: 1 (db_unfetched)
> Fetch time: Thu Dec 03 15:53:24 EST 2009
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 18000 seconds (0 days)
> Score: 2.0549393
> Signature: null
> Metadata:
>
>
>
>
> so if meeting this url several time in 2 hours, that means becoz of the 0 days it gonna be fetched several times ??
> it will not look at the 18000 secondes ???
>
>
> thx
>
>
>
>
>
>> Date: Thu, 3 Dec 2009 22:39:29 +0100
>> From: reinhard.schwab@aon.at
>> To: nutch-user@lucene.apache.org
>> Subject: Re: db.fetch.interval.default
>>
>> hi,
>>
>> i have identified one source of such a problem and opened an issue at jira.
>> you can apply this patch and check whether is solves your problem.
>>
>> https://issues.apache.org/jira/browse/NUTCH-774
>>
>> btw you can check also your crawldb for such items - the retry interval
>> is set to 0.
>> just dump the crawldb and search for it.
>>
>> regards
>> reinhard
>>
>> BELLINI ADAM schrieb:
>>
>>> hi,
>>>
>>> i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly
>>>
>>> <property>
>>>
>>> <name>db.fetch.interval.default</name>
>>>
>>> <value>18000</value>
>>>
>>> <description>The number of seconds between re-fetches of a page (5 hours ).
>>>
>>> </description>
>>>
>>> </property>
>>>
>>>
>>> the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).
>>>
>>>
>>> but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!
>>>
>>> and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?
>>>
>>> why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !
>>>
>>> plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth since i have all the urls in the crawldb ?
>>>
>>>
>>>
>>>
>>> i'm using this recrawl script :
>>>
>>> depth=10
>>>
>>> echo "----- Inject (Step 1 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>>>
>>> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>>> for((i=0; i < $depth; i++))
>>>
>>> do
>>>
>>> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>>
>>> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
>>>
>>> if [ $? -ne 0 ]
>>> then
>>> echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>>> break
>>> fi
>>> segment=`ls -d $crawl/segments/* | tail -1`
>>>
>>> $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>>>
>>> if [ $? -ne 0 ]
>>> then
>>> echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>>> echo "runbot: Deleting segment $segment."
>>> rm $RMARGS $segment
>>> continue
>>> fi
>>>
>>> echo " ----- Updating Dadatabase ( $steps) -----"
>>>
>>>
>>> $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>>>
>>> done
>>>
>>> echo "----- Merge Segments (Step 3 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
>>>
>>>
>>> rm $crawl/segments
>>> mv $crawl/MERGEDsegments $crawl/segments
>>>
>>> echo "----- Invert Links (Step 4 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
>>>
>>>
>>>
>>>
>>> _________________________________________________________________
>>> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
>>> http://go.microsoft.com/?linkid=9691819
>>>
>>>
>
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
>
RE: db.fetch.interval.default
Posted by BELLINI ADAM <mb...@msn.com>.
hi
i dumped the database, and this is what i found:
Status: 1 (db_unfetched)
Fetch time: Thu Dec 03 15:53:24 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 2.0549393
Signature: null
Metadata:
so if meeting this url several time in 2 hours, that means becoz of the 0 days it gonna be fetched several times ??
it will not look at the 18000 secondes ???
thx
> Date: Thu, 3 Dec 2009 22:39:29 +0100
> From: reinhard.schwab@aon.at
> To: nutch-user@lucene.apache.org
> Subject: Re: db.fetch.interval.default
>
> hi,
>
> i have identified one source of such a problem and opened an issue at jira.
> you can apply this patch and check whether is solves your problem.
>
> https://issues.apache.org/jira/browse/NUTCH-774
>
> btw you can check also your crawldb for such items - the retry interval
> is set to 0.
> just dump the crawldb and search for it.
>
> regards
> reinhard
>
> BELLINI ADAM schrieb:
> > hi,
> >
> > i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly
> >
> > <property>
> >
> > <name>db.fetch.interval.default</name>
> >
> > <value>18000</value>
> >
> > <description>The number of seconds between re-fetches of a page (5 hours ).
> >
> > </description>
> >
> > </property>
> >
> >
> > the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).
> >
> >
> > but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!
> >
> > and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?
> >
> > why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !
> >
> > plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth since i have all the urls in the crawldb ?
> >
> >
> >
> >
> > i'm using this recrawl script :
> >
> > depth=10
> >
> > echo "----- Inject (Step 1 of $steps) -----"
> > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> >
> > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> > for((i=0; i < $depth; i++))
> >
> > do
> >
> > echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >
> > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
> >
> > if [ $? -ne 0 ]
> > then
> > echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> > break
> > fi
> > segment=`ls -d $crawl/segments/* | tail -1`
> >
> > $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >
> > if [ $? -ne 0 ]
> > then
> > echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> > echo "runbot: Deleting segment $segment."
> > rm $RMARGS $segment
> > continue
> > fi
> >
> > echo " ----- Updating Dadatabase ( $steps) -----"
> >
> >
> > $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> >
> > done
> >
> > echo "----- Merge Segments (Step 3 of $steps) -----"
> > $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
> >
> >
> > rm $crawl/segments
> > mv $crawl/MERGEDsegments $crawl/segments
> >
> > echo "----- Invert Links (Step 4 of $steps) -----"
> > $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
> >
> >
> >
> >
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> >
>
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819
Re: db.fetch.interval.default
Posted by reinhard schwab <re...@aon.at>.
hi,
i have identified one source of such a problem and opened an issue at jira.
you can apply this patch and check whether is solves your problem.
https://issues.apache.org/jira/browse/NUTCH-774
btw you can check also your crawldb for such items - the retry interval
is set to 0.
just dump the crawldb and search for it.
regards
reinhard
BELLINI ADAM schrieb:
> hi,
>
> i'm crawling my intranet , and i have setted the db.fetch.interval.default to be 5 hours, but it seens that it doesnt work correctly
>
> <property>
>
> <name>db.fetch.interval.default</name>
>
> <value>18000</value>
>
> <description>The number of seconds between re-fetches of a page (5 hours ).
>
> </description>
>
> </property>
>
>
> the first crawl when the crawl directory $crawl doesnt existe yet the crawl spend just 2 hours (with depth=10).
>
>
> but when performing the recrawl with the recrawl.sh script (with crawldb full), it takes like 2 hours for each depth !!
>
> and when i checked the log file i found that one URL is fetched like several times ! so did my 5 hours db.fetch.interval.default works correctly ?
>
> why it's refetching same URLs several time at each depth (depth =10), i understood that the timestamp of pages will not allow a refetch since the time (5 hours) is not spent yet !
>
> plz can you just explain me how this db.fetch.interval.defaul works ? should i use only one depth since i have all the urls in the crawldb ?
>
>
>
>
> i'm using this recrawl script :
>
> depth=10
>
> echo "----- Inject (Step 1 of $steps) -----"
> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
>
> do
>
> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>
> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
>
> if [ $? -ne 0 ]
> then
> echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> break
> fi
> segment=`ls -d $crawl/segments/* | tail -1`
>
> $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>
> if [ $? -ne 0 ]
> then
> echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> echo "runbot: Deleting segment $segment."
> rm $RMARGS $segment
> continue
> fi
>
> echo " ----- Updating Dadatabase ( $steps) -----"
>
>
> $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>
> done
>
> echo "----- Merge Segments (Step 3 of $steps) -----"
> $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
>
>
> rm $crawl/segments
> mv $crawl/MERGEDsegments $crawl/segments
>
> echo "----- Invert Links (Step 4 of $steps) -----"
> $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
>
>
>
>
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
>