You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by xiao yang <ya...@gmail.com> on 2009/12/16 20:21:04 UTC

Re: difference in time between an initial crawl and recrawl with a full crawldb

It depends on your crawldb size, and the number of urls you fetch.
Crawldb stores the urls fetched and to be fetched. When you recrawl
with seperated command, first you will read data from crawldb and
generate the urls will be fetched this round.
An initial crawl first injects seed urls into crawldb, and then start
the process the same with recrawl.
The initial crawl fetchs for a number of rounds according the depth
parameter. For each round, new urls parsed from fetched pages will be
added to the crawldb, and will be used in the "generate" phase.

Thanks!
Xiao

On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
>
> hi,
>
> i just want to know the difference between a first initial crawl and a recrawl using the fetch, generate, update commands
> is there a diffence in time between using an initial crawl every time (by deleting the crawl_folder ) and using a recrawl without deleting the initial crawl_folder
>
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819

Prune DFS Index

Posted by Patricio Galeas <pg...@yahoo.de>.

Hello,

I´m using Hadoop-Nutch configuration in a single node.
Now I got some unwanted URLs in my index and I would like to prune it using the PruneIndexTool, but it seems towork only if a use the exported index to the local file system.

Is it possible to run the PruneIndexTool in the HDFS?


Thanks
Patricio

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.

thx verry much for the explanations,

so in my case i realy dont have choices...i just have to delete the crawldb every time we'll get rid of some intranet pages till the day that someone can tell me how to delete documents form crawldb :)




> Date: Thu, 17 Dec 2009 09:00:56 +0100
> Subject: Re: difference in time between an initial crawl and recrawl with a 	full crawldb
> From: millebii@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Difficult to answer what will take more, because it depends a lot on the use
> case:
> 
> If you start from an empty crawldb you might need several
> generate/fetch/update/invertlink cycles to get back the complete set of URLs
> to crawl.
> If you have a full database the different steps take a little bit longer,
> but you can recrawl in one pass, probably two because you need to discover
> new pages first before getting them in your next fetch cycle.
> 
> So you have to try in your case.
> 
> Deleted pages: there is a prune command that allows to remove pages from the
> index, however I think since you are using solrindex it won't really help.
> Pages which don't exist anymore will stay in the segment until you delete
> the segment & re-fetch.
> I don't know that you can delete pages from the crawldb, good question.
> 
> Incremental crawl : I call that the common situation where you run
> consecutive cycles generate/fetch/updatedb/invertlink exploring you crawldb
> and adding the results to your indexed segments list.
> 
> 
> 
> 2009/12/17 BELLINI ADAM <mb...@msn.com>
> 
> >
> > hi
> > i will answer some of your question and just tell me if i'm on the right
> > way :
> > you said :
> >
> > 1-....I suspect that you want to say I start with a crawl command and later
> > on do incremental steps by hand. ...
> >
> > -yes it's exactly what i mean.
> >
> > 2... although it depends on your steps.
> > - and I droped the index step because i use solrindex
> >
> >
> > 3-.... Deleting the crawl folder is weird because it means you start from
> > scratch
> > everytime... and on incremental crawl it won't work...
> >
> > - and that's what i want to understand, when you say : it means you start
> > from scratch...so runing incremental steps with an empty crawldb (since i
> > deleted the crawl folder) will take more time than doing it with an initial
> > full crawldb ? and what will make this to take  more time ? is that the
> > inject, generate , fecth or update command ?? i guessed that the inject
> > command will not take that much time!
> >
> > i'm asking that because in our intranet they add and delete pages every
> > day, and when they delete pages i have to purge the index and the crawldb to
> > get rid of those urls (i'm deleting  the crawl folder and the index every
> > time they delete pages from the intranet)...so i asked my self if it will
> > take more time to start with an empty crawldb or not ! if it will take the
> > same time so i could delete every day the crawl folder and the index rather
> > than waiting for them to ask me to delete those urls from the index.
> >
> > but mabe is there a way to delete documents (by their urls) from the
> > crawldb and from solr index ??
> >
> > plz what do you mean by  incremental crawls ?
> >
> > thx a lot
> >
> >
> >
> > > Well,
> > >
> > > Doing a crawl of depth 10 or 10 times a loop of individual commands will
> > > give you essentially the same results (bare in mind it does not use the
> > same
> > > file for url filtering).
> > >
> > > I don't know what you guys call "initial crawl", I suspect that you want
> > to
> > > say I start with a crawl command and later on do incremental steps by
> > hand.
> > > I don't think it is a good idea, or make sure you are using the same url
> > > filtering file content.
> > >
> > > For the rest there is no real difference between one stop shop crawl
> > command
> > > and individual commands... although it depends on your steps.
> > > For instance I dropped the segment merge steps which is usually blowing
> > up
> > > the ressources & the time available.
> > >
> > > Deleting the crawl folder is weird because it means you start from
> > scratch
> > > everytime... and on incremental crawl it won't work
> > >
> > > I suggest you do the commands manually one by one, look at the crawl &/or
> > > link db after the different steps and you will get a feeling how it
> > works.
> > > Basically the crawl command is not quite meant to be an "initial crawl"
> > > command, but rather a handy one command thing. Actually I think it does
> > not
> > > work for incremental crawls use cases.
> > >
> > >
> > >
> > > 2009/12/16 BELLINI ADAM <mb...@msn.com>
> > >
> > > >
> > > > in my case i didnt noticed that....but mabe recrawling with a full
> > crawldb
> > > > seems to be more quick than the initial crawl...but i needed someone
> > tell me
> > > > i'm right or not, mabe with some metrics
> > > >
> > > >
> > > >
> > > > > Subject: RE: difference in time between an initial crawl and recrawl
> > with
> > > > a full crawldb
> > > > > Date: Wed, 16 Dec 2009 14:55:08 -0500
> > > > > From: Vijaya_Peters@sra.com
> > > > > To: nutch-user@lucene.apache.org
> > > > >
> > > > > My experience has been that, when I delete the crawldb and do a crawl
> > > > > again, it seems to concatenate the urls so the same file gets fetched
> > > > > over and over again.
> > > > >
> > > > > Vijaya Peters
> > > > > SRA International, Inc.
> > > > > 4350 Fair Lakes Court North
> > > > > Room 4004
> > > > > Fairfax, VA  22033
> > > > > Tel:  703-502-1184
> > > > >
> > > > > www.sra.com
> > > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > > consecutive years
> > > > > P Please consider the environment before printing this e-mail
> > > > > This electronic message transmission contains information from SRA
> > > > > International, Inc. which may be confidential, privileged or
> > > > > proprietary.  The information is intended for the use of the
> > individual
> > > > > or entity named above.  If you are not the intended recipient, be
> > aware
> > > > > that any disclosure, copying, distribution, or use of the contents of
> > > > > this information is strictly prohibited.  If you have received this
> > > > > electronic information in error, please notify us immediately by
> > > > > telephone at 866-584-2143.
> > > > >
> > > > > -----Original Message-----
> > > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > > Sent: Wednesday, December 16, 2009 2:21 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: difference in time between an initial crawl and recrawl
> > > > > with a full crawldb
> > > > >
> > > > > It depends on your crawldb size, and the number of urls you fetch.
> > > > > Crawldb stores the urls fetched and to be fetched. When you recrawl
> > > > > with seperated command, first you will read data from crawldb and
> > > > > generate the urls will be fetched this round.
> > > > > An initial crawl first injects seed urls into crawldb, and then start
> > > > > the process the same with recrawl.
> > > > > The initial crawl fetchs for a number of rounds according the depth
> > > > > parameter. For each round, new urls parsed from fetched pages will be
> > > > > added to the crawldb, and will be used in the "generate" phase.
> > > > >
> > > > > Thanks!
> > > > > Xiao
> > > > >
> > > > > On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
> > wrote:
> > > > > >
> > > > > > hi,
> > > > > >
> > > > > > i just want to know the difference between a first initial crawl
> > and a
> > > > > recrawl using the fetch, generate, update commands
> > > > > > is there a diffence in time between using an initial crawl every
> > time
> > > > > (by deleting the crawl_folder ) and using a recrawl without deleting
> > the
> > > > > initial crawl_folder
> > > > > >
> > > > > > _________________________________________________________________
> > > > > > Eligible CDN College & University students can upgrade to Windows 7
> > > > > before Jan 3 for only $39.99. Upgrade now!
> > > > > > http://go.microsoft.com/?linkid=9691819
> > > >
> > > > _________________________________________________________________
> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> > Windows 7
> > > > now
> > > > http://go.microsoft.com/?linkid=9691818
> > >
> > >
> > >
> > >
> > > --
> > > -MilleBii-
> >
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7 before
> > Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> >
> 
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Windows Live: Keep your friends up to date with what you do online.
http://go.microsoft.com/?linkid=9691815

Re: difference in time between an initial crawl and recrawl with a full crawldb

Posted by MilleBii <mi...@gmail.com>.

Difficult to answer what will take more, because it depends a lot on the use
case:

If you start from an empty crawldb you might need several
generate/fetch/update/invertlink cycles to get back the complete set of URLs
to crawl.
If you have a full database the different steps take a little bit longer,
but you can recrawl in one pass, probably two because you need to discover
new pages first before getting them in your next fetch cycle.

So you have to try in your case.

Deleted pages: there is a prune command that allows to remove pages from the
index, however I think since you are using solrindex it won't really help.
Pages which don't exist anymore will stay in the segment until you delete
the segment & re-fetch.
I don't know that you can delete pages from the crawldb, good question.

Incremental crawl : I call that the common situation where you run
consecutive cycles generate/fetch/updatedb/invertlink exploring you crawldb
and adding the results to your indexed segments list.



2009/12/17 BELLINI ADAM <mb...@msn.com>

>
> hi
> i will answer some of your question and just tell me if i'm on the right
> way :
> you said :
>
> 1-....I suspect that you want to say I start with a crawl command and later
> on do incremental steps by hand. ...
>
> -yes it's exactly what i mean.
>
> 2... although it depends on your steps.
> - and I droped the index step because i use solrindex
>
>
> 3-.... Deleting the crawl folder is weird because it means you start from
> scratch
> everytime... and on incremental crawl it won't work...
>
> - and that's what i want to understand, when you say : it means you start
> from scratch...so runing incremental steps with an empty crawldb (since i
> deleted the crawl folder) will take more time than doing it with an initial
> full crawldb ? and what will make this to take  more time ? is that the
> inject, generate , fecth or update command ?? i guessed that the inject
> command will not take that much time!
>
> i'm asking that because in our intranet they add and delete pages every
> day, and when they delete pages i have to purge the index and the crawldb to
> get rid of those urls (i'm deleting  the crawl folder and the index every
> time they delete pages from the intranet)...so i asked my self if it will
> take more time to start with an empty crawldb or not ! if it will take the
> same time so i could delete every day the crawl folder and the index rather
> than waiting for them to ask me to delete those urls from the index.
>
> but mabe is there a way to delete documents (by their urls) from the
> crawldb and from solr index ??
>
> plz what do you mean by  incremental crawls ?
>
> thx a lot
>
>
>
> > Well,
> >
> > Doing a crawl of depth 10 or 10 times a loop of individual commands will
> > give you essentially the same results (bare in mind it does not use the
> same
> > file for url filtering).
> >
> > I don't know what you guys call "initial crawl", I suspect that you want
> to
> > say I start with a crawl command and later on do incremental steps by
> hand.
> > I don't think it is a good idea, or make sure you are using the same url
> > filtering file content.
> >
> > For the rest there is no real difference between one stop shop crawl
> command
> > and individual commands... although it depends on your steps.
> > For instance I dropped the segment merge steps which is usually blowing
> up
> > the ressources & the time available.
> >
> > Deleting the crawl folder is weird because it means you start from
> scratch
> > everytime... and on incremental crawl it won't work
> >
> > I suggest you do the commands manually one by one, look at the crawl &/or
> > link db after the different steps and you will get a feeling how it
> works.
> > Basically the crawl command is not quite meant to be an "initial crawl"
> > command, but rather a handy one command thing. Actually I think it does
> not
> > work for incremental crawls use cases.
> >
> >
> >
> > 2009/12/16 BELLINI ADAM <mb...@msn.com>
> >
> > >
> > > in my case i didnt noticed that....but mabe recrawling with a full
> crawldb
> > > seems to be more quick than the initial crawl...but i needed someone
> tell me
> > > i'm right or not, mabe with some metrics
> > >
> > >
> > >
> > > > Subject: RE: difference in time between an initial crawl and recrawl
> with
> > > a full crawldb
> > > > Date: Wed, 16 Dec 2009 14:55:08 -0500
> > > > From: Vijaya_Peters@sra.com
> > > > To: nutch-user@lucene.apache.org
> > > >
> > > > My experience has been that, when I delete the crawldb and do a crawl
> > > > again, it seems to concatenate the urls so the same file gets fetched
> > > > over and over again.
> > > >
> > > > Vijaya Peters
> > > > SRA International, Inc.
> > > > 4350 Fair Lakes Court North
> > > > Room 4004
> > > > Fairfax, VA  22033
> > > > Tel:  703-502-1184
> > > >
> > > > www.sra.com
> > > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > > consecutive years
> > > > P Please consider the environment before printing this e-mail
> > > > This electronic message transmission contains information from SRA
> > > > International, Inc. which may be confidential, privileged or
> > > > proprietary.  The information is intended for the use of the
> individual
> > > > or entity named above.  If you are not the intended recipient, be
> aware
> > > > that any disclosure, copying, distribution, or use of the contents of
> > > > this information is strictly prohibited.  If you have received this
> > > > electronic information in error, please notify us immediately by
> > > > telephone at 866-584-2143.
> > > >
> > > > -----Original Message-----
> > > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > > Sent: Wednesday, December 16, 2009 2:21 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: difference in time between an initial crawl and recrawl
> > > > with a full crawldb
> > > >
> > > > It depends on your crawldb size, and the number of urls you fetch.
> > > > Crawldb stores the urls fetched and to be fetched. When you recrawl
> > > > with seperated command, first you will read data from crawldb and
> > > > generate the urls will be fetched this round.
> > > > An initial crawl first injects seed urls into crawldb, and then start
> > > > the process the same with recrawl.
> > > > The initial crawl fetchs for a number of rounds according the depth
> > > > parameter. For each round, new urls parsed from fetched pages will be
> > > > added to the crawldb, and will be used in the "generate" phase.
> > > >
> > > > Thanks!
> > > > Xiao
> > > >
> > > > On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
> wrote:
> > > > >
> > > > > hi,
> > > > >
> > > > > i just want to know the difference between a first initial crawl
> and a
> > > > recrawl using the fetch, generate, update commands
> > > > > is there a diffence in time between using an initial crawl every
> time
> > > > (by deleting the crawl_folder ) and using a recrawl without deleting
> the
> > > > initial crawl_folder
> > > > >
> > > > > _________________________________________________________________
> > > > > Eligible CDN College & University students can upgrade to Windows 7
> > > > before Jan 3 for only $39.99. Upgrade now!
> > > > > http://go.microsoft.com/?linkid=9691819
> > >
> > > _________________________________________________________________
> > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> Windows 7
> > > now
> > > http://go.microsoft.com/?linkid=9691818
> >
> >
> >
> >
> > --
> > -MilleBii-
>
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before
> Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
>



-- 
-MilleBii-

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.

hi 
i will answer some of your question and just tell me if i'm on the right way :
you said :

1-....I suspect that you want to say I start with a crawl command and later on do incremental steps by hand. ...

-yes it's exactly what i mean.

2... although it depends on your steps.
- and I droped the index step because i use solrindex


3-.... Deleting the crawl folder is weird because it means you start from scratch
everytime... and on incremental crawl it won't work...

- and that's what i want to understand, when you say : it means you start from scratch...so runing incremental steps with an empty crawldb (since i deleted the crawl folder) will take more time than doing it with an initial full crawldb ? and what will make this to take  more time ? is that the inject, generate , fecth or update command ?? i guessed that the inject command will not take that much time!

i'm asking that because in our intranet they add and delete pages every day, and when they delete pages i have to purge the index and the crawldb to get rid of those urls (i'm deleting  the crawl folder and the index every time they delete pages from the intranet)...so i asked my self if it will take more time to start with an empty crawldb or not ! if it will take the same time so i could delete every day the crawl folder and the index rather than waiting for them to ask me to delete those urls from the index.

but mabe is there a way to delete documents (by their urls) from the crawldb and from solr index ??

plz what do you mean by  incremental crawls ?

thx a lot



> Well,
> 
> Doing a crawl of depth 10 or 10 times a loop of individual commands will
> give you essentially the same results (bare in mind it does not use the same
> file for url filtering).
> 
> I don't know what you guys call "initial crawl", I suspect that you want to
> say I start with a crawl command and later on do incremental steps by hand.
> I don't think it is a good idea, or make sure you are using the same url
> filtering file content.
> 
> For the rest there is no real difference between one stop shop crawl command
> and individual commands... although it depends on your steps.
> For instance I dropped the segment merge steps which is usually blowing up
> the ressources & the time available.
> 
> Deleting the crawl folder is weird because it means you start from scratch
> everytime... and on incremental crawl it won't work
> 
> I suggest you do the commands manually one by one, look at the crawl &/or
> link db after the different steps and you will get a feeling how it works.
> Basically the crawl command is not quite meant to be an "initial crawl"
> command, but rather a handy one command thing. Actually I think it does not
> work for incremental crawls use cases.
> 
> 
> 
> 2009/12/16 BELLINI ADAM <mb...@msn.com>
> 
> >
> > in my case i didnt noticed that....but mabe recrawling with a full crawldb
> > seems to be more quick than the initial crawl...but i needed someone tell me
> > i'm right or not, mabe with some metrics
> >
> >
> >
> > > Subject: RE: difference in time between an initial crawl and recrawl with
> > a full crawldb
> > > Date: Wed, 16 Dec 2009 14:55:08 -0500
> > > From: Vijaya_Peters@sra.com
> > > To: nutch-user@lucene.apache.org
> > >
> > > My experience has been that, when I delete the crawldb and do a crawl
> > > again, it seems to concatenate the urls so the same file gets fetched
> > > over and over again.
> > >
> > > Vijaya Peters
> > > SRA International, Inc.
> > > 4350 Fair Lakes Court North
> > > Room 4004
> > > Fairfax, VA  22033
> > > Tel:  703-502-1184
> > >
> > > www.sra.com
> > > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > > consecutive years
> > > P Please consider the environment before printing this e-mail
> > > This electronic message transmission contains information from SRA
> > > International, Inc. which may be confidential, privileged or
> > > proprietary.  The information is intended for the use of the individual
> > > or entity named above.  If you are not the intended recipient, be aware
> > > that any disclosure, copying, distribution, or use of the contents of
> > > this information is strictly prohibited.  If you have received this
> > > electronic information in error, please notify us immediately by
> > > telephone at 866-584-2143.
> > >
> > > -----Original Message-----
> > > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > > Sent: Wednesday, December 16, 2009 2:21 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: difference in time between an initial crawl and recrawl
> > > with a full crawldb
> > >
> > > It depends on your crawldb size, and the number of urls you fetch.
> > > Crawldb stores the urls fetched and to be fetched. When you recrawl
> > > with seperated command, first you will read data from crawldb and
> > > generate the urls will be fetched this round.
> > > An initial crawl first injects seed urls into crawldb, and then start
> > > the process the same with recrawl.
> > > The initial crawl fetchs for a number of rounds according the depth
> > > parameter. For each round, new urls parsed from fetched pages will be
> > > added to the crawldb, and will be used in the "generate" phase.
> > >
> > > Thanks!
> > > Xiao
> > >
> > > On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
> > > >
> > > > hi,
> > > >
> > > > i just want to know the difference between a first initial crawl and a
> > > recrawl using the fetch, generate, update commands
> > > > is there a diffence in time between using an initial crawl every time
> > > (by deleting the crawl_folder ) and using a recrawl without deleting the
> > > initial crawl_folder
> > > >
> > > > _________________________________________________________________
> > > > Eligible CDN College & University students can upgrade to Windows 7
> > > before Jan 3 for only $39.99. Upgrade now!
> > > > http://go.microsoft.com/?linkid=9691819
> >
> > _________________________________________________________________
> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
> > now
> > http://go.microsoft.com/?linkid=9691818
> 
> 
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Re: difference in time between an initial crawl and recrawl with a full crawldb

Posted by MilleBii <mi...@gmail.com>.

Well,

Doing a crawl of depth 10 or 10 times a loop of individual commands will
give you essentially the same results (bare in mind it does not use the same
file for url filtering).

I don't know what you guys call "initial crawl", I suspect that you want to
say I start with a crawl command and later on do incremental steps by hand.
I don't think it is a good idea, or make sure you are using the same url
filtering file content.

For the rest there is no real difference between one stop shop crawl command
and individual commands... although it depends on your steps.
For instance I dropped the segment merge steps which is usually blowing up
the ressources & the time available.

Deleting the crawl folder is weird because it means you start from scratch
everytime... and on incremental crawl it won't work

I suggest you do the commands manually one by one, look at the crawl &/or
link db after the different steps and you will get a feeling how it works.
Basically the crawl command is not quite meant to be an "initial crawl"
command, but rather a handy one command thing. Actually I think it does not
work for incremental crawls use cases.



2009/12/16 BELLINI ADAM <mb...@msn.com>

>
> in my case i didnt noticed that....but mabe recrawling with a full crawldb
> seems to be more quick than the initial crawl...but i needed someone tell me
> i'm right or not, mabe with some metrics
>
>
>
> > Subject: RE: difference in time between an initial crawl and recrawl with
> a full crawldb
> > Date: Wed, 16 Dec 2009 14:55:08 -0500
> > From: Vijaya_Peters@sra.com
> > To: nutch-user@lucene.apache.org
> >
> > My experience has been that, when I delete the crawldb and do a crawl
> > again, it seems to concatenate the urls so the same file gets fetched
> > over and over again.
> >
> > Vijaya Peters
> > SRA International, Inc.
> > 4350 Fair Lakes Court North
> > Room 4004
> > Fairfax, VA  22033
> > Tel:  703-502-1184
> >
> > www.sra.com
> > Named to FORTUNE's "100 Best Companies to Work For" list for 10
> > consecutive years
> > P Please consider the environment before printing this e-mail
> > This electronic message transmission contains information from SRA
> > International, Inc. which may be confidential, privileged or
> > proprietary.  The information is intended for the use of the individual
> > or entity named above.  If you are not the intended recipient, be aware
> > that any disclosure, copying, distribution, or use of the contents of
> > this information is strictly prohibited.  If you have received this
> > electronic information in error, please notify us immediately by
> > telephone at 866-584-2143.
> >
> > -----Original Message-----
> > From: xiao yang [mailto:yangxiao9901@gmail.com]
> > Sent: Wednesday, December 16, 2009 2:21 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: difference in time between an initial crawl and recrawl
> > with a full crawldb
> >
> > It depends on your crawldb size, and the number of urls you fetch.
> > Crawldb stores the urls fetched and to be fetched. When you recrawl
> > with seperated command, first you will read data from crawldb and
> > generate the urls will be fetched this round.
> > An initial crawl first injects seed urls into crawldb, and then start
> > the process the same with recrawl.
> > The initial crawl fetchs for a number of rounds according the depth
> > parameter. For each round, new urls parsed from fetched pages will be
> > added to the crawldb, and will be used in the "generate" phase.
> >
> > Thanks!
> > Xiao
> >
> > On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
> > >
> > > hi,
> > >
> > > i just want to know the difference between a first initial crawl and a
> > recrawl using the fetch, generate, update commands
> > > is there a diffence in time between using an initial crawl every time
> > (by deleting the crawl_folder ) and using a recrawl without deleting the
> > initial crawl_folder
> > >
> > > _________________________________________________________________
> > > Eligible CDN College & University students can upgrade to Windows 7
> > before Jan 3 for only $39.99. Upgrade now!
> > > http://go.microsoft.com/?linkid=9691819
>
> _________________________________________________________________
> Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7
> now
> http://go.microsoft.com/?linkid=9691818




-- 
-MilleBii-

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.

in my case i didnt noticed that....but mabe recrawling with a full crawldb seems to be more quick than the initial crawl...but i needed someone tell me i'm right or not, mabe with some metrics 



> Subject: RE: difference in time between an initial crawl and recrawl with a full crawldb
> Date: Wed, 16 Dec 2009 14:55:08 -0500
> From: Vijaya_Peters@sra.com
> To: nutch-user@lucene.apache.org
> 
> My experience has been that, when I delete the crawldb and do a crawl
> again, it seems to concatenate the urls so the same file gets fetched
> over and over again.
> 
> Vijaya Peters
> SRA International, Inc.
> 4350 Fair Lakes Court North
> Room 4004
> Fairfax, VA  22033
> Tel:  703-502-1184
> 
> www.sra.com
> Named to FORTUNE's "100 Best Companies to Work For" list for 10
> consecutive years
> P Please consider the environment before printing this e-mail
> This electronic message transmission contains information from SRA
> International, Inc. which may be confidential, privileged or
> proprietary.  The information is intended for the use of the individual
> or entity named above.  If you are not the intended recipient, be aware
> that any disclosure, copying, distribution, or use of the contents of
> this information is strictly prohibited.  If you have received this
> electronic information in error, please notify us immediately by
> telephone at 866-584-2143.
> 
> -----Original Message-----
> From: xiao yang [mailto:yangxiao9901@gmail.com] 
> Sent: Wednesday, December 16, 2009 2:21 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: difference in time between an initial crawl and recrawl
> with a full crawldb
> 
> It depends on your crawldb size, and the number of urls you fetch.
> Crawldb stores the urls fetched and to be fetched. When you recrawl
> with seperated command, first you will read data from crawldb and
> generate the urls will be fetched this round.
> An initial crawl first injects seed urls into crawldb, and then start
> the process the same with recrawl.
> The initial crawl fetchs for a number of rounds according the depth
> parameter. For each round, new urls parsed from fetched pages will be
> added to the crawldb, and will be used in the "generate" phase.
> 
> Thanks!
> Xiao
> 
> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
> >
> > hi,
> >
> > i just want to know the difference between a first initial crawl and a
> recrawl using the fetch, generate, update commands
> > is there a diffence in time between using an initial crawl every time
> (by deleting the crawl_folder ) and using a recrawl without deleting the
> initial crawl_folder
> >
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by "Peters, Vijaya" <Vi...@sra.com>.

My experience has been that, when I delete the crawldb and do a crawl
again, it seems to concatenate the urls so the same file gets fetched
over and over again.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's "100 Best Companies to Work For" list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-----Original Message-----
From: xiao yang [mailto:yangxiao9901@gmail.com] 
Sent: Wednesday, December 16, 2009 2:21 PM
To: nutch-user@lucene.apache.org
Subject: Re: difference in time between an initial crawl and recrawl
with a full crawldb

It depends on your crawldb size, and the number of urls you fetch.
Crawldb stores the urls fetched and to be fetched. When you recrawl
with seperated command, first you will read data from crawldb and
generate the urls will be fetched this round.
An initial crawl first injects seed urls into crawldb, and then start
the process the same with recrawl.
The initial crawl fetchs for a number of rounds according the depth
parameter. For each round, new urls parsed from fetched pages will be
added to the crawldb, and will be used in the "generate" phase.

Thanks!
Xiao

On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
>
> hi,
>
> i just want to know the difference between a first initial crawl and a
recrawl using the fetch, generate, update commands
> is there a diffence in time between using an initial crawl every time
(by deleting the crawl_folder ) and using a recrawl without deleting the
initial crawl_folder
>
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7
before Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.

okkkkkkkkkk :)
so i have to set also interval.max because i didnt it yet ! now it is 90 days.
so i will set it 24 hours and make a try
thx very much


> Date: Fri, 18 Dec 2009 17:45:42 +0100
> Subject: Re: difference in time between an initial crawl and recrawl with a 	full crawldb
> From: millebii@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Nutch will recrawl initially every "interval.default" seconds the urls.
> 
> If it finds that the page has not changed it will increase the
> interval to a limit "interval.max"
> 
> That is, if you don't delete the whole crawldb every day. Like you seem to do.
> 
> So in your case a single incremental crawl every day is plenty enough.
>  A new page will appear the next day.
>  Pages won't go away unless you delete the segment they are in and
> recreate the index.
> The issue there is that you delete pages that are still valid also. So
> you may not want to delete segments that are younger than
> "interval.max". Balancing act.
> 
> Example :
> + interval.default 6 hours
> + interval.max 24 hours
> + keep the last 4 segment and ditch older ones
> 
> Does it make sens ?
> 
> 
> 2009/12/18, BELLINI ADAM <mb...@msn.com>:
> >
> > this is what i have in my nutch-site  (i setted db.fetch.interval.default to
> > be 6 hours and not 30 days) because in our intranet we could change pages 4
> > or 5 times in a month, so i had to do this config to be able to refetch them
> > several times a month:
> >
> > <property>
> >   <name>db.fetch.interval.default</name>
> >   <value>21600</value>
> >   <description>The number of seconds between re-fetches of a page (12 hours
> > = 43200).
> >   </description>
> > </property>
> >
> > and this is the config in nutch-default
> >
> > <property>
> >   <name>db.fetch.interval.default</name>
> >   <value>2592000</value>
> >   <description>The default number of seconds between re-fetches of a page
> > (30 days).
> >   </description>
> > </property>
> >
> > <property>
> >   <name>db.fetch.interval.max</name>
> >   <value>7776000</value>
> >   <description>The maximum number of seconds between re-fetches of a page
> >   (90 days). After this period every page in the db will be re-tried, no
> >   matter what is its status.
> >   </description>
> > </property>
> >
> > So if i well understood you are telling me that if my page didnt change in
> > the last crawling (after 6 hours )nutch will not refetch it till 90 days ??
> > this is what i didnt understand, if realy it is the case so i will miss
> > lotof new pages in my intranet !! since we add new outlinks to several pages
> > many times a month.
> >
> >
> > in my idea the proof that a page is fetched is the crawl.log, since i grep
> > 'fetching http://' on crawl.log i obtain all the pages beeing fetched, and
> > i'm finding every day the same pages fecthed that are not changed yet !!
> > so it is refetching them every day even they are not modified .
> >
> >
> > if you just clarify this to me it will help me so much...
> >
> > thx a lot
> >
> >
> >
> >
> >
> >
> >
> >> Date: Fri, 18 Dec 2009 09:00:18 +0100
> >> Subject: Re: difference in time between an initial crawl and recrawl with
> >> a 	full crawldb
> >> From: millebii@gmail.com
> >> To: nutch-user@lucene.apache.org
> >>
> >> Wait 30 days and  you should see the diffence ... Since settings are
> >> time based if you crawl every day or hour it doesn't matter look in
> >> nutch-default for the settings that control this part of generate
> >> fetch list
> >>
> >> 2009/12/18, BELLINI ADAM <mb...@msn.com>:
> >> >
> >> >
> >> > but i configured nutch to fetch every 6 hours, and i'm crawling every
> >> > day at
> >> > 3 am, and even pages didnt change i see them been fetched every day !!
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >> Date: Fri, 18 Dec 2009 00:04:12 +0100
> >> >> Subject: Re: difference in time between an initial crawl and recrawl
> >> >> with
> >> >> a 	full crawldb
> >> >> From: millebii@gmail.com
> >> >> To: nutch-user@lucene.apache.org
> >> >>
> >> >> Well it is somewhat more subtil... nutch will only recrawl a page
> >> >> every
> >> >> 30
> >> >> days by default, and if it finds that it did not change in the meantime
> >> >> it
> >> >> will delay even further to more than 30 days the next recrawl. After 90
> >> >> days
> >> >> everything is recrawled no matter what.
> >> >> So actually it does make a difference from scratch or not.
> >> >>
> >> >>
> >> >>
> >> >> 2009/12/17 BELLINI ADAM <mb...@msn.com>
> >> >>
> >> >> >
> >> >> > i understand now that nutch refetch pages even if they didnt change
> >> >> > ...and
> >> >> > thats why we couldnt save time ...
> >> >> > if nutch could fecth only pages that changed so we will save big
> >> >> > amout
> >> >> > of
> >> >> > time since working with a full crawldb.
> >> >> > so it doens make difference between crawling from scratch or
> >> >> > recrawling
> >> >> > having a full crawldb
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > > Date: Thu, 17 Dec 2009 16:08:38 +0800
> >> >> > > Subject: Re: difference in time between an initial crawl and
> >> >> > > recrawl
> >> >> > > with
> >> >> > a   full crawldb
> >> >> > > From: yangxiao9901@gmail.com
> >> >> > > To: nutch-user@lucene.apache.org
> >> >> > >
> >> >> > > If you crawl with "bin/nutch crawl ..." command without deleting
> >> >> > > the
> >> >> > > crawldb. The result will be the same with recrawl. It only wastes
> >> >> > > the
> >> >> > > initial injection phase and crawldb update phase, but that won't
> >> >> > > affect the final result.
> >> >> > >
> >> >> > >
> >> >> > > On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com>
> >> >> > > wrote:
> >> >> > > >
> >> >> > > > thx for the explanation,
> >> >> > > > so if i well understood using the separates commands i dont have
> >> >> > > > to
> >> >> > > > run
> >> >> > as many times as i did it in the initial crawl (with depth 10).
> >> >> > > >
> >> >> > > > in my recrawl i'm also doing it in a loop of 10 !! am i wrong
> >> >> > > > looping
> >> >> > 10 times (generateting fetching parsing updating ) ?? mabe i could
> >> >> > save
> >> >> > time
> >> >> > by doing just one loop  ! but since i have added
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > steps=10
> >> >> > > > echo "----- Inject (Step 1 of $steps) -----"
> >> >> > > > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> >> >> > > >
> >> >> > > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps)
> >> >> > > > -----"
> >> >> > > > for((i=0; i < $depth; i++))
> >> >> > > > do
> >> >> > > >  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >> >> > > >
> >> >> > > >
> >> >> > > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> >> >> > > >
> >> >> > > >  if [ $? -ne 0 ]
> >> >> > > >  then
> >> >> > > >    echo "runbot: Stopping at depth $depth. No more URLs to
> >> >> > > > fetch."
> >> >> > > >    break
> >> >> > > >  fi
> >> >> > > >  segment=`ls -d $crawl/segments/* | tail -1`
> >> >> > > >
> >> >> > > >  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >> >> > > >  if [ $? -ne 0 ]
> >> >> > > >  then
> >> >> > > >    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> >> >> > > >    echo "runbot: Deleting segment $segment."
> >> >> > > >    rm $RMARGS $segment
> >> >> > > >    continue
> >> >> > > >  fi
> >> >> > > >
> >> >> > > > echo " ----- Updating Dadatabase ( $steps) -----"
> >> >> > > >
> >> >> > > >
> >> >> > > >  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> >> >> > > > done
> >> >> > > >
> >> >> > > >
> >> >> > > > thx
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > >> Date: Thu, 17 Dec 2009 03:21:04 +0800
> >> >> > > >> Subject: Re: difference in time between an initial crawl and
> >> >> > > >> recrawl
> >> >> > with a   full crawldb
> >> >> > > >> From: yangxiao9901@gmail.com
> >> >> > > >> To: nutch-user@lucene.apache.org
> >> >> > > >>
> >> >> > > >> It depends on your crawldb size, and the number of urls you
> >> >> > > >> fetch.
> >> >> > > >> Crawldb stores the urls fetched and to be fetched. When you
> >> >> > > >> recrawl
> >> >> > > >> with seperated command, first you will read data from crawldb
> >> >> > > >> and
> >> >> > > >> generate the urls will be fetched this round.
> >> >> > > >> An initial crawl first injects seed urls into crawldb, and then
> >> >> > > >> start
> >> >> > > >> the process the same with recrawl.
> >> >> > > >> The initial crawl fetchs for a number of rounds according the
> >> >> > > >> depth
> >> >> > > >> parameter. For each round, new urls parsed from fetched pages
> >> >> > > >> will
> >> >> > > >> be
> >> >> > > >> added to the crawldb, and will be used in the "generate" phase.
> >> >> > > >>
> >> >> > > >> Thanks!
> >> >> > > >> Xiao
> >> >> > > >>
> >> >> > > >> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
> >> >> > wrote:
> >> >> > > >> >
> >> >> > > >> > hi,
> >> >> > > >> >
> >> >> > > >> > i just want to know the difference between a first initial
> >> >> > > >> > crawl
> >> >> > > >> > and
> >> >> > a recrawl using the fetch, generate, update commands
> >> >> > > >> > is there a diffence in time between using an initial crawl
> >> >> > > >> > every
> >> >> > time (by deleting the crawl_folder ) and using a recrawl without
> >> >> > deleting
> >> >> > the initial crawl_folder
> >> >> > > >> >
> >> >> > > >> > _________________________________________________________________
> >> >> > > >> > Eligible CDN College & University students can upgrade to
> >> >> > > >> > Windows
> >> >> > > >> > 7
> >> >> > before Jan 3 for only $39.99. Upgrade now!
> >> >> > > >> > http://go.microsoft.com/?linkid=9691819
> >> >> > > >
> >> >> > > > _________________________________________________________________
> >> >> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> >> >> > Windows 7 now
> >> >> > > > http://go.microsoft.com/?linkid=9691818
> >> >> >
> >> >> > _________________________________________________________________
> >> >> > Windows Live: Make it easier for your friends to see what you’re up
> >> >> > to
> >> >> > on
> >> >> > Facebook.
> >> >> > http://go.microsoft.com/?linkid=9691816
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> -MilleBii-
> >> >  		 	   		
> >> > _________________________________________________________________
> >> > Windows Live: Make it easier for your friends to see what you’re up to
> >> > on
> >> > Facebook.
> >> > http://go.microsoft.com/?linkid=9691816
> >>
> >>
> >> --
> >> -MilleBii-
> >  		 	   		
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7 before
> > Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818

Re: difference in time between an initial crawl and recrawl with a full crawldb

Posted by MilleBii <mi...@gmail.com>.

Nutch will recrawl initially every "interval.default" seconds the urls.

If it finds that the page has not changed it will increase the
interval to a limit "interval.max"

That is, if you don't delete the whole crawldb every day. Like you seem to do.

So in your case a single incremental crawl every day is plenty enough.
 A new page will appear the next day.
 Pages won't go away unless you delete the segment they are in and
recreate the index.
The issue there is that you delete pages that are still valid also. So
you may not want to delete segments that are younger than
"interval.max". Balancing act.

Example :
+ interval.default 6 hours
+ interval.max 24 hours
+ keep the last 4 segment and ditch older ones

Does it make sens ?


2009/12/18, BELLINI ADAM <mb...@msn.com>:
>
> this is what i have in my nutch-site  (i setted db.fetch.interval.default to
> be 6 hours and not 30 days) because in our intranet we could change pages 4
> or 5 times in a month, so i had to do this config to be able to refetch them
> several times a month:
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>21600</value>
>   <description>The number of seconds between re-fetches of a page (12 hours
> = 43200).
>   </description>
> </property>
>
> and this is the config in nutch-default
>
> <property>
>   <name>db.fetch.interval.default</name>
>   <value>2592000</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).
>   </description>
> </property>
>
> <property>
>   <name>db.fetch.interval.max</name>
>   <value>7776000</value>
>   <description>The maximum number of seconds between re-fetches of a page
>   (90 days). After this period every page in the db will be re-tried, no
>   matter what is its status.
>   </description>
> </property>
>
> So if i well understood you are telling me that if my page didnt change in
> the last crawling (after 6 hours )nutch will not refetch it till 90 days ??
> this is what i didnt understand, if realy it is the case so i will miss
> lotof new pages in my intranet !! since we add new outlinks to several pages
> many times a month.
>
>
> in my idea the proof that a page is fetched is the crawl.log, since i grep
> 'fetching http://' on crawl.log i obtain all the pages beeing fetched, and
> i'm finding every day the same pages fecthed that are not changed yet !!
> so it is refetching them every day even they are not modified .
>
>
> if you just clarify this to me it will help me so much...
>
> thx a lot
>
>
>
>
>
>
>
>> Date: Fri, 18 Dec 2009 09:00:18 +0100
>> Subject: Re: difference in time between an initial crawl and recrawl with
>> a 	full crawldb
>> From: millebii@gmail.com
>> To: nutch-user@lucene.apache.org
>>
>> Wait 30 days and  you should see the diffence ... Since settings are
>> time based if you crawl every day or hour it doesn't matter look in
>> nutch-default for the settings that control this part of generate
>> fetch list
>>
>> 2009/12/18, BELLINI ADAM <mb...@msn.com>:
>> >
>> >
>> > but i configured nutch to fetch every 6 hours, and i'm crawling every
>> > day at
>> > 3 am, and even pages didnt change i see them been fetched every day !!
>> >
>> >
>> >
>> >
>> >
>> >
>> >> Date: Fri, 18 Dec 2009 00:04:12 +0100
>> >> Subject: Re: difference in time between an initial crawl and recrawl
>> >> with
>> >> a 	full crawldb
>> >> From: millebii@gmail.com
>> >> To: nutch-user@lucene.apache.org
>> >>
>> >> Well it is somewhat more subtil... nutch will only recrawl a page
>> >> every
>> >> 30
>> >> days by default, and if it finds that it did not change in the meantime
>> >> it
>> >> will delay even further to more than 30 days the next recrawl. After 90
>> >> days
>> >> everything is recrawled no matter what.
>> >> So actually it does make a difference from scratch or not.
>> >>
>> >>
>> >>
>> >> 2009/12/17 BELLINI ADAM <mb...@msn.com>
>> >>
>> >> >
>> >> > i understand now that nutch refetch pages even if they didnt change
>> >> > ...and
>> >> > thats why we couldnt save time ...
>> >> > if nutch could fecth only pages that changed so we will save big
>> >> > amout
>> >> > of
>> >> > time since working with a full crawldb.
>> >> > so it doens make difference between crawling from scratch or
>> >> > recrawling
>> >> > having a full crawldb
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > > Date: Thu, 17 Dec 2009 16:08:38 +0800
>> >> > > Subject: Re: difference in time between an initial crawl and
>> >> > > recrawl
>> >> > > with
>> >> > a   full crawldb
>> >> > > From: yangxiao9901@gmail.com
>> >> > > To: nutch-user@lucene.apache.org
>> >> > >
>> >> > > If you crawl with "bin/nutch crawl ..." command without deleting
>> >> > > the
>> >> > > crawldb. The result will be the same with recrawl. It only wastes
>> >> > > the
>> >> > > initial injection phase and crawldb update phase, but that won't
>> >> > > affect the final result.
>> >> > >
>> >> > >
>> >> > > On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com>
>> >> > > wrote:
>> >> > > >
>> >> > > > thx for the explanation,
>> >> > > > so if i well understood using the separates commands i dont have
>> >> > > > to
>> >> > > > run
>> >> > as many times as i did it in the initial crawl (with depth 10).
>> >> > > >
>> >> > > > in my recrawl i'm also doing it in a loop of 10 !! am i wrong
>> >> > > > looping
>> >> > 10 times (generateting fetching parsing updating ) ?? mabe i could
>> >> > save
>> >> > time
>> >> > by doing just one loop  ! but since i have added
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > steps=10
>> >> > > > echo "----- Inject (Step 1 of $steps) -----"
>> >> > > > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>> >> > > >
>> >> > > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps)
>> >> > > > -----"
>> >> > > > for((i=0; i < $depth; i++))
>> >> > > > do
>> >> > > >  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>> >> > > >
>> >> > > >
>> >> > > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
>> >> > > >
>> >> > > >  if [ $? -ne 0 ]
>> >> > > >  then
>> >> > > >    echo "runbot: Stopping at depth $depth. No more URLs to
>> >> > > > fetch."
>> >> > > >    break
>> >> > > >  fi
>> >> > > >  segment=`ls -d $crawl/segments/* | tail -1`
>> >> > > >
>> >> > > >  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>> >> > > >  if [ $? -ne 0 ]
>> >> > > >  then
>> >> > > >    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>> >> > > >    echo "runbot: Deleting segment $segment."
>> >> > > >    rm $RMARGS $segment
>> >> > > >    continue
>> >> > > >  fi
>> >> > > >
>> >> > > > echo " ----- Updating Dadatabase ( $steps) -----"
>> >> > > >
>> >> > > >
>> >> > > >  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>> >> > > > done
>> >> > > >
>> >> > > >
>> >> > > > thx
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >> Date: Thu, 17 Dec 2009 03:21:04 +0800
>> >> > > >> Subject: Re: difference in time between an initial crawl and
>> >> > > >> recrawl
>> >> > with a   full crawldb
>> >> > > >> From: yangxiao9901@gmail.com
>> >> > > >> To: nutch-user@lucene.apache.org
>> >> > > >>
>> >> > > >> It depends on your crawldb size, and the number of urls you
>> >> > > >> fetch.
>> >> > > >> Crawldb stores the urls fetched and to be fetched. When you
>> >> > > >> recrawl
>> >> > > >> with seperated command, first you will read data from crawldb
>> >> > > >> and
>> >> > > >> generate the urls will be fetched this round.
>> >> > > >> An initial crawl first injects seed urls into crawldb, and then
>> >> > > >> start
>> >> > > >> the process the same with recrawl.
>> >> > > >> The initial crawl fetchs for a number of rounds according the
>> >> > > >> depth
>> >> > > >> parameter. For each round, new urls parsed from fetched pages
>> >> > > >> will
>> >> > > >> be
>> >> > > >> added to the crawldb, and will be used in the "generate" phase.
>> >> > > >>
>> >> > > >> Thanks!
>> >> > > >> Xiao
>> >> > > >>
>> >> > > >> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
>> >> > wrote:
>> >> > > >> >
>> >> > > >> > hi,
>> >> > > >> >
>> >> > > >> > i just want to know the difference between a first initial
>> >> > > >> > crawl
>> >> > > >> > and
>> >> > a recrawl using the fetch, generate, update commands
>> >> > > >> > is there a diffence in time between using an initial crawl
>> >> > > >> > every
>> >> > time (by deleting the crawl_folder ) and using a recrawl without
>> >> > deleting
>> >> > the initial crawl_folder
>> >> > > >> >
>> >> > > >> > _________________________________________________________________
>> >> > > >> > Eligible CDN College & University students can upgrade to
>> >> > > >> > Windows
>> >> > > >> > 7
>> >> > before Jan 3 for only $39.99. Upgrade now!
>> >> > > >> > http://go.microsoft.com/?linkid=9691819
>> >> > > >
>> >> > > > _________________________________________________________________
>> >> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
>> >> > Windows 7 now
>> >> > > > http://go.microsoft.com/?linkid=9691818
>> >> >
>> >> > _________________________________________________________________
>> >> > Windows Live: Make it easier for your friends to see what you’re up
>> >> > to
>> >> > on
>> >> > Facebook.
>> >> > http://go.microsoft.com/?linkid=9691816
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> -MilleBii-
>> >  		 	   		
>> > _________________________________________________________________
>> > Windows Live: Make it easier for your friends to see what you’re up to
>> > on
>> > Facebook.
>> > http://go.microsoft.com/?linkid=9691816
>>
>>
>> --
>> -MilleBii-
>  		 	   		
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before
> Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819


-- 
-MilleBii-

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.

this is what i have in my nutch-site  (i setted db.fetch.interval.default to be 6 hours and not 30 days) because in our intranet we could change pages 4 or 5 times in a month, so i had to do this config to be able to refetch them several times a month:

<property>
  <name>db.fetch.interval.default</name>
  <value>21600</value>
  <description>The number of seconds between re-fetches of a page (12 hours = 43200).
  </description>
</property>

and this is the config in nutch-default

<property>
  <name>db.fetch.interval.default</name>
  <value>2592000</value>
  <description>The default number of seconds between re-fetches of a page (30 days).
  </description>
</property>

<property>
  <name>db.fetch.interval.max</name>
  <value>7776000</value>
  <description>The maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.
  </description>
</property>

So if i well understood you are telling me that if my page didnt change in the last crawling (after 6 hours )nutch will not refetch it till 90 days ??
this is what i didnt understand, if realy it is the case so i will miss lotof new pages in my intranet !! since we add new outlinks to several pages many times a month.


in my idea the proof that a page is fetched is the crawl.log, since i grep 'fetching http://' on crawl.log i obtain all the pages beeing fetched, and i'm finding every day the same pages fecthed that are not changed yet !!
so it is refetching them every day even they are not modified .


if you just clarify this to me it will help me so much...

thx a lot







> Date: Fri, 18 Dec 2009 09:00:18 +0100
> Subject: Re: difference in time between an initial crawl and recrawl with a 	full crawldb
> From: millebii@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Wait 30 days and  you should see the diffence ... Since settings are
> time based if you crawl every day or hour it doesn't matter look in
> nutch-default for the settings that control this part of generate
> fetch list
> 
> 2009/12/18, BELLINI ADAM <mb...@msn.com>:
> >
> >
> > but i configured nutch to fetch every 6 hours, and i'm crawling every day at
> > 3 am, and even pages didnt change i see them been fetched every day !!
> >
> >
> >
> >
> >
> >
> >> Date: Fri, 18 Dec 2009 00:04:12 +0100
> >> Subject: Re: difference in time between an initial crawl and recrawl with
> >> a 	full crawldb
> >> From: millebii@gmail.com
> >> To: nutch-user@lucene.apache.org
> >>
> >> Well it is somewhat more subtil... nutch will only recrawl a page  every
> >> 30
> >> days by default, and if it finds that it did not change in the meantime it
> >> will delay even further to more than 30 days the next recrawl. After 90
> >> days
> >> everything is recrawled no matter what.
> >> So actually it does make a difference from scratch or not.
> >>
> >>
> >>
> >> 2009/12/17 BELLINI ADAM <mb...@msn.com>
> >>
> >> >
> >> > i understand now that nutch refetch pages even if they didnt change
> >> > ...and
> >> > thats why we couldnt save time ...
> >> > if nutch could fecth only pages that changed so we will save big amout
> >> > of
> >> > time since working with a full crawldb.
> >> > so it doens make difference between crawling from scratch or recrawling
> >> > having a full crawldb
> >> >
> >> >
> >> >
> >> >
> >> > > Date: Thu, 17 Dec 2009 16:08:38 +0800
> >> > > Subject: Re: difference in time between an initial crawl and recrawl
> >> > > with
> >> > a   full crawldb
> >> > > From: yangxiao9901@gmail.com
> >> > > To: nutch-user@lucene.apache.org
> >> > >
> >> > > If you crawl with "bin/nutch crawl ..." command without deleting the
> >> > > crawldb. The result will be the same with recrawl. It only wastes the
> >> > > initial injection phase and crawldb update phase, but that won't
> >> > > affect the final result.
> >> > >
> >> > >
> >> > > On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com> wrote:
> >> > > >
> >> > > > thx for the explanation,
> >> > > > so if i well understood using the separates commands i dont have to
> >> > > > run
> >> > as many times as i did it in the initial crawl (with depth 10).
> >> > > >
> >> > > > in my recrawl i'm also doing it in a loop of 10 !! am i wrong
> >> > > > looping
> >> > 10 times (generateting fetching parsing updating ) ?? mabe i could save
> >> > time
> >> > by doing just one loop  ! but since i have added
> >> > > >
> >> > > >
> >> > > >
> >> > > > steps=10
> >> > > > echo "----- Inject (Step 1 of $steps) -----"
> >> > > > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> >> > > >
> >> > > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> >> > > > for((i=0; i < $depth; i++))
> >> > > > do
> >> > > >  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >> > > >
> >> > > >
> >> > > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> >> > > >
> >> > > >  if [ $? -ne 0 ]
> >> > > >  then
> >> > > >    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> >> > > >    break
> >> > > >  fi
> >> > > >  segment=`ls -d $crawl/segments/* | tail -1`
> >> > > >
> >> > > >  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >> > > >  if [ $? -ne 0 ]
> >> > > >  then
> >> > > >    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> >> > > >    echo "runbot: Deleting segment $segment."
> >> > > >    rm $RMARGS $segment
> >> > > >    continue
> >> > > >  fi
> >> > > >
> >> > > > echo " ----- Updating Dadatabase ( $steps) -----"
> >> > > >
> >> > > >
> >> > > >  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> >> > > > done
> >> > > >
> >> > > >
> >> > > > thx
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >> Date: Thu, 17 Dec 2009 03:21:04 +0800
> >> > > >> Subject: Re: difference in time between an initial crawl and
> >> > > >> recrawl
> >> > with a   full crawldb
> >> > > >> From: yangxiao9901@gmail.com
> >> > > >> To: nutch-user@lucene.apache.org
> >> > > >>
> >> > > >> It depends on your crawldb size, and the number of urls you fetch.
> >> > > >> Crawldb stores the urls fetched and to be fetched. When you recrawl
> >> > > >> with seperated command, first you will read data from crawldb and
> >> > > >> generate the urls will be fetched this round.
> >> > > >> An initial crawl first injects seed urls into crawldb, and then
> >> > > >> start
> >> > > >> the process the same with recrawl.
> >> > > >> The initial crawl fetchs for a number of rounds according the depth
> >> > > >> parameter. For each round, new urls parsed from fetched pages will
> >> > > >> be
> >> > > >> added to the crawldb, and will be used in the "generate" phase.
> >> > > >>
> >> > > >> Thanks!
> >> > > >> Xiao
> >> > > >>
> >> > > >> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
> >> > wrote:
> >> > > >> >
> >> > > >> > hi,
> >> > > >> >
> >> > > >> > i just want to know the difference between a first initial crawl
> >> > > >> > and
> >> > a recrawl using the fetch, generate, update commands
> >> > > >> > is there a diffence in time between using an initial crawl every
> >> > time (by deleting the crawl_folder ) and using a recrawl without
> >> > deleting
> >> > the initial crawl_folder
> >> > > >> >
> >> > > >> > _________________________________________________________________
> >> > > >> > Eligible CDN College & University students can upgrade to Windows
> >> > > >> > 7
> >> > before Jan 3 for only $39.99. Upgrade now!
> >> > > >> > http://go.microsoft.com/?linkid=9691819
> >> > > >
> >> > > > _________________________________________________________________
> >> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> >> > Windows 7 now
> >> > > > http://go.microsoft.com/?linkid=9691818
> >> >
> >> > _________________________________________________________________
> >> > Windows Live: Make it easier for your friends to see what you’re up to
> >> > on
> >> > Facebook.
> >> > http://go.microsoft.com/?linkid=9691816
> >>
> >>
> >>
> >>
> >> --
> >> -MilleBii-
> >  		 	   		
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you’re up to on
> > Facebook.
> > http://go.microsoft.com/?linkid=9691816
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Re: difference in time between an initial crawl and recrawl with a full crawldb

Posted by MilleBii <mi...@gmail.com>.

Wait 30 days and  you should see the diffence ... Since settings are
time based if you crawl every day or hour it doesn't matter look in
nutch-default for the settings that control this part of generate
fetch list

2009/12/18, BELLINI ADAM <mb...@msn.com>:
>
>
> but i configured nutch to fetch every 6 hours, and i'm crawling every day at
> 3 am, and even pages didnt change i see them been fetched every day !!
>
>
>
>
>
>
>> Date: Fri, 18 Dec 2009 00:04:12 +0100
>> Subject: Re: difference in time between an initial crawl and recrawl with
>> a 	full crawldb
>> From: millebii@gmail.com
>> To: nutch-user@lucene.apache.org
>>
>> Well it is somewhat more subtil... nutch will only recrawl a page  every
>> 30
>> days by default, and if it finds that it did not change in the meantime it
>> will delay even further to more than 30 days the next recrawl. After 90
>> days
>> everything is recrawled no matter what.
>> So actually it does make a difference from scratch or not.
>>
>>
>>
>> 2009/12/17 BELLINI ADAM <mb...@msn.com>
>>
>> >
>> > i understand now that nutch refetch pages even if they didnt change
>> > ...and
>> > thats why we couldnt save time ...
>> > if nutch could fecth only pages that changed so we will save big amout
>> > of
>> > time since working with a full crawldb.
>> > so it doens make difference between crawling from scratch or recrawling
>> > having a full crawldb
>> >
>> >
>> >
>> >
>> > > Date: Thu, 17 Dec 2009 16:08:38 +0800
>> > > Subject: Re: difference in time between an initial crawl and recrawl
>> > > with
>> > a   full crawldb
>> > > From: yangxiao9901@gmail.com
>> > > To: nutch-user@lucene.apache.org
>> > >
>> > > If you crawl with "bin/nutch crawl ..." command without deleting the
>> > > crawldb. The result will be the same with recrawl. It only wastes the
>> > > initial injection phase and crawldb update phase, but that won't
>> > > affect the final result.
>> > >
>> > >
>> > > On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com> wrote:
>> > > >
>> > > > thx for the explanation,
>> > > > so if i well understood using the separates commands i dont have to
>> > > > run
>> > as many times as i did it in the initial crawl (with depth 10).
>> > > >
>> > > > in my recrawl i'm also doing it in a loop of 10 !! am i wrong
>> > > > looping
>> > 10 times (generateting fetching parsing updating ) ?? mabe i could save
>> > time
>> > by doing just one loop  ! but since i have added
>> > > >
>> > > >
>> > > >
>> > > > steps=10
>> > > > echo "----- Inject (Step 1 of $steps) -----"
>> > > > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>> > > >
>> > > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>> > > > for((i=0; i < $depth; i++))
>> > > > do
>> > > >  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>> > > >
>> > > >
>> > > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
>> > > >
>> > > >  if [ $? -ne 0 ]
>> > > >  then
>> > > >    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>> > > >    break
>> > > >  fi
>> > > >  segment=`ls -d $crawl/segments/* | tail -1`
>> > > >
>> > > >  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>> > > >  if [ $? -ne 0 ]
>> > > >  then
>> > > >    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>> > > >    echo "runbot: Deleting segment $segment."
>> > > >    rm $RMARGS $segment
>> > > >    continue
>> > > >  fi
>> > > >
>> > > > echo " ----- Updating Dadatabase ( $steps) -----"
>> > > >
>> > > >
>> > > >  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>> > > > done
>> > > >
>> > > >
>> > > > thx
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >> Date: Thu, 17 Dec 2009 03:21:04 +0800
>> > > >> Subject: Re: difference in time between an initial crawl and
>> > > >> recrawl
>> > with a   full crawldb
>> > > >> From: yangxiao9901@gmail.com
>> > > >> To: nutch-user@lucene.apache.org
>> > > >>
>> > > >> It depends on your crawldb size, and the number of urls you fetch.
>> > > >> Crawldb stores the urls fetched and to be fetched. When you recrawl
>> > > >> with seperated command, first you will read data from crawldb and
>> > > >> generate the urls will be fetched this round.
>> > > >> An initial crawl first injects seed urls into crawldb, and then
>> > > >> start
>> > > >> the process the same with recrawl.
>> > > >> The initial crawl fetchs for a number of rounds according the depth
>> > > >> parameter. For each round, new urls parsed from fetched pages will
>> > > >> be
>> > > >> added to the crawldb, and will be used in the "generate" phase.
>> > > >>
>> > > >> Thanks!
>> > > >> Xiao
>> > > >>
>> > > >> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
>> > wrote:
>> > > >> >
>> > > >> > hi,
>> > > >> >
>> > > >> > i just want to know the difference between a first initial crawl
>> > > >> > and
>> > a recrawl using the fetch, generate, update commands
>> > > >> > is there a diffence in time between using an initial crawl every
>> > time (by deleting the crawl_folder ) and using a recrawl without
>> > deleting
>> > the initial crawl_folder
>> > > >> >
>> > > >> > _________________________________________________________________
>> > > >> > Eligible CDN College & University students can upgrade to Windows
>> > > >> > 7
>> > before Jan 3 for only $39.99. Upgrade now!
>> > > >> > http://go.microsoft.com/?linkid=9691819
>> > > >
>> > > > _________________________________________________________________
>> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
>> > Windows 7 now
>> > > > http://go.microsoft.com/?linkid=9691818
>> >
>> > _________________________________________________________________
>> > Windows Live: Make it easier for your friends to see what you’re up to
>> > on
>> > Facebook.
>> > http://go.microsoft.com/?linkid=9691816
>>
>>
>>
>>
>> --
>> -MilleBii-
>  		 	   		
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you’re up to on
> Facebook.
> http://go.microsoft.com/?linkid=9691816


-- 
-MilleBii-

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.


but i configured nutch to fetch every 6 hours, and i'm crawling every day at 3 am, and even pages didnt change i see them been fetched every day !!






> Date: Fri, 18 Dec 2009 00:04:12 +0100
> Subject: Re: difference in time between an initial crawl and recrawl with a 	full crawldb
> From: millebii@gmail.com
> To: nutch-user@lucene.apache.org
> 
> Well it is somewhat more subtil... nutch will only recrawl a page  every 30
> days by default, and if it finds that it did not change in the meantime it
> will delay even further to more than 30 days the next recrawl. After 90 days
> everything is recrawled no matter what.
> So actually it does make a difference from scratch or not.
> 
> 
> 
> 2009/12/17 BELLINI ADAM <mb...@msn.com>
> 
> >
> > i understand now that nutch refetch pages even if they didnt change ...and
> > thats why we couldnt save time ...
> > if nutch could fecth only pages that changed so we will save big amout of
> > time since working with a full crawldb.
> > so it doens make difference between crawling from scratch or recrawling
> > having a full crawldb
> >
> >
> >
> >
> > > Date: Thu, 17 Dec 2009 16:08:38 +0800
> > > Subject: Re: difference in time between an initial crawl and recrawl with
> > a   full crawldb
> > > From: yangxiao9901@gmail.com
> > > To: nutch-user@lucene.apache.org
> > >
> > > If you crawl with "bin/nutch crawl ..." command without deleting the
> > > crawldb. The result will be the same with recrawl. It only wastes the
> > > initial injection phase and crawldb update phase, but that won't
> > > affect the final result.
> > >
> > >
> > > On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com> wrote:
> > > >
> > > > thx for the explanation,
> > > > so if i well understood using the separates commands i dont have to run
> > as many times as i did it in the initial crawl (with depth 10).
> > > >
> > > > in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping
> > 10 times (generateting fetching parsing updating ) ?? mabe i could save time
> > by doing just one loop  ! but since i have added
> > > >
> > > >
> > > >
> > > > steps=10
> > > > echo "----- Inject (Step 1 of $steps) -----"
> > > > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> > > >
> > > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> > > > for((i=0; i < $depth; i++))
> > > > do
> > > >  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> > > >
> > > >
> > > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> > > >
> > > >  if [ $? -ne 0 ]
> > > >  then
> > > >    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> > > >    break
> > > >  fi
> > > >  segment=`ls -d $crawl/segments/* | tail -1`
> > > >
> > > >  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> > > >  if [ $? -ne 0 ]
> > > >  then
> > > >    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> > > >    echo "runbot: Deleting segment $segment."
> > > >    rm $RMARGS $segment
> > > >    continue
> > > >  fi
> > > >
> > > > echo " ----- Updating Dadatabase ( $steps) -----"
> > > >
> > > >
> > > >  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> > > > done
> > > >
> > > >
> > > > thx
> > > >
> > > >
> > > >
> > > >
> > > >> Date: Thu, 17 Dec 2009 03:21:04 +0800
> > > >> Subject: Re: difference in time between an initial crawl and recrawl
> > with a   full crawldb
> > > >> From: yangxiao9901@gmail.com
> > > >> To: nutch-user@lucene.apache.org
> > > >>
> > > >> It depends on your crawldb size, and the number of urls you fetch.
> > > >> Crawldb stores the urls fetched and to be fetched. When you recrawl
> > > >> with seperated command, first you will read data from crawldb and
> > > >> generate the urls will be fetched this round.
> > > >> An initial crawl first injects seed urls into crawldb, and then start
> > > >> the process the same with recrawl.
> > > >> The initial crawl fetchs for a number of rounds according the depth
> > > >> parameter. For each round, new urls parsed from fetched pages will be
> > > >> added to the crawldb, and will be used in the "generate" phase.
> > > >>
> > > >> Thanks!
> > > >> Xiao
> > > >>
> > > >> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
> > wrote:
> > > >> >
> > > >> > hi,
> > > >> >
> > > >> > i just want to know the difference between a first initial crawl and
> > a recrawl using the fetch, generate, update commands
> > > >> > is there a diffence in time between using an initial crawl every
> > time (by deleting the crawl_folder ) and using a recrawl without deleting
> > the initial crawl_folder
> > > >> >
> > > >> > _________________________________________________________________
> > > >> > Eligible CDN College & University students can upgrade to Windows 7
> > before Jan 3 for only $39.99. Upgrade now!
> > > >> > http://go.microsoft.com/?linkid=9691819
> > > >
> > > > _________________________________________________________________
> > > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> > Windows 7 now
> > > > http://go.microsoft.com/?linkid=9691818
> >
> > _________________________________________________________________
> > Windows Live: Make it easier for your friends to see what you’re up to on
> > Facebook.
> > http://go.microsoft.com/?linkid=9691816
> 
> 
> 
> 
> -- 
> -MilleBii-
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816

Re: difference in time between an initial crawl and recrawl with a full crawldb

Posted by MilleBii <mi...@gmail.com>.

Well it is somewhat more subtil... nutch will only recrawl a page  every 30
days by default, and if it finds that it did not change in the meantime it
will delay even further to more than 30 days the next recrawl. After 90 days
everything is recrawled no matter what.
So actually it does make a difference from scratch or not.



2009/12/17 BELLINI ADAM <mb...@msn.com>

>
> i understand now that nutch refetch pages even if they didnt change ...and
> thats why we couldnt save time ...
> if nutch could fecth only pages that changed so we will save big amout of
> time since working with a full crawldb.
> so it doens make difference between crawling from scratch or recrawling
> having a full crawldb
>
>
>
>
> > Date: Thu, 17 Dec 2009 16:08:38 +0800
> > Subject: Re: difference in time between an initial crawl and recrawl with
> a   full crawldb
> > From: yangxiao9901@gmail.com
> > To: nutch-user@lucene.apache.org
> >
> > If you crawl with "bin/nutch crawl ..." command without deleting the
> > crawldb. The result will be the same with recrawl. It only wastes the
> > initial injection phase and crawldb update phase, but that won't
> > affect the final result.
> >
> >
> > On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com> wrote:
> > >
> > > thx for the explanation,
> > > so if i well understood using the separates commands i dont have to run
> as many times as i did it in the initial crawl (with depth 10).
> > >
> > > in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping
> 10 times (generateting fetching parsing updating ) ?? mabe i could save time
> by doing just one loop  ! but since i have added
> > >
> > >
> > >
> > > steps=10
> > > echo "----- Inject (Step 1 of $steps) -----"
> > > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> > >
> > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> > > for((i=0; i < $depth; i++))
> > > do
> > >  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> > >
> > >
> > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> > >
> > >  if [ $? -ne 0 ]
> > >  then
> > >    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> > >    break
> > >  fi
> > >  segment=`ls -d $crawl/segments/* | tail -1`
> > >
> > >  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> > >  if [ $? -ne 0 ]
> > >  then
> > >    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> > >    echo "runbot: Deleting segment $segment."
> > >    rm $RMARGS $segment
> > >    continue
> > >  fi
> > >
> > > echo " ----- Updating Dadatabase ( $steps) -----"
> > >
> > >
> > >  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> > > done
> > >
> > >
> > > thx
> > >
> > >
> > >
> > >
> > >> Date: Thu, 17 Dec 2009 03:21:04 +0800
> > >> Subject: Re: difference in time between an initial crawl and recrawl
> with a   full crawldb
> > >> From: yangxiao9901@gmail.com
> > >> To: nutch-user@lucene.apache.org
> > >>
> > >> It depends on your crawldb size, and the number of urls you fetch.
> > >> Crawldb stores the urls fetched and to be fetched. When you recrawl
> > >> with seperated command, first you will read data from crawldb and
> > >> generate the urls will be fetched this round.
> > >> An initial crawl first injects seed urls into crawldb, and then start
> > >> the process the same with recrawl.
> > >> The initial crawl fetchs for a number of rounds according the depth
> > >> parameter. For each round, new urls parsed from fetched pages will be
> > >> added to the crawldb, and will be used in the "generate" phase.
> > >>
> > >> Thanks!
> > >> Xiao
> > >>
> > >> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com>
> wrote:
> > >> >
> > >> > hi,
> > >> >
> > >> > i just want to know the difference between a first initial crawl and
> a recrawl using the fetch, generate, update commands
> > >> > is there a diffence in time between using an initial crawl every
> time (by deleting the crawl_folder ) and using a recrawl without deleting
> the initial crawl_folder
> > >> >
> > >> > _________________________________________________________________
> > >> > Eligible CDN College & University students can upgrade to Windows 7
> before Jan 3 for only $39.99. Upgrade now!
> > >> > http://go.microsoft.com/?linkid=9691819
> > >
> > > _________________________________________________________________
> > > Ready. Set. Get a great deal on Windows 7. See fantastic deals on
> Windows 7 now
> > > http://go.microsoft.com/?linkid=9691818
>
> _________________________________________________________________
> Windows Live: Make it easier for your friends to see what you’re up to on
> Facebook.
> http://go.microsoft.com/?linkid=9691816




-- 
-MilleBii-

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.

i understand now that nutch refetch pages even if they didnt change ...and thats why we couldnt save time ...
if nutch could fecth only pages that changed so we will save big amout of time since working with a full crawldb.
so it doens make difference between crawling from scratch or recrawling having a full crawldb




> Date: Thu, 17 Dec 2009 16:08:38 +0800
> Subject: Re: difference in time between an initial crawl and recrawl with a 	full crawldb
> From: yangxiao9901@gmail.com
> To: nutch-user@lucene.apache.org
> 
> If you crawl with "bin/nutch crawl ..." command without deleting the
> crawldb. The result will be the same with recrawl. It only wastes the
> initial injection phase and crawldb update phase, but that won't
> affect the final result.
> 
> 
> On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com> wrote:
> >
> > thx for the explanation,
> > so if i well understood using the separates commands i dont have to run as many times as i did it in the initial crawl (with depth 10).
> >
> > in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping 10 times (generateting fetching parsing updating ) ?? mabe i could save time by doing just one loop  ! but since i have added
> >
> >
> >
> > steps=10
> > echo "----- Inject (Step 1 of $steps) -----"
> > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
> >
> > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> > for((i=0; i < $depth; i++))
> > do
> >  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
> >
> >
> > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
> >
> >  if [ $? -ne 0 ]
> >  then
> >    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
> >    break
> >  fi
> >  segment=`ls -d $crawl/segments/* | tail -1`
> >
> >  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
> >  if [ $? -ne 0 ]
> >  then
> >    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
> >    echo "runbot: Deleting segment $segment."
> >    rm $RMARGS $segment
> >    continue
> >  fi
> >
> > echo " ----- Updating Dadatabase ( $steps) -----"
> >
> >
> >  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> > done
> >
> >
> > thx
> >
> >
> >
> >
> >> Date: Thu, 17 Dec 2009 03:21:04 +0800
> >> Subject: Re: difference in time between an initial crawl and recrawl with a   full crawldb
> >> From: yangxiao9901@gmail.com
> >> To: nutch-user@lucene.apache.org
> >>
> >> It depends on your crawldb size, and the number of urls you fetch.
> >> Crawldb stores the urls fetched and to be fetched. When you recrawl
> >> with seperated command, first you will read data from crawldb and
> >> generate the urls will be fetched this round.
> >> An initial crawl first injects seed urls into crawldb, and then start
> >> the process the same with recrawl.
> >> The initial crawl fetchs for a number of rounds according the depth
> >> parameter. For each round, new urls parsed from fetched pages will be
> >> added to the crawldb, and will be used in the "generate" phase.
> >>
> >> Thanks!
> >> Xiao
> >>
> >> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
> >> >
> >> > hi,
> >> >
> >> > i just want to know the difference between a first initial crawl and a recrawl using the fetch, generate, update commands
> >> > is there a diffence in time between using an initial crawl every time (by deleting the crawl_folder ) and using a recrawl without deleting the initial crawl_folder
> >> >
> >> > _________________________________________________________________
> >> > Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> >> > http://go.microsoft.com/?linkid=9691819
> >
> > _________________________________________________________________
> > Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
> > http://go.microsoft.com/?linkid=9691818
 		 	   		  
_________________________________________________________________
Windows Live: Make it easier for your friends to see what you’re up to on Facebook.
http://go.microsoft.com/?linkid=9691816

Re: difference in time between an initial crawl and recrawl with a full crawldb

Posted by xiao yang <ya...@gmail.com>.

If you crawl with "bin/nutch crawl ..." command without deleting the
crawldb. The result will be the same with recrawl. It only wastes the
initial injection phase and crawldb update phase, but that won't
affect the final result.


On Thu, Dec 17, 2009 at 3:56 AM, BELLINI ADAM <mb...@msn.com> wrote:
>
> thx for the explanation,
> so if i well understood using the separates commands i dont have to run as many times as i did it in the initial crawl (with depth 10).
>
> in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping 10 times (generateting fetching parsing updating ) ?? mabe i could save time by doing just one loop  ! but since i have added
>
>
>
> steps=10
> echo "----- Inject (Step 1 of $steps) -----"
> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
> do
>  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>
>
> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments
>
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>    break
>  fi
>  segment=`ls -d $crawl/segments/* | tail -1`
>
>  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>  if [ $? -ne 0 ]
>  then
>    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>    echo "runbot: Deleting segment $segment."
>    rm $RMARGS $segment
>    continue
>  fi
>
> echo " ----- Updating Dadatabase ( $steps) -----"
>
>
>  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
> done
>
>
> thx
>
>
>
>
>> Date: Thu, 17 Dec 2009 03:21:04 +0800
>> Subject: Re: difference in time between an initial crawl and recrawl with a   full crawldb
>> From: yangxiao9901@gmail.com
>> To: nutch-user@lucene.apache.org
>>
>> It depends on your crawldb size, and the number of urls you fetch.
>> Crawldb stores the urls fetched and to be fetched. When you recrawl
>> with seperated command, first you will read data from crawldb and
>> generate the urls will be fetched this round.
>> An initial crawl first injects seed urls into crawldb, and then start
>> the process the same with recrawl.
>> The initial crawl fetchs for a number of rounds according the depth
>> parameter. For each round, new urls parsed from fetched pages will be
>> added to the crawldb, and will be used in the "generate" phase.
>>
>> Thanks!
>> Xiao
>>
>> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
>> >
>> > hi,
>> >
>> > i just want to know the difference between a first initial crawl and a recrawl using the fetch, generate, update commands
>> > is there a diffence in time between using an initial crawl every time (by deleting the crawl_folder ) and using a recrawl without deleting the initial crawl_folder
>> >
>> > _________________________________________________________________
>> > Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
>> > http://go.microsoft.com/?linkid=9691819
>
> _________________________________________________________________
> Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
> http://go.microsoft.com/?linkid=9691818

RE: difference in time between an initial crawl and recrawl with a full crawldb

Posted by BELLINI ADAM <mb...@msn.com>.

thx for the explanation,
so if i well understood using the separates commands i dont have to run as many times as i did it in the initial crawl (with depth 10).

in my recrawl i'm also doing it in a loop of 10 !! am i wrong looping 10 times (generateting fetching parsing updating ) ?? mabe i could save time by doing just one loop  ! but since i have added



steps=10
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"


$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments

  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d $crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
    echo "runbot: Deleting segment $segment."
    rm $RMARGS $segment
    continue
  fi

echo " ----- Updating Dadatabase ( $steps) -----"


  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
done


thx




> Date: Thu, 17 Dec 2009 03:21:04 +0800
> Subject: Re: difference in time between an initial crawl and recrawl with a 	full crawldb
> From: yangxiao9901@gmail.com
> To: nutch-user@lucene.apache.org
> 
> It depends on your crawldb size, and the number of urls you fetch.
> Crawldb stores the urls fetched and to be fetched. When you recrawl
> with seperated command, first you will read data from crawldb and
> generate the urls will be fetched this round.
> An initial crawl first injects seed urls into crawldb, and then start
> the process the same with recrawl.
> The initial crawl fetchs for a number of rounds according the depth
> parameter. For each round, new urls parsed from fetched pages will be
> added to the crawldb, and will be used in the "generate" phase.
> 
> Thanks!
> Xiao
> 
> On Wed, Dec 16, 2009 at 11:01 PM, BELLINI ADAM <mb...@msn.com> wrote:
> >
> > hi,
> >
> > i just want to know the difference between a first initial crawl and a recrawl using the fetch, generate, update commands
> > is there a diffence in time between using an initial crawl every time (by deleting the crawl_folder ) and using a recrawl without deleting the initial crawl_folder
> >
> > _________________________________________________________________
> > Eligible CDN College & University students can upgrade to Windows 7 before Jan 3 for only $39.99. Upgrade now!
> > http://go.microsoft.com/?linkid=9691819
 		 	   		  
_________________________________________________________________
Ready. Set. Get a great deal on Windows 7. See fantastic deals on Windows 7 now
http://go.microsoft.com/?linkid=9691818