You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lokkju <lo...@gmail.com> on 2005/10/21 01:52:34 UTC

Re-crawling or what?

I have searched through the mail archives, and seen this question
asked alot, but no answer ever seems to come back.  I am going to be
using nutch against 5 sites, and I want to update the index on a
nightly basis.  Besides deleting the previous crawl, then running it
again, what method of doing nightly updates is recommended?

Thanks,
Nick

Re: Re-crawling or what?

Posted by Lokkju <lo...@gmail.com>.

It would seem that MD5 hash would still require you to actually get
all the remote content - but I'll look at the patch, and perhaps it
can give me some ideas on using the last-modified-date and the
content-size.

On 10/21/05, Michael Ji <fj...@yahoo.com> wrote:
> there is a nutch patch 61
> http://issues.apache.org/jira/browse/NUTCH-61
>
> to detect the unmodified content of a target page by
> looking for its' content MD5 hash value; somehow, it
> is not merged to branch yet; I implemented patch 61
> for my local development, but no further testing yet;
>
> for the refetching, you only have to generate a new
> fetchlist---not a new db;
>
> Michael Ji,
>
> --- Lokkju <lo...@gmail.com> wrote:
>
> > Well, I guess I am looking at a few things -
> >
> > Running nightly, as I said
> > Using the last-modified-date header returned by the
> > server to
> > determine if I even want to download the whole file
> > - if the last
> > modifed date has not changed, and the file size is
> > the same, then I
> > can probably skip it.
> >
> > Of course, this pre-supposes that I am only updating
> > a database - it
> > seems sort of rediculous that currently, the only
> > easy method of
> > recrawling a site is to create a new db.
> >
> > On 10/21/05, Michael Ji <fj...@yahoo.com> wrote:
> > > I guess you can run segmentMergeTool to merge new
> > > segments with previous one ( document with
> > duplicated
> > > URL and content MD5 will be discarded) and then
> > run
> > > index on it,
> > >
> > > not sure if it is the best scenario for daily
> > > refetching---just my thought based on the code I
> > dig
> > > out,
> > >
> > > Michael Ji,
> > >
> > > --- Lokkju <lo...@gmail.com> wrote:
> > >
> > > > I have searched through the mail archives, and
> > seen
> > > > this question
> > > > asked alot, but no answer ever seems to come
> > back.
> > > > I am going to be
> > > > using nutch against 5 sites, and I want to
> > update
> > > > the index on a
> > > > nightly basis.  Besides deleting the previous
> > crawl,
> > > > then running it
> > > > again, what method of doing nightly updates is
> > > > recommended?
> > > >
> > > > Thanks,
> > > > Nick
> > > >
> > >
> > >
> > >
> > >
> > >
> > > __________________________________
> > > Yahoo! Mail - PC Magazine Editors' Choice 2005
> > > http://mail.yahoo.com
> > >
> >
>
>
>
>
> __________________________________
> Yahoo! FareChase: Search multiple travel sites in one click.
> http://farechase.yahoo.com
>

Re: Re-crawling or what?

Posted by Michael Ji <fj...@yahoo.com>.

hi Stefan:

Actually, I implemented nutch 61 in my local
development and have a discussion with Andrzej ( see
the attached comments from Andrzej )

Mainly, the first difficulty Andrzej pointed out is
repeating "dedupling". This might be solved by calling
SegmentMergeTool.java. Means we only keep a fresh
segment and no need to keep all the old segments.
Ofcause, merge segments has cost.

But the second difficulty of "lost segment", it is
exactly as Andrzej described. No direct solution form
my view yet. Maybe we could rely on the robustness of
our local file system.

My wish to use nutch 61 to save parsing time if page's
content is not changed.

My testing experience (2 months ago) was that I found
nutch 61 DID generate parse_data/ parse_text/ for a
page with unchanged content. (my test might be wrong)
I will run test again to verify that as soon as I have
a bit time.

thanks,

Michael Ji,

(attached, my previous discussion with Andrzej )
=================================================
Unfortunately, the patches related to detecting the
unmodified content will have to wait until after the
release.
Here's the problem: It's quite easy to add this
checking and recording capability to all fetcher
plugins, fetchlist generation and db update tools, and
I've done this in my local patches. However,
after a while I discovered a serious problem in the
way Nutch currently manages "phasing out" of old
segment data. If we assume that we always refresh
after some fixed interval (30 days, or whatever), then
we can safely delete segments older than 30 days. If
the interval varies, then potentially we could be
stuck with some segments with very old (but still
valid) data. This is very
inefficient, because in a single given segment there
might be only a couple of such pages left after a
while, and the rest of them would have to be removed
again and again by deduplication because newer pages
would exist in newer segments. Moreover (and this is
the worst problem) if such segments are lost, the
information in webdb must be updated in a way to force
refetching, even though the "If-Modified-Since" or the
MD5 points out that the page is still unchanged since
the last fetching. Currently the only way to do this
is to "add days" - but if we use a variable refetch
interval then it doesn't make much sense. I think we
need to track in a better way which pages are
"missing" from the segments, and have to be
re-fetched, or to have a better DB update mechanism if
we lose some segments.
Perhaps we should extend the Page to record which
segment holds the latest version of the page? But
segments don't have unique ID's now (a directory name
is too fragile and too easily changed)  Related
question: in the FetchListEntry we have a "fetch"
flag. I think that after minor modifications of the
FetchListTool (to generate only entries, which we are
supposed to fetch) we could get rid of this flag, or
change its semantics to mean "unconditionally fetch,
even if unmodified".
====================================================

--- Stefan Groschupf <sg...@media-style.com> wrote:

> > to detect the unmodified content of a target page
> by
> > looking for its' content MD5 hash value; somehow,
> it
> > is not merged to branch yet; I implemented patch
> 61
> > for my local development, but no further testing
> yet;
> 
> Michael, I really would love to see this patch in
> the sources,  
> however Andrzej Bialecki  suggested some
> improvements.
> Can you realize this improvements against the
> actually sources?  I  
> would vote for the improved patch and I guess a lot
> of other peoples  
> find this improved patch very useful as well.
> 
> THANKS!
> Stefan
> 
> 

__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: Re-crawling or what?

Posted by Stefan Groschupf <sg...@media-style.com>.

> to detect the unmodified content of a target page by
> looking for its' content MD5 hash value; somehow, it
> is not merged to branch yet; I implemented patch 61
> for my local development, but no further testing yet;

Michael, I really would love to see this patch in the sources,  
however Andrzej Bialecki  suggested some improvements.
Can you realize this improvements against the actually sources?  I  
would vote for the improved patch and I guess a lot of other peoples  
find this improved patch very useful as well.

THANKS!
Stefan

Re: Re-crawling or what?

Posted by Michael Ji <fj...@yahoo.com>.

there is a nutch patch 61
http://issues.apache.org/jira/browse/NUTCH-61

to detect the unmodified content of a target page by
looking for its' content MD5 hash value; somehow, it
is not merged to branch yet; I implemented patch 61
for my local development, but no further testing yet;

for the refetching, you only have to generate a new
fetchlist---not a new db;

Michael Ji,

--- Lokkju <lo...@gmail.com> wrote:

> Well, I guess I am looking at a few things -
> 
> Running nightly, as I said
> Using the last-modified-date header returned by the
> server to
> determine if I even want to download the whole file
> - if the last
> modifed date has not changed, and the file size is
> the same, then I
> can probably skip it.
> 
> Of course, this pre-supposes that I am only updating
> a database - it
> seems sort of rediculous that currently, the only
> easy method of
> recrawling a site is to create a new db.
> 
> On 10/21/05, Michael Ji <fj...@yahoo.com> wrote:
> > I guess you can run segmentMergeTool to merge new
> > segments with previous one ( document with
> duplicated
> > URL and content MD5 will be discarded) and then
> run
> > index on it,
> >
> > not sure if it is the best scenario for daily
> > refetching---just my thought based on the code I
> dig
> > out,
> >
> > Michael Ji,
> >
> > --- Lokkju <lo...@gmail.com> wrote:
> >
> > > I have searched through the mail archives, and
> seen
> > > this question
> > > asked alot, but no answer ever seems to come
> back.
> > > I am going to be
> > > using nutch against 5 sites, and I want to
> update
> > > the index on a
> > > nightly basis.  Besides deleting the previous
> crawl,
> > > then running it
> > > again, what method of doing nightly updates is
> > > recommended?
> > >
> > > Thanks,
> > > Nick
> > >
> >
> >
> >
> >
> >
> > __________________________________
> > Yahoo! Mail - PC Magazine Editors' Choice 2005
> > http://mail.yahoo.com
> >
> 



		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com

Re: Re-crawling or what?

Posted by Lokkju <lo...@gmail.com>.

Well, I guess I am looking at a few things -

Running nightly, as I said
Using the last-modified-date header returned by the server to
determine if I even want to download the whole file - if the last
modifed date has not changed, and the file size is the same, then I
can probably skip it.

Of course, this pre-supposes that I am only updating a database - it
seems sort of rediculous that currently, the only easy method of
recrawling a site is to create a new db.

On 10/21/05, Michael Ji <fj...@yahoo.com> wrote:
> I guess you can run segmentMergeTool to merge new
> segments with previous one ( document with duplicated
> URL and content MD5 will be discarded) and then run
> index on it,
>
> not sure if it is the best scenario for daily
> refetching---just my thought based on the code I dig
> out,
>
> Michael Ji,
>
> --- Lokkju <lo...@gmail.com> wrote:
>
> > I have searched through the mail archives, and seen
> > this question
> > asked alot, but no answer ever seems to come back.
> > I am going to be
> > using nutch against 5 sites, and I want to update
> > the index on a
> > nightly basis.  Besides deleting the previous crawl,
> > then running it
> > again, what method of doing nightly updates is
> > recommended?
> >
> > Thanks,
> > Nick
> >
>
>
>
>
>
> __________________________________
> Yahoo! Mail - PC Magazine Editors' Choice 2005
> http://mail.yahoo.com
>

Re: Re-crawling or what?

Posted by Michael Ji <fj...@yahoo.com>.

I guess you can run segmentMergeTool to merge new
segments with previous one ( document with duplicated
URL and content MD5 will be discarded) and then run
index on it,

not sure if it is the best scenario for daily
refetching---just my thought based on the code I dig
out,

Michael Ji,

--- Lokkju <lo...@gmail.com> wrote:

> I have searched through the mail archives, and seen
> this question
> asked alot, but no answer ever seems to come back. 
> I am going to be
> using nutch against 5 sites, and I want to update
> the index on a
> nightly basis.  Besides deleting the previous crawl,
> then running it
> again, what method of doing nightly updates is
> recommended?
> 
> Thanks,
> Nick
> 



	
		
__________________________________ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com