You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mehmet Tan <me...@agmlab.com> on 2006/04/05 16:21:31 UTC

Re: Adaptive Refetch

   Andrzej,
Thanks for your response and patch. But I have a few more questions about
adaptive refetch. As far as I understood the solution below is 'not to 
overwrite
some fields of the entries' in the db. Assume we applied the adaptive 
refetch idea in your patch to the 0.7 version. We have the same 
redirection problem there too.
What do you think is the best way to solve this problem there in version 
0.7?

Thanks...

Mehmet

Andrzej Bialecki wrote:

> Andrzej Bialecki wrote:
>
>> Mehmet Tan wrote:
>>
>>>    Hi,
>>> I want to ask a question about redirections. Correct me if I'm wrong
>>> but if a page is redirected to a page that is already in the webdb, 
>>> then the
>>> next updatedb operation will overwrite all previous info about refetch,
>>> because it is a newly created page in the fetcher whose 
>>> fetchInterval is the initial
>>> fetch interval. How does the adaptive refetch algorithm handle this 
>>> situation?
>>
>>
>> Yes, this is a bug, and it affects both the original and the patched 
>> versions - fetch interval shouldn't be blindly copied from any new 
>> CrawlDatum (this happens in CrawlDbReducer.java:86 in both versions), 
>> instead it should be initialized with the value from 
>> old.getFetchInterval(), if available. Please fix this in your 
>> version, I'll fix this in the un-patched version.
>>
>> Thanks for spotting this!
>>
>
> Please check the attached patch, it should properly copy all original 
> values first, and then only update those that are necessary.
>
>------------------------------------------------------------------------
>
>Index: CrawlDbReducer.java
>===================================================================
>--- CrawlDbReducer.java	(revision 389791)
>+++ CrawlDbReducer.java	(working copy)
>@@ -61,38 +61,38 @@
>       }
>     }
> 
>-    CrawlDatum result = null;
>+    CrawlDatum result = new CrawlDatum();
>+    // initialize with previous values, also copy metadata from old
>+    // and overlay them with new metadata
>+    if (old != null) {
>+      result.set(old);
>+      result.getMetaData().putAll(highest.getMetaData());
>+    } else {
>+      result.set(highest);
>+    }
> 
>     switch (highest.getStatus()) {                // determine new status
> 
>     case CrawlDatum.STATUS_DB_UNFETCHED:          // no new entry
>     case CrawlDatum.STATUS_DB_FETCHED:
>     case CrawlDatum.STATUS_DB_GONE:
>-      result = old;                               // use old
>+      // use old
>+      result = old;
>       break;
> 
>     case CrawlDatum.STATUS_LINKED:                // highest was link
>-      if (old != null) {                          // if old exists
>-        result = old;                             // use it
>-      } else {
>-        result = highest;                         // use new entry
>+      if (old == null) {
>         result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
>-        result.setScore(1.0f);                    // initial score is 1.0f
>       }
>-      result.setSignature(null);                  // reset the signature
>       break;
>       
>     case CrawlDatum.STATUS_FETCH_SUCCESS:         // succesful fetch
>-      result = highest;                           // use new entry
>-      if (highest.getSignature() == null) highest.setSignature(signature);
>+      if (highest.getSignature() == null) result.setSignature(signature);
>       result.setStatus(CrawlDatum.STATUS_DB_FETCHED);
>       result.setNextFetchTime();
>       break;
> 
>     case CrawlDatum.STATUS_FETCH_RETRY:           // temporary failure
>-      result = highest;                           // use new entry
>-      if (old != null)
>-        result.setSignature(old.getSignature());  // use old signature
>       if (highest.getRetriesSinceFetch() < retryMax) {
>         result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);
>       } else {
>@@ -101,9 +101,6 @@
>       break;
> 
>     case CrawlDatum.STATUS_FETCH_GONE:            // permanent failure
>-      result = highest;                           // use new entry
>-      if (old != null)
>-        result.setSignature(old.getSignature());  // use old signature
>       result.setStatus(CrawlDatum.STATUS_DB_GONE);
>       break;
> 
>@@ -111,10 +108,8 @@
>       throw new RuntimeException("Unknown status: "+highest.getStatus());
>     }
>     
>-    if (result != null) {
>-      result.setScore(result.getScore() + scoreIncrement);
>-      output.collect(key, result);
>-    }
>+    result.setScore(result.getScore() + scoreIncrement);
>+    output.collect(key, result);
>   }
> 
> }
>  
>


Re: Adaptive Refetch

Posted by Andrzej Bialecki <ab...@getopt.org>.
Mehmet Tan wrote:
>
>  
>   Sorry but I am not sure I could explain the problem properly.
> What I am trying to ask is this:
> You have pages A,B,C,D in webdb and then you come
> to a page E during the crawl and page E redirects you to page
> A for example. Then you create a new Page object in the fetcher
> with url A and write this to db (with updatedb). This overwrites page A
> already in db, and you lose everything you knew about page A.
>
> In version 0.8, you (correct me if I am wrong) copy the old values to 
> not to
> overwrite some fields. So I am trying to find out how to solve the above
> redirection problem in nutch-0.7, if we apply your adaptive refetch 
> idea to
> nutch-0.7.

Ah, ok, I get it now.

Well, first of all in 0.7 there was no metadata to worry about, so the 
issue is simpler.

In 0.7, if you look at  UpdateDatabaseTool, it clones the Page found in 
fetcherOutput. This instance should be equal to the old instance (from 
older DB) + any updates made during fetching. However, if this Page 
comes from a redirect, then it will contain wrong information (newly 
initialized score, see Fetcher:156), that's true. So, the 
UpdateDatabaseTool:256 should probably use 
webdb.addPageIfNotPresent(newPage).

When it comes to 0.8, the situation is slightly different. First, there 
is a bug in Fetcher so that currently it doesn't handle redirects based 
on parsed content, and doesn't store this information in the segment. :/ 
So, no harm done yet, but purely by accident.

Then, in CrawlDbReducer (latest revision) we copy just the old metadata 
and all other information is taken from the new CrawlDatum. It's true, 
however, that if you fetched the same page twice or more in a single 
segment (or even in a single updatedb batch job), then some of the 
entries will read SUCCESS, but they could contain incomplete data (e.g. 
no metadata which was stored in CrawlDB and put on a fetchlist). Which 
one will be picked - well, it depends on CrawlDatum.compareTo, probably 
the latest (which may have come from a redirect). As we loop in 
CrawlDbReducer, trying to find the "highest" status value, there could 
be more than 1 value with the same status (SUCCESS), and we will be left 
with the last one.

So, the problem still exists, we could lose some data.

A way to solve this would be to introduce CrawlDatum.SUCCESS_REDIRECTED, 
with a value lower than CrawlDatum.SUCCESS. By default, we probably 
should skip them. Optionally, we could also accumulate in the result any 
metadata from all CrawlDatum.SUCCESS* pages, but there is again danger 
that some newly found pages will contain default metadata that 
overwrites values coming from "legitimate" entries in a fetchlist.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Adaptive Refetch

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:
> Mehmet Tan wrote:
>> What I am trying to ask is this:
>> You have pages A,B,C,D in webdb and then you come
>> to a page E during the crawl and page E redirects you to page
>> A for example. Then you create a new Page object in the fetcher
>> with url A and write this to db (with updatedb). This overwrites page A
>> already in db, and you lose everything you knew about page A.
>
> Redirects are mostly invisible to Nutch.  In the case you describe, 
> the content of url E (which redirects to A) would be the same as the 
> content for A, but these would have separate entries in the CrawlDB, 
> link-graph, etc.  We do store the final url in a redirect chain so 
> that we can resolve relative references in the page, but that is not 
> used as the url for the content.  The content is always associated 
> with the first url in the redirect chain.

The problem was not conceptual, but in the implementation of 
CrawlDbReducer, where new "synthetic" CrawlDatum A' (created in response 
to a redirect) could overwrite CrawlDatum A coming from a legitimate 
entry in the fetchlist. CrawlDatum A could contain metadata coming form 
previous fetches, which would be absent in CrawlDatum A', but in the end 
probably CrawlDatum A' would be picked as the final version to be 
committed to DB, resulting in a data loss.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Adaptive Refetch

Posted by Doug Cutting <cu...@apache.org>.
Mehmet Tan wrote:
> What I am trying to ask is this:
> You have pages A,B,C,D in webdb and then you come
> to a page E during the crawl and page E redirects you to page
> A for example. Then you create a new Page object in the fetcher
> with url A and write this to db (with updatedb). This overwrites page A
> already in db, and you lose everything you knew about page A.

Redirects are mostly invisible to Nutch.  In the case you describe, the 
content of url E (which redirects to A) would be the same as the content 
for A, but these would have separate entries in the CrawlDB, link-graph, 
etc.  We do store the final url in a redirect chain so that we can 
resolve relative references in the page, but that is not used as the url 
for the content.  The content is always associated with the first url in 
the redirect chain.

Doug

Re: Adaptive Refetch

Posted by Mehmet Tan <me...@agmlab.com>.
  
   Sorry but I am not sure I could explain the problem properly.
What I am trying to ask is this:
You have pages A,B,C,D in webdb and then you come
to a page E during the crawl and page E redirects you to page
A for example. Then you create a new Page object in the fetcher
with url A and write this to db (with updatedb). This overwrites page A
already in db, and you lose everything you knew about page A.

In version 0.8, you (correct me if I am wrong) copy the old values to not to
overwrite some fields. So I am trying to find out how to solve the above
redirection problem in nutch-0.7, if we apply your adaptive refetch idea to
nutch-0.7.

Thanks..

Mehmet

Andrzej Bialecki wrote:

> Mehmet Tan wrote:
>
>>
>>   Andrzej,
>> Thanks for your response and patch. But I have a few more questions 
>> about
>> adaptive refetch. As far as I understood the solution below is 'not 
>> to overwrite
>> some fields of the entries' in the db. Assume we applied the adaptive 
>> refetch idea in your patch to the 0.7 version. We have the same 
>> redirection problem there too.
>> What do you think is the best way to solve this problem there in 
>> version 0.7?
>
>
> Well, you refer to two different problems:
>
> * there was a problem in CrawlDbReducer that (possibly) new values of 
> fetchInterval and fetchTime were not applied correctly to the 
> CrawlDatum to be stored in the DB. The patch contained a fix ONLY for 
> this issue.
>
> * redirection problem: I'm not sure what should be the solution, IMHO 
> it's a matter of properly setting URLFilters. If you don't allow 
> certain patterns, you should not collect such urls, no matter if they 
> come from redirection or directly from the outlinks. If you make an 
> exception for such urls, next time you generate a fetchlist or 
> updatedb these urls will be filtered out anyway.
>


Re: Adaptive Refetch

Posted by Andrzej Bialecki <ab...@getopt.org>.
Mehmet Tan wrote:
>
>   Andrzej,
> Thanks for your response and patch. But I have a few more questions about
> adaptive refetch. As far as I understood the solution below is 'not to 
> overwrite
> some fields of the entries' in the db. Assume we applied the adaptive 
> refetch idea in your patch to the 0.7 version. We have the same 
> redirection problem there too.
> What do you think is the best way to solve this problem there in 
> version 0.7?

Well, you refer to two different problems:

* there was a problem in CrawlDbReducer that (possibly) new values of 
fetchInterval and fetchTime were not applied correctly to the CrawlDatum 
to be stored in the DB. The patch contained a fix ONLY for this issue.

* redirection problem: I'm not sure what should be the solution, IMHO 
it's a matter of properly setting URLFilters. If you don't allow certain 
patterns, you should not collect such urls, no matter if they come from 
redirection or directly from the outlinks. If you make an exception for 
such urls, next time you generate a fetchlist or updatedb these urls 
will be filtered out anyway.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com