You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Cesare Zavattari <ce...@ctrl-z-bg.org> on 2012/11/15 12:33:27 UTC

setting modifiedTime in DefaultFetchSchedule

Hi all,
the AdaptiveFetchSchedure has the following line:

if (modifiedTime <= 0) modifiedTime = fetchTime;

that DefaultFetchSchedule has not. This seems to
prevent DefaultFetchSchedule handle correctly possible 403 responses
(modifiedTime seems to be always zero and HttpRequest.java doesn't
set If-Modified-Since request part).

This is true for both nutch 1.x and 2.x.

Is this the expected behaviour?

Thanks
Bye

-- 
Cesare

Re: setting modifiedTime in DefaultFetchSchedule

Posted by Cesare Zavattari <ce...@ctrl-z-bg.org>.

On Tue, Nov 20, 2012 at 12:12 AM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:

> Hi Cesare,
>

Ciao Sebastian and thanks for your email.


> > modifiedTime = fetchTime;
> > instead of:
> > if (modifiedTime <= 0) modifiedTime = fetchTime;
> This will always overwrite modified time with the time the fetch took
> place.
> I would prefer the way as it's done in AdaptiveFetchSchedule:
> only set modifiedTime if it's unset (=0).
>

here's my problem:

- you fetch a page XXX the first time
- modifiedTime is 0, so it's set to fetchTime
- from now on I'll get 304...
- ... unless XXX changes
- modifiedTime will never be changed and I'll never get 304 again, page
will be always fetched (200) because If-Modified-Since will always be true

this is why I always set modifiedTime. We could skip it if status is
NOTMODIFIED.

The same issue seems to affect AdaptiveFetchSchedule

> I don't know if this is correct (probably not) but at least 304 seems to
> be
> > handled. In particular, in the protocol-file (File.getProtocolOutput)
> I've
> > added a special case for 304:
> >
> > if (code == 304) { // got a not modified response
> >     return new ProtocolOutput(response.toContent(),
> >       ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
> >         }
> >
> > I suppose this is NOT the right solution :-)
> At a first glance, it's not bad. Protocol-file needs obviously a revision:
> the 304 is set properly in FileResponse.java but in File.java it is
> treated as
> redirect:
>    else if (code >= 300 && code < 400) { // handle redirect
> So, thanks. Good catch!
>
> Would be great if you could open Jira issues for
> - setting modified time in DefaultSchedule
> - 304 handling in protocol-file
> If you can provide patches, even better. Thanks!
>

I want to be sure about the right solution for setting modifiedTime
properly.

About your problem with removal / re-adding files:
> - a file system is crawled as if linked web pages:
>   a directory is just an HTML page with all files and sub-directories
>   as links.
>

this is clear. Let's consider a page A that links a page B:

A -> B

A is seed. I use the following command:

./nutch crawl urls -depth 2 -topN 5

we crawl it. Ok.
Now let's remove page B.

./nutch crawl urls -depth 2 -topN 5

B gets a 404. Fine.

now let's restore B and crawl again.

This works as expected if A and B are html pages (B is fetched by "./nutch
crawl"). If A is a directory and B is a file, B will never be fetched
again. Moreover, in this case A get a 200 because a new file is added, so
the pasing/generate phases should force the refetch of B, isn't it?

Reproducing it is easy:

mkdir /tmp/files/
echo "AAA" >/tmp/files/aa.txt

the only seed is file://localhost/tmp/files/

./nutch crawl urls -depth 2 -topN 5    // both /tmp/files/ and
/tmp/files/aa.txt are get
rm /tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5    // /tmp/files/aa.txt gets a 404
echo "AAA" >/tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5    // /tmp/files/ has changed, is get
(200) while aa.txt:

ParserJob: parsing all
Parsing file://localhost/tmp/files/
Skipping file://localhost/tmp/files/aa.txt; different batch id (null)

and is never fetched again, despite the page that links it (the directory)
has changed.

is this the expected behavior?

thanks a lot

-- 
Cesare

Re: setting modifiedTime in DefaultFetchSchedule

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Cesare,

> modifiedTime = fetchTime;
> instead of:
> if (modifiedTime <= 0) modifiedTime = fetchTime;
This will always overwrite modified time with the time the fetch took place.
I would prefer the way as it's done in AdaptiveFetchSchedule:
only set modifiedTime if it's unset (=0).

After a closer look at 1.x regarding this point I can confirm:
- with DefaultFetchSchedule the modifiedTime is never set / always 0

> I don't know if this is correct (probably not) but at least 304 seems to be
> handled. In particular, in the protocol-file (File.getProtocolOutput) I've
> added a special case for 304:
>
> if (code == 304) { // got a not modified response
>     return new ProtocolOutput(response.toContent(),
>       ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
>         }
>
> I suppose this is NOT the right solution :-)
At a first glance, it's not bad. Protocol-file needs obviously a revision:
the 304 is set properly in FileResponse.java but in File.java it is treated as
redirect:
   else if (code >= 300 && code < 400) { // handle redirect
So, thanks. Good catch!

Would be great if you could open Jira issues for
- setting modified time in DefaultSchedule
- 304 handling in protocol-file
If you can provide patches, even better. Thanks!

About your problem with removal / re-adding files:
- a file system is crawled as if linked web pages:
  a directory is just an HTML page with all files and sub-directories
  as links.
- re-crawling does not necessarily remove deleted files from the index.
  The I had a cloURL/path to a deleted file is kept forever
  until it's removed explicitely.
- You have to force a re-fetch of the URL/file to be sure it is still
  present or has been removed. If 304 handling is working, this should
  be quite cheap for file system crawls because a re-parse is not necessary.

Ciao,
Sebastian


On 11/19/2012 05:30 PM, Cesare Zavattari wrote:
> Ciao,
> in the meanwhile I've done some other test using nutch 2.1 with
> DefaultFetchSchedule where I've put:
> 
> modifiedTime = fetchTime;
> 
> instead of:
> 
> if (modifiedTime <= 0) modifiedTime = fetchTime;
> 
> I don't know if this is correct (probably not) but at least 304 seems to be
> handled. In particular, in the protocol-file (File.getProtocolOutput) I've
> added a special case for 304:
> 
> if (code == 304) { // got a not modified response
>     return new ProtocolOutput(response.toContent(),
>       ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
>         }
> 
> I suppose this is NOT the right solution :-)
> Anyway, this is another problem I have with protocol-file. I have the seed:
> 
> file://localhost/tmp/files/
> 
> this directory contains a couple of files, aa.txt and bbbbb.txt
> If a file is deleted, recrawl, readded, it is ignored. I mean:
> 
> ./nutch crawl urls -depth 2 -topN 5
> rm /tmp/files/bbbbb.txt
> ./nutch crawl urls -depth 2 -topN 5
> echo "saaaszzz" >/tmp/files/bbbbb.txt
> ./nutch crawl urls -depth 2 -topN 5
> 
> ...
> Skipping file://localhost/tmp/files/bbbbb.txt; different batch id (null)
> ...
> 
> and the dump sticks with
> 
> ...
> baseUrl:        file://localhost/tmp/files/bbbbb.txt
> status: 1 (status_unfetched)
> ...
> protocolStatus: EXCEPTION, args=[org.apache.nutch.protocol.file.FileError:
> File Error: 404]
> 
> 
> 
> what am I doing wrong?
> 
> Thanks a lot!
> 
> 
> 
> 
> On Thu, Nov 15, 2012 at 7:25 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> Hi Cesare,
>>
>> hmhh... Good catch!
>>
>> The modifiedTime is also set in CrawlDbReducer.reduce
>> right after FetchSchedule.setFetchSchedule is called and the signature
>> hasn't changed compared to the previous fetch, cf. NUTCH-1341.
>>
>> At a first glance, it looks like the modifiedTime is indeed never set
>> with DefaultFetchSchedule.
>> I'll have a more detailed look at this and come back soon.
>>
>> Thanks,
>> Sebastian
>>
>> On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
>>> Hi all,
>>> the AdaptiveFetchSchedure has the following line:
>>>
>>> if (modifiedTime <= 0) modifiedTime = fetchTime;
>>>
>>> that DefaultFetchSchedule has not. This seems to
>>> prevent DefaultFetchSchedule handle correctly possible 403 responses
>>> (modifiedTime seems to be always zero and HttpRequest.java doesn't
>>> set If-Modified-Since request part).
>>>
>>> This is true for both nutch 1.x and 2.x.
>>>
>>> Is this the expected behaviour?
>>>
>>> Thanks
>>> Bye
>>>
>>
>>
> 
>

Re: setting modifiedTime in DefaultFetchSchedule

Posted by Cesare Zavattari <ce...@ctrl-z-bg.org>.

Ciao,
in the meanwhile I've done some other test using nutch 2.1 with
DefaultFetchSchedule where I've put:

modifiedTime = fetchTime;

instead of:

if (modifiedTime <= 0) modifiedTime = fetchTime;

I don't know if this is correct (probably not) but at least 304 seems to be
handled. In particular, in the protocol-file (File.getProtocolOutput) I've
added a special case for 304:

if (code == 304) { // got a not modified response
    return new ProtocolOutput(response.toContent(),
      ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
        }

I suppose this is NOT the right solution :-)
Anyway, this is another problem I have with protocol-file. I have the seed:

file://localhost/tmp/files/

this directory contains a couple of files, aa.txt and bbbbb.txt
If a file is deleted, recrawl, readded, it is ignored. I mean:

./nutch crawl urls -depth 2 -topN 5
rm /tmp/files/bbbbb.txt
./nutch crawl urls -depth 2 -topN 5
echo "saaaszzz" >/tmp/files/bbbbb.txt
./nutch crawl urls -depth 2 -topN 5

...
Skipping file://localhost/tmp/files/bbbbb.txt; different batch id (null)
...

and the dump sticks with

...
baseUrl:        file://localhost/tmp/files/bbbbb.txt
status: 1 (status_unfetched)
...
protocolStatus: EXCEPTION, args=[org.apache.nutch.protocol.file.FileError:
File Error: 404]



what am I doing wrong?

Thanks a lot!




On Thu, Nov 15, 2012 at 7:25 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:

> Hi Cesare,
>
> hmhh... Good catch!
>
> The modifiedTime is also set in CrawlDbReducer.reduce
> right after FetchSchedule.setFetchSchedule is called and the signature
> hasn't changed compared to the previous fetch, cf. NUTCH-1341.
>
> At a first glance, it looks like the modifiedTime is indeed never set
> with DefaultFetchSchedule.
> I'll have a more detailed look at this and come back soon.
>
> Thanks,
> Sebastian
>
> On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
> > Hi all,
> > the AdaptiveFetchSchedure has the following line:
> >
> > if (modifiedTime <= 0) modifiedTime = fetchTime;
> >
> > that DefaultFetchSchedule has not. This seems to
> > prevent DefaultFetchSchedule handle correctly possible 403 responses
> > (modifiedTime seems to be always zero and HttpRequest.java doesn't
> > set If-Modified-Since request part).
> >
> > This is true for both nutch 1.x and 2.x.
> >
> > Is this the expected behaviour?
> >
> > Thanks
> > Bye
> >
>
>


-- 
Cesare

Re: setting modifiedTime in DefaultFetchSchedule

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Cesare,

hmhh... Good catch!

The modifiedTime is also set in CrawlDbReducer.reduce
right after FetchSchedule.setFetchSchedule is called and the signature
hasn't changed compared to the previous fetch, cf. NUTCH-1341.

At a first glance, it looks like the modifiedTime is indeed never set
with DefaultFetchSchedule.
I'll have a more detailed look at this and come back soon.

Thanks,
Sebastian

On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
> Hi all,
> the AdaptiveFetchSchedure has the following line:
> 
> if (modifiedTime <= 0) modifiedTime = fetchTime;
> 
> that DefaultFetchSchedule has not. This seems to
> prevent DefaultFetchSchedule handle correctly possible 403 responses
> (modifiedTime seems to be always zero and HttpRequest.java doesn't
> set If-Modified-Since request part).
> 
> This is true for both nutch 1.x and 2.x.
> 
> Is this the expected behaviour?
> 
> Thanks
> Bye
>