You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Cesare Zavattari <ce...@ctrl-z-bg.org> on 2012/11/15 12:33:27 UTC
setting modifiedTime in DefaultFetchSchedule
Hi all,
the AdaptiveFetchSchedure has the following line:
if (modifiedTime <= 0) modifiedTime = fetchTime;
that DefaultFetchSchedule has not. This seems to
prevent DefaultFetchSchedule handle correctly possible 403 responses
(modifiedTime seems to be always zero and HttpRequest.java doesn't
set If-Modified-Since request part).
This is true for both nutch 1.x and 2.x.
Is this the expected behaviour?
Thanks
Bye
--
Cesare
Re: setting modifiedTime in DefaultFetchSchedule
Posted by Cesare Zavattari <ce...@ctrl-z-bg.org>.
On Tue, Nov 20, 2012 at 12:12 AM, Sebastian Nagel <
wastl.nagel@googlemail.com> wrote:
> Hi Cesare,
>
Ciao Sebastian and thanks for your email.
> > modifiedTime = fetchTime;
> > instead of:
> > if (modifiedTime <= 0) modifiedTime = fetchTime;
> This will always overwrite modified time with the time the fetch took
> place.
> I would prefer the way as it's done in AdaptiveFetchSchedule:
> only set modifiedTime if it's unset (=0).
>
here's my problem:
- you fetch a page XXX the first time
- modifiedTime is 0, so it's set to fetchTime
- from now on I'll get 304...
- ... unless XXX changes
- modifiedTime will never be changed and I'll never get 304 again, page
will be always fetched (200) because If-Modified-Since will always be true
this is why I always set modifiedTime. We could skip it if status is
NOTMODIFIED.
The same issue seems to affect AdaptiveFetchSchedule
> I don't know if this is correct (probably not) but at least 304 seems to
> be
> > handled. In particular, in the protocol-file (File.getProtocolOutput)
> I've
> > added a special case for 304:
> >
> > if (code == 304) { // got a not modified response
> > return new ProtocolOutput(response.toContent(),
> > ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
> > }
> >
> > I suppose this is NOT the right solution :-)
> At a first glance, it's not bad. Protocol-file needs obviously a revision:
> the 304 is set properly in FileResponse.java but in File.java it is
> treated as
> redirect:
> else if (code >= 300 && code < 400) { // handle redirect
> So, thanks. Good catch!
>
> Would be great if you could open Jira issues for
> - setting modified time in DefaultSchedule
> - 304 handling in protocol-file
> If you can provide patches, even better. Thanks!
>
I want to be sure about the right solution for setting modifiedTime
properly.
About your problem with removal / re-adding files:
> - a file system is crawled as if linked web pages:
> a directory is just an HTML page with all files and sub-directories
> as links.
>
this is clear. Let's consider a page A that links a page B:
A -> B
A is seed. I use the following command:
./nutch crawl urls -depth 2 -topN 5
we crawl it. Ok.
Now let's remove page B.
./nutch crawl urls -depth 2 -topN 5
B gets a 404. Fine.
now let's restore B and crawl again.
This works as expected if A and B are html pages (B is fetched by "./nutch
crawl"). If A is a directory and B is a file, B will never be fetched
again. Moreover, in this case A get a 200 because a new file is added, so
the pasing/generate phases should force the refetch of B, isn't it?
Reproducing it is easy:
mkdir /tmp/files/
echo "AAA" >/tmp/files/aa.txt
the only seed is file://localhost/tmp/files/
./nutch crawl urls -depth 2 -topN 5 // both /tmp/files/ and
/tmp/files/aa.txt are get
rm /tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5 // /tmp/files/aa.txt gets a 404
echo "AAA" >/tmp/files/aa.txt
./nutch crawl urls -depth 2 -topN 5 // /tmp/files/ has changed, is get
(200) while aa.txt:
ParserJob: parsing all
Parsing file://localhost/tmp/files/
Skipping file://localhost/tmp/files/aa.txt; different batch id (null)
and is never fetched again, despite the page that links it (the directory)
has changed.
is this the expected behavior?
thanks a lot
--
Cesare
Re: setting modifiedTime in DefaultFetchSchedule
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Cesare,
> modifiedTime = fetchTime;
> instead of:
> if (modifiedTime <= 0) modifiedTime = fetchTime;
This will always overwrite modified time with the time the fetch took place.
I would prefer the way as it's done in AdaptiveFetchSchedule:
only set modifiedTime if it's unset (=0).
After a closer look at 1.x regarding this point I can confirm:
- with DefaultFetchSchedule the modifiedTime is never set / always 0
> I don't know if this is correct (probably not) but at least 304 seems to be
> handled. In particular, in the protocol-file (File.getProtocolOutput) I've
> added a special case for 304:
>
> if (code == 304) { // got a not modified response
> return new ProtocolOutput(response.toContent(),
> ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
> }
>
> I suppose this is NOT the right solution :-)
At a first glance, it's not bad. Protocol-file needs obviously a revision:
the 304 is set properly in FileResponse.java but in File.java it is treated as
redirect:
else if (code >= 300 && code < 400) { // handle redirect
So, thanks. Good catch!
Would be great if you could open Jira issues for
- setting modified time in DefaultSchedule
- 304 handling in protocol-file
If you can provide patches, even better. Thanks!
About your problem with removal / re-adding files:
- a file system is crawled as if linked web pages:
a directory is just an HTML page with all files and sub-directories
as links.
- re-crawling does not necessarily remove deleted files from the index.
The I had a cloURL/path to a deleted file is kept forever
until it's removed explicitely.
- You have to force a re-fetch of the URL/file to be sure it is still
present or has been removed. If 304 handling is working, this should
be quite cheap for file system crawls because a re-parse is not necessary.
Ciao,
Sebastian
On 11/19/2012 05:30 PM, Cesare Zavattari wrote:
> Ciao,
> in the meanwhile I've done some other test using nutch 2.1 with
> DefaultFetchSchedule where I've put:
>
> modifiedTime = fetchTime;
>
> instead of:
>
> if (modifiedTime <= 0) modifiedTime = fetchTime;
>
> I don't know if this is correct (probably not) but at least 304 seems to be
> handled. In particular, in the protocol-file (File.getProtocolOutput) I've
> added a special case for 304:
>
> if (code == 304) { // got a not modified response
> return new ProtocolOutput(response.toContent(),
> ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
> }
>
> I suppose this is NOT the right solution :-)
> Anyway, this is another problem I have with protocol-file. I have the seed:
>
> file://localhost/tmp/files/
>
> this directory contains a couple of files, aa.txt and bbbbb.txt
> If a file is deleted, recrawl, readded, it is ignored. I mean:
>
> ./nutch crawl urls -depth 2 -topN 5
> rm /tmp/files/bbbbb.txt
> ./nutch crawl urls -depth 2 -topN 5
> echo "saaaszzz" >/tmp/files/bbbbb.txt
> ./nutch crawl urls -depth 2 -topN 5
>
> ...
> Skipping file://localhost/tmp/files/bbbbb.txt; different batch id (null)
> ...
>
> and the dump sticks with
>
> ...
> baseUrl: file://localhost/tmp/files/bbbbb.txt
> status: 1 (status_unfetched)
> ...
> protocolStatus: EXCEPTION, args=[org.apache.nutch.protocol.file.FileError:
> File Error: 404]
>
>
>
> what am I doing wrong?
>
> Thanks a lot!
>
>
>
>
> On Thu, Nov 15, 2012 at 7:25 PM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
>
>> Hi Cesare,
>>
>> hmhh... Good catch!
>>
>> The modifiedTime is also set in CrawlDbReducer.reduce
>> right after FetchSchedule.setFetchSchedule is called and the signature
>> hasn't changed compared to the previous fetch, cf. NUTCH-1341.
>>
>> At a first glance, it looks like the modifiedTime is indeed never set
>> with DefaultFetchSchedule.
>> I'll have a more detailed look at this and come back soon.
>>
>> Thanks,
>> Sebastian
>>
>> On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
>>> Hi all,
>>> the AdaptiveFetchSchedure has the following line:
>>>
>>> if (modifiedTime <= 0) modifiedTime = fetchTime;
>>>
>>> that DefaultFetchSchedule has not. This seems to
>>> prevent DefaultFetchSchedule handle correctly possible 403 responses
>>> (modifiedTime seems to be always zero and HttpRequest.java doesn't
>>> set If-Modified-Since request part).
>>>
>>> This is true for both nutch 1.x and 2.x.
>>>
>>> Is this the expected behaviour?
>>>
>>> Thanks
>>> Bye
>>>
>>
>>
>
>
Re: setting modifiedTime in DefaultFetchSchedule
Posted by Cesare Zavattari <ce...@ctrl-z-bg.org>.
Ciao,
in the meanwhile I've done some other test using nutch 2.1 with
DefaultFetchSchedule where I've put:
modifiedTime = fetchTime;
instead of:
if (modifiedTime <= 0) modifiedTime = fetchTime;
I don't know if this is correct (probably not) but at least 304 seems to be
handled. In particular, in the protocol-file (File.getProtocolOutput) I've
added a special case for 304:
if (code == 304) { // got a not modified response
return new ProtocolOutput(response.toContent(),
ProtocolStatusUtils.makeStatus(ProtocolStatusCodes.NOTMODIFIED));
}
I suppose this is NOT the right solution :-)
Anyway, this is another problem I have with protocol-file. I have the seed:
file://localhost/tmp/files/
this directory contains a couple of files, aa.txt and bbbbb.txt
If a file is deleted, recrawl, readded, it is ignored. I mean:
./nutch crawl urls -depth 2 -topN 5
rm /tmp/files/bbbbb.txt
./nutch crawl urls -depth 2 -topN 5
echo "saaaszzz" >/tmp/files/bbbbb.txt
./nutch crawl urls -depth 2 -topN 5
...
Skipping file://localhost/tmp/files/bbbbb.txt; different batch id (null)
...
and the dump sticks with
...
baseUrl: file://localhost/tmp/files/bbbbb.txt
status: 1 (status_unfetched)
...
protocolStatus: EXCEPTION, args=[org.apache.nutch.protocol.file.FileError:
File Error: 404]
what am I doing wrong?
Thanks a lot!
On Thu, Nov 15, 2012 at 7:25 PM, Sebastian Nagel <wastl.nagel@googlemail.com
> wrote:
> Hi Cesare,
>
> hmhh... Good catch!
>
> The modifiedTime is also set in CrawlDbReducer.reduce
> right after FetchSchedule.setFetchSchedule is called and the signature
> hasn't changed compared to the previous fetch, cf. NUTCH-1341.
>
> At a first glance, it looks like the modifiedTime is indeed never set
> with DefaultFetchSchedule.
> I'll have a more detailed look at this and come back soon.
>
> Thanks,
> Sebastian
>
> On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
> > Hi all,
> > the AdaptiveFetchSchedure has the following line:
> >
> > if (modifiedTime <= 0) modifiedTime = fetchTime;
> >
> > that DefaultFetchSchedule has not. This seems to
> > prevent DefaultFetchSchedule handle correctly possible 403 responses
> > (modifiedTime seems to be always zero and HttpRequest.java doesn't
> > set If-Modified-Since request part).
> >
> > This is true for both nutch 1.x and 2.x.
> >
> > Is this the expected behaviour?
> >
> > Thanks
> > Bye
> >
>
>
--
Cesare
Re: setting modifiedTime in DefaultFetchSchedule
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Cesare,
hmhh... Good catch!
The modifiedTime is also set in CrawlDbReducer.reduce
right after FetchSchedule.setFetchSchedule is called and the signature
hasn't changed compared to the previous fetch, cf. NUTCH-1341.
At a first glance, it looks like the modifiedTime is indeed never set
with DefaultFetchSchedule.
I'll have a more detailed look at this and come back soon.
Thanks,
Sebastian
On 11/15/2012 12:33 PM, Cesare Zavattari wrote:
> Hi all,
> the AdaptiveFetchSchedure has the following line:
>
> if (modifiedTime <= 0) modifiedTime = fetchTime;
>
> that DefaultFetchSchedule has not. This seems to
> prevent DefaultFetchSchedule handle correctly possible 403 responses
> (modifiedTime seems to be always zero and HttpRequest.java doesn't
> set If-Modified-Since request part).
>
> This is true for both nutch 1.x and 2.x.
>
> Is this the expected behaviour?
>
> Thanks
> Bye
>