You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Christian Weiske <ch...@netresearch.de> on 2011/08/03 10:02:10 UTC

Fetching ever-changing URLs

Hi,


I'd like to crawl pages of chat logs that change whenever someone sends
a message in our chat rooms, which happens every couple of seconds.
The HTML log pages are updated instantly by the prosody jabber server
and thus have always current timestamps.

Nutch seems to reject them now because they are too new:

> -shouldFetch rejected
>  'http://conference.nr:5290/muc_log/',
>  fetchTime=1314950217363, curTime=1312358255779


I have two questions:

1. Which timestamp format is that? They don't seem to be unix
timestamps, because 
> $ php -r 'echo date("Y-m-d H:i:s", 1312358255779);'
> 43556-12-23 16:56:19
is the wrong year :)

2. What can I do to not get those URLs rejected? I already tried to set
   > db.fetch.schedule.adaptive.sync_delta
   to false and 
   > db.fetch.schedule.adaptive.inc_rate
   > db.fetch.schedule.adaptive.dec_rate
   to 0, but that does not help.

-- 
Viele Grüße
Christian Weiske

Re: Fetching ever-changing URLs

Posted by Dinçer Kavraal <dk...@gmail.com>.
Hi again,

Maybe you could try getting differential logs of the chat server, if
possible. If you are handling chat server, you could set log rotation for 10
mins. for instance, and then add those as if they are different web pages.

Or, you should check db.fetch.interval.* values and probably your key to
solve is writing a custom class and use it as db.fetch.schedule.class and
db.signature.class. After all, you need to know, which page should be
scheduled how, and which page is actually modified.

Best

2011/8/3 Christian Weiske <ch...@netresearch.de>

> Hi,
>
>
> I'd like to crawl pages of chat logs that change whenever someone sends
> a message in our chat rooms, which happens every couple of seconds.
> The HTML log pages are updated instantly by the prosody jabber server
> and thus have always current timestamps.
>
> Nutch seems to reject them now because they are too new:
>
> > -shouldFetch rejected
> >  'http://conference.nr:5290/muc_log/',
> >  fetchTime=1314950217363, curTime=1312358255779
>
>
> I have two questions:
>
> 1. Which timestamp format is that? They don't seem to be unix
> timestamps, because
> > $ php -r 'echo date("Y-m-d H:i:s", 1312358255779);'
> > 43556-12-23 16:56:19
> is the wrong year :)
>
> 2. What can I do to not get those URLs rejected? I already tried to set
>   > db.fetch.schedule.adaptive.sync_delta
>   to false and
>   > db.fetch.schedule.adaptive.inc_rate
>   > db.fetch.schedule.adaptive.dec_rate
>   to 0, but that does not help.
>
> --
> Viele Grüße
> Christian Weiske
>