You are viewing a plain text version of this content. The canonical link for it is here.

Posted to droids-dev@incubator.apache.org by Mingfai <mi...@gmail.com> on 2009/04/04 12:18:46 UTC

Re-crawling scenario and HTTP Headers

hi,

I think I got a better picture of Droids now and have learnt things beyond
the Simple Runtime including the more advanced GaussianRandomDelayTimer and
SimpleTaskQueueWithHistory. It seems to me the SimpleTaskQueue is not useful
for most web crawling scenario as pages are usually linked to each others,
and SimpleTaskQueueWithHistory is very useful.

AFAIK, there is no mechanism that cater the re-crawling scenario. I wonder
if anyone has idea on:

   - how to determine a page/URL is changed?
      - follow cache and expiry date in the HTTP header
      - Size, plus and minus 5-15%
      - Text change detection algothmn, such as  Myer's diff algorithm (i
      only know the name :-) and i'm not sure if it is really meaningful to do
      detection in this way)
      http://code.google.com/p/google-diff-match-patch/

      - when to implement the detection logic in Droids?
   - We could have a Task Validator to check the fetch history and maybe
      reject the task if the expiry time is not over yet. This is the
first level
      of change detection.
      - At the parse time, as the content is first accessed, one could
      implement a parser that do change detection.

For both of the above case, there is a problem that the ContentEntity
doesn't contain the full set of HTTP Header. (at least, HTTP headers that
are relevant to change detection) Should all HTTP Headers be stored in the
ContentEntity?

Regards,
mingfai

Re: Re-crawling scenario and HTTP Headers

Posted by Mingfai <mi...@gmail.com>.

>
>
> >
> > For both of the above case, there is a problem that the ContentEntity
> > doesn't contain the full set of HTTP Header. (at least, HTTP headers that
> > are relevant to change detection) Should all HTTP Headers be stored in
> the
> > ContentEntity?
>
> Yes, that makes sense. However we need to implement it hybrid, since we
> have FileContentEntity and HttpContentEntity. I mean ALL headers just
> make sense for HttpContentEntity, right?
>


How about assume parsing and extraction will always be run in the same
thread? so, we may can the HttpClient Request and Response (and preferably
the HttpEntity as well) in a ThreadLocal variable. In this way we don't need
to change the interface. And be clear by a "finally" for sure.

regards,
mingfai

Re: Re-crawling scenario and HTTP Headers

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.

On Sat, 2009-04-04 at 20:18 +0800, Mingfai wrote:
> hi,
> 
> I think I got a better picture of Droids now and have learnt things beyond
> the Simple Runtime including the more advanced GaussianRandomDelayTimer and
> SimpleTaskQueueWithHistory. It seems to me the SimpleTaskQueue is not useful
> for most web crawling scenario as pages are usually linked to each others,
> and SimpleTaskQueueWithHistory is very useful.

Yeah, I agree:
http://markmail.org/thread/5t2dyozc2d3l2no2

> 
> AFAIK, there is no mechanism that cater the re-crawling scenario. I wonder
> if anyone has idea on:
> 
>    - how to determine a page/URL is changed?
>       - follow cache and expiry date in the HTTP header

That should be the starting point. Requesting the header is normally
fast and reliable. 

>       - Size, plus and minus 5-15%

Not sure about that since this seems pretty hacky, since the internal
text could have changed entirely but the absolute size would be the
same. 

>       - Text change detection algothmn, such as  Myer's diff algorithm (i
>       only know the name :-) and i'm not sure if it is really meaningful to do
>       detection in this way)
>       http://code.google.com/p/google-diff-match-patch/

The problem with this is that you actually have to request the response
body to compare it with the page on your system. That only makes sense
when the handler stage is cosuming a lot of time/resources to be
invoked.

e.g. the helloCrawler requests a page and saves it to disk. When we now
compare the http responseBody with the page it may be faster to just
save it again.

> 
>       - when to implement the detection logic in Droids?
>    - We could have a Task Validator to check the fetch history and maybe
>       reject the task if the expiry time is not over yet. This is the
> first level
>       of change detection.

Agree, there should be an expires/changed validator like you describe.

>       - At the parse time, as the content is first accessed, one could
>       implement a parser that do change detection.

That could serve as second level change detection, but with above said,
that in some cases the benefit does not justify the extra work.

> 
> For both of the above case, there is a problem that the ContentEntity
> doesn't contain the full set of HTTP Header. (at least, HTTP headers that
> are relevant to change detection) Should all HTTP Headers be stored in the
> ContentEntity?

Yes, that makes sense. However we need to implement it hybrid, since we
have FileContentEntity and HttpContentEntity. I mean ALL headers just
make sense for HttpContentEntity, right?

salu2

> 
> Regards,
> mingfai
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)