You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/02/18 10:50:52 UTC

Content-based Crawl vs Link-based Crawl?

As nutch crawls web pages from links to links by extracting outlinks from
the page.
For example, we can check if the link text contains some keywords from a
dictionary to decide whether or not to crawl it.
Moreover, we can check if the content of a page fetched by an outlink
contains some keywords from a dictionary.

I think this can be done by using a plug-in like url filter, but it seems to
cause the performance problem of the crawling process. So I'd like to listen
to your opinions. Is it possible or meaningful to crawl not just by links
but contents or terms?

Re: Content-based Crawl vs Link-based Crawl?

Posted by Elwin <ma...@gmail.com>.

Hi Howie,

  Thank you for valuable suggestion. I will consider it carefully.
  As I'm going to parse non-English (actually Chinese) pages, so I think
maybe  regular expressions are not very useful to me. I decide to integrate
some simple date mining techniques to achieve it.


2006/2/19, Howie Wang <ho...@hotmail.com>:
>
> I think doing this sort of thing works out very well for niche search
> engines.
> Analyzing the contents of the page takes up some time, but it's just
> milliseconds
> per page. If you contrast this with actually fetching a page that you
> don't
> want
> (several seconds * num pages), you can see that the time savings are very
> much
> in your favor.
>
> I'm not sure if you'd create a URLFilter since I don't think that gives
> you
> easy
> access to the page contents. You could do it in an HtmlParseFilter. Just
> copy the
> parse-html plugin, look for the bit of code where the Outlinks array is
> set.
> Then filter
> that Outlinks array as you see fit.
>
> One thing to be careful about is using regular expressions in Java to
> analyze the
> page contents. I've had lots of problems with hanging using
> java.util.regex.
> I get
> this with perfectly legal regex's, and it's only on certain pages that I
> get
> problems.
> It's not as big a problem for me since most of my regex stuff is during
> the
> indexing phase, and it's easy to re-index. If it happens during the fetch,
> it's a bigger
> pain, since you have to recover from an aborted fetch. So you might want
> to
> do lots of small crawls, instead of big full crawls.
>
> Howie
>
>
> >I think this can be done by using a plug-in like url filter, but it seems
> >to
> >cause the performance problem of the crawling process. So I'd like to
> >listen
> >to your opinions. Is it possible or meaningful to crawl not just by links
> >but contents or terms?
>
>
>


--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

RE: Content-based Crawl vs Link-based Crawl?

Posted by Howie Wang <ho...@hotmail.com>.

I think doing this sort of thing works out very well for niche search 
engines.
Analyzing the contents of the page takes up some time, but it's just 
milliseconds
per page. If you contrast this with actually fetching a page that you don't 
want
(several seconds * num pages), you can see that the time savings are very 
much
in your favor.

I'm not sure if you'd create a URLFilter since I don't think that gives you 
easy
access to the page contents. You could do it in an HtmlParseFilter. Just 
copy the
parse-html plugin, look for the bit of code where the Outlinks array is set. 
Then filter
that Outlinks array as you see fit.

One thing to be careful about is using regular expressions in Java to 
analyze the
page contents. I've had lots of problems with hanging using java.util.regex. 
I get
this with perfectly legal regex's, and it's only on certain pages that I get 
problems.
It's not as big a problem for me since most of my regex stuff is during the
indexing phase, and it's easy to re-index. If it happens during the fetch, 
it's a bigger
pain, since you have to recover from an aborted fetch. So you might want to
do lots of small crawls, instead of big full crawls.

Howie


>I think this can be done by using a plug-in like url filter, but it seems 
>to
>cause the performance problem of the crawling process. So I'd like to 
>listen
>to your opinions. Is it possible or meaningful to crawl not just by links
>but contents or terms?