You are viewing a plain text version of this content. The canonical link for it is here.

Posted to agent@nutch.apache.org by Viksit Gaur <vi...@gmail.com> on 2008/01/07 04:52:58 UTC

Crawling techniques?

Hello all,

I was trying to figure out the best method to crawl a site without 
getting any of the irrelevant bits such as flash widgets, javascript, 
links to ad networks, and others. The objective is to index all relevant 
textual data. (This may be extrapolated to other forms of data of course)

My main question is - should this sort of elimination be done during the 
crawl, which would mean modifying the crawler; or should everything be 
crawled, indexed, and then have a text parsing system with some logic to 
extract the relevant bits?

Using the crawl-urlfilter seems like the first option, but I believe it 
has its drawbacks. Firstly, it needs regexps which match URLs, which 
would have to be handwritten (even automated scripts would need human 
manipulation at some point). For instance, the scripts or images may be 
hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries 
are far apart to make automation tough. And any such customizations 
would need to be tailor made for each site crawled - a tall task. Is 
there a way to extend the crawler itself to do this? I remember seeing 
something on the list archives about extending the crawler, but I can't 
find it again anymore.. Any pointers?

The second option was to write some sort of a custom class for the 
indexer (a form of the pluginexample on the wiki I guess).

Either way, I'm not sure what the better method is. Any ideas would be 
appreciated!

Cheers,
Viksit

PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure 
which one was a better option.

Re: Crawling techniques?

Posted by Viksit Gaur <vi...@gmail.com>.

Hey Martin,

Thanks for the link - thats pretty close to what I was looking for, I'll
give it a shot! The discussion which lead to the thread you pointed out was
even better!

Cheers,
Viksit

On Jan 7, 2008 3:28 AM, Martin Kuen <ma...@gmail.com> wrote:

> Hi Viksit,
>
> maybe you are looking for this thread:
> http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436465
>
> Cheers,
>
> Martin
>
>
> PS: nutch-user is the correct option. nutch-agent is primarly for
> site-owners who want to report misbehaving nutch bots.
>
> On Jan 7, 2008 4:52 AM, Viksit Gaur <vi...@gmail.com> wrote:
> > Hello all,
> >
> > I was trying to figure out the best method to crawl a site without
> > getting any of the irrelevant bits such as flash widgets, javascript,
> > links to ad networks, and others. The objective is to index all relevant
> > textual data. (This may be extrapolated to other forms of data of
> course)
> >
> > My main question is - should this sort of elimination be done during the
> > crawl, which would mean modifying the crawler; or should everything be
> > crawled, indexed, and then have a text parsing system with some logic to
> > extract the relevant bits?
> >
> > Using the crawl-urlfilter seems like the first option, but I believe it
> > has its drawbacks. Firstly, it needs regexps which match URLs, which
> > would have to be handwritten (even automated scripts would need human
> > manipulation at some point). For instance, the scripts or images may be
> > hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries
> > are far apart to make automation tough. And any such customizations
> > would need to be tailor made for each site crawled - a tall task. Is
> > there a way to extend the crawler itself to do this? I remember seeing
> > something on the list archives about extending the crawler, but I can't
> > find it again anymore.. Any pointers?
> >
> > The second option was to write some sort of a custom class for the
> > indexer (a form of the pluginexample on the wiki I guess).
> >
> > Either way, I'm not sure what the better method is. Any ideas would be
> > appreciated!
> >
> > Cheers,
> > Viksit
> >
> > PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure
> > which one was a better option.
> >
>

Re: Crawling techniques?

Posted by Martin Kuen <ma...@gmail.com>.

Hi Viksit,

maybe you are looking for this thread:
http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436465

Cheers,

Martin


PS: nutch-user is the correct option. nutch-agent is primarly for
site-owners who want to report misbehaving nutch bots.

On Jan 7, 2008 4:52 AM, Viksit Gaur <vi...@gmail.com> wrote:
> Hello all,
>
> I was trying to figure out the best method to crawl a site without
> getting any of the irrelevant bits such as flash widgets, javascript,
> links to ad networks, and others. The objective is to index all relevant
> textual data. (This may be extrapolated to other forms of data of course)
>
> My main question is - should this sort of elimination be done during the
> crawl, which would mean modifying the crawler; or should everything be
> crawled, indexed, and then have a text parsing system with some logic to
> extract the relevant bits?
>
> Using the crawl-urlfilter seems like the first option, but I believe it
> has its drawbacks. Firstly, it needs regexps which match URLs, which
> would have to be handwritten (even automated scripts would need human
> manipulation at some point). For instance, the scripts or images may be
> hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries
> are far apart to make automation tough. And any such customizations
> would need to be tailor made for each site crawled - a tall task. Is
> there a way to extend the crawler itself to do this? I remember seeing
> something on the list archives about extending the crawler, but I can't
> find it again anymore.. Any pointers?
>
> The second option was to write some sort of a custom class for the
> indexer (a form of the pluginexample on the wiki I guess).
>
> Either way, I'm not sure what the better method is. Any ideas would be
> appreciated!
>
> Cheers,
> Viksit
>
> PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure
> which one was a better option.
>