You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Daniel Fai <em...@gmail.com> on 2008/08/20 20:30:58 UTC

Newbie: How to exclude domains from crawling websites?

Hello,
i successfully got nutch 0.9 running and i am really satisfied with it.
But now i am unable to find the specific information i need (i also googled
and searched this mail archive, but no answer really satisfied me).

First i explain what i have done:
I crawled around 400 internet webpages which include the
specific content/topic which i am searching for.
I have a text file "urls" which include all the 400 pages.

http://www.domain1.com
 http://www.domain2.com
http://www.domain3.com
http://www.domain4.com
and so on....

The crawl result is as expected buuuut it also found links to other domains
which i don't want to have in my search results.
For example one domain include a link to www.paypal.com which i don't want
that this domain is a part of my nutch results.

http://www.domain2.com has a link to www.paypal.com . Domain2 should be
indexed but not the link to www.paypal.com.

How and where do i exclude this domain to avoid fetching and indexing?
I have some more domains which i don't like to become indexed.

I know only one possibility to enter each domain to below this row:
# accept hosts in MY.DOMAIN.NAME

But shall i add all 400 domains here?
Is there no part exclude or avoid named domains?

i would be happy getting replies from you experts. Great would be adding an
example for me.

Daniel