You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel Fai <em...@gmail.com> on 2008/08/20 20:30:58 UTC
Newbie: How to exclude domains from crawling websites?
Hello,
i successfully got nutch 0.9 running and i am really satisfied with it.
But now i am unable to find the specific information i need (i also googled
and searched this mail archive, but no answer really satisfied me).
First i explain what i have done:
I crawled around 400 internet webpages which include the
specific content/topic which i am searching for.
I have a text file "urls" which include all the 400 pages.
http://www.domain1.com
http://www.domain2.com
http://www.domain3.com
http://www.domain4.com
and so on....
The crawl result is as expected buuuut it also found links to other domains
which i don't want to have in my search results.
For example one domain include a link to www.paypal.com which i don't want
that this domain is a part of my nutch results.
http://www.domain2.com has a link to www.paypal.com . Domain2 should be
indexed but not the link to www.paypal.com.
How and where do i exclude this domain to avoid fetching and indexing?
I have some more domains which i don't like to become indexed.
I know only one possibility to enter each domain to below this row:
# accept hosts in MY.DOMAIN.NAME
But shall i add all 400 domains here?
Is there no part exclude or avoid named domains?
i would be happy getting replies from you experts. Great would be adding an
example for me.
Daniel