You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2012/08/01 18:06:18 UTC

Re: keyword crawling

On Jul 19, 2012, at 2:46am, albsmith wrote:

> I don't think it is possible to instruct the search engine to mainly focus on
> particular keyword alone. commonly the Search Engine follows The robots
> exclusion protocol (REP), or robots.txt is a text file webmasters create to
> instruct robots (typically search engine robots) on how to crawl & index
> pages on their website. From this i could know that it's possible only in
> the case of indexing the pages alone not a particular keyword i hope so. If
> you have any idea about this kindly share with me.
> http://www.prodigyapex.com/

I _think_ what you're asking about is how to do a focused crawl, where you want the crawler to (mostly) fetch pages that contain target keywords.

If so, then see http://www.scaleunlimited.com/about/focused-crawler/ for some ideas on how to do this.

It's possible to do the same thing in Nutch, using plug-in page scorers, but I haven't looked at that code in a while.

-- Ken

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: keyword crawling

Posted by albsmith <al...@gmail.com>.

URL State database , Page score , Link score , Fetched Pages database These
are good. But its not included that we can instruct the SE focus on
particular keyword alone in our page.
http://www.reputationrhino.com/ Reputation Management  



--
View this message in context: http://lucene.472066.n3.nabble.com/keyword-crawling-tp616806p3999323.html
Sent from the Nutch - User mailing list archive at Nabble.com.