You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/01/31 08:08:29 UTC

Negative keywords and few minor restrictions

Hi,

 I am looking for ways to stop nutch from crawling or showing the negative
keywords in the search. What is the best way of doing it? Should I be using
any plugins?

 Apart from this, I am also looking out ways to ignore or prioritize some
pattern of URL's that nutch is crawling.

 Some help would be really appreciated.

Thanks,
Abhi

Re: Negative keywords and few minor restrictions

Posted by ".: Abhishek :." <ab...@gmail.com>.

Hi Tiger,

 But the negative keywords are usually some regex patterns.

 And, if I am not wrong HtmlParseFilter is a plug-in right? How do I enable
plug-ins in nutch or write my own?

 In other words, I understood what you said but unsure on where this logic
has to go :( Sorry for the trouble.

Thanks,
Abhishek


On Mon, Jan 31, 2011 at 3:47 PM, 黄淑明 <sh...@gmail.com> wrote:

> For the first feature, actually you can use simple string find ( such
> as indexOf) in your class that implements HtmlParseFilter.
> while the page contains words that you want to ignore, just return
> null, and add a metadata to content (say: crawl_me=0);
> and for those pages that contains words you like, just set crawl_me=0.9.
>
> while nutch are generate urls, put judge code in your urlFilter, while
> there's a "crawl_me" and value is zero, then return null.
> and judge in ScoreFilter class, and set a higher score to the urls
> that has higher crawl_me rate.
>
>
> tiger
> 2011/1/31
>
> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> > Hi,
> >
> >  I am looking for ways to stop nutch from crawling or showing the
> negative
> > keywords in the search. What is the best way of doing it? Should I be
> using
> > any plugins?
> >
> >  Apart from this, I am also looking out ways to ignore or prioritize some
> > pattern of URL's that nutch is crawling.
> >
> >  Some help would be really appreciated.
> >
> > Thanks,
> > Abhi
> >
>

Re: Negative keywords and few minor restrictions

Posted by 黄淑明 <sh...@gmail.com>.

For the first feature, actually you can use simple string find ( such
as indexOf) in your class that implements HtmlParseFilter.
while the page contains words that you want to ignore, just return
null, and add a metadata to content (say: crawl_me=0);
and for those pages that contains words you like, just set crawl_me=0.9.

while nutch are generate urls, put judge code in your urlFilter, while
there's a "crawl_me" and value is zero, then return null.
and judge in ScoreFilter class, and set a higher score to the urls
that has higher crawl_me rate.


tiger
2011/1/31

2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> Hi,
>
>  I am looking for ways to stop nutch from crawling or showing the negative
> keywords in the search. What is the best way of doing it? Should I be using
> any plugins?
>
>  Apart from this, I am also looking out ways to ignore or prioritize some
> pattern of URL's that nutch is crawling.
>
>  Some help would be really appreciated.
>
> Thanks,
> Abhi
>