You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/01/31 08:08:29 UTC
Negative keywords and few minor restrictions
Hi,
I am looking for ways to stop nutch from crawling or showing the negative
keywords in the search. What is the best way of doing it? Should I be using
any plugins?
Apart from this, I am also looking out ways to ignore or prioritize some
pattern of URL's that nutch is crawling.
Some help would be really appreciated.
Thanks,
Abhi
Re: Negative keywords and few minor restrictions
Posted by ".: Abhishek :." <ab...@gmail.com>.
Hi Tiger,
But the negative keywords are usually some regex patterns.
And, if I am not wrong HtmlParseFilter is a plug-in right? How do I enable
plug-ins in nutch or write my own?
In other words, I understood what you said but unsure on where this logic
has to go :( Sorry for the trouble.
Thanks,
Abhishek
On Mon, Jan 31, 2011 at 3:47 PM, 黄淑明 <sh...@gmail.com> wrote:
> For the first feature, actually you can use simple string find ( such
> as indexOf) in your class that implements HtmlParseFilter.
> while the page contains words that you want to ignore, just return
> null, and add a metadata to content (say: crawl_me=0);
> and for those pages that contains words you like, just set crawl_me=0.9.
>
> while nutch are generate urls, put judge code in your urlFilter, while
> there's a "crawl_me" and value is zero, then return null.
> and judge in ScoreFilter class, and set a higher score to the urls
> that has higher crawl_me rate.
>
>
> tiger
> 2011/1/31
>
> 2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> > Hi,
> >
> > I am looking for ways to stop nutch from crawling or showing the
> negative
> > keywords in the search. What is the best way of doing it? Should I be
> using
> > any plugins?
> >
> > Apart from this, I am also looking out ways to ignore or prioritize some
> > pattern of URL's that nutch is crawling.
> >
> > Some help would be really appreciated.
> >
> > Thanks,
> > Abhi
> >
>
Re: Negative keywords and few minor restrictions
Posted by 黄淑明 <sh...@gmail.com>.
For the first feature, actually you can use simple string find ( such
as indexOf) in your class that implements HtmlParseFilter.
while the page contains words that you want to ignore, just return
null, and add a metadata to content (say: crawl_me=0);
and for those pages that contains words you like, just set crawl_me=0.9.
while nutch are generate urls, put judge code in your urlFilter, while
there's a "crawl_me" and value is zero, then return null.
and judge in ScoreFilter class, and set a higher score to the urls
that has higher crawl_me rate.
tiger
2011/1/31
2011/1/31 .: Abhishek :. <ab...@gmail.com>:
> Hi,
>
> I am looking for ways to stop nutch from crawling or showing the negative
> keywords in the search. What is the best way of doing it? Should I be using
> any plugins?
>
> Apart from this, I am also looking out ways to ignore or prioritize some
> pattern of URL's that nutch is crawling.
>
> Some help would be really appreciated.
>
> Thanks,
> Abhi
>