You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kumar Limbu <ku...@gmail.com> on 2005/12/28 03:51:25 UTC

Can we search based on two fileds?

Hi everyone,

I am currently indexing a single website, say www.somesite.com. But I do not
want to crawl urls with certain pattern let's say "nocrawl", ie
www.somesite.com/nocrawl.html or www.somesite.com/apage.php?nocrawl. I want
to discard any urls that contains the pattern 'nocrawl'. How do I do it? I
am using nutch version 7.1. Also I want to use the 'crawl' command for
crawling these pages.

Thank you for you support.

--
Keep on smiling
:) Kumar

Re: Can we search based on two fileds?

Posted by Chih How Bong <ch...@gmail.com>.
Is it possible get it done by modify the regualar expression in the config
file?

Bong


On 1/3/06, Nguyen Ngoc Giang <gi...@gmail.com> wrote:
>
> Maybe you try to write a plugin for query parser that excludes all
> patterns
> you want to avoid. A heavy penalization on the url will do the work IMHO.
>
>
> On 12/28/05, Kumar Limbu <ku...@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I am currently indexing a single website, say www.somesite.com. But I do
> > not
> > want to crawl urls with certain pattern let's say "nocrawl", ie
> > www.somesite.com/nocrawl.html or www.somesite.com/apage.php?nocrawl. I
> > want
> > to discard any urls that contains the pattern 'nocrawl'. How do I do it?
> I
> > am using nutch version 7.1. Also I want to use the 'crawl' command for
> > crawling these pages.
> >
> > Thank you for you support.
> >
> > --
> > Keep on smiling
> > :) Kumar
> >
> >
>
>

Re: Can we search based on two fileds?

Posted by Nguyen Ngoc Giang <gi...@gmail.com>.
Maybe you try to write a plugin for query parser that excludes all patterns
you want to avoid. A heavy penalization on the url will do the work IMHO.


On 12/28/05, Kumar Limbu <ku...@gmail.com> wrote:
>
> Hi everyone,
>
> I am currently indexing a single website, say www.somesite.com. But I do
> not
> want to crawl urls with certain pattern let's say "nocrawl", ie
> www.somesite.com/nocrawl.html or www.somesite.com/apage.php?nocrawl. I
> want
> to discard any urls that contains the pattern 'nocrawl'. How do I do it? I
> am using nutch version 7.1. Also I want to use the 'crawl' command for
> crawling these pages.
>
> Thank you for you support.
>
> --
> Keep on smiling
> :) Kumar
>
>