You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by simon_ece <si...@yahoo.com> on 2007/05/04 09:20:20 UTC

Nutch - Filtering (REGEX)

hi all,
i am new to Nutch. I would like to crawl a particular site and get the
result in the following pattern.I dont want to list other urls from the
Crwaled site.

Site to be Crwal :eg" www.example.com
^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$

i can crawl and geting all the matching urls from the site,
i dont know how to filterout the urls and get only the particular urls,
kindly post the suggestions
Thanks & Regards
Simon

-- 
View this message in context: http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch - Filtering (REGEX)

Posted by Marcin Okraszewski <ok...@gmail.com>.
In other words, you want to crawl whole site, but index only some pages?

To be honest this is something I would like to do also. I finish check
it yet, but seems that you can write IndexingFilter, which would throw
exception if the page shouldn't be indexed. Unfortunatelly you cannot
return null, bacause there is null pointer exception. Throwing the
exception, causes a warn log message, which may cause log overload if
you have a large site.

I hope it helps,
Marcin Okraszewski


On 5/5/07, simon_ece <si...@yahoo.com> wrote:
>
> hi, thanks for the reply,
>
> this is my conf/Crawl-url filter file content
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
>
> +^http://([a-z0-9]*\.)*example.com/
>
>
>
> # skip everything else
> -.
>
> its crawling the whole site and i can view all the related matches while
> searching,
> but i need to filter out someof the pages
> for eg:
> if i search for some category (red)
> this will list out all the links ;
> but i do want to show only a particular link which should matches the
> regular expression
>
> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
>
> kindly post your suggestion
> Regards,
> Simon
> __________________________________________________________________
>
> Marcin Okraszewski wrote:
> >
> > How about  conf/crawl-urlfilter.txt  ??
> >
> > Marcin
> >
> > On 5/4/07, simon_ece <si...@yahoo.com> wrote:
> >>
> >> hi all,
> >> i am new to Nutch. I would like to crawl a particular site and get the
> >> result in the following pattern.I dont want to list other urls from the
> >> Crwaled site.
> >>
> >> Site to be Crwal :eg" www.example.com
> >> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
> >>
> >> i can crawl and geting all the matching urls from the site,
> >> i dont know how to filterout the urls and get only the particular urls,
> >> kindly post the suggestions
> >> Thanks & Regards
> >> Simon
> >>
> >> --
> >> View this message in context:
> >> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context: http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10334300
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Nutch - Filtering (REGEX)

Posted by simon_ece <si...@yahoo.com>.
hi, thanks for the reply, 

this is my conf/Crawl-url filter file content 

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*example.com/



# skip everything else
-.

its crawling the whole site and i can view all the related matches while
searching,
but i need to filter out someof the pages
for eg:
if i search for some category (red)
this will list out all the links ;
but i do want to show only a particular link which should matches the
regular expression

^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$

kindly post your suggestion
Regards,
Simon
__________________________________________________________________

Marcin Okraszewski wrote:
> 
> How about  conf/crawl-urlfilter.txt  ??
> 
> Marcin
> 
> On 5/4/07, simon_ece <si...@yahoo.com> wrote:
>>
>> hi all,
>> i am new to Nutch. I would like to crawl a particular site and get the
>> result in the following pattern.I dont want to list other urls from the
>> Crwaled site.
>>
>> Site to be Crwal :eg" www.example.com
>> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
>>
>> i can crawl and geting all the matching urls from the site,
>> i dont know how to filterout the urls and get only the particular urls,
>> kindly post the suggestions
>> Thanks & Regards
>> Simon
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10334300
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch - Filtering (REGEX)

Posted by Marcin Okraszewski <ok...@gmail.com>.
How about  conf/crawl-urlfilter.txt  ??

Marcin

On 5/4/07, simon_ece <si...@yahoo.com> wrote:
>
> hi all,
> i am new to Nutch. I would like to crawl a particular site and get the
> result in the following pattern.I dont want to list other urls from the
> Crwaled site.
>
> Site to be Crwal :eg" www.example.com
> ^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$
>
> i can crawl and geting all the matching urls from the site,
> i dont know how to filterout the urls and get only the particular urls,
> kindly post the suggestions
> Thanks & Regards
> Simon
>
> --
> View this message in context: http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3690583.html#a10318059
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>