You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by cha <ch...@metrixline.com> on 2007/03/21 16:37:53 UTC

help needed : filters in regex-urlfilter.txt


Hi,

I want to ignore the following urls from crawling

for eg.

http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
http://www.example.com/stores/abcd/merch-cats/abcd.*
http://www.example.com/stores/abcd/merch/abd.*


I have used regex-urlfilter.txt file  and negate the following urls:


# skip URLs containing certain characters as probable queries, etc.
#-[*!@?]
-http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
-http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
-http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*

The above filters still don't filters all the urls.

is there any way to solve this..any alternatives??

Awaiting,

Cha



-- 
View this message in context: http://www.nabble.com/help-needed-%3A-filters-in-regex-urlfilter.txt-tf3441531.html#a9596460
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: help needed : filters in regex-urlfilter.txt

Posted by Enis Soztutar <en...@gmail.com>.
cha wrote:
> Hi,
>
> I want to ignore the following urls from crawling
>
> for eg.
>
> http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
> http://www.example.com/stores/abcd/merch-cats/abcd.*
> http://www.example.com/stores/abcd/merch/abd.*
>
>
> I have used regex-urlfilter.txt file  and negate the following urls:
>
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[*!@?]
> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
> -http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*
>
> The above filters still don't filters all the urls.
>
> is there any way to solve this..any alternatives??
>
> Awaiting,
>
> Cha
>
>
>
>   
Did you enable urlfilter-regex plugin in your configuration?




Re: help needed : filters in regex-urlfilter.txt

Posted by Ravi Chintakunta <ra...@gmail.com>.
Just including this regex

-^http://www.example.com/stores/abc.*

should help.

If you just want to skip URLs that include the word merch, then add this:

-merch

Also ensure that you have

-.

at the end of the file, to skip all other URLs.


- Ravi Chintakunta


On 3/23/07, cha <ch...@metrixline.com> wrote:
>
> Hi,
>
> I have try the filters u have provided ..but still its not working..
>
> i have enable urlfilter-regex plugin in your configuration as well..
>
> i cant find out whats the problem is.
>
> Cheers,
> cha
>
>
>
> Jason Culverhouse wrote:
> >
> > Cha,
> > You want something like this
> >
> > -^http://([a-z0-9]*\.)*example.com/stores/[^/]+/(merch-cats-pg|merch-
> > cats|merch)/
> >
> > Your regex fails to match because that last segment '/merch-cats-pg
> > \.*' requires a literal .
> > So it matches http://www.example.com/stores/abcd/merch-cats-pg.
> > not http://www.example.com/stores/abcd/merch-cats-pg/
> >
> > You could also just change that to  '/merch-cats-pg.*'
> >
> > Get a copy of Mastering Regular Expressions By Jeffrey E. F. Friedl
> > http://www.oreilly.com/catalog/regex3/index.html
> >
> > Jason
> >
> > On Mar 21, 2007, at 8:37 AM, cha wrote:
> >
> >>
> >>
> >> Hi,
> >>
> >> I want to ignore the following urls from crawling
> >>
> >> for eg.
> >>
> >> http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
> >> http://www.example.com/stores/abcd/merch-cats/abcd.*
> >> http://www.example.com/stores/abcd/merch/abd.*
> >>
> >>
> >> I have used regex-urlfilter.txt file  and negate the following urls:
> >>
> >>
> >> # skip URLs containing certain characters as probable queries, etc.
> >> #-[*!@?]
> >> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
> >> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
> >> -http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*
> >>
> >> The above filters still don't filters all the urls.
> >>
> >> is there any way to solve this..any alternatives??
> >>
> >> Awaiting,
> >>
> >> Cha
> >>
> >>
> >>
> >> --
> >> View this message in context: http://www.nabble.com/help-needed-%3A-
> >> filters-in-regex-urlfilter.txt-tf3441531.html#a9596460
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> >
> >
>
> --
> View this message in context: http://www.nabble.com/help-needed-%3A-filters-in-regex-urlfilter.txt-tf3441531.html#a9629082
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: help needed : filters in regex-urlfilter.txt

Posted by Jason Culverhouse <ja...@mischievous.org>.
Cha,
Are you updating regex-urlfilter.txt or crawl-urlfilter.txt ?
If you use nutch crawl, you need to update crawl-urlfilter.txt
If you use whole web crawling you need to use regex-urlfilter.txt
Jason

On Mar 22, 2007, at 10:26 PM, cha wrote:

>
> Hi,
>
> I have try the filters u have provided ..but still its not working..
>
> i have enable urlfilter-regex plugin in your configuration as well..
>
> i cant find out whats the problem is.
>
> Cheers,
> cha
>
>
>
> Jason Culverhouse wrote:
>>
>> Cha,
>> You want something like this
>>
>> -^http://([a-z0-9]*\.)*example.com/stores/[^/]+/(merch-cats-pg|merch-
>> cats|merch)/
>>
>> Your regex fails to match because that last segment '/merch-cats-pg
>> \.*' requires a literal .
>> So it matches http://www.example.com/stores/abcd/merch-cats-pg.
>> not http://www.example.com/stores/abcd/merch-cats-pg/
>>
>> You could also just change that to  '/merch-cats-pg.*'
>>
>> Get a copy of Mastering Regular Expressions By Jeffrey E. F. Friedl
>> http://www.oreilly.com/catalog/regex3/index.html
>>
>> Jason
>>
>> On Mar 21, 2007, at 8:37 AM, cha wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> I want to ignore the following urls from crawling
>>>
>>> for eg.
>>>
>>> http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
>>> http://www.example.com/stores/abcd/merch-cats/abcd.*
>>> http://www.example.com/stores/abcd/merch/abd.*
>>>
>>>
>>> I have used regex-urlfilter.txt file  and negate the following urls:
>>>
>>>
>>> # skip URLs containing certain characters as probable queries, etc.
>>> #-[*!@?]
>>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
>>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
>>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*
>>>
>>> The above filters still don't filters all the urls.
>>>
>>> is there any way to solve this..any alternatives??
>>>
>>> Awaiting,
>>>
>>> Cha
>>>
>>>
>>>
>>> --  
>>> View this message in context: http://www.nabble.com/help-needed-%3A-
>>> filters-in-regex-urlfilter.txt-tf3441531.html#a9596460
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>
> --  
> View this message in context: http://www.nabble.com/help-needed-%3A- 
> filters-in-regex-urlfilter.txt-tf3441531.html#a9629082
> Sent from the Nutch - User mailing list archive at Nabble.com.
>


Re: help needed : filters in regex-urlfilter.txt

Posted by cha <ch...@metrixline.com>.
Hi,

I have try the filters u have provided ..but still its not working..

i have enable urlfilter-regex plugin in your configuration as well..

i cant find out whats the problem is.

Cheers,
cha



Jason Culverhouse wrote:
> 
> Cha,
> You want something like this
> 
> -^http://([a-z0-9]*\.)*example.com/stores/[^/]+/(merch-cats-pg|merch- 
> cats|merch)/
> 
> Your regex fails to match because that last segment '/merch-cats-pg 
> \.*' requires a literal .
> So it matches http://www.example.com/stores/abcd/merch-cats-pg.
> not http://www.example.com/stores/abcd/merch-cats-pg/
> 
> You could also just change that to  '/merch-cats-pg.*'
> 
> Get a copy of Mastering Regular Expressions By Jeffrey E. F. Friedl
> http://www.oreilly.com/catalog/regex3/index.html
> 
> Jason
> 
> On Mar 21, 2007, at 8:37 AM, cha wrote:
> 
>>
>>
>> Hi,
>>
>> I want to ignore the following urls from crawling
>>
>> for eg.
>>
>> http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
>> http://www.example.com/stores/abcd/merch-cats/abcd.*
>> http://www.example.com/stores/abcd/merch/abd.*
>>
>>
>> I have used regex-urlfilter.txt file  and negate the following urls:
>>
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> #-[*!@?]
>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
>> -http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*
>>
>> The above filters still don't filters all the urls.
>>
>> is there any way to solve this..any alternatives??
>>
>> Awaiting,
>>
>> Cha
>>
>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/help-needed-%3A- 
>> filters-in-regex-urlfilter.txt-tf3441531.html#a9596460
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/help-needed-%3A-filters-in-regex-urlfilter.txt-tf3441531.html#a9629082
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: help needed : filters in regex-urlfilter.txt

Posted by Jason Culverhouse <ja...@mischievous.org>.
Cha,
You want something like this

-^http://([a-z0-9]*\.)*example.com/stores/[^/]+/(merch-cats-pg|merch- 
cats|merch)/

Your regex fails to match because that last segment '/merch-cats-pg 
\.*' requires a literal .
So it matches http://www.example.com/stores/abcd/merch-cats-pg.
not http://www.example.com/stores/abcd/merch-cats-pg/

You could also just change that to  '/merch-cats-pg.*'

Get a copy of Mastering Regular Expressions By Jeffrey E. F. Friedl
http://www.oreilly.com/catalog/regex3/index.html

Jason

On Mar 21, 2007, at 8:37 AM, cha wrote:

>
>
> Hi,
>
> I want to ignore the following urls from crawling
>
> for eg.
>
> http://www.example.com/stores/abcd/merch-cats-pg/abcd.*
> http://www.example.com/stores/abcd/merch-cats/abcd.*
> http://www.example.com/stores/abcd/merch/abd.*
>
>
> I have used regex-urlfilter.txt file  and negate the following urls:
>
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[*!@?]
> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats-pg\.*
> -http://([a-z0-9]*\.)*example.com/stores/.*/merch-cats\.*
> -http://([a-z0-9]*\.)*example.com/stores/.*/merch\.*
>
> The above filters still don't filters all the urls.
>
> is there any way to solve this..any alternatives??
>
> Awaiting,
>
> Cha
>
>
>
> -- 
> View this message in context: http://www.nabble.com/help-needed-%3A- 
> filters-in-regex-urlfilter.txt-tf3441531.html#a9596460
> Sent from the Nutch - User mailing list archive at Nabble.com.
>