You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ajaxtrend <te...@yahoo.com> on 2008/10/20 07:54:55 UTC

problem with RegExURLFilter class

Hi,
   I am somehow facing a strange problem using regex for urls mentioned in crawl-urlfilter.txt. Before using any regx for urls, I test them in a standalone class and they work correctly i.e. pattern.matcher(url).find() returns true.
But when the same url and regex is used during crawling, it returns false. I am not sure how it behaves differently.
Let me give an example

RegEx in crawl-urlfilter.txt :

^http://bangalore.locanto.in/(used-cars|ID_\\d+)/((\\d*/(\\d+/)*)|(.*.html))

URL: http://bangalore.locanto.in/used-cars/902/

During standalone testing(not in nutch environment), attern.matcher(url).find() returns true. However in nucth environment it returns false.

Appreciate your help on this.

- RB

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Filter Adult Content

Posted by Webmaster <we...@axismedia.ca>.
Hi,

My link DB is growing substantially now and Im' crawling some 12 million
urls a day.  I plan on generating my linkdb in portions (10) of 100 million
each to place on my sand box servers for a distributed search cluster.
Before I move this out of hadoop and place it on local file systems I want
to filter my linkdb for any adult content.  

Does anyone have any pointers or ready made filters for this?

I'm sure I can create some filters to do this to a degree; however a tried
and true system would be ideal.

Thanks..

Axel..