You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by simon_ece <si...@yahoo.com> on 2007/05/03 09:36:11 UTC

Nutch - Filtering (REGEX)

hi all,
i am new to Nutch. I would like to crawl a particular site and get the
result in the following pattern.I dont want to list other urls from the
Crwaled site.

Site to be Crwal :eg" www.example.com
^http://([a-z0-9]*\.)example.com/([a-zA-Z]*)-\([a-z0-9]*\)-.*-\([0-9]*-[A-Za-z0-9]*\)\.html$

i can crawl and geting all the matching urls from the site,
i dont know how to filterout the urls and get only the particular urls,
kindly post the suggestions
Thanks & Regards
Simon
-- 
View this message in context: http://www.nabble.com/Nutch---Filtering-%28REGEX%29-tf3685035.html#a10300328
Sent from the Nutch - Dev mailing list archive at Nabble.com.