You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kirk Gillock <pk...@isara.org> on 2009/12/07 12:47:49 UTC

Fetched links contain html

Hello fellow Nutch users,

In a few days we'll start crawling a long list of Thai websites. With 
previous crawls we noticed there were A LOT of poorly formatted html 
pages and the crawler would sometimes fetch links that contain html code 
(ex: http://www.website.com/news/index.php</ul> ). How can we regex 
those URLs so that the html code (</ul>) is removed? Would we use the 
regex-normalizer.xml file? If so, what would the code look like?

Thanks in advance,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org