You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kirk Gillock <pk...@isara.org> on 2009/12/07 12:47:49 UTC
Fetched links contain html
Hello fellow Nutch users,
In a few days we'll start crawling a long list of Thai websites. With
previous crawls we noticed there were A LOT of poorly formatted html
pages and the crawler would sometimes fetch links that contain html code
(ex: http://www.website.com/news/index.php</ul> ). How can we regex
those URLs so that the html code (</ul>) is removed? Would we use the
regex-normalizer.xml file? If so, what would the code look like?
Thanks in advance,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org