You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/17 17:09:19 UTC
Tika 0.8-SNAPSHOT and HTML torture testing
I just committed some changes to Tika that (in theory) should ensure
all URLs get extracted from HTML documents.
See https://issues.apache.org/jira/browse/TIKA-463 for details.
It would be great if somebody active in Nutch could try this out with
the current suite of Nutch tests for HTML processing.
Thanks!
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g