You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/17 17:09:19 UTC

Tika 0.8-SNAPSHOT and HTML torture testing

I just committed some changes to Tika that (in theory) should ensure  
all URLs get extracted from HTML documents.

See https://issues.apache.org/jira/browse/TIKA-463 for details.

It would be great if somebody active in Nutch could try this out with  
the current suite of Nutch tests for HTML processing.

Thanks!

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g