You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/13 16:51:21 UTC

Frameset handling

I've run into an issue with extracting links from <frame src="xxx">  
elements inside of a <frameset>. There are two problems:

1. Currently <frameset> and <frame> elements are discarded.

2. If I fix #1, then XHTMLContentHandler assumes <body>, so you get  
invalid XHTML that looks like:

<html>
	<body>
		<frameset>

I can tweak XHTMLContentHandler to do the right thing, but first  
wanted to see if anybody had an objection to emitting

<html>
	<frameset>
		...

...for these cases.

This also probably won't do the right thing for busted HTML, as  
previously discussed on the list, where there's a <frameset> inside of  
a <body> in the original source - with a bit more work, I could  
probably handle that too, but probably not today.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g