You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/13 16:51:21 UTC
Frameset handling
I've run into an issue with extracting links from <frame src="xxx">
elements inside of a <frameset>. There are two problems:
1. Currently <frameset> and <frame> elements are discarded.
2. If I fix #1, then XHTMLContentHandler assumes <body>, so you get
invalid XHTML that looks like:
<html>
<body>
<frameset>
I can tweak XHTMLContentHandler to do the right thing, but first
wanted to see if anybody had an objection to emitting
<html>
<frameset>
...
...for these cases.
This also probably won't do the right thing for busted HTML, as
previously discussed on the list, where there's a <frameset> inside of
a <body> in the original source - with a bit more work, I could
probably handle that too, but probably not today.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g