You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/07/12 22:47:53 UTC

[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887515#action_12887515 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

After looking at this a bit more, it seems like the main issue is whether the default behavior should be to return all valid XHTML 1.0 strict elements, or only those that can have text inside.

The latter behavior is what's currently implemented.

I can see arguments both ways. Returning (for example) an <img> element would remove some unpleasant surprises from users of Tika (like Nutch) that assume they're getting back all of the "important" HTML elements, but it would mean extra elements being generated that have no text, for the typical Tika user who only cares about text.

I'm leaning towards (a) creating a LinksHtmlMapper that handles all of the potentially link containing elements, and (b) modifying the HtmlMapper interface to support proper resolution of relative URLs. This would move code from HtmlHandler.startElement (what currently handles the "a" element) into HtmlMapper. Not sure if I can also then change HtmlMapper into an abstract base class that supports relative link handling, and has a constructor that takes in enough context to provide for this.

I'm going to wait a bit for comments.

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.