You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/07/12 22:15:49 UTC

[jira] Created: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
---------------------------------------------------------------------------------

                 Key: TIKA-463
                 URL: https://issues.apache.org/jira/browse/TIKA-463
             Project: Tika
          Issue Type: Bug
            Reporter: Ken Krugler
            Assignee: Ken Krugler


All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.

For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.

But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-463:
-------------------------------

    Attachment: TIKA-463.patch

Patch which implements some of the ideas described in this issue. 

- HTMLMapper is an abstract class with a constructor HtmlMapper(Metadata metadata, ParseContext context)
- all extensions of HtmlMapper can access the metadata and context
- HTMLMapper implements the method resolve(String url)
- Created a LinksHtmlMapper which extends DefaultHtmlMapper
- HtmlHandler.bodyLevel is used to restrict the propagation of characters() but not the elements
- HtmlHandler has a variable inHead to separate the treatment of elements in the header from the rest (don't know if this is really needed but that's how it is done now)

Note that : 
- HTMLMapper.resolve() is currently called from the HtmlHandler
- the signatures of the mapper methods have not been changed
- custom processing of some elements (A, BASE, LINK, ...) is still done in the HtmlHandler and not in the mapper

This patch passes the tests. 



> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.