You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2010/07/12 22:15:49 UTC

[jira] Created: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
---------------------------------------------------------------------------------

                 Key: TIKA-463
                 URL: https://issues.apache.org/jira/browse/TIKA-463
             Project: Tika
          Issue Type: Bug
            Reporter: Ken Krugler
            Assignee: Ken Krugler


All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.

For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.

But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated TIKA-463:
-------------------------------

    Attachment: TIKA-463.patch

Patch which implements some of the ideas described in this issue. 

- HTMLMapper is an abstract class with a constructor HtmlMapper(Metadata metadata, ParseContext context)
- all extensions of HtmlMapper can access the metadata and context
- HTMLMapper implements the method resolve(String url)
- Created a LinksHtmlMapper which extends DefaultHtmlMapper
- HtmlHandler.bodyLevel is used to restrict the propagation of characters() but not the elements
- HtmlHandler has a variable inHead to separate the treatment of elements in the header from the rest (don't know if this is really needed but that's how it is done now)

Note that : 
- HTMLMapper.resolve() is currently called from the HtmlHandler
- the signatures of the mapper methods have not been changed
- custom processing of some elements (A, BASE, LINK, ...) is still done in the HtmlHandler and not in the mapper

This patch passes the tests. 



> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887716#action_12887716 ] 

Julien Nioche commented on TIKA-463:
------------------------------------

creating a LinksHtmlMapper : +1, that would be a nice intermediate between the default mapper and the identity mapper 

handling of links in mapper : mapSafeAttribute() returns a normalised representation of the attribute names that are allowed but does not affect the value of the attributes. Maybe we should change the method so that it returns BOTH the normalised name (or null of the attribute must be skipped) and the corresponding normalised value (e.g. the resolved URL) given a name/value couple. The mapper implementation could then manage the resolution of the URLs internally. This would also be useful for normalising the names and values of elements in the header such as http-equiv.

HtmlParser as an abstract class : what about following Jukka's suggestion for Handlers in https://issues.apache.org/jira/browse/TIKA-458 and have a Factory?

As for frames, it raises another issue (see https://issues.apache.org/jira/browse/TIKA-457) which is that anything outside <body> and <head> is currently discarded by the HTMLMapper. This is why I considered doing TIKA-458 but maybe we could make the HTMLHandler more generic and delegate the decisions to the Mappers e.g. by adding a method isBody(). 

The body level is currently used to : 
a) distinguish the elements in the header
b) determine where characters should be added to the text of the document

Do we really need (a)? Are elements such as LINK, BASE or META found anywhere outside the HEAD? Should mapSafeElement() take into account the path of an element as well e.g. to allow a <link> only if it has <head> for parent?




> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler resolved TIKA-463.
------------------------------

    Fix Version/s: 0.8
       Resolution: Fixed

SVN 986348. With this commit, i'm going to resolve this issue. It's not perfect yet, but feels close enough for now. Issues I ran into and comments in general:

* The <applet> element isn't supported, and can contain URLs
* The <object> element has a codebase="xxx" attribute that defines the base URL for the "classid" URL, but that isn't getting special handling.
* The <object> element has a urllist="url, url" attribute that can contain one or more space-separated URLs, but I'm ignoring it.
* The DefaultHtmlMapper doesn't pass through all valid XHTML 1.0 elements or their attributes, but that's a topic for another issue.
* No checks are done for required attributes, or restrictions on values of attributes. For example, the <img> element must have an alt="xxx" attribute.
* TagSoup adds some of the required attributes, so you can now get output that has attributes (with default values) that didn't exist in the source HTML.
* The HtmlParserTest code should be using XPath expressions to validate output, versus string patterns.
* It would be good to use a validating parser (for XHTML 1.0) to double-check the output from all tests.

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>         Attachments: TIKA-463-1.patch, TIKA-463-2.patch, TIKA-463-3.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899528#action_12899528 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

Hi Julien,

As per Jukka's suggestion, the XHTMLContentHandler puts everything it finds in metadata into the <head> block as <meta> elements. This way it also helps out non-HTML parsers.

One issue this causes, though, is that it exposes an existing issue with how HtmlHandler treats the http-equiv meta tag. This gets mapped to a <meta name="Content-Type" xxx> element, but you can also have an existing <meta name="content-type" xxx> element. Makes me think we should treat metadata keys as case-insensitive, to avoid this issue. And/or remap "well-known" keys to their correct capitalization.

-- Ken

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>         Attachments: TIKA-463-1.patch, TIKA-463-2.patch, TIKA-463-3.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-463:
-----------------------------

    Attachment: TIKA-463-2.patch

Fixed a problem with passing through null values from Metadata entries as <meta> elements, as this causes some SAX processing code to throw a NPE.

Also improved the test for broken HTML with a <frameset> element inside of a <body> element.

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463-1.patch, TIKA-463-2.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-463:
-----------------------------

    Attachment: TIKA-463-3.patch

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463-1.patch, TIKA-463-2.patch, TIKA-463-3.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899465#action_12899465 ] 

Julien Nioche commented on TIKA-463:
------------------------------------

Look good. I must be missing something obvious but I can't work out where an element like META is sent to the XHTML output. Wasn't the case before as far as I can remember and this can't be in HtmlHandler as it imposes the constraints I described earlier i.e. it used to simply put the info in the metadata. Ken, would you mind giving me a hint?


> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>             Fix For: 0.8
>
>         Attachments: TIKA-463-1.patch, TIKA-463-2.patch, TIKA-463-3.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898321#action_12898321 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

Added support for <frame> elements in SVN 985288, which is the patch for [TIKA-457].

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463-1.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892958#action_12892958 ] 

Julien Nioche commented on TIKA-463:
------------------------------------

Am very tempted to push things one step further and delegate the startElement() and endElement() to the mappers so that users can do whatever they fancy in their custom mapper implementations. In that case we'd probably not need mapSafeElement and mapSafeAttribute any longer. The patch above gives the mappers access to the metadata.

For example, <a> have a special treatment in the HTMLHandler and we currently can't get the rel attribute in from <a href="http://www.nutch.org" rel="nofollow">, which for a crawler is quite an embarrassment. Instead, by delegating the logic to the mappers we get total control on what can be done while at the same time remain able to keep the existing behaviour by default. 

Any reason not to delegate start/endElement to the mappers? It would be good to get some feedback on this, as I really need to improve the  handling of HTML for Nutch :-)

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892996#action_12892996 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

I like the idea of being able to encapsulate all special processing into one easily extensible class.

I'm trying to come to grips with what things should be done in HtmlParser vs. HtmlHandler vs. HtmlMapper.

Since most of what we're talking about is moving code from HtmlHandler to HtmlMapper, I agree that trying to provide as much control as possible to HtmlMapper (which can be overridden) makes sense. But when I look at what would be left in HtmlHandler, it's not clear to me that we'd even need that class anymore. But I'd need to spend more time thinking about things like why HtmlHandler is subclassing TextContentHandler vs. DefaultHandler.

In summary, it feels like we're heading down a path where HtmlHandler is the extension point (there is no HtmlMapper), and it should have some methods (beyond the std ContentHandler methods) that can be overridden to adjust behavior. Otherwise it would be this very thin shim, without much value, that just adds complexity to the calling chain.



> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890001#action_12890001 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

Hi Julien,

Thanks for the patch! I'm on vacation, but will review it when I'm back.

-- Ken


> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887523#action_12887523 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

The other issue I've run into with HtmlMapper is that it seems impossible currently to have it do the right thing for remapping URLs, even if I create my own custom implementation of that interface.

The problem is that you specify the mapper via ParseContext(HtmlMapper.class, my-custom-code.class). So this means my-custom-code gets instantiated via a no-args constructor, and it doesn't have access to the metadata, so it doesn't know the base URL to use for normalizing URLs.

If I could, I'd change HtmlParser to be an abstract class, and have a constructor that takes Metadata and ParseContext arguments. And give it a "resolveUrl()" method that the mapSafeAttribute() method could use, versus baking that into HtmlHandler.

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898021#action_12898021 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

I committed a change (985052) that will emit <img> elements with resolved src=<url> attributes. This gets me past a roadblock, so I'm going to hold off a bit on any additional changes. I'd like to do something more along the lines of what Julien is proposing, but it feels like too big of a bite for me right now.

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463-1.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887515#action_12887515 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

After looking at this a bit more, it seems like the main issue is whether the default behavior should be to return all valid XHTML 1.0 strict elements, or only those that can have text inside.

The latter behavior is what's currently implemented.

I can see arguments both ways. Returning (for example) an <img> element would remove some unpleasant surprises from users of Tika (like Nutch) that assume they're getting back all of the "important" HTML elements, but it would mean extra elements being generated that have no text, for the typical Tika user who only cares about text.

I'm leaning towards (a) creating a LinksHtmlMapper that handles all of the potentially link containing elements, and (b) modifying the HtmlMapper interface to support proper resolution of relative URLs. This would move code from HtmlHandler.startElement (what currently handles the "a" element) into HtmlMapper. Not sure if I can also then change HtmlMapper into an abstract base class that supports relative link handling, and has a constructor that takes in enough context to provide for this.

I'm going to wait a bit for comments.

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-463:
-----------------------------

    Attachment: TIKA-463-1.patch

Simple patch that does a few things...

- Clean up mapping of tag names & attributes
- Support for <img> element


> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463-1.patch, TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-463:
-----------------------------------

    Component/s: parser

- classify the component

> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-463) HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892845#action_12892845 ] 

Ken Krugler commented on TIKA-463:
----------------------------------

Logging some notes I made on different attributes used for URLs, based on element name:

attribute - elements

href - base, link, a, area
src - script, img, input
cite - blockquote, q, ins, del
data - object
longdesc - img
usemap - img, input


> HtmlParser doesn't extract links from img, map, object, frame, iframe, area, link
> ---------------------------------------------------------------------------------
>
>                 Key: TIKA-463
>                 URL: https://issues.apache.org/jira/browse/TIKA-463
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>         Attachments: TIKA-463.patch
>
>
> All of the listed HTML elements can have URLs as attributes, and thus we'd want to extract those links, if possible.
> For elements that aren't valid as XHTML 1.0, there might be some challenges in the right way to handle this.
> But if XHTML 1.0 means the union of "transitional and frameset" variants, then all of the above are valid, and thus should be emitted by the parser,

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.