You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/10/16 16:53:31 UTC
[jira] Resolved: (TIKA-304) HtmlParser could be easier to subclass

     [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-304.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

I solved this a bit differently in revision 825915, where I added the following protected methods to HTMLParser. These should give subclasses a way to customize the key HTML to XHTML mappings currently implemented in the HTML parser. Note that in some future version we may want to deprecate and remove these methods in case the internals of the HTML parser change dramatically (e.g. if we implement the discussed CSS/Javascript processing features).

    /**
     * Maps "safe" HTML element names to semantic XHTML equivalents. If the
     * given element is unknown or deemed unsafe for inclusion in the parse
     * output, then this method returns <code>null</code> and the element
     * will be ignored but the content inside it is still processed. See
     * the {@link #isDiscardElement(String)} method for a way to discard
     * the entire contents of an element.
     * <p>
     * Subclasses can override this method to customize the default mapping.
     *
     * @since Apache Tika 0.5
     * @param name HTML element name (upper case)
     * @return XHTML element name (lower case), or
     *         <code>null</code> if the element is unsafe 
     */
    protected String mapSafeElement(String name)

    /**
     * Checks whether all content within the given HTML element should be
     * discarded instead of including it in the parse output. Subclasses
     * can override this method to customize the set of discarded elements.
     *
     * @since Apache Tika 0.5
     * @param name HTML element name (upper case)
     * @return <code>true</code> if content inside the named element
     *         should be ignored, <code>false</code> otherwise
     */
    protected boolean isDiscardElement(String name)

> HtmlParser could be easier to subclass
> --------------------------------------
>
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id attributes) but let's see how you like my first patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.