You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Benson Margulies (JIRA)" <ji...@apache.org> on 2009/10/09 14:27:31 UTC

[jira] Created: (TIKA-304) HtmlParser could be easier to subclass

HtmlParser could be easier to subclass
--------------------------------------

                 Key: TIKA-304
                 URL: https://issues.apache.org/jira/browse/TIKA-304
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.4, 0.5
            Reporter: Benson Margulies
         Attachments: html-parser-subclass.diff

It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-304) HtmlParser could be easier to subclass

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764087#action_12764087 ] 

Ken Krugler commented on TIKA-304:
----------------------------------

A few comments on this:

1. I think it's an improvement, not a bug :)

2. I agree that it would be great to be able to alter the behavior of HtmlParser. Making subclassing easier is one approach, another might be the ability (IoC model) of specifying a different content handler.

3. Preserving attributes is very important - I had a todo on my list to file an issue about this. E.g. with links, there can be attributes like the target content language that you want to preserve.

4. I have some mods for HtmlParser that I need to turn into issues/patches, e.g. link extraction from <img>, <link>, etc tags. But I'd hate to put Jukka into n-way merge hell. So I might wait for this patch to get rolled in first.

> HtmlParser could be easier to subclass
> --------------------------------------
>
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>         Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id attributes) but let's see how you like my first patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-304) HtmlParser could be easier to subclass

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated TIKA-304:
----------------------------------

    Description: 
It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.

It would also be good if attributes could be preserved (particularly id attributes) but let's see how you like my first patch.



  was:
It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.



> HtmlParser could be easier to subclass
> --------------------------------------
>
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>         Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id attributes) but let's see how you like my first patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-304) HtmlParser could be easier to subclass

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated TIKA-304:
----------------------------------

    Attachment: html-parser-subclass.diff

> HtmlParser could be easier to subclass
> --------------------------------------
>
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>         Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-304) HtmlParser could be easier to subclass

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-304.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

I solved this a bit differently in revision 825915, where I added the following protected methods to HTMLParser. These should give subclasses a way to customize the key HTML to XHTML mappings currently implemented in the HTML parser. Note that in some future version we may want to deprecate and remove these methods in case the internals of the HTML parser change dramatically (e.g. if we implement the discussed CSS/Javascript processing features).

    /**
     * Maps "safe" HTML element names to semantic XHTML equivalents. If the
     * given element is unknown or deemed unsafe for inclusion in the parse
     * output, then this method returns <code>null</code> and the element
     * will be ignored but the content inside it is still processed. See
     * the {@link #isDiscardElement(String)} method for a way to discard
     * the entire contents of an element.
     * <p>
     * Subclasses can override this method to customize the default mapping.
     *
     * @since Apache Tika 0.5
     * @param name HTML element name (upper case)
     * @return XHTML element name (lower case), or
     *         <code>null</code> if the element is unsafe 
     */
    protected String mapSafeElement(String name)

    /**
     * Checks whether all content within the given HTML element should be
     * discarded instead of including it in the parse output. Subclasses
     * can override this method to customize the set of discarded elements.
     *
     * @since Apache Tika 0.5
     * @param name HTML element name (upper case)
     * @return <code>true</code> if content inside the named element
     *         should be ignored, <code>false</code> otherwise
     */
    protected boolean isDiscardElement(String name)

> HtmlParser could be easier to subclass
> --------------------------------------
>
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id attributes) but let's see how you like my first patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-304) HtmlParser could be easier to subclass

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764088#action_12764088 ] 

Benson Margulies commented on TIKA-304:
---------------------------------------

Oops. I didn't notice that this project has a full set of types. I'm used to CXF that just has bug and story.



> HtmlParser could be easier to subclass
> --------------------------------------
>
>                 Key: TIKA-304
>                 URL: https://issues.apache.org/jira/browse/TIKA-304
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.4, 0.5
>            Reporter: Benson Margulies
>         Attachments: html-parser-subclass.diff
>
>
> It would be nice if one could subclass HtmlParser to change what it passes along, instead of having to copy it. I'll attach a first effort.
> It would also be good if attributes could be preserved (particularly id attributes) but let's see how you like my first patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.