You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2007/10/10 21:16:50 UTC

[jira] Created: (TIKA-53) XHTML SAX events from parsers

XHTML SAX events from parsers
-----------------------------

                 Key: TIKA-53
                 URL: https://issues.apache.org/jira/browse/TIKA-53
             Project: Tika
          Issue Type: Improvement
          Components: general
            Reporter: Jukka Zitting
            Assignee: Jukka Zitting
             Fix For: 0.1-incubator


Tika parsers should produce a sequence XHTML SAX events instead of a single unstructured String as the parsed document content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-53) XHTML SAX events from parsers

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-53.
-------------------------------

    Resolution: Fixed

Committed the proposed patch with slight modifications in revision 584092.

> XHTML SAX events from parsers
> -----------------------------
>
>                 Key: TIKA-53
>                 URL: https://issues.apache.org/jira/browse/TIKA-53
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-53.patch
>
>
> Tika parsers should produce a sequence XHTML SAX events instead of a single unstructured String as the parsed document content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-53) XHTML SAX events from parsers

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-53?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-53:
------------------------------

    Attachment: TIKA-53.patch

The attached patch (TIKA-53.patch) is my first shot at this.

Most of the parsers just take the String that they used to produce before, and output the following SAX events:

    <html xmlns="http://www.w3.org/1999/xhtml">
        <head>
            <title>...</title>
        </head>
        <body>
          <p>...</p>
        </body>
    </html>

The only exception for now is the HTMLParser (surprise!) that uses the XHTML output from Tidy.

The TXTParser class is also slightly more advanced, as it'll avoid reading the full document in memory (assuming ICU4J doesn't do that). Instead it'll read the character stream in small batches and use the characters() SAX event to feed that stream to the given ContentHandler.

> XHTML SAX events from parsers
> -----------------------------
>
>                 Key: TIKA-53
>                 URL: https://issues.apache.org/jira/browse/TIKA-53
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: TIKA-53.patch
>
>
> Tika parsers should produce a sequence XHTML SAX events instead of a single unstructured String as the parsed document content.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.