You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2008/11/15 18:09:49 UTC

[jira] Updated: (TIKA-171) New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags

     [ https://issues.apache.org/jira/browse/TIKA-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-171:
-------------------------------

    Attachment: TIKA-171.patch

Patch with the new ContentHandler and modified tests.

> New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-171
>                 URL: https://issues.apache.org/jira/browse/TIKA-171
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.2-incubating
>            Reporter: Uwe Schindler
>         Attachments: TIKA-171.patch
>
>
> One problem with mapping document content to plain text is incorrect whitespace handling:
> The normal way to parse documents to plain text is to instantiate a parser and pass the SAX events from the parser to a BodyContentHandler(TextContentHandler(Writer)). This appends all output to a writer (see example on web site).
> This works good for thumb parsers that just create a single <p>> tag in XHTML output whith all content of the document in it (including newlines).
> As soon, as a more inteligent parser is used (e.g. HTML Parser) that creates multiple nodes and a feature-rich XHTML document, the problems begin. The TextContentHandler just strips all tags away and only characters() events are forwarded to the Writer. When the original document (e.g. a HTML document) does not contain additional whitespace and linefeeds (e.g. it is correct and possible to create a XHTML document with all content in one text line, but consisting of several paragraphs. In this case </p><p> events between paragraphs are stripped and there is no whitespace anymore between the two paragraphs.
> My patch contains a new XHTMLToTextContentHandler, that checks the elements and inserts whitespace to the output depending on the XHTML tag type. HTML block tags like <p/> get a newline at the end, but HTML inline tags do not add whitespace. This mapping is done by a simple Set<String> of tag names extracted from the XHTML 1.0 spec. To make it even better, tables are printed out with white space and tabs between cells.
> With this patch, I am able to correctly index a lot of document with Lucene.
> The patch also changes some tests to correctly check for the '\n' at the end of plain text streams (which are included because of the single <p>-paragraph around plain text).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.