You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2008/11/15 18:07:47 UTC

[jira] Created: (TIKA-171) New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags

New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags
------------------------------------------------------------------------------------------------------------

Key: TIKA-171
URL: https://issues.apache.org/jira/browse/TIKA-171
Project: Tika
Issue Type: Improvement
Components: general
Affects Versions: 0.2-incubating
Reporter: Uwe Schindler

One problem with mapping document content to plain text is incorrect whitespace handling:
The normal way to parse documents to plain text is to instantiate a parser and pass the SAX events from the parser to a BodyContentHandler(TextContentHandler(Writer)). This appends all output to a writer (see example on web site).

This works good for thumb parsers that just create a single > tag in XHTML output whith all content of the document in it (including newlines).

As soon, as a more inteligent parser is used (e.g. HTML Parser) that creates multiple nodes and a feature-rich XHTML document, the problems begin. The TextContentHandler just strips all tags away and only characters() events are forwarded to the Writer. When the original document (e.g. a HTML document) does not contain additional whitespace and linefeeds (e.g. it is correct and possible to create a XHTML document with all content in one text line, but consisting of several paragraphs. In this case events between paragraphs are stripped and there is no whitespace anymore between the two paragraphs.

My patch contains a new XHTMLToTextContentHandler, that checks the elements and inserts whitespace to the output depending on the XHTML tag type. HTML block tags like get a newline at the end, but HTML inline tags do not add whitespace. This mapping is done by a simple Set<String> of tag names extracted from the XHTML 1.0 spec. To make it even better, tables are printed out with white space and tabs between cells.

With this patch, I am able to correctly index a lot of document with Lucene.

The patch also changes some tests to correctly check for the '\n' at the end of plain text streams (which are included because of the single -paragraph around plain text).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-171) New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated TIKA-171:
-------------------------------

    Attachment: TIKA-171.patch

Patch with the new ContentHandler and modified tests.

> New ContentHandler for plain text output that has no problem with missing white space after XHTML block tags
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-171
>                 URL: https://issues.apache.org/jira/browse/TIKA-171
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.2-incubating
>            Reporter: Uwe Schindler
>         Attachments: TIKA-171.patch
>
>
> One problem with mapping document content to plain text is incorrect whitespace handling:
> The normal way to parse documents to plain text is to instantiate a parser and pass the SAX events from the parser to a BodyContentHandler(TextContentHandler(Writer)). This appends all output to a writer (see example on web site).
> This works good for thumb parsers that just create a single <p>> tag in XHTML output whith all content of the document in it (including newlines).
> As soon, as a more inteligent parser is used (e.g. HTML Parser) that creates multiple nodes and a feature-rich XHTML document, the problems begin. The TextContentHandler just strips all tags away and only characters() events are forwarded to the Writer. When the original document (e.g. a HTML document) does not contain additional whitespace and linefeeds (e.g. it is correct and possible to create a XHTML document with all content in one text line, but consisting of several paragraphs. In this case </p><p> events between paragraphs are stripped and there is no whitespace anymore between the two paragraphs.
> My patch contains a new XHTMLToTextContentHandler, that checks the elements and inserts whitespace to the output depending on the XHTML tag type. HTML block tags like <p/> get a newline at the end, but HTML inline tags do not add whitespace. This mapping is done by a simple Set<String> of tag names extracted from the XHTML 1.0 spec. To make it even better, tables are printed out with white space and tabs between cells.
> With this patch, I am able to correctly index a lot of document with Lucene.
> The patch also changes some tests to correctly check for the '\n' at the end of plain text streams (which are included because of the single <p>-paragraph around plain text).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.