You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2013/06/11 21:00:22 UTC

[jira] [Comment Edited] (TIKA-1134) ContentHandler gets ignorable whitespace for
tags when parsing HTML

    [ https://issues.apache.org/jira/browse/TIKA-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13680544#comment-13680544 ] 

Hoss Man edited comment on TIKA-1134 at 6/11/13 7:00 PM:
---------------------------------------------------------

-patch includes a test demonstrating hte problem in Solr, and an example of how we could work around this in SolrContentHandler -- but i don't think the workarround is a good idea ... not w/o a lot more careful thought about how all that extra ignorblae whitespace might affect people (not just from html docs, but from any other types of docs where Tika produces ignorable whitespace sax events)-

sorry .. accidentally attached file here that was ment for SOLR-4679
                
      was (Author: hossman):
    patch includes a test demonstrating hte problem in Solr, and an example of how we could work around this in SolrContentHandler -- but i don't think the workarround is a good idea ... not w/o a lot more careful thought about how all that extra ignorblae whitespace might affect people (not just from html docs, but from any other types of docs where Tika produces ignorable whitespace sax events)
                  
> ContentHandler gets ignorable whitespace for <br> tags when parsing HTML
> ------------------------------------------------------------------------
>
>                 Key: TIKA-1134
>                 URL: https://issues.apache.org/jira/browse/TIKA-1134
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: TIKA-1134.patch
>
>
> I'm not very knowledgable about Tika, so it's possible iI'm missunderstanding something here, but it appears that the way Tika parses HTML to produce XHTML SAX events is missinterpreting "<br>" tags as equivilent to ignorable whitespace containing a newline.  This means that clients who ask Tika to parse files, and specify their own ContentHandler to capture the character data can get sequences of run-on text w/o knowing that the "<br>" tag was present -- _unless_ they explicitly handle ignorbaleWhitespace and treat it as "real" whitespace -- but this creates a catch-22 if you really do want to ignore the ignorable whitespace in the HTML markup.
> The crux of the problem seems to be:
>  * instead of generating a startElement event for "br" the HtmlParser treats it as a xhtml.newline().
>  * xhtml.newline() generates and ignorableWhitespace SAX event instead of a characters SAX event
> ...either one of these by themselves might be fine, but in combination they don't really make any sense.  If for example an actual newline exists in the html, it comes across as part of a characters SAX event, not as ignorbale whitespace.
> Changing the newline() function to delegate to characters(...) seems to solve the problem for <br> tags in HTML, but breaks several tests -- probably because the newline() function is also used to add intentionally add (synthetic) ignorableWhitespace events after elements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira