You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2016/10/05 12:49:21 UTC

[jira] [Commented] (NUTCH-2318) Text extraction in HtmlParser adds too much whitespace.

    [ https://issues.apache.org/jira/browse/NUTCH-2318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15548610#comment-15548610 ] 

Markus Jelsma commented on NUTCH-2318:
--------------------------------------

This is a know problem, it also affects the TikaParser and our custom parser also suffer this problem. Some websites use spans to give the first character of an artice greater height, and this indeed leads to bad output.

I have investigated this problem in our parser and could not come to a real resolution although one possible solution might be to make a distinction between line and block elements when adding the space.

> Text extraction in HtmlParser adds too much whitespace.
> -------------------------------------------------------
>
>                 Key: NUTCH-2318
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2318
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.1
>            Reporter: Felix Zett
>
> In parse-html, org.apache.nutch.parse.html.HtmlParser will call DOMContentUtils.getText() to extract the text content. For every text node encountered in the document, the getTextHelper() function will first add a space character to the already extracted text and then the text content itself (stripped of excess whitespace). This means that parsing HTML such as
> {{<p>behavi<em>ou</em>r</p>}}
> will lead to this extracted text:
> {{behavi ou r}}
> I would have expected a parser not to add whitespace to content that visually (and actually) does not contain any in the first place. This applies to all similar semantic tags as well as {{<span>}}.
> My naiive approach would be to remove the lines {{text = text.trim()}} and {{sb.append(' ')}}, but I'm aware that this will lead to bad parsing of stuff like {{<p>foo</p><p>bar</p>}}.
> This is not an issue in parse-tika, since tika removes all "unimportant" tags beforehand. However, I'd like to keep using parse-html because I need to keep the document reasonably intact for parse filters applied later.
> I know I could write a parse filter that will re-extract the text content, but this feels like a bug (or at least a shortcoming) in the ParseHtml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)