You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2014/12/11 22:40:16 UTC
[jira] [Commented] (NUTCH-1898) Add -dumpRawHTML prameter to
parsechecker tool
[ https://issues.apache.org/jira/browse/NUTCH-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243194#comment-14243194 ]
Sebastian Nagel commented on NUTCH-1898:
----------------------------------------
The raw document could be also viewed by (similar for other protocol implementations):
{noformat}
bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http <url>
{noformat}
But agreed, to find problems it's always useful to have a look at the raw HTML, and it's easier to have few but powerful debugging tools.
Or does "raw" mean the serialized DOM which (1) is also available for binary, non-HTML document formats, and (2) may look slightly different and (3) isn't easily viewed by any other tool?
> Add -dumpRawHTML prameter to parsechecker tool
> ----------------------------------------------
>
> Key: NUTCH-1898
> URL: https://issues.apache.org/jira/browse/NUTCH-1898
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.9, 2.2.1
> Reporter: Lewis John McGibbney
> Priority: Minor
> Fix For: 2.4, 1.10
>
>
> The ability to obtain raw HTML alongside all of the other parse data we get within existing parsechecker would compliment the tool.
> This issue should merely append the raw HTML markup to the existing output. It should be an optional parameter, same as -dumpText
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)