You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Federico Bonelli (JIRA)" <ji...@apache.org> on 2016/04/20 10:09:25 UTC

[jira] [Commented] (NUTCH-1785) Ability to index raw content

    [ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249465#comment-15249465 ] 

Federico Bonelli commented on NUTCH-1785:
-----------------------------------------

I'm experiencing charset issues with this patch, probably due to Sebastian Nagel's remark:
bq. conversion via {code} new String(content.getContent()) {code} is needless if base64 is true

I will now try to base64 encode the content.getContent() byte array directly, but I was wondering about the inital intent behind the conversion back and forth from byte[] to String and back to byte[] before base64 encoding.

{code:java}
String binary = new String(content.getContent());

// optionally encode as base64
if (base64) {
        binary = Base64.encodeBase64String(StringUtils.getBytesUtf8(binary));
}
{code}

What was the inital intent behind this?

> Ability to index raw content
> ----------------------------
>
>                 Key: NUTCH-1785
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1785
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)