You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Federico Bonelli (JIRA)" <ji...@apache.org> on 2016/04/20 10:09:25 UTC
[jira] [Commented] (NUTCH-1785) Ability to index raw content
[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249465#comment-15249465 ]
Federico Bonelli commented on NUTCH-1785:
-----------------------------------------
I'm experiencing charset issues with this patch, probably due to Sebastian Nagel's remark:
bq. conversion via {code} new String(content.getContent()) {code} is needless if base64 is true
I will now try to base64 encode the content.getContent() byte array directly, but I was wondering about the inital intent behind the conversion back and forth from byte[] to String and back to byte[] before base64 encoding.
{code:java}
String binary = new String(content.getContent());
// optionally encode as base64
if (base64) {
binary = Base64.encodeBase64String(StringUtils.getBytesUtf8(binary));
}
{code}
What was the inital intent behind this?
> Ability to index raw content
> ----------------------------
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured indexing back-end. Since Content is never read, a plugin is out of the question and therefore we need to force IndexJob to process Content as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)