You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by NOMURA Yoshihide <y....@jp.fujitsu.com> on 2008/06/02 08:19:52 UTC

Text file character encoding

Hello,
I'm using Hadoop 0.17.0 to analyze some large amount of CSV files.

And I need to read such files in different character encoding from UTF-8,
but I think TextInputFormat doesn't support such character encoding.

I guess LineRecordReader class or Text class should support encoding
settings like this.
 conf.set("io.file.defaultEncoding", "MS932");

Is there any plan to supoort different character encoding in
TextInputFormat?

Regards,
-- 
NOMURA Yoshihide:
    Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
    Tel: 044-754-2675 (Ext: 7112-6358)
    Fax: 044-754-2570 (Ext: 7112-3834)
    E-Mail: [y.nomura@jp.fujitsu.com]

Re: Text file character encoding

Posted by Ted Dunning <te...@gmail.com>.

You should file a Jira, make the change and submit a patch!

On Sun, Jun 1, 2008 at 11:19 PM, NOMURA Yoshihide <y....@jp.fujitsu.com>
wrote:

> Hello,
> I'm using Hadoop 0.17.0 to analyze some large amount of CSV files.
>
> And I need to read such files in different character encoding from UTF-8,
> but I think TextInputFormat doesn't support such character encoding.
>
> I guess LineRecordReader class or Text class should support encoding
> settings like this.
>  conf.set("io.file.defaultEncoding", "MS932");
>
> Is there any plan to supoort different character encoding in
> TextInputFormat?
>
> Regards,
> --
> NOMURA Yoshihide:
>    Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
>    Tel: 044-754-2675 (Ext: 7112-6358)
>    Fax: 044-754-2570 (Ext: 7112-3834)
>    E-Mail: [y.nomura@jp.fujitsu.com]
>
>


-- 
ted