You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Khaleel Khalid <kh...@suntecgroup.com> on 2014/01/24 05:24:15 UTC

Localization feature

Hi All,
 
Does Hadoop/MapReduce have localization feature?
 
There is a scenario wherein we have to process files containing Dutch, German characters. 
 
When we process files containing a character like 'Ç', the character gets replaced by '�' in the output.
 
Is there any possible work around for this?
 
 
Thanks in advance,
 
Khaleel

RE: Localization feature

Posted by java8964 <ja...@hotmail.com>.
You need to be more clear about how do you process the files.
I think the important question is what kind of InputFormat and OutputFormat you are using in your case.
If you are using the default one, on Linux, I believe the TextInputFormat and TextOutputFormat will both convert bytes array to text using UTF-8 encoding. So if your source data is UTF-8, then your output should be fine.
To help you in this case, you need to figure out following:
1) What kind InputFormat/OutputFormat you are using?2) How do you write the data output? Using Reducer Context.write to output, or you write to HDFS directly in your code?3) What encoding is your source data?
Yong

Subject: Localization feature
Date: Fri, 24 Jan 2014 09:54:15 +0530
From: khaleelk@suntecgroup.com
To: user@hadoop.apache.org






Hi All,
 
Does Hadoop/MapReduce have localization feature?
 
There is a scenario wherein we have to process files containing Dutch, German characters. 
 
When we process files containing a character like 'Ç', the character gets replaced by '�' in the output.
 
Is there any possible work around for this?
 
 



Thanks in advance,
 

Khaleel 		 	   		  

RE: Localization feature

Posted by java8964 <ja...@hotmail.com>.
You need to be more clear about how do you process the files.
I think the important question is what kind of InputFormat and OutputFormat you are using in your case.
If you are using the default one, on Linux, I believe the TextInputFormat and TextOutputFormat will both convert bytes array to text using UTF-8 encoding. So if your source data is UTF-8, then your output should be fine.
To help you in this case, you need to figure out following:
1) What kind InputFormat/OutputFormat you are using?2) How do you write the data output? Using Reducer Context.write to output, or you write to HDFS directly in your code?3) What encoding is your source data?
Yong

Subject: Localization feature
Date: Fri, 24 Jan 2014 09:54:15 +0530
From: khaleelk@suntecgroup.com
To: user@hadoop.apache.org






Hi All,
 
Does Hadoop/MapReduce have localization feature?
 
There is a scenario wherein we have to process files containing Dutch, German characters. 
 
When we process files containing a character like 'Ç', the character gets replaced by '�' in the output.
 
Is there any possible work around for this?
 
 



Thanks in advance,
 

Khaleel 		 	   		  

RE: Localization feature

Posted by java8964 <ja...@hotmail.com>.
You need to be more clear about how do you process the files.
I think the important question is what kind of InputFormat and OutputFormat you are using in your case.
If you are using the default one, on Linux, I believe the TextInputFormat and TextOutputFormat will both convert bytes array to text using UTF-8 encoding. So if your source data is UTF-8, then your output should be fine.
To help you in this case, you need to figure out following:
1) What kind InputFormat/OutputFormat you are using?2) How do you write the data output? Using Reducer Context.write to output, or you write to HDFS directly in your code?3) What encoding is your source data?
Yong

Subject: Localization feature
Date: Fri, 24 Jan 2014 09:54:15 +0530
From: khaleelk@suntecgroup.com
To: user@hadoop.apache.org






Hi All,
 
Does Hadoop/MapReduce have localization feature?
 
There is a scenario wherein we have to process files containing Dutch, German characters. 
 
When we process files containing a character like 'Ç', the character gets replaced by '�' in the output.
 
Is there any possible work around for this?
 
 



Thanks in advance,
 

Khaleel 		 	   		  

RE: Localization feature

Posted by java8964 <ja...@hotmail.com>.
You need to be more clear about how do you process the files.
I think the important question is what kind of InputFormat and OutputFormat you are using in your case.
If you are using the default one, on Linux, I believe the TextInputFormat and TextOutputFormat will both convert bytes array to text using UTF-8 encoding. So if your source data is UTF-8, then your output should be fine.
To help you in this case, you need to figure out following:
1) What kind InputFormat/OutputFormat you are using?2) How do you write the data output? Using Reducer Context.write to output, or you write to HDFS directly in your code?3) What encoding is your source data?
Yong

Subject: Localization feature
Date: Fri, 24 Jan 2014 09:54:15 +0530
From: khaleelk@suntecgroup.com
To: user@hadoop.apache.org






Hi All,
 
Does Hadoop/MapReduce have localization feature?
 
There is a scenario wherein we have to process files containing Dutch, German characters. 
 
When we process files containing a character like 'Ç', the character gets replaced by '�' in the output.
 
Is there any possible work around for this?
 
 



Thanks in advance,
 

Khaleel