You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ajay Srivastava <Aj...@guavus.com> on 2012/09/11 07:54:06 UTC

Non utf-8 chars in input

Hi,

I am using default inputFormat class for reading input from text files but the input file has some non utf-8 characters.
I guess that TextInputFormat class is default inputFormat class and it replaces these non utf-8 chars by "\uFFFD". If I do not want this behavior and need actual char in my mapper what should be the correct inputFormat class ?



Regards,
Ajay Srivastava

Re: Non utf-8 chars in input

Posted by Ajay Srivastava <Aj...@guavus.com>.
Rekha,

I guess that problem is that Text class uses utf-8 encoding and one can not set other encoding for this class.
I have not seen any other Text like class which supports other encoding otherwise I have written my custom input format class.

Thanks for your inputs.


Regards,
Ajay Srivastava


On 11-Sep-2012, at 1:31 PM, Joshi, Rekha wrote:

> Actually even if that works, it does not seem an ideal solution.
> 
> I think format and encoding are distinct, and enforcing format must not
> enforce an encoding.So that means there must be a possibility to pass
> encoding as a user choice on construction,
> e.g.:TextInputFormat("your-encoding").
> But I do not see that in api, so even if I extend
> InputFormat/RecordReader, I will not be able to have a feature of
> setEncoding() on my file format.Having that would be a good solution.
> 
> Thanks
> Rekha
> 
> On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:
> 
>> Hi Ajay,
>> 
>> Try SequenceFileAsBinaryInputFormat ?
>> 
>> 
>> Thanks
>> Rekha
>> 
>> On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I am using default inputFormat class for reading input from text files
>>> but the input file has some non utf-8 characters.
>>> I guess that TextInputFormat class is default inputFormat class and it
>>> replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>> behavior and need actual char in my mapper what should be the correct
>>> inputFormat class ?
>>> 
>>> 
>>> 
>>> Regards,
>>> Ajay Srivastava
>> 
> 


Re: Non utf-8 chars in input

Posted by Ajay Srivastava <Aj...@guavus.com>.
Rekha,

I guess that problem is that Text class uses utf-8 encoding and one can not set other encoding for this class.
I have not seen any other Text like class which supports other encoding otherwise I have written my custom input format class.

Thanks for your inputs.


Regards,
Ajay Srivastava


On 11-Sep-2012, at 1:31 PM, Joshi, Rekha wrote:

> Actually even if that works, it does not seem an ideal solution.
> 
> I think format and encoding are distinct, and enforcing format must not
> enforce an encoding.So that means there must be a possibility to pass
> encoding as a user choice on construction,
> e.g.:TextInputFormat("your-encoding").
> But I do not see that in api, so even if I extend
> InputFormat/RecordReader, I will not be able to have a feature of
> setEncoding() on my file format.Having that would be a good solution.
> 
> Thanks
> Rekha
> 
> On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:
> 
>> Hi Ajay,
>> 
>> Try SequenceFileAsBinaryInputFormat ?
>> 
>> 
>> Thanks
>> Rekha
>> 
>> On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I am using default inputFormat class for reading input from text files
>>> but the input file has some non utf-8 characters.
>>> I guess that TextInputFormat class is default inputFormat class and it
>>> replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>> behavior and need actual char in my mapper what should be the correct
>>> inputFormat class ?
>>> 
>>> 
>>> 
>>> Regards,
>>> Ajay Srivastava
>> 
> 


Re: Non utf-8 chars in input

Posted by Ajay Srivastava <Aj...@guavus.com>.
Rekha,

I guess that problem is that Text class uses utf-8 encoding and one can not set other encoding for this class.
I have not seen any other Text like class which supports other encoding otherwise I have written my custom input format class.

Thanks for your inputs.


Regards,
Ajay Srivastava


On 11-Sep-2012, at 1:31 PM, Joshi, Rekha wrote:

> Actually even if that works, it does not seem an ideal solution.
> 
> I think format and encoding are distinct, and enforcing format must not
> enforce an encoding.So that means there must be a possibility to pass
> encoding as a user choice on construction,
> e.g.:TextInputFormat("your-encoding").
> But I do not see that in api, so even if I extend
> InputFormat/RecordReader, I will not be able to have a feature of
> setEncoding() on my file format.Having that would be a good solution.
> 
> Thanks
> Rekha
> 
> On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:
> 
>> Hi Ajay,
>> 
>> Try SequenceFileAsBinaryInputFormat ?
>> 
>> 
>> Thanks
>> Rekha
>> 
>> On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I am using default inputFormat class for reading input from text files
>>> but the input file has some non utf-8 characters.
>>> I guess that TextInputFormat class is default inputFormat class and it
>>> replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>> behavior and need actual char in my mapper what should be the correct
>>> inputFormat class ?
>>> 
>>> 
>>> 
>>> Regards,
>>> Ajay Srivastava
>> 
> 


Re: Non utf-8 chars in input

Posted by Ajay Srivastava <Aj...@guavus.com>.
Rekha,

I guess that problem is that Text class uses utf-8 encoding and one can not set other encoding for this class.
I have not seen any other Text like class which supports other encoding otherwise I have written my custom input format class.

Thanks for your inputs.


Regards,
Ajay Srivastava


On 11-Sep-2012, at 1:31 PM, Joshi, Rekha wrote:

> Actually even if that works, it does not seem an ideal solution.
> 
> I think format and encoding are distinct, and enforcing format must not
> enforce an encoding.So that means there must be a possibility to pass
> encoding as a user choice on construction,
> e.g.:TextInputFormat("your-encoding").
> But I do not see that in api, so even if I extend
> InputFormat/RecordReader, I will not be able to have a feature of
> setEncoding() on my file format.Having that would be a good solution.
> 
> Thanks
> Rekha
> 
> On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:
> 
>> Hi Ajay,
>> 
>> Try SequenceFileAsBinaryInputFormat ?
>> 
>> 
>> Thanks
>> Rekha
>> 
>> On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I am using default inputFormat class for reading input from text files
>>> but the input file has some non utf-8 characters.
>>> I guess that TextInputFormat class is default inputFormat class and it
>>> replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>> behavior and need actual char in my mapper what should be the correct
>>> inputFormat class ?
>>> 
>>> 
>>> 
>>> Regards,
>>> Ajay Srivastava
>> 
> 


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Actually even if that works, it does not seem an ideal solution.

I think format and encoding are distinct, and enforcing format must not
enforce an encoding.So that means there must be a possibility to pass
encoding as a user choice on construction,
e.g.:TextInputFormat("your-encoding").
But I do not see that in api, so even if I extend
InputFormat/RecordReader, I will not be able to have a feature of
setEncoding() on my file format.Having that would be a good solution.

Thanks
Rekha

On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:

>Hi Ajay,
>
>Try SequenceFileAsBinaryInputFormat ?
>
>
>Thanks
>Rekha
>
>On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>wrote:
>
>>Hi,
>>
>>I am using default inputFormat class for reading input from text files
>>but the input file has some non utf-8 characters.
>>I guess that TextInputFormat class is default inputFormat class and it
>>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>behavior and need actual char in my mapper what should be the correct
>>inputFormat class ?
>>
>>
>>
>>Regards,
>>Ajay Srivastava
>


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Actually even if that works, it does not seem an ideal solution.

I think format and encoding are distinct, and enforcing format must not
enforce an encoding.So that means there must be a possibility to pass
encoding as a user choice on construction,
e.g.:TextInputFormat("your-encoding").
But I do not see that in api, so even if I extend
InputFormat/RecordReader, I will not be able to have a feature of
setEncoding() on my file format.Having that would be a good solution.

Thanks
Rekha

On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:

>Hi Ajay,
>
>Try SequenceFileAsBinaryInputFormat ?
>
>
>Thanks
>Rekha
>
>On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>wrote:
>
>>Hi,
>>
>>I am using default inputFormat class for reading input from text files
>>but the input file has some non utf-8 characters.
>>I guess that TextInputFormat class is default inputFormat class and it
>>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>behavior and need actual char in my mapper what should be the correct
>>inputFormat class ?
>>
>>
>>
>>Regards,
>>Ajay Srivastava
>


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Actually even if that works, it does not seem an ideal solution.

I think format and encoding are distinct, and enforcing format must not
enforce an encoding.So that means there must be a possibility to pass
encoding as a user choice on construction,
e.g.:TextInputFormat("your-encoding").
But I do not see that in api, so even if I extend
InputFormat/RecordReader, I will not be able to have a feature of
setEncoding() on my file format.Having that would be a good solution.

Thanks
Rekha

On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:

>Hi Ajay,
>
>Try SequenceFileAsBinaryInputFormat ?
>
>
>Thanks
>Rekha
>
>On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>wrote:
>
>>Hi,
>>
>>I am using default inputFormat class for reading input from text files
>>but the input file has some non utf-8 characters.
>>I guess that TextInputFormat class is default inputFormat class and it
>>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>behavior and need actual char in my mapper what should be the correct
>>inputFormat class ?
>>
>>
>>
>>Regards,
>>Ajay Srivastava
>


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Actually even if that works, it does not seem an ideal solution.

I think format and encoding are distinct, and enforcing format must not
enforce an encoding.So that means there must be a possibility to pass
encoding as a user choice on construction,
e.g.:TextInputFormat("your-encoding").
But I do not see that in api, so even if I extend
InputFormat/RecordReader, I will not be able to have a feature of
setEncoding() on my file format.Having that would be a good solution.

Thanks
Rekha

On 11/09/12 12:37 PM, "Joshi, Rekha" <Re...@intuit.com> wrote:

>Hi Ajay,
>
>Try SequenceFileAsBinaryInputFormat ?
>
>
>Thanks
>Rekha
>
>On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com>
>wrote:
>
>>Hi,
>>
>>I am using default inputFormat class for reading input from text files
>>but the input file has some non utf-8 characters.
>>I guess that TextInputFormat class is default inputFormat class and it
>>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>>behavior and need actual char in my mapper what should be the correct
>>inputFormat class ?
>>
>>
>>
>>Regards,
>>Ajay Srivastava
>


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Hi Ajay,

Try SequenceFileAsBinaryInputFormat ?


Thanks
Rekha

On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com> wrote:

>Hi,
>
>I am using default inputFormat class for reading input from text files
>but the input file has some non utf-8 characters.
>I guess that TextInputFormat class is default inputFormat class and it
>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>behavior and need actual char in my mapper what should be the correct
>inputFormat class ?
>
>
>
>Regards,
>Ajay Srivastava


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Hi Ajay,

Try SequenceFileAsBinaryInputFormat ?


Thanks
Rekha

On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com> wrote:

>Hi,
>
>I am using default inputFormat class for reading input from text files
>but the input file has some non utf-8 characters.
>I guess that TextInputFormat class is default inputFormat class and it
>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>behavior and need actual char in my mapper what should be the correct
>inputFormat class ?
>
>
>
>Regards,
>Ajay Srivastava


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Hi Ajay,

Try SequenceFileAsBinaryInputFormat ?


Thanks
Rekha

On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com> wrote:

>Hi,
>
>I am using default inputFormat class for reading input from text files
>but the input file has some non utf-8 characters.
>I guess that TextInputFormat class is default inputFormat class and it
>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>behavior and need actual char in my mapper what should be the correct
>inputFormat class ?
>
>
>
>Regards,
>Ajay Srivastava


Re: Non utf-8 chars in input

Posted by "Joshi, Rekha" <Re...@intuit.com>.
Hi Ajay,

Try SequenceFileAsBinaryInputFormat ?


Thanks
Rekha

On 11/09/12 11:24 AM, "Ajay Srivastava" <Aj...@guavus.com> wrote:

>Hi,
>
>I am using default inputFormat class for reading input from text files
>but the input file has some non utf-8 characters.
>I guess that TextInputFormat class is default inputFormat class and it
>replaces these non utf-8 chars by "\uFFFD". If I do not want this
>behavior and need actual char in my mapper what should be the correct
>inputFormat class ?
>
>
>
>Regards,
>Ajay Srivastava