You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by paradisehit <pa...@163.com> on 2008/08/26 10:51:54 UTC

Why the default LOAD and STORE use UTF-8? Why not use byte?

Hello!
    I have meet a code problem about the charset. I use Hadoop to store the log data, and my log data is not coded in UTF-8, for example GBK in china. If I use the PigStorage() to process my data, the data will be treated as UTF-8, then, I use my program to process the UTF-8 data, it can also run, but the result will be
 not right.
    And can we use the pig LOAD and STORE like Hadoop, not change the orignal data charset, store it as it was! Any one can help me? Or tell me why use the default UTF8?

Re: Why the default LOAD and STORE use UTF-8? Why not use byte?

Posted by Alan Gates <ga...@yahoo-inc.com>.

As part of the rework in the types branch, we're changing it so that 
data is treated as raw bytes and not converted to strings unless the 
user explicitly converts it or does a string operation on it. You might 
take a look at this and see if it meets your need. One issue is, there's 
no regular expression operator defined on byte arrays. Also, streaming 
is not quite integrated into this branch yet, but should be in a week or so.

Alan.

paradisehit wrote:
>  
>  yes! I write my load/store function and I read bytes to DataAtom and generate the tuple through new Tuple(ArrayList<DataAtom>), but when I use the MATCHES and other compare operator such as "==", ">", it will occur error ,because when use that operation, it uses the DataAtom's stringVal, not binaryVal. In deed, the stringVal is empty.
>
>
>
> Tht fact is that I will process the file that may contain UTF8 ,GBK and other charset data.When I process the data, I use a single program, such like:
>
>
>  DEFINE WORDSEG `./run.sh` ship('run.sh','dict.tar.gz', 'word2term'); 
>
> QUERY2TERM = STREAM ALLQUERY THROUGH WORDSEG;
>
> and the run.sh just can process the GBK charset. So I just want process the data(ALLQUERY) as it original was. So I choose the byte[] not String to create Tuple.But the MATCHES can't process the byte DataAtom in Tuple.
>
> I also wanna to know why choose String as the below basic data type, not byte?
>
>
>
> ��2008-08-26��"Olga Natkovich" <ol...@yahoo-inc.com> д����
>   
>> PigStorage is written to work with UTF8 data. You will need to write
>> your on load/store function to get different semantics.
>>
>> Olga 
>>
>>     
>>> -----Original Message-----
>>> From: paradisehit [mailto:paradisehit@163.com] 
>>> Sent: Tuesday, August 26, 2008 1:52 AM
>>> To: pig-user@incubator.apache.org; pig-dev@incubator.apache.org
>>> Subject: Why the default LOAD and STORE use UTF-8? Why not use byte?
>>>
>>> Hello!
>>>     I have meet a code problem about the charset. I use 
>>> Hadoop to store the log data, and my log data is not coded in 
>>> UTF-8, for example GBK in china. If I use the PigStorage() to 
>>> process my data, the data will be treated as UTF-8, then, I 
>>> use my program to process the UTF-8 data, it can also run, 
>>> but the result will be  not right.
>>>     And can we use the pig LOAD and STORE like Hadoop, not 
>>> change the orignal data charset, store it as it was! Any one 
>>> can help me? Or tell me why use the default UTF8?
>>>  
>>>
>>>       
>
>

Re:RE: Why the default LOAD and STORE use UTF-8? Why not use byte?

Posted by paradisehit <pa...@163.com>.

 
 yes! I write my load/store function and I read bytes to DataAtom and generate the tuple through new Tuple(ArrayList<DataAtom>), but when I use the MATCHES and other compare operator such as "==", ">", it will occur error ,because when use that operation, it uses the DataAtom's stringVal, not binaryVal. In deed, the stringVal is empty.



Tht fact is that I will process the file that may contain UTF8 ,GBK and other charset data.When I process the data, I use a single program, such like:


 DEFINE WORDSEG `./run.sh` ship('run.sh','dict.tar.gz', 'word2term'); 

QUERY2TERM = STREAM ALLQUERY THROUGH WORDSEG;

and the run.sh just can process the GBK charset. So I just want process the data(ALLQUERY) as it original was. So I choose the byte[] not String to create Tuple.But the MATCHES can't process the byte DataAtom in Tuple.

I also wanna to know why choose String as the below basic data type, not byte?



在2008-08-26，"Olga Natkovich" <ol...@yahoo-inc.com> 写道：
>PigStorage is written to work with UTF8 data. You will need to write
>your on load/store function to get different semantics.
>
>Olga 
>
>> -----Original Message-----
>> From: paradisehit [mailto:paradisehit@163.com] 
>> Sent: Tuesday, August 26, 2008 1:52 AM
>> To: pig-user@incubator.apache.org; pig-dev@incubator.apache.org
>> Subject: Why the default LOAD and STORE use UTF-8? Why not use byte?
>> 
>> Hello!
>>     I have meet a code problem about the charset. I use 
>> Hadoop to store the log data, and my log data is not coded in 
>> UTF-8, for example GBK in china. If I use the PigStorage() to 
>> process my data, the data will be treated as UTF-8, then, I 
>> use my program to process the UTF-8 data, it can also run, 
>> but the result will be  not right.
>>     And can we use the pig LOAD and STORE like Hadoop, not 
>> change the orignal data charset, store it as it was! Any one 
>> can help me? Or tell me why use the default UTF8?
>>  
>>

RE: Why the default LOAD and STORE use UTF-8? Why not use byte?

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

PigStorage is written to work with UTF8 data. You will need to write
your on load/store function to get different semantics.

Olga 

> -----Original Message-----
> From: paradisehit [mailto:paradisehit@163.com] 
> Sent: Tuesday, August 26, 2008 1:52 AM
> To: pig-user@incubator.apache.org; pig-dev@incubator.apache.org
> Subject: Why the default LOAD and STORE use UTF-8? Why not use byte?
> 
> Hello!
>     I have meet a code problem about the charset. I use 
> Hadoop to store the log data, and my log data is not coded in 
> UTF-8, for example GBK in china. If I use the PigStorage() to 
> process my data, the data will be treated as UTF-8, then, I 
> use my program to process the UTF-8 data, it can also run, 
> but the result will be  not right.
>     And can we use the pig LOAD and STORE like Hadoop, not 
> change the orignal data charset, store it as it was! Any one 
> can help me? Or tell me why use the default UTF8?
>  
>

RE: Why the default LOAD and STORE use UTF-8? Why not use byte?

Posted by Olga Natkovich <ol...@yahoo-inc.com>.

PigStorage is written to work with UTF8 data. You will need to write
your on load/store function to get different semantics.

Olga 

> -----Original Message-----
> From: paradisehit [mailto:paradisehit@163.com] 
> Sent: Tuesday, August 26, 2008 1:52 AM
> To: pig-user@incubator.apache.org; pig-dev@incubator.apache.org
> Subject: Why the default LOAD and STORE use UTF-8? Why not use byte?
> 
> Hello!
>     I have meet a code problem about the charset. I use 
> Hadoop to store the log data, and my log data is not coded in 
> UTF-8, for example GBK in china. If I use the PigStorage() to 
> process my data, the data will be treated as UTF-8, then, I 
> use my program to process the UTF-8 data, it can also run, 
> but the result will be  not right.
>     And can we use the pig LOAD and STORE like Hadoop, not 
> change the orignal data charset, store it as it was! Any one 
> can help me? Or tell me why use the default UTF8?
>  
>