You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "bc Wong (JIRA)" <ji...@apache.org> on 2010/08/02 18:01:24 UTC

[jira] Created: (HIVE-1505) Support non-UTF8 data

Support non-UTF8 data
---------------------

                 Key: HIVE-1505
                 URL: https://issues.apache.org/jira/browse/HIVE-1505
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Serializers/Deserializers
    Affects Versions: 0.5.0
            Reporter: bc Wong


I'd like to work with non-UTF8 data easily.

Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1505) Support non-UTF8 data

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900697#action_12900697 ] 

Edward Capriolo commented on HIVE-1505:
---------------------------------------

 Maybe you should fork hive and call it chive. 

On a serious node . Great job. Would you consider editing the cli.xml in the xdocs to explain this feature? I think it would be very helpful look in docs/xdocs/.

> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1505) Support non-UTF8 data

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901753#action_12901753 ] 

Ted Xu commented on HIVE-1505:
------------------------------

Thanks Edward.

I dug into the problem and found the patch will not working when the query have subqueries, it is very hard to retain encoding information in those queries.

Table properties may miss in queries, the problem is the same as missing field delimiter setting, because whenever hive can't get table properties in subquery (e.g., join operation), the default value is used (^A for field delimiter, that's why the deserializer will fail most of the time when data contains ^A character even if ^A is not set for field delimiter).

 

> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1505) Support non-UTF8 data

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1505:
-----------------------------

    Status: Open  (was: Patch Available)

> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (HIVE-1505) Support non-UTF8 data

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-1505 started by Ted Xu.

> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1505) Support non-UTF8 data

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Xu updated HIVE-1505:
-------------------------

    Status: Patch Available  (was: In Progress)

Please have a review for trunk-encoding.patch, thanks.

> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1505) Support non-UTF8 data

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Xu updated HIVE-1505:
-------------------------

    Attachment: trunk-encoding.patch

We implemented encoding config feature on tables.
Set table encoding through serde parameter, for example:
{code}
alter table src set serdeproperties ('serialization.encoding'='GBK');
{code}
that makes table src using GBK encoding (Chinese encoding format). Further more, if using command line interface, parameter 'hive.cli.encoding' shall be set. 'hive.cli.encoding' must set before hive prompt started, so set 'hive.cli.encoding' in hive-site.xml or using -hiveconf hive.cli.encoding=GBK in command line parameter, instead of 'set hive.cli.encoding=GBK' in hive ql.
Because of the reason above, I can't find a way to add a unit test.




> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-1505) Support non-UTF8 data

Posted by "Ted Xu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Xu reassigned HIVE-1505:
----------------------------

    Assignee: Ted Xu

> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the upper ascii characters in '\xef\xbf\xbd', which is the replacement character '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.