You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Jonathan Chang (JIRA)" <ji...@apache.org> on 2011/08/02 21:30:27 UTC

[jira] [Created] (HIVE-2333) LazySimpleSerDe does not properly handle arrays / escape control characters

LazySimpleSerDe does not properly handle arrays / escape control characters
---------------------------------------------------------------------------

                 Key: HIVE-2333
                 URL: https://issues.apache.org/jira/browse/HIVE-2333
             Project: Hive
          Issue Type: Bug
            Reporter: Jonathan Chang


LazySimpleSerDe, the default SerDe for Hive is severely broken:

* Empty arrays are serialized as an empty string. Hence an array(array()) is indistinguishable from array(array(array())) from array().
* Similarly, empty strings are serialized as an empty string. Hence array('') is also indistinguishable from an empty array.
* if the serialized string equals the null sequence, then it is ambiguous as to whether it is an array with a single null element or a null array.

It also does not do well with control characters:

> select array('foo\002bar') from tmp;
...
["foo","bar"]

> select array('foo\001bar') from tmp;
...
["foo"]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2333) LazySimpleSerDe does not properly handle arrays / escape control characters

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078642#comment-13078642 ] 

Amareshwari Sriramadasu commented on HIVE-2333:
-----------------------------------------------

Seems related to HIVE-2303. I propose we should escape the delimiters always, irrespective of whether it is configured or not.

> LazySimpleSerDe does not properly handle arrays / escape control characters
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-2333
>                 URL: https://issues.apache.org/jira/browse/HIVE-2333
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Jonathan Chang
>            Priority: Critical
>
> LazySimpleSerDe, the default SerDe for Hive is severely broken:
> * Empty arrays are serialized as an empty string. Hence an array(array()) is indistinguishable from array(array(array())) from array().
> * Similarly, empty strings are serialized as an empty string. Hence array('') is also indistinguishable from an empty array.
> * if the serialized string equals the null sequence, then it is ambiguous as to whether it is an array with a single null element or a null array.
> It also does not do well with control characters:
> > select array('foo\002bar') from tmp;
> ...
> ["foo","bar"]
> > select array('foo\001bar') from tmp;
> ...
> ["foo"]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2333) LazySimpleSerDe does not properly handle arrays / escape control characters

Posted by "Adam Kramer (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Kramer updated HIVE-2333:
------------------------------

    Priority: Critical  (was: Major)

> LazySimpleSerDe does not properly handle arrays / escape control characters
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-2333
>                 URL: https://issues.apache.org/jira/browse/HIVE-2333
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Jonathan Chang
>            Priority: Critical
>
> LazySimpleSerDe, the default SerDe for Hive is severely broken:
> * Empty arrays are serialized as an empty string. Hence an array(array()) is indistinguishable from array(array(array())) from array().
> * Similarly, empty strings are serialized as an empty string. Hence array('') is also indistinguishable from an empty array.
> * if the serialized string equals the null sequence, then it is ambiguous as to whether it is an array with a single null element or a null array.
> It also does not do well with control characters:
> > select array('foo\002bar') from tmp;
> ...
> ["foo","bar"]
> > select array('foo\001bar') from tmp;
> ...
> ["foo"]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira