You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2008/12/08 22:36:44 UTC

[jira] Created: (HIVE-136) SerDe should escape some special characters

SerDe should escape some special characters
-------------------------------------------

                 Key: HIVE-136
                 URL: https://issues.apache.org/jira/browse/HIVE-136
             Project: Hadoop Hive
          Issue Type: Bug
            Reporter: Zheng Shao


MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
Otherwise the data will look corrupted.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Hammerbacher updated HIVE-136:
-----------------------------------

    Component/s: Serializers/Deserializers

Adding to "Serializers/Deserializers" component.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

    Attachment: HIVE-136.3.patch

Added some comments.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch, HIVE-136.3.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

    Attachment: HIVE-136.5.patch

Removed LazyString from GroupByOperator.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch, HIVE-136.3.patch, HIVE-136.4.patch, HIVE-136.5.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

    Attachment: HIVE-136.4.patch

Added the missing functionality. Also added some comments in GroupByOperator.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch, HIVE-136.3.patch, HIVE-136.4.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726284#action_12726284 ] 

Namit Jain commented on HIVE-136:
---------------------------------

Other than that, it looks good.

+1

I will commit if after that change if tests pass.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch, HIVE-136.3.patch, HIVE-136.4.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666830#action_12666830 ] 

Joydeep Sen Sarma commented on HIVE-136:
----------------------------------------

ignore the comment on '\n' -> '\\n' - looks like it's already covered. the jira is not showing the backslashes - but i happened to look at an unread email and it much more readable.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

    Description: 
MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
Otherwise the data will look corrupted.

We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
So LazySimpleSerDe needs to have the capability of escaping and unescaping.


  was:
MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
Otherwise the data will look corrupted.


       Priority: Blocker  (was: Critical)
       Assignee: Zheng Shao

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670887#action_12670887 ] 

Joydeep Sen Sarma commented on HIVE-136:
----------------------------------------

sorry for dropping the ball on this:

the case that concerned me was the case of transform. say we have an input file with "\005" that was produced outside hive. based on what i understand - the proposal is to convert it to code point 5. i feel uncomfortable with this - i would rather that we pass thru this data and let the user (either in the transform script or via an explicit UDF) deal with it.  same concern with \0. (let's say i dump out a directory listing from a windows machine - "\0" fragment would be for a filename beginning with char '0' - instead if we unescape it to character 0 and pass it to the script - it would make it difficult to analyze this kind of data. I am also not sure how different languages would deal with a null character - some (like C) might drop the part beyond the null character altogether).

so my vote would be to keep the unescaping to the bare minimum required and provide other functions to provide any enhanced semantics.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-136:
-------------------------------

    Priority: Critical  (was: Major)

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-136:
----------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed. Thanks Zheng

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch, HIVE-136.3.patch, HIVE-136.4.patch, HIVE-136.5.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

    Attachment: HIVE-136.1.patch

This patch adds the capability of escaping, following the MySQL way.

The command to create an escaped table is:
{code}
CREATE TABLE table1 (a STRING, b STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ESCAPED BY '\\';
{code}


> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>         Attachments: HIVE-136.1.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666348#action_12666348 ] 

Zheng Shao commented on HIVE-136:
---------------------------------

1. The code does allow ROW SEPARATOR to be different from '\n';

2. For deserialization, we might deserialize from user-generated data (data generated outside of hive). If that data do escape a single back-slash to be double back-slashes, then 

Basically the principle of escaping in the serialization phase is to escape as less as possible, so in most cases, the user does not need to care about escaping when reading the data (if the data do not contain these special characters). And if the user does do unescaping, he just needs to follow the steps in the deserialization mentioned above - without considering what is the actual column separator.

The principle of unescaping in the deserialization phase is to unescape as much as possible (well, if the user writes single backslash to be double backslash, then the only reason that there is still single backslash in the file is because user escaped something.) This also helps the user in that his escaping code does not need to be changed for different column separators.



> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726280#action_12726280 ] 

Namit Jain commented on HIVE-136:
---------------------------------

LazyFactory does not support boolean

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch, HIVE-136.3.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726283#action_12726283 ] 

Namit Jain commented on HIVE-136:
---------------------------------

Also, remove the reference to LazyString from GroupByOperator.


All primitives have 3 representations: Integer, IntWritable, LazyInteger
Same goes for String.

We know in groupby, since we have converted to Standard Writable, it cannot be LazyString.

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch, HIVE-136.3.patch, HIVE-136.4.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654585#action_12654585 ] 

Zheng Shao commented on HIVE-136:
---------------------------------

The problem breaks down to 3 items:

1. When loading the data, the user would be responsible for providing data in the expected escaped format. For example, in order to represent in the text file one field containing a carriage return, it should be written as "\n". Sequence file does not have to escape "\n" but that will be recommended. Both file formats would need to escape column/item/key separator.

2. When the serde is reading the data, it should unescape the data. This means string.split need to be replaced by a scan, which will have a little performance loss.

3. When the serde is writing the data, it should escape the data. This also means a scan of the string and conversion to a new string, which also have a little performance loss.


> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Zheng Shao
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667347#action_12667347 ] 

Zheng Shao commented on HIVE-136:
---------------------------------

> if more aggressive unescaping is required - we can always provide unescape UDFs. That would be much better since we could have some standard semantics for the unescaping (json/html/xml etc)

Let's say the content of the file contains something like "\ \ 0 0 5", and also "\ 0 0 5". After the unescaping, both of them will be "\ 0 0 5" so UDF won't be able to unescape it further.


I think it's OK that the escaping and unescaping are not exactly the same, given that a string escaped using our logic will be unescaped back to the original value.

The extra logic in unescaping is only used to deal with "impossible" cases - the cases that will never happen using our own escaping logic. In that case we can either throw an error or "guess" what the user means.  My guess is that most other escaping logic will escape "\" to "\ \" (if they escape something using "\", they have to escape "\" too). And they don't do special handling according to the actual separator (most of them escape all special characters). So if there is actually a "\ 0 0 5", it is most likely be the ascii code 5.


I am also OK if we just escape (and unescape reflectively) all characters outside the range of 32-127, plus "\" and all separators.  That makes the logic simpler. But it is still a question what if we see an "impossible" sequence of characters in the escaped stream, e.g. "\ 0 4 0 " (0x20, 32, space).


> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

    Priority: Critical  (was: Blocker)

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666627#action_12666627 ] 

Joydeep Sen Sarma commented on HIVE-136:
----------------------------------------

for #1 - so if the row separator can be other than newline - i am guessing u want to generalize to escaping to whatever the row separator is (as opposed to specifically escaping newlines) - right?

also - if there is a literal '\n' sequence in the original string - are we going to convert into '\\n' during serialization and then match '\\' back in deserialization (so as to leave '\n' back in deserialized string). generalize to other separators as well of course.

#2 - I am uncomfortable unescaping more than what we escape. for example - let's say there's a \XXX symbol in some text file. Now we need to feed it into a transform script. we unescape when reading the text file - but we do not escape it back when sending to script. So the user script does not see the raw data in the file. 

Presumably - if the user had put \XXX in the file - then they knew how to handle it in their scripts and we are needlessly tampering with this data.

if more aggressive unescaping is required  - we can always provide unescape UDFs. That would be much better since we could have some standard semantics for the unescaping (json/html/xml etc)



> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666346#action_12666346 ] 

Joydeep Sen Sarma commented on HIVE-136:
----------------------------------------

looks like in right direction. questions:

- i like the logic of escaping row/column/key/item separators. do we allow anything other than newline to be row separator?

- for deserialization - shouldn't we only unescape what we might have escaped during serialization? (so should not be unescaping octals and nulls for example)




> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670894#action_12670894 ] 

Zheng Shao commented on HIVE-136:
---------------------------------

In HIVE-270 I put a link to how MySQL does escaping. That seems to be a clean solution although it won't support string.split or even splitting lines using "\n".

If we change "\005" with "\\" in your case, it won't work with the symmetric escaping/unescaping logic as well.

What about this? We explicitly allow users to enable/disable escaping. If the user disable it, then we just do nothing (and the data may look corrupt to them - but that's the only thing we can do). If the user enable it, then we do the escaping/unescaping logic as I mentioned (not symmetric, but try to guess what the user really means).


> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665737#action_12665737 ] 

Zheng Shao commented on HIVE-136:
---------------------------------

Proposal:

1. For serialization:
* \ -> \\
* newline -> \n
* carriage return -> \r

The following characters are escaped only if they are column/item/key separators or quotations:
* null character -> \000 (octal number)
* ^A -> \001
* ^B -> \002
...
* tab -> \t
...


2. For deserialization:
* \\ -> \
* \n -> newline
* \r -> carriage return
* \xxx (where xxx are octal number from 000 to 177 (127 in decimal) )
* \0 -> null character
* \ (without a match above) -> \


In this proposal, we don't support quotation (" and '). Quotation will be required to read mysql/oracle dumped data, but I hope to address it in a different serde since we need to distinguish the 


> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Critical
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

        Fix Version/s: 0.4.0
    Affects Version/s: 0.3.1
                       0.3.0
               Status: Patch Available  (was: Open)

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.3.0, 0.3.1
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-136) SerDe should escape some special characters

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-136:
----------------------------

    Attachment: HIVE-136.2.patch

This patch fixed all test failures (except the existing failure in trunk).

> SerDe should escape some special characters
> -------------------------------------------
>
>                 Key: HIVE-136
>                 URL: https://issues.apache.org/jira/browse/HIVE-136
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Critical
>         Attachments: HIVE-136.1.patch, HIVE-136.2.patch
>
>
> MetadataTypedColumnsetSerDe and DynamicSerDe should escape some special characters like '\n' or the column/item/key separator.
> Otherwise the data will look corrupted.
> We plan to deprecate MetadataTypedColumnsetSerDe and DynamicSerDe for the simple delimited format, and use LazySimpleSerDe instead.
> So LazySimpleSerDe needs to have the capability of escaping and unescaping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.