You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Luke Forehand (Commented) (JIRA)" <ji...@apache.org> on 2011/10/03 22:43:35 UTC

[jira] [Commented] (HIVE-1898) The ESCAPED BY clause does not seem to pick up newlines in colums and the line terminator cannot be changed

    [ https://issues.apache.org/jira/browse/HIVE-1898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119581#comment-13119581 ] 

Luke Forehand commented on HIVE-1898:
-------------------------------------

I don't think 'ESCAPED BY' will solve your problem.  From what I have gathered, 'ESCAPED BY' translates into a SerDe property that indicates the escape character, which is default '\'.  It does not allow you to specify which characters you would like to escape.  

Looking at the source code that initializes LazySimpleSerDe.SerDeParameters, only the default separator chars (char codes 1, 2, 3) are flagged as needing escaping.

I have data that contains newlines as well, and as a workaround I pass the column through the regexp_replace function like:

regexp_replace(<my column>, "\n", "")

which eliminates the newline chars during deserialization.  However I don't like doing this and would appreciate alternative solutions if there are any.  Anyone?
                
> The ESCAPED BY clause does not seem to pick up newlines in colums and the line terminator cannot be changed
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1898
>                 URL: https://issues.apache.org/jira/browse/HIVE-1898
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: Josh Patterson
>            Priority: Minor
>
> If I want to preserve data in columns which contains a newline (webcrawling for instance) I cannot set the ESCAPED BY clause to escape these out (other characters such as commas escape fine, however). This may be due to the line terminators, which are locked to be newlines, are picked up first, and then fields processed. 
> This seems to be related to:
> "SerDe should escape some special characters"
> https://issues.apache.org/jira/browse/HIVE-136
> and
> "Implement "LINES TERMINATED BY""
> https://issues.apache.org/jira/browse/HIVE-302
> where at comment: https://issues.apache.org/jira/browse/HIVE-302?focusedCommentId=12793435&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12793435
> "This is not fixable currently because the line terminator is determined by LineRecordReader.LineReader which is in the Hadoop land."

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira