You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "ronan stokes (JIRA)" <ji...@apache.org> on 2014/11/09 17:23:33 UTC

[jira] [Commented] (HIVE-8763) Support for use of enclosed quotes in LazySimpleSerde

    [ https://issues.apache.org/jira/browse/HIVE-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203979#comment-14203979 ] 

ronan stokes commented on HIVE-8763:
------------------------------------

To avoid any performance issues, the SERDE modifications will not support embedded record delimiters in quoted strings . For example if the source data uses newline (UTF-8 0x0a) as the record delimiter, the modifications will not do anything specifically to handle that - nor will they  disallow it. 

As handling of embedded record delimiters requires changes to the underlying input format, I am not proposing to handle embedded record delimiters with these modifications. 

> Support for use of enclosed quotes in LazySimpleSerde
> -----------------------------------------------------
>
>                 Key: HIVE-8763
>                 URL: https://issues.apache.org/jira/browse/HIVE-8763
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.11.0, 0.12.0, 0.13.0, 0.13.1
>         Environment: many - verified on Centos / Redhat with CDH
>            Reporter: ronan stokes
>
> Currently the LazySimpleSerde does not support the use of quotes for delimited fields to allow use of separators within a quoted field - this means having to use alternatives for many common use cases for CSV style data. 
> Key scenarios that do not work include:
> (3 column row for int, string, float delimited by ',')
> 100,"3.5 inch hard drive, quantity 10",2650.30
> 100,"3.5 \" hard drive, quantity 10",2650.30
> 100,  "3.5 "" hard drive, quantity 10",  2650.30
> 100,"3.5 "" hard drive, quantity 10",2650.30
> There are a number of fixes that I have implemented support in the deserialization stage to a copy of the Lazy simple serde to address this:
> For serialization, the code is unchanged with the relevant embedded characters being escaped.
> Assuming a row with 3 fields - SKU ID, description, price, delimited by ','
> 1) allow use of enclosed quotes around a string field 
> For example 
> 100,"3.5 inch hard drive, quantity 10",2650.30
> 2) support escaping of quotes within field to allow use of embedded quote
> 100,"3.5 \" hard drive, quantity 10",2650.30
> 3) support for old style CSV embedded quotes 
> for example 
> 100,"3.5 "" hard drive, quantity 10",2650.30
> 4) support for skipping of leading spaces in field
> For example (note space between first ',' and opening quote)
> 100,  "3.5 "" hard drive, quantity 10",  2650.30
> In each case, with the changes these are evaluated as though the delimiters and embedded quotes were escaped:
> e.g
> 100, 3.5 \" hard drive\, quantity 10,  2650.30
> All of these are enabled or disabled using serde properties for quotechar, whether enclosed quotes is supported, whether double embedded quotes are treated as single quote (of same char type)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)