You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "ronan stokes (JIRA)" <ji...@apache.org> on 2014/11/09 17:23:33 UTC
[jira] [Commented] (HIVE-8763) Support for use of enclosed quotes
in LazySimpleSerde
[ https://issues.apache.org/jira/browse/HIVE-8763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203979#comment-14203979 ]
ronan stokes commented on HIVE-8763:
------------------------------------
To avoid any performance issues, the SERDE modifications will not support embedded record delimiters in quoted strings . For example if the source data uses newline (UTF-8 0x0a) as the record delimiter, the modifications will not do anything specifically to handle that - nor will they disallow it.
As handling of embedded record delimiters requires changes to the underlying input format, I am not proposing to handle embedded record delimiters with these modifications.
> Support for use of enclosed quotes in LazySimpleSerde
> -----------------------------------------------------
>
> Key: HIVE-8763
> URL: https://issues.apache.org/jira/browse/HIVE-8763
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 0.11.0, 0.12.0, 0.13.0, 0.13.1
> Environment: many - verified on Centos / Redhat with CDH
> Reporter: ronan stokes
>
> Currently the LazySimpleSerde does not support the use of quotes for delimited fields to allow use of separators within a quoted field - this means having to use alternatives for many common use cases for CSV style data.
> Key scenarios that do not work include:
> (3 column row for int, string, float delimited by ',')
> 100,"3.5 inch hard drive, quantity 10",2650.30
> 100,"3.5 \" hard drive, quantity 10",2650.30
> 100, "3.5 "" hard drive, quantity 10", 2650.30
> 100,"3.5 "" hard drive, quantity 10",2650.30
> There are a number of fixes that I have implemented support in the deserialization stage to a copy of the Lazy simple serde to address this:
> For serialization, the code is unchanged with the relevant embedded characters being escaped.
> Assuming a row with 3 fields - SKU ID, description, price, delimited by ','
> 1) allow use of enclosed quotes around a string field
> For example
> 100,"3.5 inch hard drive, quantity 10",2650.30
> 2) support escaping of quotes within field to allow use of embedded quote
> 100,"3.5 \" hard drive, quantity 10",2650.30
> 3) support for old style CSV embedded quotes
> for example
> 100,"3.5 "" hard drive, quantity 10",2650.30
> 4) support for skipping of leading spaces in field
> For example (note space between first ',' and opening quote)
> 100, "3.5 "" hard drive, quantity 10", 2650.30
> In each case, with the changes these are evaluated as though the delimiters and embedded quotes were escaped:
> e.g
> 100, 3.5 \" hard drive\, quantity 10, 2650.30
> All of these are enabled or disabled using serde properties for quotechar, whether enclosed quotes is supported, whether double embedded quotes are treated as single quote (of same char type)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)