You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Alon Goldshuv (JIRA)" <ji...@apache.org> on 2014/11/06 11:24:34 UTC

[jira] [Commented] (HIVE-7777) Add CSV Serde based on OpenCSV

    [ https://issues.apache.org/jira/browse/HIVE-7777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200053#comment-14200053 ] 

Alon Goldshuv commented on HIVE-7777:
-------------------------------------

While the serde works fine, it has an issue, which is quite serious IMO - It forces all the column types to String. This means that running a query on data that isn't all string type can return wrong query results. In the unit tests I see a single example of a table using all string columns, and in the tests linked here there are many tables with non-string types, but all the queries seem to be simple COUNT(*), which won't catch the problem.

Consider the following example:

{noformat}
CREATE EXTERNAL TABLE test (totalprice DECIMAL(38,10)) 
ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde' with 
serdeproperties ("separatorChar" = ",","quoteChar"= "'","escapeChar"= "\\") 
STORED AS TEXTFILE 
LOCATION '<some location>' 
tblproperties ("skip.header.line.count"="1");
{noformat}

Now consider this sql:

hive> select min(totalprice) from test;

in this case given my data, the result should have been 874.89, but the actual result became 100001.57 (as it is first according to byte ordering of a string type). this is a wrong result.

hive> desc extended test;
OK
o_totalprice        	string              	from deserializer
...

I apologize if it's a false alarm and I'm misusing the DDL somehow. Otherwise - this is a concern as wrong query results is a bad thing...


> Add CSV Serde based on OpenCSV
> ------------------------------
>
>                 Key: HIVE-7777
>                 URL: https://issues.apache.org/jira/browse/HIVE-7777
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Ferdinand Xu
>            Assignee: Ferdinand Xu
>              Labels: TODOC14
>             Fix For: 0.14.0
>
>         Attachments: HIVE-7777.1.patch, HIVE-7777.2.patch, HIVE-7777.3.patch, HIVE-7777.patch, csv-serde-master.zip
>
>
> There is no official support for csvSerde for hive while there is an open source project in github(https://github.com/ogrodnek/csv-serde). CSV is of high frequency in use as a data format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)