You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Sreeman <sr...@gmail.com> on 2015/02/13 05:19:50 UTC

CSV file reading in hive

Hi All,

How all of you are creating hive/Impala table when the CSV file has some
values with COMMA in between. it is like

sree,12345,"payment made,but it is not successful"

 

 

I know opencsv serde is there but it is not available in lower versions of
Hive 14.0

 


Re: CSV file reading in hive

Posted by "sreebalineni ." <sr...@gmail.com>.
Hi Furcy,
Thats lot of information.Thanks a lot
On Feb 13, 2015 3:40 PM, "Furcy Pin" <fu...@flaminem.com> wrote:

> Hi Sreeman,
>
> Unfortunately, I don't think that Hive built-in format can currently read
> csv files with fields enclosed in double quotes.
> More generally, for having ingested quite a lot of messy csv files myself,
> I would recommend you to write a MapReduce (or Spark) job
> for cleaning your csv before giving it to Hive. This is what I did.
> The (other) kind of issue I've met were among :
>
>    - File not encoded in utf-8, making special characters unreadable for
>    Hive
>    - Some lines with missing or too many columns, which could shift your
>    columns and ruin your stats.
>    - Some lines with unreadable characters (probably data corruption)
>    - I even got some lines with java stack traces in it
>
> I hope your csv is cleaner than that, and would recommend that if you have
> the control on how it is generated, replace your current separator with tab
> (and replace inline tabs with \t) or something like that.
>
> There might be some open source tools for data cleaning already out there.
> I plan to release mine one day, once I've migrated it to Spark maybe, and
> if my company agrees.
>
> If you're lazy, I heard that Dataiku Studio (which has a free version) can
> do such thing, though I never used it myself.
>
> Hope this helps,
>
> Furcy
>
>
>
> 2015-02-13 7:30 GMT+01:00 Slava Markeyev <sl...@upsight.com>:
>
>> You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED
>> BY ',' ESCAPED BY '\'. Check the DDL for details
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
>>
>>
>>
>> On Thu, Feb 12, 2015 at 8:19 PM, Sreeman <sr...@gmail.com> wrote:
>>
>>>  Hi All,
>>>
>>> How all of you are creating hive/Impala table when the CSV file has some
>>> values with COMMA in between. it is like
>>>
>>> sree,12345,"payment made,but it is not successful"
>>>
>>>
>>>
>>>
>>>
>>> I know opencsv serde is there but it is not available in lower versions
>>> of Hive 14.0
>>>
>>>
>>>
>>
>>
>>
>> --
>>
>> Slava Markeyev | Engineering | Upsight
>> Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
>> <http://www.linkedin.com/in/slavamarkeyev>
>>
>
>

Re: CSV file reading in hive

Posted by Furcy Pin <fu...@flaminem.com>.
Hi Sreeman,

Unfortunately, I don't think that Hive built-in format can currently read
csv files with fields enclosed in double quotes.
More generally, for having ingested quite a lot of messy csv files myself,
I would recommend you to write a MapReduce (or Spark) job
for cleaning your csv before giving it to Hive. This is what I did.
The (other) kind of issue I've met were among :

   - File not encoded in utf-8, making special characters unreadable for
   Hive
   - Some lines with missing or too many columns, which could shift your
   columns and ruin your stats.
   - Some lines with unreadable characters (probably data corruption)
   - I even got some lines with java stack traces in it

I hope your csv is cleaner than that, and would recommend that if you have
the control on how it is generated, replace your current separator with tab
(and replace inline tabs with \t) or something like that.

There might be some open source tools for data cleaning already out there.
I plan to release mine one day, once I've migrated it to Spark maybe, and
if my company agrees.

If you're lazy, I heard that Dataiku Studio (which has a free version) can
do such thing, though I never used it myself.

Hope this helps,

Furcy



2015-02-13 7:30 GMT+01:00 Slava Markeyev <sl...@upsight.com>:

> You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED
> BY ',' ESCAPED BY '\'. Check the DDL for details
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
>
>
>
> On Thu, Feb 12, 2015 at 8:19 PM, Sreeman <sr...@gmail.com> wrote:
>
>>  Hi All,
>>
>> How all of you are creating hive/Impala table when the CSV file has some
>> values with COMMA in between. it is like
>>
>> sree,12345,"payment made,but it is not successful"
>>
>>
>>
>>
>>
>> I know opencsv serde is there but it is not available in lower versions
>> of Hive 14.0
>>
>>
>>
>
>
>
> --
>
> Slava Markeyev | Engineering | Upsight
> Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
> <http://www.linkedin.com/in/slavamarkeyev>
>

Re: CSV file reading in hive

Posted by Slava Markeyev <sl...@upsight.com>.
You can use lazy simple serde with ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' ESCAPED BY '\'. Check the DDL for details
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



On Thu, Feb 12, 2015 at 8:19 PM, Sreeman <sr...@gmail.com> wrote:

>  Hi All,
>
> How all of you are creating hive/Impala table when the CSV file has some
> values with COMMA in between. it is like
>
> sree,12345,"payment made,but it is not successful"
>
>
>
>
>
> I know opencsv serde is there but it is not available in lower versions of
> Hive 14.0
>
>
>



-- 

Slava Markeyev | Engineering | Upsight
Find me on LinkedIn <http://www.linkedin.com/in/slavamarkeyev>
<http://www.linkedin.com/in/slavamarkeyev>

Re: CSV file reading in hive

Posted by Alexander Pivovarov <ap...@gmail.com>.
hive csv serde is available for all hive versions

https://github.com/ogrodnek/csv-serde


DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER  "
DEFAULT_SEPARATOR        ,


add jar path/to/csv-serde.jar;   (or put it to hive/hadoop/mr
classpath on all boxes on cluster)

-- you can use custom separator/quote/escape

create table my_table(a string, b string, ...)
 row format serde 'com.bizo.hive.serde.csv.CSVSerde'
 with serdeproperties (
   "separatorChar" = "\t",
   "quoteChar"     = "'",
   "escapeChar"    = "\\"
  )
 stored as textfile
;



On Thu, Feb 12, 2015 at 8:19 PM, Sreeman <sr...@gmail.com> wrote:

>  Hi All,
>
> How all of you are creating hive/Impala table when the CSV file has some
> values with COMMA in between. it is like
>
> sree,12345,"payment made,but it is not successful"
>
>
>
>
>
> I know opencsv serde is there but it is not available in lower versions of
> Hive 14.0
>
>
>