You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ayon Sinha <ay...@yahoo.com> on 2011/07/26 23:13:39 UTC

URGENT: Hive not respecting escaped delimiter characters

We have database dumps with TAB delimiters. The fields with TAB have them escaped in the dumped text file. But HIVE does not respect escaped delimiters so 
create external table scratch.delete_me (a int, b int, c bigint, d string, e int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/tmp/users';


creates rows with value of e as NULL for some rows.

Hive also does not allow multi-character delimiters for ROW FORMAT DELIMITED spec. 

What is the cleanest way to get past this problem? Options are:
1. Write custom SerDe class
2. Use RegexSerde
3. Remove escaped delimiter chars from data

I need to know the roadblocks before I invest time on any one of them.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.

Re: URGENT: Hive not respecting escaped delimiter characters

Posted by Ayon Sinha <ay...@yahoo.com>.
I'm confused by this:
https://issues.cloudera.org/browse/SQOOP-111?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel


So does Hive actually support the escape characters?
I am testing it as we speak.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.



________________________________
From: Edward Capriolo <ed...@gmail.com>
To: user@hive.apache.org; Ayon Sinha <ay...@yahoo.com>
Sent: Tuesday, July 26, 2011 2:38 PM
Subject: Re: URGENT: Hive not respecting escaped delimiter characters





On Tue, Jul 26, 2011 at 5:13 PM, Ayon Sinha <ay...@yahoo.com> wrote:

We have database dumps with TAB delimiters. The fields with TAB have them escaped in the dumped text file. But HIVE does not respect escaped delimiters so 
>create external table scratch.delete_me (a int, b int, c bigint, d string, e int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/tmp/users';
>
>
>
>creates rows with value of e as NULL for some rows.
>
>
>Hive also does not allow multi-character delimiters for ROW FORMAT DELIMITED spec. 
>
>
>What is the cleanest way to get past this problem? Options are:
>1. Write custom SerDe class
>2. Use RegexSerde
>3. Remove escaped delimiter chars from data
>
>
>I need to know the roadblocks before I invest time on any one of them.
> 
>-Ayon
>See My Photos on Flickr
>Also check out my Blog for answers to commonly asked questions.
>
Yes.
1) Not a bad solution but you when you start having to write custom InputFormats and Serdes per each table you get annoyed
2) Regex serde has poor performance vs delimited because of the complexity of regex. 
3) This is how I would do it. Sure it means changes upstream but hey that is not your problem. Your the hive guy :)

Re: URGENT: Hive not respecting escaped delimiter characters

Posted by Edward Capriolo <ed...@gmail.com>.
On Tue, Jul 26, 2011 at 5:13 PM, Ayon Sinha <ay...@yahoo.com> wrote:

> We have database dumps with TAB delimiters. The fields with TAB have them
> escaped in the dumped text file. But HIVE does not respect escaped
> delimiters so
> create external table scratch.delete_me (a int, b int, c bigint, d string,
> e int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE
> LOCATION '/tmp/users';
>
> creates rows with value of e as NULL for some rows.
>
> Hive also does not allow multi-character delimiters for ROW FORMAT
> DELIMITED spec.
>
> What is the cleanest way to get past this problem? Options are:
> 1. Write custom SerDe class
> 2. Use RegexSerde
> 3. Remove escaped delimiter chars from data
>
> I need to know the roadblocks before I invest time on any one of them.
>
> -Ayon
> See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
> Also check out my Blog for answers to commonly asked questions.<http://dailyadvisor.blogspot.com>
>

Yes.
1) Not a bad solution but you when you start having to write custom
InputFormats and Serdes per each table you get annoyed
2) Regex serde has poor performance vs delimited because of the complexity
of regex.
3) This is how I would do it. Sure it means changes upstream but hey that is
not your problem. Your the hive guy :)