You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Eric Gottschalk <eg...@gmail.com> on 2009/01/14 22:38:29 UTC

Problem with delimiter

I am trying to load a text file that uses ascii(254) as a delimiter
Column #1 is time
Column #2 is ID

Contents of this file (test_data.txt):
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582

I create a table in hive with the following syntax:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
OK
Time taken: 8.891 seconds

After this, I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt'  
OVERWRITE INTO TABLE test;
Loading data to table test
OK
Time taken: 0.623 seconds

To make sure that data is loaded correctly, I select only the first  
column (time)

hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006
Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  - 
Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006
map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0006
OK
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
Time taken: 12.25 seconds

It is obvious it did not parse the columns correctly.

To narrow the problem down, I created a version of this file where the  
delimiter is a tab

Contents of this file (test_data_tab.txt):
11-25-2008-15:05:34	546460554662946399
11-25-2008-15:06:21	556460862552067034
11-25-2008-15:17:51	576460863857903582
11-25-2008-14:37:44	546460819054541434
11-25-2008-15:05:34	546460554662946399
11-25-2008-15:06:21	556460862552067034
11-25-2008-15:17:51	576460863857903582
11-25-2008-14:37:44	546460819054541434
11-25-2008-15:05:34	546460554662946399
11-25-2008-15:06:21	556460862552067034
11-25-2008-15:17:51	576460863857903582
11-25-2008-14:37:44	546460819054541434
11-25-2008-15:05:34	546460554662946399
11-25-2008-15:06:21	556460862552067034
11-25-2008-15:17:51	576460863857903582

I use a nearly identical syntax to create new table. The only change  
is the new field delimiter '9' (tab)

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
OK
Time taken: 0.105 seconds

Next I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt'  
OVERWRITE INTO TABLE test;
Loading data to table test
OK
Time taken: 0.179 seconds

Check to see if the data has been loaded using the same query..

hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007
Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  - 
Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007
map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0007
OK
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
Time taken: 10.553 seconds

Success. The import with ascii(9) worked, but ascii(254) failed. Is  
there anyway to load this original data file with ascii(254) delimiter?

RE: Problem with delimiter

Posted by Ashish Thusoo <at...@facebook.com>.

The problem here is that bytes are signed in java as a result I think we are limited to 0 to 128 ascii for delimiter values. Please file a JIRA for this and I can get a fix out to you to handle this case.

Ashish
________________________________________
From: Eric Gottschalk [egottschalk@gmail.com]
Sent: Wednesday, January 14, 2009 1:50 PM
To: hive-user@hadoop.apache.org
Subject: Re: Problem with delimiter

The extended table description for table #1:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
OK
Time taken: 0.089 seconds
hive> describe extended test;
OK
time    string
id      string
Detailed Table Information:
Table(tableName:test,dbName:default,owner:hadoop,createTime:
1231969614,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
[FieldSchema(name:time,type:string,comment:null),
FieldSchema(name:id,type:string,comment:null)],location:hdfs://
localhost:54310/
test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=254,field.delim=254}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
Time taken: 0.105 seconds

For good measure, the table with the tab:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED
FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
OK
Time taken: 0.124 seconds
hive> describe extended test;
OK
time    string
id      string
Detailed Table Information:
Table(tableName:test,dbName:default,owner:hadoop,createTime:
1231969673,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:
[FieldSchema(name:time,type:string,comment:null),
FieldSchema(name:id,type:string,comment:null)],location:hdfs://
localhost:54310/
test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=9,field.delim=9}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
Time taken: 0.087 seconds


On Jan 14, 2009, at 1:43 PM, Ashish Thusoo wrote:

> Can you do a
>
> describe extended test
>
> after you create it with the first create table statement and send
> out the output.
>
> Ashish
> ________________________________________
> From: Eric Gottschalk [egottschalk@gmail.com]
> Sent: Wednesday, January 14, 2009 1:38 PM
> To: hive-user@hadoop.apache.org
> Subject: Problem with delimiter
>
> I am trying to load a text file that uses ascii(254) as a delimiter
> Column #1 is time
> Column #2 is ID
>
> Contents of this file (test_data.txt):
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
>
> I create a table in hive with the following syntax:
>
> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
> OK
> Time taken: 8.891 seconds
>
> After this, I load the file
>
> hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt'
> OVERWRITE INTO TABLE test;
> Loading data to table test
> OK
> Time taken: 0.623 seconds
>
> To make sure that data is loaded correctly, I select only the first
> column (time)
>
> hive> select time from test;
> Total MapReduce jobs = 1
> Starting Job = job_200901141536_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006
> Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  -
> Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006
> map = 0%,  reduce =0%
> map = 50%,  reduce =0%
> map = 100%,  reduce =0%
> Ended Job = job_200901141536_0006
> OK
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> Time taken: 12.25 seconds
>
> It is obvious it did not parse the columns correctly.
>
> To narrow the problem down, I created a version of this file where
> the delimiter is a tab
>
> Contents of this file (test_data_tab.txt):
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
> 11-25-2008-14:37:44 546460819054541434
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
> 11-25-2008-14:37:44 546460819054541434
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
> 11-25-2008-14:37:44 546460819054541434
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
>
> I use a nearly identical syntax to create new table. The only change
> is the new field delimiter '9' (tab)
>
> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
> OK
> Time taken: 0.105 seconds
>
> Next I load the file
>
> hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt'
> OVERWRITE INTO TABLE test;
> Loading data to table test
> OK
> Time taken: 0.179 seconds
>
> Check to see if the data has been loaded using the same query..
>
> hive> select time from test;
> Total MapReduce jobs = 1
> Starting Job = job_200901141536_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007
> Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  -
> Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007
> map = 0%,  reduce =0%
> map = 50%,  reduce =0%
> map = 100%,  reduce =0%
> Ended Job = job_200901141536_0007
> OK
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-14:37:44
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-14:37:44
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-14:37:44
> Time taken: 10.553 seconds
>
> Success. The import with ascii(9) worked, but ascii(254) failed. Is
> there anyway to load this original data file with ascii(254)
> delimiter?
>
>

RE: Problem with delimiter

Posted by Ashish Thusoo <at...@facebook.com>.

I did not see this.

I tried -2 and it does get 254 as the delimiter... I do think that we should fix this to get proper ascii values...

Ashish
________________________________________
From: Zheng Shao [zshao9@gmail.com]
Sent: Wednesday, January 14, 2009 2:06 PM
To: hive-user@hadoop.apache.org
Subject: Re: Problem with delimiter

Hi Eric,

chars > 127 are probably NOT supported since we assume the text is in UTF-8.
You might still want to try "-2" in case that works.

I don't know what encoding will support characters from 0 to 255.

Zheng
On Wed, Jan 14, 2009 at 3:50 PM, Eric Gottschalk <eg...@gmail.com>> wrote:
The extended table description for table #1:


hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
OK
Time taken: 0.089 seconds
hive> describe extended test;
OK
time    string
id      string
Detailed Table Information:
Table(tableName:test,dbName:default,owner:hadoop,createTime:1231969614,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:time,type:string,comment:null), FieldSchema(name:id,type:string,comment:null)],location:hdfs://localhost:54310/test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=254,field.delim=254}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})

Time taken: 0.105 seconds

For good measure, the table with the tab:


hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
OK
Time taken: 0.124 seconds
hive> describe extended test;
OK
time    string
id      string
Detailed Table Information:
Table(tableName:test,dbName:default,owner:hadoop,createTime:1231969673,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:time,type:string,comment:null), FieldSchema(name:id,type:string,comment:null)],location:hdfs://localhost:54310/test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=9,field.delim=9}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
Time taken: 0.087 seconds



On Jan 14, 2009, at 1:43 PM, Ashish Thusoo wrote:

Can you do a

describe extended test

after you create it with the first create table statement and send out the output.

Ashish
________________________________________
From: Eric Gottschalk [egottschalk@gmail.com<ma...@gmail.com>]
Sent: Wednesday, January 14, 2009 1:38 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Problem with delimiter

I am trying to load a text file that uses ascii(254) as a delimiter
Column #1 is time
Column #2 is ID

Contents of this file (test_data.txt):
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582

I create a table in hive with the following syntax:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
OK
Time taken: 8.891 seconds

After this, I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt' OVERWRITE INTO TABLE test;
Loading data to table test
OK
Time taken: 0.623 seconds

To make sure that data is loaded correctly, I select only the first column (time)

hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006
Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006
map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0006
OK
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
Time taken: 12.25 seconds

It is obvious it did not parse the columns correctly.

To narrow the problem down, I created a version of this file where the delimiter is a tab

Contents of this file (test_data_tab.txt):
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582
11-25-2008-14:37:44 546460819054541434
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582
11-25-2008-14:37:44 546460819054541434
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582
11-25-2008-14:37:44 546460819054541434
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582

I use a nearly identical syntax to create new table. The only change is the new field delimiter '9' (tab)

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
OK
Time taken: 0.105 seconds

Next I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt' OVERWRITE INTO TABLE test;
Loading data to table test
OK
Time taken: 0.179 seconds

Check to see if the data has been loaded using the same query..

hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007
Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007
map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0007
OK
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
Time taken: 10.553 seconds

Success. The import with ascii(9) worked, but ascii(254) failed. Is there anyway to load this original data file with ascii(254) delimiter?






--
Yours,
Zheng

Re: Problem with delimiter

Posted by Zheng Shao <zs...@gmail.com>.

Hi Eric,

chars > 127 are probably NOT supported since we assume the text is in UTF-8.
You might still want to try "-2" in case that works.

I don't know what encoding will support characters from 0 to 255.

Zheng
On Wed, Jan 14, 2009 at 3:50 PM, Eric Gottschalk <eg...@gmail.com>wrote:

> The extended table description for table #1:
>
> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS
> TERMINATED BY '254' STORED AS TEXTFILE;
> OK
> Time taken: 0.089 seconds
> hive> describe extended test;
> OK
> time    string
> id      string
> Detailed Table Information:
> Table(tableName:test,dbName:default,owner:hadoop,createTime:1231969614,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:time,type:string,comment:null),
> FieldSchema(name:id,type:string,comment:null)],location:hdfs://localhost:54310/test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=254,field.delim=254}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
>
> Time taken: 0.105 seconds
>
> For good measure, the table with the tab:
>
> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS
> TERMINATED BY '9' STORED AS TEXTFILE;
> OK
> Time taken: 0.124 seconds
> hive> describe extended test;
> OK
> time    string
> id      string
> Detailed Table Information:
> Table(tableName:test,dbName:default,owner:hadoop,createTime:1231969673,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols:[FieldSchema(name:time,type:string,comment:null),
> FieldSchema(name:id,type:string,comment:null)],location:hdfs://localhost:54310/test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=9,field.delim=9}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
> Time taken: 0.087 seconds
>
>
>
> On Jan 14, 2009, at 1:43 PM, Ashish Thusoo wrote:
>
> Can you do a
>>
>> describe extended test
>>
>> after you create it with the first create table statement and send out the
>> output.
>>
>> Ashish
>> ________________________________________
>> From: Eric Gottschalk [egottschalk@gmail.com]
>> Sent: Wednesday, January 14, 2009 1:38 PM
>> To: hive-user@hadoop.apache.org
>> Subject: Problem with delimiter
>>
>> I am trying to load a text file that uses ascii(254) as a delimiter
>> Column #1 is time
>> Column #2 is ID
>>
>> Contents of this file (test_data.txt):
>> 11-25-2008-14:37:44þ546460819054541434
>> 11-25-2008-15:05:34þ546460554662946399
>> 11-25-2008-15:06:21þ556460862552067034
>> 11-25-2008-15:17:51þ576460863857903582
>> 11-25-2008-14:37:44þ546460819054541434
>> 11-25-2008-15:05:34þ546460554662946399
>> 11-25-2008-15:06:21þ556460862552067034
>> 11-25-2008-15:17:51þ576460863857903582
>> 11-25-2008-14:37:44þ546460819054541434
>> 11-25-2008-15:05:34þ546460554662946399
>> 11-25-2008-15:06:21þ556460862552067034
>> 11-25-2008-15:17:51þ576460863857903582
>> 11-25-2008-14:37:44þ546460819054541434
>> 11-25-2008-15:05:34þ546460554662946399
>> 11-25-2008-15:06:21þ556460862552067034
>> 11-25-2008-15:17:51þ576460863857903582
>>
>> I create a table in hive with the following syntax:
>>
>> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED
>> FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
>> OK
>> Time taken: 8.891 seconds
>>
>> After this, I load the file
>>
>> hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt' OVERWRITE
>> INTO TABLE test;
>> Loading data to table test
>> OK
>> Time taken: 0.623 seconds
>>
>> To make sure that data is loaded correctly, I select only the first column
>> (time)
>>
>> hive> select time from test;
>> Total MapReduce jobs = 1
>> Starting Job = job_200901141536_0006, Tracking URL =
>> http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006
>> Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job
>>  -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006
>> map = 0%,  reduce =0%
>> map = 50%,  reduce =0%
>> map = 100%,  reduce =0%
>> Ended Job = job_200901141536_0006
>> OK
>> 11-25-2008-14:37:44�546460819054541434
>> 11-25-2008-15:05:34�546460554662946399
>> 11-25-2008-15:06:21�556460862552067034
>> 11-25-2008-15:17:51�576460863857903582
>> 11-25-2008-14:37:44�546460819054541434
>> 11-25-2008-15:05:34�546460554662946399
>> 11-25-2008-15:06:21�556460862552067034
>> 11-25-2008-15:17:51�576460863857903582
>> 11-25-2008-14:37:44�546460819054541434
>> 11-25-2008-15:05:34�546460554662946399
>> 11-25-2008-15:06:21�556460862552067034
>> 11-25-2008-15:17:51�576460863857903582
>> 11-25-2008-14:37:44�546460819054541434
>> 11-25-2008-15:05:34�546460554662946399
>> 11-25-2008-15:06:21�556460862552067034
>> 11-25-2008-15:17:51�576460863857903582
>> Time taken: 12.25 seconds
>>
>> It is obvious it did not parse the columns correctly.
>>
>> To narrow the problem down, I created a version of this file where the
>> delimiter is a tab
>>
>> Contents of this file (test_data_tab.txt):
>> 11-25-2008-15:05:34 546460554662946399
>> 11-25-2008-15:06:21 556460862552067034
>> 11-25-2008-15:17:51 576460863857903582
>> 11-25-2008-14:37:44 546460819054541434
>> 11-25-2008-15:05:34 546460554662946399
>> 11-25-2008-15:06:21 556460862552067034
>> 11-25-2008-15:17:51 576460863857903582
>> 11-25-2008-14:37:44 546460819054541434
>> 11-25-2008-15:05:34 546460554662946399
>> 11-25-2008-15:06:21 556460862552067034
>> 11-25-2008-15:17:51 576460863857903582
>> 11-25-2008-14:37:44 546460819054541434
>> 11-25-2008-15:05:34 546460554662946399
>> 11-25-2008-15:06:21 556460862552067034
>> 11-25-2008-15:17:51 576460863857903582
>>
>> I use a nearly identical syntax to create new table. The only change is
>> the new field delimiter '9' (tab)
>>
>> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED
>> FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
>> OK
>> Time taken: 0.105 seconds
>>
>> Next I load the file
>>
>> hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt'
>> OVERWRITE INTO TABLE test;
>> Loading data to table test
>> OK
>> Time taken: 0.179 seconds
>>
>> Check to see if the data has been loaded using the same query..
>>
>> hive> select time from test;
>> Total MapReduce jobs = 1
>> Starting Job = job_200901141536_0007, Tracking URL =
>> http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007
>> Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job
>>  -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007
>> map = 0%,  reduce =0%
>> map = 50%,  reduce =0%
>> map = 100%,  reduce =0%
>> Ended Job = job_200901141536_0007
>> OK
>> 11-25-2008-15:05:34
>> 11-25-2008-15:06:21
>> 11-25-2008-15:17:51
>> 11-25-2008-14:37:44
>> 11-25-2008-15:05:34
>> 11-25-2008-15:06:21
>> 11-25-2008-15:17:51
>> 11-25-2008-15:05:34
>> 11-25-2008-15:06:21
>> 11-25-2008-15:17:51
>> 11-25-2008-14:37:44
>> 11-25-2008-15:05:34
>> 11-25-2008-15:06:21
>> 11-25-2008-15:17:51
>> 11-25-2008-14:37:44
>> Time taken: 10.553 seconds
>>
>> Success. The import with ascii(9) worked, but ascii(254) failed. Is there
>> anyway to load this original data file with ascii(254) delimiter?
>>
>>
>>
>


-- 
Yours,
Zheng

Re: Problem with delimiter

Posted by Eric Gottschalk <eg...@gmail.com>.

The extended table description for table #1:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
OK
Time taken: 0.089 seconds
hive> describe extended test;
OK
time	string
id	string
Detailed Table Information:
Table(tableName:test,dbName:default,owner:hadoop,createTime: 
1231969614,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols: 
[FieldSchema(name:time,type:string,comment:null),  
FieldSchema(name:id,type:string,comment:null)],location:hdfs:// 
localhost:54310/ 
test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=254,field.delim=254}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
Time taken: 0.105 seconds

For good measure, the table with the tab:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED  
FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
OK
Time taken: 0.124 seconds
hive> describe extended test;
OK
time	string
id	string
Detailed Table Information:
Table(tableName:test,dbName:default,owner:hadoop,createTime: 
1231969673,lastAccessTime:0,retention:0,sd:StorageDescriptor(cols: 
[FieldSchema(name:time,type:string,comment:null),  
FieldSchema(name:id,type:string,comment:null)],location:hdfs:// 
localhost:54310/ 
test,inputFormat:org.apache.hadoop.mapred.TextInputFormat,outputFormat:org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat,compressed:false,numBuckets:-1,serdeInfo:SerDeInfo(name:null,serializationLib:org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe,parameters:{serialization.format=9,field.delim=9}),bucketCols:[],sortCols:[],parameters:{}),partitionKeys:[],parameters:{})
Time taken: 0.087 seconds


On Jan 14, 2009, at 1:43 PM, Ashish Thusoo wrote:

> Can you do a
>
> describe extended test
>
> after you create it with the first create table statement and send  
> out the output.
>
> Ashish
> ________________________________________
> From: Eric Gottschalk [egottschalk@gmail.com]
> Sent: Wednesday, January 14, 2009 1:38 PM
> To: hive-user@hadoop.apache.org
> Subject: Problem with delimiter
>
> I am trying to load a text file that uses ascii(254) as a delimiter
> Column #1 is time
> Column #2 is ID
>
> Contents of this file (test_data.txt):
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
> 11-25-2008-14:37:44þ546460819054541434
> 11-25-2008-15:05:34þ546460554662946399
> 11-25-2008-15:06:21þ556460862552067034
> 11-25-2008-15:17:51þ576460863857903582
>
> I create a table in hive with the following syntax:
>
> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED  
> FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
> OK
> Time taken: 8.891 seconds
>
> After this, I load the file
>
> hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt'  
> OVERWRITE INTO TABLE test;
> Loading data to table test
> OK
> Time taken: 0.623 seconds
>
> To make sure that data is loaded correctly, I select only the first  
> column (time)
>
> hive> select time from test;
> Total MapReduce jobs = 1
> Starting Job = job_200901141536_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006
> Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  - 
> Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006
> map = 0%,  reduce =0%
> map = 50%,  reduce =0%
> map = 100%,  reduce =0%
> Ended Job = job_200901141536_0006
> OK
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> 11-25-2008-14:37:44�546460819054541434
> 11-25-2008-15:05:34�546460554662946399
> 11-25-2008-15:06:21�556460862552067034
> 11-25-2008-15:17:51�576460863857903582
> Time taken: 12.25 seconds
>
> It is obvious it did not parse the columns correctly.
>
> To narrow the problem down, I created a version of this file where  
> the delimiter is a tab
>
> Contents of this file (test_data_tab.txt):
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
> 11-25-2008-14:37:44 546460819054541434
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
> 11-25-2008-14:37:44 546460819054541434
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
> 11-25-2008-14:37:44 546460819054541434
> 11-25-2008-15:05:34 546460554662946399
> 11-25-2008-15:06:21 556460862552067034
> 11-25-2008-15:17:51 576460863857903582
>
> I use a nearly identical syntax to create new table. The only change  
> is the new field delimiter '9' (tab)
>
> hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED  
> FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
> OK
> Time taken: 0.105 seconds
>
> Next I load the file
>
> hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt'  
> OVERWRITE INTO TABLE test;
> Loading data to table test
> OK
> Time taken: 0.179 seconds
>
> Check to see if the data has been loaded using the same query..
>
> hive> select time from test;
> Total MapReduce jobs = 1
> Starting Job = job_200901141536_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007
> Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  - 
> Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007
> map = 0%,  reduce =0%
> map = 50%,  reduce =0%
> map = 100%,  reduce =0%
> Ended Job = job_200901141536_0007
> OK
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-14:37:44
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-14:37:44
> 11-25-2008-15:05:34
> 11-25-2008-15:06:21
> 11-25-2008-15:17:51
> 11-25-2008-14:37:44
> Time taken: 10.553 seconds
>
> Success. The import with ascii(9) worked, but ascii(254) failed. Is  
> there anyway to load this original data file with ascii(254)  
> delimiter?
>
>

RE: Problem with delimiter

Posted by Ashish Thusoo <at...@facebook.com>.

Can you do a 

describe extended test

after you create it with the first create table statement and send out the output.

Ashish
________________________________________
From: Eric Gottschalk [egottschalk@gmail.com]
Sent: Wednesday, January 14, 2009 1:38 PM
To: hive-user@hadoop.apache.org
Subject: Problem with delimiter

I am trying to load a text file that uses ascii(254) as a delimiter
Column #1 is time
Column #2 is ID

Contents of this file (test_data.txt):
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582
11-25-2008-14:37:44þ546460819054541434
11-25-2008-15:05:34þ546460554662946399
11-25-2008-15:06:21þ556460862552067034
11-25-2008-15:17:51þ576460863857903582

I create a table in hive with the following syntax:

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '254' STORED AS TEXTFILE;
OK
Time taken: 8.891 seconds

After this, I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data.txt' OVERWRITE INTO TABLE test;
Loading data to table test
OK
Time taken: 0.623 seconds

To make sure that data is loaded correctly, I select only the first column (time)

hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0006, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0006
Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0006
map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0006
OK
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
11-25-2008-14:37:44�546460819054541434
11-25-2008-15:05:34�546460554662946399
11-25-2008-15:06:21�556460862552067034
11-25-2008-15:17:51�576460863857903582
Time taken: 12.25 seconds

It is obvious it did not parse the columns correctly.

To narrow the problem down, I created a version of this file where the delimiter is a tab

Contents of this file (test_data_tab.txt):
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582
11-25-2008-14:37:44 546460819054541434
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582
11-25-2008-14:37:44 546460819054541434
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582
11-25-2008-14:37:44 546460819054541434
11-25-2008-15:05:34 546460554662946399
11-25-2008-15:06:21 556460862552067034
11-25-2008-15:17:51 576460863857903582

I use a nearly identical syntax to create new table. The only change is the new field delimiter '9' (tab)

hive> CREATE TABLE test(time STRING, ID string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '9' STORED AS TEXTFILE;
OK
Time taken: 0.105 seconds

Next I load the file

hive> LOAD DATA INPATH 'hdfs://localhost:54310/test_data_tab.txt' OVERWRITE INTO TABLE test;
Loading data to table test
OK
Time taken: 0.179 seconds

Check to see if the data has been loaded using the same query..

hive> select time from test;
Total MapReduce jobs = 1
Starting Job = job_200901141536_0007, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200901141536_0007
Kill Command = /home/hadoop/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_200901141536_0007
map = 0%,  reduce =0%
map = 50%,  reduce =0%
map = 100%,  reduce =0%
Ended Job = job_200901141536_0007
OK
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
11-25-2008-15:05:34
11-25-2008-15:06:21
11-25-2008-15:17:51
11-25-2008-14:37:44
Time taken: 10.553 seconds

Success. The import with ascii(9) worked, but ascii(254) failed. Is there anyway to load this original data file with ascii(254) delimiter?