You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Bruce Bian <we...@gmail.com> on 2012/05/31 12:17:29 UTC

confused on different behavior of Bucketized tables do not support INSERT INTO

Hi,
I've got a table vt_new_data which is defined as follows:
CREATE TABLE VT_NEW_DATA
(
     V_ACCOUNT_NUM string
    ,V_ACCOUNT_MODIFIER_NUM string
    ,V_DEPOSIT_TYPE_CD string
    ,V_DEPOSIT_TERM int
    ,V_LEDGER_SUBJECT_ID string
    ,V_ACCOUNTING_ORG_CD string
    ,V_OPEN_DT string
    ,V_CLOSE_DT string
    ,V_CURRENCY_CD string
    ,V_ACCOUNT_BAL float
    ,V_INNER_MONTH_DELAY_ACCUM float
) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY
(V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS
RCFile;

when I execute the following query
explain insert into table vt_new_data select * from vt_new_data limit 1;
(this is just a test)
an FAILED: Error in semantic analysis: Bucketized tables do not support
INSERT INTO: Table: vt_new_data error is occurred

but when I execute the query:
explain insert into table vt_new_data
select /*+ MAPJOIN(T4) */
     t1.account_num as v_account_num
    ,t1.account_modifier_num as v_account_modifier_num
    ,'3006' as v_deposit_type_cd
    ,0 as  v_deposit_term
    ,'23201000' v_ledger_subject_id
    ,coalesce(t2.party_id,'')  as v_accounting_org_cd
    ,coalesce(t3.card_begin_dt,'19000101') as v_open_dt
    ,coalesce(t3.card_live_dt,'19000101') as v_close_dt
    ,coalesce(t4.currency_cd,substr(t1.account_modifier_num,3,3)) as
v_currency_cd
    ,coalesce(t4.agt_amt,0) as v_account_bal
    ,0 as v_inner_month_delay_accum
from t03_e_cash_bucket t1
left outer join t03_agt_amount_h_bucket t4
    on t1.account_num=t4.account_num
    and t1.account_modifier_num=t4.account_modifier_num
    and t4.agt_amt_type_cd = '001'
    and t4.start_date<='$TXNDATE'
    and t4.end_date>'$TXNDATE'
left outer join t01_party_card_rela_h_bucket t2
    on  t1.card_no = t2.card_no
    and t2.party_card_rela_type_cd = '01'
    and t2.start_date<='$TXNDATE'
    and t2.end_date>'$TXNDATE'
left outer join t03_card_bucket t3
    on t1.card_no = t3.card_no;

the execution plan is generated successfully and triggered an SMB Map Join,
which is great.

But I don't see the difference here? As both are inserting into a
bucketized and sorted table?

Re: confused on different behavior of Bucketized tables do not support INSERT INTO

Posted by Mark Grover <mg...@oanda.com>.

Hi Bruce,
It seems to me that your issue may be two-fold.

1) The JIRA that introduced "insert into <table>" (https://issues.apache.org/jira/browse/HIVE-306) in Hive 0.8 only works for non-bucketed partitions or tables. This is because appending inside a table or (its partition, if it's partitioned) works by adding a new file within the folder on HDFS. Since bucketing in a partition is implemented as files with a folder, this would break bucketing. To allow for insert into support for bucketed tables, we will probably have to regenerate the entire bucket(s) within the partition. Not sure if this is being done or has already been done in a newer release.

2) It seems like there was a minor bug in the implementation. As mentioned by this ticket (https://issues.apache.org/jira/browse/HIVE-3064) where "insert into <table>" works as "insert overwrite <table>" if the table name in the insert statement has upper case characters (black magic, eh?). Regardless of whether your table was created using upper or lower case letters, this bug manifests itself if you use upper case characters for the table name in your insert into table query.

To summarize, you shouldn't be calling insert into on bucketed tables for the reasons stated above. Also, if you are using insert into command, try to restrict your table names to lower case letters.

I have also verified that "insert into table dummy select * from dummy" works when dummy is a non-bucketed table. As a matter of fact, "insert overwrite table dummy select * from dummy" works too.

I would have expected your query "insert into table test2 select * from test;" to fail since test2 is bucketed as well. However, it doesn't. This seems to be a bug and I have created a JIRA for this (https://issues.apache.org/jira/browse/HIVE-3077).

Hope that helps,
Mark

----- Original Message -----
From: "Bruce Bian" <we...@gmail.com>
To: user@hive.apache.org
Sent: Thursday, May 31, 2012 10:16:03 AM
Subject: Re: confused on different behavior of Bucketized tables do not support INSERT INTO

So I did another test on this. 

hive> create table test(foo int,bar string) clustered by(foo) sorted by (foo asc) into 2 buckets; 
OK 
Time taken: 0.097 seconds 
hive> create table test2 (foo int,bar string) clustered by(foo) sorted by (foo asc) into 2 buckets; 
OK 

hive> LOAD DATA LOCAL INPATH 'hive/examples/files/kv1.txt' OVERWRITE INTO TABLE test; 

hive> set hive.enforce.bucketing=true; 
hive> set hive.enforce.sorting=true; 

hive> insert into table test2 select * from test; 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
………………………………………………………… 

hive> insert into table test2 select * from test2; 
FAILED: Error in semantic analysis: Bucketized tables do not support INSERT INTO: Table: test2 


Seems like the error"FAILED: Error in semantic analysis: Bucketized tables do not support INSERT INTO: Table: vt_new_data error is occurred 
" is only thrown when insert into a bucketized table from the same table? And when insert into a bucketized table multi-times, it will create a original_file_copy_n under the same bucket. 



-rw-r--r-- 3 wbian supergroup 2856 2012-05-31 22:03 /user/hive/warehouse/test2/000000_0 
-rw-r--r-- 3 wbian supergroup 2856 2012-05-31 22:04 /user/hive/warehouse/test2/000000_0_copy_1 
-rw-r--r-- 3 wbian supergroup 2956 2012-05-31 22:03 /user/hive/warehouse/test2/000001_0 
-rw-r--r-- 3 wbian supergroup 2956 2012-05-31 22:04 /user/hive/warehouse/test2/000001_0_copy_1 


And since what I want to do is SMB Map Join, the following triggered the SMB Map Join successfully 

set hive.optimize.bucketmapjoin= true; 
set hive.optimize.bucketmapjoin.sortedmerge = true; 
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; 
select /*+mapjoin(test)*/ * from pokes join test on pokes.foo=test.foo; 


So what's the reason for throwing that error(i mean why not support insert into a bucketized table from the same table)?And isn't that error message kind of misleading? 








On Thu, May 31, 2012 at 6:43 PM, Bruce Bian < weidong.ban@gmail.com > wrote: 


I'm using hive 0.9.0 



On Thursday, May 31, 2012, Bruce Bian wrote: 


Hi, 
I've got a table vt_new_data which is defined as follows: 

CREATE TABLE VT_NEW_DATA 
( 
V_ACCOUNT_NUM string 
,V_ACCOUNT_MODIFIER_NUM string 
,V_DEPOSIT_TYPE_CD string 
,V_DEPOSIT_TERM int 
,V_LEDGER_SUBJECT_ID string 
,V_ACCOUNTING_ORG_CD string 
,V_OPEN_DT string 
,V_CLOSE_DT string 
,V_CURRENCY_CD string 
,V_ACCOUNT_BAL float 
,V_INNER_MONTH_DELAY_ACCUM float 
) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS RCFile; 


when I execute the following query 
explain insert into table vt_new_data select * from vt_new_data limit 1; (this is just a test) 
an FAILED: Error in semantic analysis: Bucketized tables do not support INSERT INTO: Table: vt_new_data error is occurred 


but when I execute the query: 

explain insert into table vt_new_data 
select /*+ MAPJOIN(T4) */ 
t1.account_num as v_account_num 
,t1.account_modifier_num as v_account_modifier_num 
,'3006' as v_deposit_type_cd 
,0 as v_deposit_term 
,'23201000' v_ledger_subject_id 
,coalesce(t2.party_id,'') as v_accounting_org_cd 
,coalesce(t3.card_begin_dt,'19000101') as v_open_dt 
,coalesce(t3.card_live_dt,'19000101') as v_close_dt 
,coalesce(t4.currency_cd,substr(t1.account_modifier_num,3,3)) as v_currency_cd 
,coalesce(t4.agt_amt,0) as v_account_bal 
,0 as v_inner_month_delay_accum 
from t03_e_cash_bucket t1 
left outer join t03_agt_amount_h_bucket t4 
on t1.account_num=t4.account_num 
and t1.account_modifier_num=t4.account_modifier_num 
and t4.agt_amt_type_cd = '001' 
and t4.start_date<='$TXNDATE' 
and t4.end_date>'$TXNDATE' 
left outer join t01_party_card_rela_h_bucket t2 
on t1.card_no = t2.card_no 
and t2.party_card_rela_type_cd = '01' 
and t2.start_date<='$TXNDATE' 
and t2.end_date>'$TXNDATE' 
left outer join t03_card_bucket t3 
on t1.card_no = t3.card_no; 


the execution plan is generated successfully and triggered an SMB Map Join, which is great. 


But I don't see the difference here? As both are inserting into a bucketized and sorted table?

Re: confused on different behavior of Bucketized tables do not support INSERT INTO

Posted by Bruce Bian <we...@gmail.com>.

So I did another test on this.
hive> create table test(foo int,bar string) clustered by(foo) sorted by
(foo asc) into 2 buckets;
OK
Time taken: 0.097 seconds
hive> create table test2 (foo int,bar string) clustered by(foo) sorted by
(foo asc) into 2 buckets;
OK
hive> LOAD DATA LOCAL INPATH 'hive/examples/files/kv1.txt' OVERWRITE INTO
TABLE test;
hive> set hive.enforce.bucketing=true;
hive> set hive.enforce.sorting=true;
hive> insert into table test2 select * from test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
…………………………………………………………
hive> insert into table test2 select * from test2;
FAILED: Error in semantic analysis: Bucketized tables do not support INSERT
INTO: Table: test2

Seems like the error"FAILED: Error in semantic analysis: Bucketized tables
do not support INSERT INTO: Table: vt_new_data error is occurred
" is only thrown when insert into a bucketized table from the same table?
And when insert into a bucketized table multi-times, it will create a
original_file_copy_n under the same bucket.

-rw-r--r--   3 wbian supergroup       2856 2012-05-31 22:03
/user/hive/warehouse/test2/000000_0
-rw-r--r--   3 wbian supergroup       2856 2012-05-31 22:04
/user/hive/warehouse/test2/000000_0_copy_1
-rw-r--r--   3 wbian supergroup       2956 2012-05-31 22:03
/user/hive/warehouse/test2/000001_0
-rw-r--r--   3 wbian supergroup       2956 2012-05-31 22:04
/user/hive/warehouse/test2/000001_0_copy_1

And since what I want to do is SMB Map Join, the following triggered the
SMB Map Join successfully
set hive.optimize.bucketmapjoin= true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select /*+mapjoin(test)*/ * from pokes join test on pokes.foo=test.foo;

So what's the reason for throwing that error(i mean why not support insert
into a bucketized table from the same table)?And isn't that error message
kind of misleading?



On Thu, May 31, 2012 at 6:43 PM, Bruce Bian <we...@gmail.com> wrote:

> I'm using hive 0.9.0
>
> On Thursday, May 31, 2012, Bruce Bian wrote:
>
>> Hi,
>> I've got a table vt_new_data which is defined as follows:
>> CREATE TABLE VT_NEW_DATA
>> (
>>      V_ACCOUNT_NUM string
>>     ,V_ACCOUNT_MODIFIER_NUM string
>>     ,V_DEPOSIT_TYPE_CD string
>>     ,V_DEPOSIT_TERM int
>>     ,V_LEDGER_SUBJECT_ID string
>>     ,V_ACCOUNTING_ORG_CD string
>>     ,V_OPEN_DT string
>>     ,V_CLOSE_DT string
>>     ,V_CURRENCY_CD string
>>     ,V_ACCOUNT_BAL float
>>     ,V_INNER_MONTH_DELAY_ACCUM float
>> ) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY
>> (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS
>> RCFile;
>>
>> when I execute the following query
>> explain insert into table vt_new_data select * from vt_new_data limit 1;
>> (this is just a test)
>> an FAILED: Error in semantic analysis: Bucketized tables do not support
>> INSERT INTO: Table: vt_new_data error is occurred
>>
>> but when I execute the query:
>> explain insert into table vt_new_data
>> select /*+ MAPJOIN(T4) */
>>      t1.account_num as v_account_num
>>     ,t1.account_modifier_num as v_account_modifier_num
>>     ,'3006' as v_deposit_type_cd
>>     ,0 as  v_deposit_term
>>     ,'23201000' v_ledger_subject_id
>>     ,coalesce(t2.party_id,'')  as v_accounting_org_cd
>>     ,coalesce(t3.card_begin_dt,'19000101') as v_open_dt
>>     ,coalesce(t3.card_live_dt,'19000101') as v_close_dt
>>     ,coalesce(t4.currency_cd,substr(t1.account_modifier_num,3,3)) as
>> v_currency_cd
>>     ,coalesce(t4.agt_amt,0) as v_account_bal
>>     ,0 as v_inner_month_delay_accum
>> from t03_e_cash_bucket t1
>> left outer join t03_agt_amount_h_bucket t4
>>     on t1.account_num=t4.account_num
>>     and t1.account_modifier_num=t4.account_modifier_num
>>     and t4.agt_amt_type_cd = '001'
>>     and t4.start_date<='$TXNDATE'
>>     and t4.end_date>'$TXNDATE'
>> left outer join t01_party_card_rela_h_bucket t2
>>     on  t1.card_no = t2.card_no
>>     and t2.party_card_rela_type_cd = '01'
>>     and t2.start_date<='$TXNDATE'
>>     and t2.end_date>'$TXNDATE'
>> left outer join t03_card_bucket t3
>>     on t1.card_no = t3.card_no;
>>
>> the execution plan is generated successfully and triggered an SMB Map
>> Join, which is great.
>>
>> But I don't see the difference here? As both are inserting into a
>> bucketized and sorted table?
>>
>

Re: confused on different behavior of Bucketized tables do not support INSERT INTO

Posted by Bruce Bian <we...@gmail.com>.

I'm using hive 0.9.0

On Thursday, May 31, 2012, Bruce Bian wrote:

> Hi,
> I've got a table vt_new_data which is defined as follows:
> CREATE TABLE VT_NEW_DATA
> (
>      V_ACCOUNT_NUM string
>     ,V_ACCOUNT_MODIFIER_NUM string
>     ,V_DEPOSIT_TYPE_CD string
>     ,V_DEPOSIT_TERM int
>     ,V_LEDGER_SUBJECT_ID string
>     ,V_ACCOUNTING_ORG_CD string
>     ,V_OPEN_DT string
>     ,V_CLOSE_DT string
>     ,V_CURRENCY_CD string
>     ,V_ACCOUNT_BAL float
>     ,V_INNER_MONTH_DELAY_ACCUM float
> ) CLUSTERED BY (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM) SORTED BY
> (V_ACCOUNT_NUM,V_ACCOUNT_MODIFIER_NUM ASC) INTO 256 BUCKETS STORED AS
> RCFile;
>
> when I execute the following query
> explain insert into table vt_new_data select * from vt_new_data limit 1;
> (this is just a test)
> an FAILED: Error in semantic analysis: Bucketized tables do not support
> INSERT INTO: Table: vt_new_data error is occurred
>
> but when I execute the query:
> explain insert into table vt_new_data
> select /*+ MAPJOIN(T4) */
>      t1.account_num as v_account_num
>     ,t1.account_modifier_num as v_account_modifier_num
>     ,'3006' as v_deposit_type_cd
>     ,0 as  v_deposit_term
>     ,'23201000' v_ledger_subject_id
>     ,coalesce(t2.party_id,'')  as v_accounting_org_cd
>     ,coalesce(t3.card_begin_dt,'19000101') as v_open_dt
>     ,coalesce(t3.card_live_dt,'19000101') as v_close_dt
>     ,coalesce(t4.currency_cd,substr(t1.account_modifier_num,3,3)) as
> v_currency_cd
>     ,coalesce(t4.agt_amt,0) as v_account_bal
>     ,0 as v_inner_month_delay_accum
> from t03_e_cash_bucket t1
> left outer join t03_agt_amount_h_bucket t4
>     on t1.account_num=t4.account_num
>     and t1.account_modifier_num=t4.account_modifier_num
>     and t4.agt_amt_type_cd = '001'
>     and t4.start_date<='$TXNDATE'
>     and t4.end_date>'$TXNDATE'
> left outer join t01_party_card_rela_h_bucket t2
>     on  t1.card_no = t2.card_no
>     and t2.party_card_rela_type_cd = '01'
>     and t2.start_date<='$TXNDATE'
>     and t2.end_date>'$TXNDATE'
> left outer join t03_card_bucket t3
>     on t1.card_no = t3.card_no;
>
> the execution plan is generated successfully and triggered an SMB Map
> Join, which is great.
>
> But I don't see the difference here? As both are inserting into a
> bucketized and sorted table?
>