You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "John P. Petrakis (JIRA)" <ji...@apache.org> on 2015/11/13 23:09:11 UTC
[jira] [Created] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.

John P. Petrakis created HIVE-12412:
---------------------------------------

             Summary: Multi insert queries fail to run properly in hive 1.1.x or later.
                 Key: HIVE-12412
                 URL: https://issues.apache.org/jira/browse/HIVE-12412
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.1.0, 1.2.0
            Reporter: John P. Petrakis


We use multi insert queries to take data in one table and manipulate it by inserting it into a results table.  Queries are of this form:

from (select * from data_table lateral view explode(data_table.f2) f2 as explode_f2) as explode_data_table  
       insert overwrite table results_table partition (q_id='C.P1',rl='1') 
       select 
       array(cast(if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as String),cast(explode_f2.s1 as String)) as dimensions, 
       ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, 
       null as rownm 
       where (explode_data_table.date_id between 20151016 and 20151016)
       group by 
       if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1),
       explode_f2.s1         
       INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') 
       SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS dimensions, 
       ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
       null AS rownm 
       WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) 
       INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') 
       SELECT 
       ARRAY(CAST('Total' as String)) AS dimensions, 
       ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
       null AS rownm 
       WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)

This query is meant to total a given field of a struct that is potentially a list of structs.  For our test data set, which consists of a single row, the summation yields "Null",  with messages in the hive log of the nature:

Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
or "Extra fields detected..."

For significantly more data, this query will eventually cause a run time error while processing a column (caused by array index out of bounds exception in one of the lazy binary classes such as LazyBinaryString or LazyBinaryStruct).

Using the query above from the hive command line, the following data was used:
(note there are tabs in the data below)

string one	one:1.0:1.00:10.0,eon:1.0:1.00:100.0
string two	two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
string thr	three:3.0:3.00:30.0
string fou	four:4.0:4.00:40.0

There are two fields, a string, (eg. 'string one') and a list of structs.  The following is used to create the table:

create table if not exists t1 (
                         f1 string, 
                          f2 array<struct<s1:string,d1:double,d2:double,d3:double>>
                         )
                          partitioned by (clid string, date_id string) 
                          row format delimited fields 
                         terminated by '09' 
                         collection items terminated by ',' 
                         map keys terminated by ':'
                         lines terminated by '10' 
                         location '/user/hive/warehouse/t1';

And the following is used to load the data:

load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE  into table t1  partition(client_id='987654321',date_id='20151016');

The resulting table should yield the following:
["string fou","four"]	[4.0]	null	C.P1	1	
["string one","eon"]	[1.0]	null	C.P1	1	
["string one","one"]	[1.0]	null	C.P1	1	
["string thr","three"]	[3.0]	null	C.P1	1	
["string two","otw"]	[2.0]	null	C.P1	1	
["string two","two"]	[2.0]	null	C.P1	1	
["string two","wott"]	[2.0]	null	C.P1	1	
["Total","Total"]	[15.0]	null	C.P2	0	
["Total"]	[15.0]	null	C.P5	0	

However what we get is:
Hive Runtime Error while processing row {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array index out of bounds exception)

If we reduce the above data to a SINGLE row, the we don't get an exception but the total fields come out as NULL.

The ONLY way this query would work is 
1) if I added a group by (date_id) or even group by ('') as the last line in the query... or removed the last where clause for the final insert.  (The reason why we have a where clause on the initial select is due to push down predicates not working... and causing a complete scan of the entire table).

or 
2) removing the group by clauses entirely 

or using
3) hive.multigroupby.singlereducer = false

Once 3) is used no modification of the query is needed... it all runs.  This is very similar in nature to the following:
https://issues.apache.org/jira/browse/HIVE-2750
which is supposedly fixed and closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)