You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Niklaus Xiao (JIRA)" <ji...@apache.org> on 2017/06/05 11:51:05 UTC

[jira] [Updated] (HIVE-12412) Multi insert queries fail to run properly in hive 1.1.x or later.

     [ https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Niklaus Xiao updated HIVE-12412:
--------------------------------
    Affects Version/s: 2.3.0

> Multi insert queries fail to run properly in hive 1.1.x or later.
> -----------------------------------------------------------------
>
>                 Key: HIVE-12412
>                 URL: https://issues.apache.org/jira/browse/HIVE-12412
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.0, 1.1.0, 2.3.0
>            Reporter: John P. Petrakis
>              Labels: Correctness, CorrectnessBug
>
> We use multi insert queries to take data in one table and manipulate it by inserting it into a results table.  Queries are of this form:
> from (select * from data_table lateral view explode(data_table.f2) f2 as explode_f2) as explode_data_table  
>        insert overwrite table results_table partition (q_id='C.P1',rl='1') 
>        select 
>        array(cast(if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as String),cast(explode_f2.s1 as String)) as dimensions, 
>        ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics, 
>        null as rownm 
>        where (explode_data_table.date_id between 20151016 and 20151016)
>        group by 
>        if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1),
>        explode_f2.s1         
>        INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0') 
>        SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS dimensions, 
>        ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>        null AS rownm 
>        WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016) 
>        INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0') 
>        SELECT 
>        ARRAY(CAST('Total' as String)) AS dimensions, 
>        ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics, 
>        null AS rownm 
>        WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
> This query is meant to total a given field of a struct that is potentially a list of structs.  For our test data set, which consists of a single row, the summation yields "Null",  with messages in the hive log of the nature:
> Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
> or "Extra fields detected..."
> For significantly more data, this query will eventually cause a run time error while processing a column (caused by array index out of bounds exception in one of the lazy binary classes such as LazyBinaryString or LazyBinaryStruct).
> Using the query above from the hive command line, the following data was used:
> (note there are tabs in the data below)
> string one	one:1.0:1.00:10.0,eon:1.0:1.00:100.0
> string two	two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
> string thr	three:3.0:3.00:30.0
> string fou	four:4.0:4.00:40.0
> There are two fields, a string, (eg. 'string one') and a list of structs.  The following is used to create the table:
> create table if not exists t1 (
>                          f1 string, 
>                           f2 array<struct<s1:string,d1:double,d2:double,d3:double>>
>                          )
>                           partitioned by (clid string, date_id string) 
>                           row format delimited fields 
>                          terminated by '09' 
>                          collection items terminated by ',' 
>                          map keys terminated by ':'
>                          lines terminated by '10' 
>                          location '/user/hive/warehouse/t1';
> And the following is used to load the data:
> load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE  into table t1  partition(client_id='987654321',date_id='20151016');
> The resulting table should yield the following:
> ["string fou","four"]	[4.0]	null	C.P1	1	
> ["string one","eon"]	[1.0]	null	C.P1	1	
> ["string one","one"]	[1.0]	null	C.P1	1	
> ["string thr","three"]	[3.0]	null	C.P1	1	
> ["string two","otw"]	[2.0]	null	C.P1	1	
> ["string two","two"]	[2.0]	null	C.P1	1	
> ["string two","wott"]	[2.0]	null	C.P1	1	
> ["Total","Total"]	[15.0]	null	C.P2	0	
> ["Total"]	[15.0]	null	C.P5	0	
> However what we get is:
> Hive Runtime Error while processing row {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array index out of bounds exception)
> If we reduce the above data to a SINGLE row, the we don't get an exception but the total fields come out as NULL.
> The ONLY way this query would work is 
> 1) if I added a group by (date_id) or even group by ('') as the last line in the query... or removed the last where clause for the final insert.  (The reason why we have a where clause on the initial select is due to push down predicates not working... and causing a complete scan of the entire table).
> or 
> 2) removing the group by clauses entirely 
> or using
> 3) hive.multigroupby.singlereducer = false
> Once 3) is used no modification of the query is needed... it all runs.  This is very similar in nature to the following:
> https://issues.apache.org/jira/browse/HIVE-2750
> which is supposedly fixed and closed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)