You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Niklaus Xiao (JIRA)" <ji...@apache.org> on 2017/06/05 11:51:05 UTC
[jira] [Updated] (HIVE-12412) Multi insert queries fail to run
properly in hive 1.1.x or later.
[ https://issues.apache.org/jira/browse/HIVE-12412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niklaus Xiao updated HIVE-12412:
--------------------------------
Affects Version/s: 2.3.0
> Multi insert queries fail to run properly in hive 1.1.x or later.
> -----------------------------------------------------------------
>
> Key: HIVE-12412
> URL: https://issues.apache.org/jira/browse/HIVE-12412
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.2.0, 1.1.0, 2.3.0
> Reporter: John P. Petrakis
> Labels: Correctness, CorrectnessBug
>
> We use multi insert queries to take data in one table and manipulate it by inserting it into a results table. Queries are of this form:
> from (select * from data_table lateral view explode(data_table.f2) f2 as explode_f2) as explode_data_table
> insert overwrite table results_table partition (q_id='C.P1',rl='1')
> select
> array(cast(if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1) as String),cast(explode_f2.s1 as String)) as dimensions,
> ARRAY(CAST(sum(explode_f2.d1) as Double)) as metrics,
> null as rownm
> where (explode_data_table.date_id between 20151016 and 20151016)
> group by
> if(explode_data_table.f1 is null or explode_data_table.f1='', 'UNKNOWN',explode_data_table.f1),
> explode_f2.s1
> INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P2',rl='0')
> SELECT ARRAY(CAST('Total' as String),CAST('Total' as String)) AS dimensions,
> ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics,
> null AS rownm
> WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
> INSERT OVERWRITE TABLE results_table PARTITION (q_id='C.P5',rl='0')
> SELECT
> ARRAY(CAST('Total' as String)) AS dimensions,
> ARRAY(CAST(sum(explode_f2.d1) as Double)) AS metrics,
> null AS rownm
> WHERE (explode_data_table.date_id BETWEEN 20151016 AND 20151016)
> This query is meant to total a given field of a struct that is potentially a list of structs. For our test data set, which consists of a single row, the summation yields "Null", with messages in the hive log of the nature:
> Missing fields! Expected 2 fields but only got 1! Ignoring similar problems.
> or "Extra fields detected..."
> For significantly more data, this query will eventually cause a run time error while processing a column (caused by array index out of bounds exception in one of the lazy binary classes such as LazyBinaryString or LazyBinaryStruct).
> Using the query above from the hive command line, the following data was used:
> (note there are tabs in the data below)
> string one one:1.0:1.00:10.0,eon:1.0:1.00:100.0
> string two two:2.0:2.00:20.0,otw:2.0:2.00:20.0,wott:2.0:2.00:20.0
> string thr three:3.0:3.00:30.0
> string fou four:4.0:4.00:40.0
> There are two fields, a string, (eg. 'string one') and a list of structs. The following is used to create the table:
> create table if not exists t1 (
> f1 string,
> f2 array<struct<s1:string,d1:double,d2:double,d3:double>>
> )
> partitioned by (clid string, date_id string)
> row format delimited fields
> terminated by '09'
> collection items terminated by ','
> map keys terminated by ':'
> lines terminated by '10'
> location '/user/hive/warehouse/t1';
> And the following is used to load the data:
> load data local inpath '/path/to/data/file/cplx_test.data2' OVERWRITE into table t1 partition(client_id='987654321',date_id='20151016');
> The resulting table should yield the following:
> ["string fou","four"] [4.0] null C.P1 1
> ["string one","eon"] [1.0] null C.P1 1
> ["string one","one"] [1.0] null C.P1 1
> ["string thr","three"] [3.0] null C.P1 1
> ["string two","otw"] [2.0] null C.P1 1
> ["string two","two"] [2.0] null C.P1 1
> ["string two","wott"] [2.0] null C.P1 1
> ["Total","Total"] [15.0] null C.P2 0
> ["Total"] [15.0] null C.P5 0
> However what we get is:
> Hive Runtime Error while processing row {"_col2":2.5306499719322744E-258,"_col3":""} (ultimately due to an array index out of bounds exception)
> If we reduce the above data to a SINGLE row, the we don't get an exception but the total fields come out as NULL.
> The ONLY way this query would work is
> 1) if I added a group by (date_id) or even group by ('') as the last line in the query... or removed the last where clause for the final insert. (The reason why we have a where clause on the initial select is due to push down predicates not working... and causing a complete scan of the entire table).
> or
> 2) removing the group by clauses entirely
> or using
> 3) hive.multigroupby.singlereducer = false
> Once 3) is used no modification of the query is needed... it all runs. This is very similar in nature to the following:
> https://issues.apache.org/jira/browse/HIVE-2750
> which is supposedly fixed and closed.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)