You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jim Krehl <ji...@rd.io> on 2012/10/05 04:21:38 UTC
Lateral Views and Multi Table Insert
Hi,
I have a question about using lateral views with multi table insert.
I have a table of data that represents raw log data, the structure of
which makes it onerous to query directly largely because it requires
UNIONTYPE columns. So, I transform that raw table into 3 new tables,
a primary table and 2 1-to-many tables.
The raw table is similar to this:
CREATE TABLE IF NOT EXISTS events_raw (
event_id STRING,
event_data_0 INT,
event_data_1 BIGINT,
packed_event_data_2 UNIONTYPE <
INT,
STRUCT <
event_data_2:INT,
event_data_2_sub_0:BOOLEAN,
event_data_2_sub_1:BOOLEAN>>,
packed_event_data_3 UNIONTYPE <
BIGINT,
ARRAY <
STRUCT <
event_data_3_metadata_key:STRING,
event_data_3_metadata_value:STRING>>>,
packed_event_data_4 UNIONTYPE <
BOOLEAN,
STRUCT <
event_data_4:BOOLEAN,
event_data_4_metadata:ARRAY <
STRUCT <
event_data_4_metadata_key:STRING,
event_data_4_metadata_value:STRING>>>>);
This is to be transformed into these tables:
CREATE TABLE IF NOT EXISTS events (
event_id STRING,
event_data_0 INT,
event_data_1 BIGINT,
event_data_2 INT,
event_data_2_sub_0 BOOLEAN,
event_data_2_sub_1 BOOLEAN,
event_data_3 BIGINT,
event_data_4 BOOLEAN);
CREATE TABLE IF NOT EXISTS event_data_3_metadata (
event_id STRING,
metadata_key STRING,
metadata_value STRING);
CREATE TABLE IF NOT EXISTS event_data_4_metadata (
event_id STRING,
metadata_key STRING,
metadata_value STRING);
The only way I know how to unpack and/or explode the UNIONTYPEs is to
create custom UDTFs for each UNIONTYPE column. For example, I created
an unpack_packed_event_data_2 function which maps an single
UnionObject to a STRUCT<event_data_2:INT, event_data_2_sub_0:BOOLEAN,
event_data_2_sub_1:BOOLEAN>. Similarly, I created UDTFs to explode
the ARRAY elements contained in the UNIONTYPE columns.
Using those UDTFs I devised these queries to build the transformed tables:
FROM
events_raw
LATERAL VIEW
unpack_event_data_2 (packed_event_data_2) event_data_2_struct AS
event_data_2,
event_data_2_sub_0,
event_data_2_sub_1
LATERAL VIEW
unpack_event_data_3 (packed_event_data_3) event_data_3_struct AS
event_data_3
LATERAL VIEW
unpack_event_data_4 (packed_event_data_4) event_data_4_struct AS
event_data_4
INSERT INTO TABLE events
SELECT
event_id,
event_data_0,
event_data_1,
event_data_2_struct.event_data_2,
event_data_2_struct.event_data_2_sub_0,
event_data_2_struct.event_data_2_sub_1,
event_data_3_struct.event_data_3,
event_data_4_struct.event_data_4);
FROM
events_raw
LATERAL VIEW
explode_event_data_3 (packed_event_data_3) event_data_3_array_element AS
metadata_key,
metadata_value
INSERT INTO TABLE event_data_3_metadata
SELECT
event_id,
event_data_3_array_element.metadata_key,
event_data_3_array_element.metadata_value);
FROM
events_raw
LATERAL VIEW
explode_event_data_4 (packed_event_data_4) event_data_4_array_element AS
metadata_key,
metadata_value
INSERT INTO TABLE event_data_3_metadata
SELECT
event_id,
event_data_4_array_element.metadata_key,
event_data_4_array_element.metadata_value);
This works correctly, the tables are filled with the appropriate
number of rows. However, the raw table is scanned 3 times to
accomplish this and that is very costly given the amount of data.
When I combine those 3 statements into one Multi Table Insert:
FROM
events_raw
LATERAL VIEW
unpack_event_data_2 (packed_event_data_2) event_data_2_struct AS
event_data_2,
event_data_2_sub_0,
event_data_2_sub_1
LATERAL VIEW
unpack_event_data_3 (packed_event_data_3) event_data_3_struct AS
event_data_3
LATERAL VIEW
unpack_event_data_4 (packed_event_data_4) event_data_4_struct AS
event_data_4
LATERAL VIEW
explode_event_data_3 (packed_event_data_3) event_data_3_array_element AS
metadata_key,
metadata_value
LATERAL VIEW
explode_event_data_4 (packed_event_data_4) event_data_4_array_element AS
metadata_key,
metadata_value
INSERT INTO TABLE events
SELECT
event_id,
event_data_0,
event_data_1,
event_data_2_struct.event_data_2,
event_data_2_struct.event_data_2_sub_0,
event_data_2_struct.event_data_2_sub_1,
event_data_3_struct.event_data_3,
event_data_4_struct.event_data_4
INSERT INTO TABLE event_data_3_metadata
SELECT
event_id,
event_data_3_array_element.metadata_key,
event_data_3_array_element.metadata_value
INSERT INTO TABLE event_data_4_metadata
SELECT
event_id,
event_data_4_array_element.metadata_key,
event_data_4_array_element.metadata_value;
The query fails with:
[Hive Error]: Query returned non-zero code: 10, cause: FAILED: Error
in semantic analysis: Column packed_event_data_3 Found in more than
One Tables/Subqueries.
I don't know how to get around having separate unpack_event_data_3 and
explode_event_data_3 functions. Combining them would seem to marry
the functions' output signatures and in instances when the BIGINT is
type of the UNIONTYPE there shouldn't be a row of NULL values in the
event_data_4_metadata table and vice versa.
Is there a better way to do this?
Thanks,
Jim Krehl