You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Tom Davis <ma...@gmail.com> on 2018/04/30 08:13:47 UTC

Exploding arrays of structs

Hi all,


I have a table with a lot of logging / metrics data that looks like this:


id: STRING

host_details: ARRAY<STRUCT<key: STRING, value: STRUCT>>

metrics: ARRAY<STRUCT<key: STRING, value: STRUCT>>


I would like to be able to extract known keys and their values (from each
of the host_details and metrics arrays-of-struct). Yes, this format sucks
and could be a map, but technically the keys aren’t unique, although I am
happy to make that compromise at this point. In reality the table has more
fields in the struct.


In Spark or MR, I could extract all the fields I need and project them to
columns with only a single pass over the table. I’m struggling to achieve
the same in Hive, which may be either my SQL ability or that there isn’t a
UDF available to meet my needs.


The most efficient approach I have found so far, uses the collect()
Brickhouse UDF And two lateral views generated from the inline() UDTF:


SELECT

id,

collect(t.key, t..value) AS map_h_m

FROM

(SELECT id, hds.key AS key, hds.value AS value

FROM metrics

LATERAL VIEW inline(metrics.host_details) hds

UNION ALL

SELECT id, ms.key AS key, ms.value AS value

FROM metrics

LATERAL VIEW inline(metrics.metrics)) t

GROUP BY t.id


I can create this as a VIEW and then extract columns from the map using the
standard map_h_m[‘error_count’] notation.


Using collect() / GROUP BY seems more efficient than doing a join by id (I
think it cuts out a shuffle), but still has problems:

   - it’s still scanning the metrics table multiple times
   - It assumes that keys are unique across both host_details and metrics.


Does anyone have any thoughts before I attempt to write my own UDF(s) to
assist?


Many thanks,

Tom

Re: Exploding arrays of structs

Posted by Furcy Pin <pi...@gmail.com>.

Hi Tom,

High level manipulation with arrays of structural is generally where SQL
reaches its limit in terms of expressivity and where Spark starts to get my
preference. Unfortunately I don't know any public UDF that can do exactly
what you want, but coding one is not that hard.
If you use scala, spark UDFs are even much easier to write (but not to
share I guess).

Concerning the self- UNION ALL, a common trick to avoid it is to create an
array containing what you need and exploding it immediately, like this:

explode ( ARRAY( host_details, metrics ))
Of course this can only work if host_details and metrics have the same type.
Another trick I learned recently and that could help you is that I think
that you call "metrics.key", which will give you the Array of all keys
(like metrics.map(_.keys) in scala spark)

Hope this helps

Furcy.

On Mon, 30 Apr 2018, 11:51 Tom Davis, <ma...@gmail.com> wrote:

> Just spotted a typo in the email below, the table structure is:
>
> id : STRING
> host_details : ARRAY<STRUCT<key : STRING, value : STRING>>
> metrics : ARRAY<STRUCT<key : STRING, value : STRING>>
>
> I.e. value is a STRING, not s STRUCT.
>
> Thanks,
>
> Tom
>
> On Mon, 30 Apr 2018 at 09:13, Tom Davis <ma...@gmail.com> wrote:
>
>> Hi all,
>>
>>
>> I have a table with a lot of logging / metrics data that looks like this:
>>
>>
>> id: STRING
>>
>> host_details: ARRAY<STRUCT<key: STRING, value: STRUCT>>
>>
>> metrics: ARRAY<STRUCT<key: STRING, value: STRUCT>>
>>
>>
>> I would like to be able to extract known keys and their values (from each
>> of the host_details and metrics arrays-of-struct). Yes, this format sucks
>> and could be a map, but technically the keys aren’t unique, although I am
>> happy to make that compromise at this point. In reality the table has more
>> fields in the struct.
>>
>>
>> In Spark or MR, I could extract all the fields I need and project them to
>> columns with only a single pass over the table. I’m struggling to achieve
>> the same in Hive, which may be either my SQL ability or that there isn’t a
>> UDF available to meet my needs.
>>
>>
>> The most efficient approach I have found so far, uses the collect()
>> Brickhouse UDF And two lateral views generated from the inline() UDTF:
>>
>>
>> SELECT
>>
>> id,
>>
>> collect(t.key, t..value) AS map_h_m
>>
>> FROM
>>
>> (SELECT id, hds.key AS key, hds.value AS value
>>
>> FROM metrics
>>
>> LATERAL VIEW inline(metrics.host_details) hds
>>
>> UNION ALL
>>
>> SELECT id, ms.key AS key, ms.value AS value
>>
>> FROM metrics
>>
>> LATERAL VIEW inline(metrics.metrics)) t
>>
>> GROUP BY t.id
>>
>>
>> I can create this as a VIEW and then extract columns from the map using
>> the standard map_h_m[‘error_count’] notation.
>>
>>
>> Using collect() / GROUP BY seems more efficient than doing a join by id
>> (I think it cuts out a shuffle), but still has problems:
>>
>>    - it’s still scanning the metrics table multiple times
>>    - It assumes that keys are unique across both host_details and
>>    metrics.
>>
>>
>> Does anyone have any thoughts before I attempt to write my own UDF(s) to
>> assist?
>>
>>
>> Many thanks,
>>
>> Tom
>>
>

Re: Exploding arrays of structs

Posted by Tom Davis <ma...@gmail.com>.

Just spotted a typo in the email below, the table structure is:

id : STRING
host_details : ARRAY<STRUCT<key : STRING, value : STRING>>
metrics : ARRAY<STRUCT<key : STRING, value : STRING>>

I.e. value is a STRING, not s STRUCT.

Thanks,

Tom

On Mon, 30 Apr 2018 at 09:13, Tom Davis <ma...@gmail.com> wrote:

> Hi all,
>
>
> I have a table with a lot of logging / metrics data that looks like this:
>
>
> id: STRING
>
> host_details: ARRAY<STRUCT<key: STRING, value: STRUCT>>
>
> metrics: ARRAY<STRUCT<key: STRING, value: STRUCT>>
>
>
> I would like to be able to extract known keys and their values (from each
> of the host_details and metrics arrays-of-struct). Yes, this format sucks
> and could be a map, but technically the keys aren’t unique, although I am
> happy to make that compromise at this point. In reality the table has more
> fields in the struct.
>
>
> In Spark or MR, I could extract all the fields I need and project them to
> columns with only a single pass over the table. I’m struggling to achieve
> the same in Hive, which may be either my SQL ability or that there isn’t a
> UDF available to meet my needs.
>
>
> The most efficient approach I have found so far, uses the collect()
> Brickhouse UDF And two lateral views generated from the inline() UDTF:
>
>
> SELECT
>
> id,
>
> collect(t.key, t..value) AS map_h_m
>
> FROM
>
> (SELECT id, hds.key AS key, hds.value AS value
>
> FROM metrics
>
> LATERAL VIEW inline(metrics.host_details) hds
>
> UNION ALL
>
> SELECT id, ms.key AS key, ms.value AS value
>
> FROM metrics
>
> LATERAL VIEW inline(metrics.metrics)) t
>
> GROUP BY t.id
>
>
> I can create this as a VIEW and then extract columns from the map using
> the standard map_h_m[‘error_count’] notation.
>
>
> Using collect() / GROUP BY seems more efficient than doing a join by id (I
> think it cuts out a shuffle), but still has problems:
>
>    - it’s still scanning the metrics table multiple times
>    - It assumes that keys are unique across both host_details and
>    metrics.
>
>
> Does anyone have any thoughts before I attempt to write my own UDF(s) to
> assist?
>
>
> Many thanks,
>
> Tom
>