You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Burak Gürsoy <bu...@gmx.net> on 2014/02/11 17:04:15 UTC

Views and partitions and performance

Hi,

Some background information:

We have a really large base table (~1P) — (lets call it raw_table)
partitioned like yyyy_mm_dd=2014-02-11/hh=12/flag=1 (or yyyy_mm_dd=2014-02-11/hh=12/flag=0)

I’m testing creating a view on that table to be able to have a consistent
user accessible interface instead of letting the users to use the actual 
table (we’ve changed the format of the table multiple times in the past
and we’ll possibly do so again).

I am creating the view like this[1] and I’ve also tested view partitions[2].

The version of Hive is 0.10.0

The issues I came across so far are:

1) I couldn’t figure out the actual purpose of the view partitions.
The document[3] lists several use cases and mentioned
“infer view partitions automatically based on the partitions of the underlying tables"
which leads to an open ticket[4] which in turn lists another one (HIVE-1941)
which is closed. However I fail to see how to create a derived view partition.
And with the definition I’ve used[2] they don’t seem to have the same effect.

2) You need to specify the partitions in the query to the view explicitly
to have the partition pruning come into effect. However, when I
say “select * from raw_table limit 10” it selects a subset somehow 
(explain and explain extended does not show any targeted partitions with
such queries — is this a bug?). But doing the same query on the view leads 
to a full table scan.

2.1) I see a similar behaviour with things like 
“select ... from myview tablesample (bucket 1 out of $bucketsize on $field)

I’m wondering if this is just the way the views work at the moment or is
there a workaround to improve, lets say: silly queries? As the interface 
seems inconsistent between the table and view access.

Thanks,
Burak



=========[1]=========

CREATE VIEW myview AS SELECT  `combined`.`column1`,
        `combined`.`column2`,
        `combined`.`yyyy_mm_dd`,
        `combined`.`hh`,
        `combined`.`column3`,
        `combined`.`column4`,
        `combined`.`column5`
FROM
(
    SELECT
        `raw`.`column1`,
        `raw`.`column2`,
        `raw`.`yyyy_mm_dd`,
        `raw`.`hh`,
        `raw`.`column3`,
        `raw`.`column4`,
        `raw`.`column5`
    FROM  `default`.`raw_table` `raw`
) `combined`

=========[2]=========

CREATE VIEW myview
PARTITIONED ON (
        yyyy_mm_dd,
        hh,
        type,
        persona,
        dc
)
AS ( ... )


=========[3]=========[

https://cwiki.apache.org/confluence/display/Hive/PartitionedViews

Re: Views and partitions and performance

Posted by Stephen Sprague <sp...@gmail.com>.

great questions, Burak.  Personally, I had not before seen the "create view
... partition on" construct. Not that that means anything but thanks for
bringing bringing out into the forefront!

So, yeah, do we have an SME out there that would like to elaborate on this
beyond the aforementioned url?


On Tue, Feb 11, 2014 at 8:04 AM, Burak Gürsoy <bu...@gmx.net> wrote:

> Hi,
>
> Some background information:
>
> We have a really large base table (~1P) -- (lets call it raw_table)
> partitioned like yyyy_mm_dd=2014-02-11/hh=12/flag=1 (or
> yyyy_mm_dd=2014-02-11/hh=12/flag=0)
>
> I'm testing creating a view on that table to be able to have a consistent
> user accessible interface instead of letting the users to use the actual
> table (we've changed the format of the table multiple times in the past
> and we'll possibly do so again).
>
> I am creating the view like this[1] and I've also tested view
> partitions[2].
>
> The version of Hive is 0.10.0
>
> The issues I came across so far are:
>
> 1) I couldn't figure out the actual purpose of the view partitions.
> The document[3] lists several use cases and mentioned
> "infer view partitions automatically based on the partitions of the
> underlying tables"
> which leads to an open ticket[4] which in turn lists another one
> (HIVE-1941)
> which is closed. However I fail to see how to create a derived view
> partition.
> And with the definition I've used[2] they don't seem to have the same
> effect.
>
> 2) You need to specify the partitions in the query to the view explicitly
> to have the partition pruning come into effect. However, when I
> say "select * from raw_table limit 10" it selects a subset somehow
> (explain and explain extended does not show any targeted partitions with
> such queries -- is this a bug?). But doing the same query on the view leads
> to a full table scan.
>
> 2.1) I see a similar behaviour with things like
> "select ... from myview tablesample (bucket 1 out of $bucketsize on $field)
>
> I'm wondering if this is just the way the views work at the moment or is
> there a workaround to improve, lets say: silly queries? As the interface
> seems inconsistent between the table and view access.
>
> Thanks,
> Burak
>
>
>
> =========[1]=========
>
> CREATE VIEW myview AS SELECT  `combined`.`column1`,
>         `combined`.`column2`,
>         `combined`.`yyyy_mm_dd`,
>         `combined`.`hh`,
>         `combined`.`column3`,
>         `combined`.`column4`,
>         `combined`.`column5`
> FROM
> (
>     SELECT
>         `raw`.`column1`,
>         `raw`.`column2`,
>         `raw`.`yyyy_mm_dd`,
>         `raw`.`hh`,
>         `raw`.`column3`,
>         `raw`.`column4`,
>         `raw`.`column5`
>     FROM  `default`.`raw_table` `raw`
> ) `combined`
>
> =========[2]=========
>
> CREATE VIEW myview
> PARTITIONED ON (
>         yyyy_mm_dd,
>         hh,
>         type,
>         persona,
>         dc
> )
> AS ( ... )
>
>
> =========[3]=========[
>
> https://cwiki.apache.org/confluence/display/Hive/PartitionedViews
>
>
>