You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Abhilash L L <ab...@infoworks.io> on 2016/08/19 05:36:53 UTC

Optimizing limit 0 / null scan cases (HIVE-7203)

Hello,

    We are on hive 1.2.x.

    We are generating queries of the form (example)

    CREATE TABLE T .....  AS
    WITH
     T1 AS (SELECT * FROM S1.A),
     T2 AS (SELECT * FROM S2.B WHERE x > 1000),
     T3 AS (SELECT * FROM T1 JOIN T2 ON T1.x=T2.y AND T1.z>T2.z)
     SELECT * from T3

     We are generating much larger queries than above and we also use UDFs
which are added on the fly and registered as temporary functions. However,
since the above syntax does not allow specifying partitions, we will need
to create the table before hand and then run the same query again to
insert. Also we need to extract the types of the output of the final query.
Lot of tools also are doing the same.

     To extract the types for the create table, we just added LIMIT 0
assuming it would skip running MR/Tez jobs and return fast.

     When on MR its launching all the jobs and finally throws away the
records, which is taking 10+ minutes for the query/cluster combination.
When on Tez its much much faster but still high.

     We tried putting LIMIT 0 and/or WHERE 1=2 always false in the query to
make it skip faster, but its not skipping launching mr jobs.

     We came across the ticket
https://issues.apache.org/jira/browse/HIVE-7203, which says the
optimisation applies only to the outer most query even when specified in
the inner queries.

     Is there any workaround which works when execution engine is either mr
or tez.

     Help / guidance would be very helpful. Please let me know if I have to
share any other information

Regards,
Abhilash

Re: Optimizing limit 0 / null scan cases (HIVE-7203)

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> Based on the explanation in the ticket, I am expecting it to not skip
>running jobs for the inner queries which has limit / always false where
>conditions.

The ticket says that the optimization does not apply for limits within the
inner clauses, but should apply always if the outer query has a limit.

It should be removing the workload parts for Tez (replaces all inputs with
a 0 row table btw, does not skip the job).

This optimization does not extend very well into MR when there are joins.

I suspect the MR conditional tasks for joins which don't play nice with
the null-scan optimization. Tez only does noconditional.task.size map
joins, it never does guesswork in the middle of a query.

Cheers,
Gopal








Re: Optimizing limit 0 / null scan cases (HIVE-7203)

Posted by Abhilash L L <ab...@infoworks.io>.
Thanks a lot for replying, Gopal.

Based on the explanation in the ticket, I am expecting it to not skip
running jobs for the inner queries which has limit / always false where
conditions.
As suggested, I will try it out on a standard schema like tpc-ds so that
its easier to re-produce.

Yes we are using CTAS as of now. And in certain cases, union also is part
of the query.

The view approach seems promising and logical for the use case, will try
that and get back regarding that.

Thanks again.


Regards,
Abhilash

On Sat, Aug 20, 2016 at 3:38 AM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

>
> >     CREATE TABLE T .....  AS
>
> The limit clause in the outermost query should prevent the entire query
> from executing.
>
> However, the CREATE TABLE expression and the UNION ALL are rather
> challenging in this matter.
>
> If you have queries which don't hit the NULL-scan fully, a BUG report
> would be helpful, provided you have a chopped down scenario of the problem
> (ideally on TPC-DS or TPC-H schemas).
>
> If it's broken, it can be fixed, but not without a bug report.
>
> >     To extract the types for the create table, we just added LIMIT 0
> >assuming it would skip running MR/Tez jobs and return fast.
> ...
> >     Is there any workaround which works when execution engine is either
> >mr or tez.
>
>
> Before the limit-0 optimization, I used "CREATE VIEW" for this.
>
> VIEWS do everything but run the query - I'd really like a "CREATE
> TEMPORARY VIEW", but that's just to avoid leaving junk behind.
>
> Cheers,
> Gopal
>
>
>

Re: Optimizing limit 0 / null scan cases (HIVE-7203)

Posted by Gopal Vijayaraghavan <go...@apache.org>.
>     CREATE TABLE T .....  AS

The limit clause in the outermost query should prevent the entire query
from executing.

However, the CREATE TABLE expression and the UNION ALL are rather
challenging in this matter.

If you have queries which don't hit the NULL-scan fully, a BUG report
would be helpful, provided you have a chopped down scenario of the problem
(ideally on TPC-DS or TPC-H schemas).

If it's broken, it can be fixed, but not without a bug report.

>     To extract the types for the create table, we just added LIMIT 0
>assuming it would skip running MR/Tez jobs and return fast.
...
>     Is there any workaround which works when execution engine is either
>mr or tez.


Before the limit-0 optimization, I used "CREATE VIEW" for this.

VIEWS do everything but run the query - I'd really like a "CREATE
TEMPORARY VIEW", but that's just to avoid leaving junk behind.

Cheers,
Gopal