You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Gopal Vijayaraghavan <go...@apache.org> on 2018/08/03 18:40:43 UTC

Re: Clustering and Large-scale analysis of Hive Queries

> I am interested in working on a project that takes a large number of Hive queries (as well as their meta data like amount of resources used etc) and find out common sub queries and expensive query groups etc.

This was roughly the central research topic of one of the Hive CBO devs, except was implemented for PIG (not Hive).

https://hal.inria.fr/hal-01353891
+
https://github.com/jcamachor/pigreuse

I think there's a lot of interest in this topic for ETL workloads and the goal is to pick this up as ETL becomes the target problem.

There's a recent SIGMOID paper which talks about the same sort of reuse.

https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf

If you are interested in looking into this using existing infra in Hive, I recommend looking at Zoltan's recent work which tracks query plans + runtime statistics from the RUNTIME_STATS table in the metastore.

You can debug through what this does by doing

"explain reoptimization  <query>;"

Cheers,
Gopal