You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Lars Francke <la...@gmail.com> on 2016/10/05 13:21:12 UTC

Question about Utilities#getInputPaths

Hi everyone,

I've encountered a performance issue at multiple customers now. The problem
is the processing of input paths when there are lots of partitions.

We check each directory if it's empty. This alone can take minutes.

There is a comment in Utilities:

"We need to add a empty file, it is not acceptable to change the operator
tree
Consider the query:
select * from (select count(1) from T union all select count(1) from T2) x;

If T is empty and T2 contains 100 rows, the user expects: 0, 100 (2 rows)"

I have to admit that I don't quite understand that. Would it mean that we'd
only get a single row if we left out this empty path?

I do not understand the internals of query planning and execution well
enough but if someone has time to explain it to me I'd be very grateful.
(If someone who understands all of this is based in Europe I'd be more than
happy to jump on a call as well or invite to a steak & beer ;-) )

If that is indeed the reason: This code was written 3 or 4 years ago. Maybe
internals have changed enough so that we can now deal with this? For simple
queries like SELECT * FROM T LIMIT 10 I'm seeing 5-10min runtimes just
because of this overhead.

Thanks!

Lars

Re: Question about Utilities#getInputPaths

Posted by Gopal Vijayaraghavan <go...@apache.org>.
(always helpful to call out a version, I'm going to assume 1.2)

>    select * from (select count(1) from T union all select count(1) from T2) x;

>    I have to admit that I don't quite understand that. Would it mean that we'd
>   only get a single row if we left out this empty path?

AFAIK, this is a bit of historical stuff from MR, where a 0 task job is not valid (in Tez, it is).

I know of at least one fix for metadata optimizations for partitioned table, which does this faster (but is not in Apache, AFAIK)

https://issues.apache.org/jira/browse/HIVE-10596

>    I do not understand the internals of query planning and execution well
>   enough but if someone has time to explain it to me I'd be very grateful.

There's a DEBUG level log named <PERFLOG> that would be useful in debugging why this is slow.

--hiveconf DEBUG,DRFA should get you the split of times within a query.

> For simple queries like SELECT * FROM T LIMIT 10 I'm seeing 5-10min runtimes just
> because of this overhead.

There are 2 optimizers you can disable and try this out.

set hive.optimize.metadataonly=false;
set hive.fetch.task.conversion=minimal; (or none)

The first one prevents creation of dummy files for a simple query like the count(1).

The second one prevents an optimizer check which will sum up the file sizes of all files till it reaches 1Gb before disabling the fetch codepath.

Cheers,
Gopal