You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Roger Marin <ro...@rogersmarin.com> on 2015/04/08 02:58:50 UTC

Analyze table compute statistics on wide table taking too long

Hi,

I have a hive table with 300 columns that are all strings with around 180k
rows, when I run analyze table compute statistics it seems to be taking
about 40 minutes to complete regardless of the execution engine.

The table also has a large Regex serde.

I an running hive 0.13.0.

Any help would be greatly appreciated.

Regards,
Roger

Re: Analyze table compute statistics on wide table taking too long

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> I'm happy to look into improving the Regex serde performance, any tips
>on where I should start looking?.

There are three things off the top of my head.

First up, the matcher needs to be reused within a single scan. You can
also check the groupCount exactly once for a given pattern.

matcher.reset() offers performance benefits in the inner loop.

Second, Text does not implement CharSequence, which would be ideal to run
regex (zero-copy) over ASCII text (tblproperties, I guess).

Converting byte sequence to unicode points is mostly wasted CPU, I would
guess - Text::toString() is actually expensive.

This is not something I¹m entirely certain of, since java Regex might have
fast-paths for String classes - to be experimented with before fixing it.

A ByteCharSequence could technically be implemented for utf-8 as well
(using ByteBuffer::getChar() instead) - but a really fast path for 7 bit
ASCII is mostly where RegexSerde needs help.

Finally, column projection and SerDe StatsProvidingRecordReader.

There is no reason to deserialize all columns that show up in the original
DDL - compute stats only cares about row-count, but which is effectively
skipping ALL of what a RegexSerde does.

You can find out which columns are being read and only extract those
groups.

That is a combination of ColumnProjectionUtils.isReadAllColumns(conf) and
ColumnProjectionUtils.getReadColumnIDs() from the operator conf.


And in case no columns are being read (like in ANALYZE or count(1)), skip
the regex checker entirely, generating merely how many Text instances were
encountered in total.

Does all of that make sense?

I haven¹t seen too much use of the RegexSerde btw, which is why these were
generally left on the backburner (the perf problem is limited to a single
³create table² off it into ORC and use the vectorized filters for
performance).

Cheers,
Gopal



Re: Analyze table compute statistics on wide table taking too long

Posted by Roger Marin <ro...@rogersmarin.com>.
Hi Gopal,

Thanks for that.

I'm happy to look into improving the Regex serde performance, any tips on
where I should start looking?.

Regards,
Roger
On 08/04/2015 11:44 AM, "Gopal Vijayaraghavan" <go...@apache.org> wrote:

>
> > The table also has a large Regex serde.
>
> There are no stats fast paths for Regex SerDe.
>
> The statistics computation is lifting each row into memory, parsing it and
> throwing it away.
>
> Most of your time would be spent in GC (check the GC time millis), due to
> the huge expense of the Regex Serde.
>
> For a direct comparison you could compute stats while turning it into
> another format
>
> set hive.stats.autogather=true;
> create table tmp1 stored as orc as select * from oldtable;
>
> Due to the nature of the columnar SerDes, that ETL would happen in
> parallel to the compute stats off the same stream (i.e autogather).
>
> That said, I have noticed performance issues with the RegexSerde, but
> haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing
> it?
>
>
> Cheers,
> Gopal
>
>
>

Re: Analyze table compute statistics on wide table taking too long

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> The table also has a large Regex serde.

There are no stats fast paths for Regex SerDe.

The statistics computation is lifting each row into memory, parsing it and
throwing it away.

Most of your time would be spent in GC (check the GC time millis), due to
the huge expense of the Regex Serde.

For a direct comparison you could compute stats while turning it into
another format

set hive.stats.autogather=true;
create table tmp1 stored as orc as select * from oldtable;

Due to the nature of the columnar SerDes, that ETL would happen in
parallel to the compute stats off the same stream (i.e autogather).

That said, I have noticed performance issues with the RegexSerde, but
haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing
it?


Cheers,
Gopal