You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by James Turton <ja...@somecomputer.xyz> on 2020/05/31 05:59:54 UTC
analyze table columns none refresh metadata performance
Hi
I have a directory of 387 Parquet files that amount to a single data set
of 131Gb. Querying them with Drill works nicely. When I try to collect
metadata for this table with
|analyze table columns none refresh metadata|
that command uses a mind-boggling of amount of CPU time. At least the
order of 10 CPU-hours and probably the order of 100 CPU-hours [1]. It
cannot require that much CPU time to collect metadata from a few hundred
Parquet files. Surely? I'd /like/ to collect statistics too for some
columns but I've had to forgo that so far because of how slow this
command is.
[1] This is on a VMware guest with 10 vCPUs that are reported as Intel
Xeon CPU E5-2690 v4 @ 2.60GHz