You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by James Turton <ja...@somecomputer.xyz> on 2020/05/31 05:59:54 UTC

analyze table columns none refresh metadata performance

Hi

I have a directory of 387 Parquet files that amount to a single data set
of 131Gb.  Querying them with Drill works nicely.  When I try to collect
metadata for this table with

|analyze table columns none refresh metadata|

that command uses a mind-boggling of amount of CPU time.  At least the
order of 10 CPU-hours and probably the order of 100 CPU-hours [1].  It
cannot require that much CPU time to collect metadata from a few hundred
Parquet files.  Surely?  I'd /like/ to collect statistics too for some
columns but I've had to forgo that so far because of how slow this
command is.

[1] This is on a VMware guest with 10 vCPUs that are reported as Intel
Xeon CPU E5-2690 v4 @ 2.60GHz