You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yana Kadiyska (JIRA)" <ji...@apache.org> on 2015/04/17 18:25:00 UTC

[jira] [Created] (SPARK-6984) Operations on tables with many partitions _very_slow

Yana Kadiyska created SPARK-6984:
------------------------------------

             Summary: Operations on tables with many partitions _very_slow
                 Key: SPARK-6984
                 URL: https://issues.apache.org/jira/browse/SPARK-6984
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.2.1
         Environment: External Hive metastore, table with 30K partitions
            Reporter: Yana Kadiyska


I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. 
"describe table" also performs _very_ poorly

Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236

I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. "describe table" is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org