You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Tom Graves <tg...@yahoo.com.INVALID> on 2015/04/13 23:53:38 UTC

Spark Sql reading hive partitioned tables?

Hey,
I was trying out spark sql using the HiveContext and doing a select on a partitioned table with lots of partitions (16,000+). It took over 6 minutes before it even started the job. It looks like it was querying the Hive metastore and got a good chunk of data back.  Which I'm guessing is info on the partitions.  Running the same query using hive takes 45 seconds for the entire job. 
I know spark sql doesn't support all the hive optimization.  Is this a known limitation currently?  
Thanks,Tom

Re: Spark Sql reading hive partitioned tables?

Posted by Michael Armbrust <mi...@databricks.com>.

We can try to add this as part of some hive refactoring we are doing for
1.4.  I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-6910

On Tue, Apr 14, 2015 at 9:58 AM, Cheolsoo Park <pi...@gmail.com> wrote:

> Is there a plan to fix this? I also ran into this issue with a *"select *
> from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto
> in worst case (1.6M partitions table). This is a serious blocker for us
> since we have many tables with near (and over) 1M partitions, and any query
> against these big tables wastes 5 minutes to get full partitions info.
>
> I briefly looked at the code, and it looks like resolving metastore
> relations is the first thing that the analyzer does prior to any other
> optimization rules such as partition pruning. So in the Hive metastore
> client, it ends up calling getAllPartitions() with no filter expression. I
> am wondering how much work will be involved to fix this issue. Can you
> please advise what you think should be done?
>
>
> On Mon, Apr 13, 2015 at 3:27 PM, Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Yeah, we don't currently push down predicates into the metastore.  Though,
>> we do prune partitions based on predicates (so we don't read the data).
>>
>> On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves <tgraves_cs@yahoo.com.invalid
>> >
>> wrote:
>>
>> > Hey,
>> > I was trying out spark sql using the HiveContext and doing a select on a
>> > partitioned table with lots of partitions (16,000+). It took over 6
>> minutes
>> > before it even started the job. It looks like it was querying the Hive
>> > metastore and got a good chunk of data back.  Which I'm guessing is
>> info on
>> > the partitions.  Running the same query using hive takes 45 seconds for
>> the
>> > entire job.
>> > I know spark sql doesn't support all the hive optimization.  Is this a
>> > known limitation currently?
>> > Thanks,Tom
>>
>
>

Re: Spark Sql reading hive partitioned tables?

Posted by Cheolsoo Park <pi...@gmail.com>.

Is there a plan to fix this? I also ran into this issue with a *"select *
from tbl where ... limit 10"* query. Spark SQL is 100x slower than Presto
in worst case (1.6M partitions table). This is a serious blocker for us
since we have many tables with near (and over) 1M partitions, and any query
against these big tables wastes 5 minutes to get full partitions info.

I briefly looked at the code, and it looks like resolving metastore
relations is the first thing that the analyzer does prior to any other
optimization rules such as partition pruning. So in the Hive metastore
client, it ends up calling getAllPartitions() with no filter expression. I
am wondering how much work will be involved to fix this issue. Can you
please advise what you think should be done?

On Mon, Apr 13, 2015 at 3:27 PM, Michael Armbrust <mi...@databricks.com>
wrote:

> Yeah, we don't currently push down predicates into the metastore.  Though,
> we do prune partitions based on predicates (so we don't read the data).
>
> On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves <tg...@yahoo.com.invalid>
> wrote:
>
> > Hey,
> > I was trying out spark sql using the HiveContext and doing a select on a
> > partitioned table with lots of partitions (16,000+). It took over 6
> minutes
> > before it even started the job. It looks like it was querying the Hive
> > metastore and got a good chunk of data back.  Which I'm guessing is info
> on
> > the partitions.  Running the same query using hive takes 45 seconds for
> the
> > entire job.
> > I know spark sql doesn't support all the hive optimization.  Is this a
> > known limitation currently?
> > Thanks,Tom
>

Re: Spark Sql reading hive partitioned tables?

Posted by Michael Armbrust <mi...@databricks.com>.

Yeah, we don't currently push down predicates into the metastore.  Though,
we do prune partitions based on predicates (so we don't read the data).

On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves <tg...@yahoo.com.invalid>
wrote:

> Hey,
> I was trying out spark sql using the HiveContext and doing a select on a
> partitioned table with lots of partitions (16,000+). It took over 6 minutes
> before it even started the job. It looks like it was querying the Hive
> metastore and got a good chunk of data back.  Which I'm guessing is info on
> the partitions.  Running the same query using hive takes 45 seconds for the
> entire job.
> I know spark sql doesn't support all the hive optimization.  Is this a
> known limitation currently?
> Thanks,Tom