You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Gordon Benjamin <go...@gmail.com> on 2014/11/06 23:01:49 UTC

Using partitioning to speed up queries in Shark

Hi All,

I'm using Spark/Shark as the foundation for some reporting that I'm doing
and have a customers table with approximately 3 million rows that I've
cached in memory.

I've also created a partitioned table that I've also cached in memory on a
per day basis

FROM
customers_cached
INSERT OVERWRITE TABLE
part_customers_cached
PARTITION(createday)
SELECT id,email,dt_cr, to_date(dt_cr) as createday where
dt_cr>unix_timestamp('2013-01-01 00:00:00') and
dt_cr<unix_timestamp('2013-12-31 23:59:59');
set exec.dynamic.partition=true;

set exec.dynamic.partition.mode=nonstrict;

however when I run the following basic tests I get this type of performance

[localhost:10000] shark> select count(*) from part_customers_cached where
 createday >= '2014-08-01' and createday <= '2014-12-06';
37204
Time taken (including network latency): 3.131 seconds

[localhost:10000] shark>  SELECT count(*) from customers_cached where
dt_cr>unix_timestamp('2013-08-01 00:00:00') and
dt_cr<unix_timestamp('2013-12-06 23:59:59');
37204
Time taken (including network latency): 1.538 seconds

I'm running this on a cluster with one master and two slaves and was hoping
that the partitioned table would be noticeably faster but it looks as
though the partitioning has slowed things down... Is this the case, or is
there some additional configuration that I need to do to speed things up?

Best Wishes,

Gordon

Re: Using partitioning to speed up queries in Shark

Posted by Nicholas Chammas <ni...@gmail.com>.

Did you mean to send this to the user list?

This is the dev list, where we discuss things related to development on
Spark itself.

On Thu, Nov 6, 2014 at 5:01 PM, Gordon Benjamin <gordon.benjamin65@gmail.com
> wrote:

> Hi All,
>
> I'm using Spark/Shark as the foundation for some reporting that I'm doing
> and have a customers table with approximately 3 million rows that I've
> cached in memory.
>
> I've also created a partitioned table that I've also cached in memory on a
> per day basis
>
> FROM
> customers_cached
> INSERT OVERWRITE TABLE
> part_customers_cached
> PARTITION(createday)
> SELECT id,email,dt_cr, to_date(dt_cr) as createday where
> dt_cr>unix_timestamp('2013-01-01 00:00:00') and
> dt_cr<unix_timestamp('2013-12-31 23:59:59');
> set exec.dynamic.partition=true;
>
> set exec.dynamic.partition.mode=nonstrict;
>
> however when I run the following basic tests I get this type of performance
>
> [localhost:10000] shark> select count(*) from part_customers_cached where
>  createday >= '2014-08-01' and createday <= '2014-12-06';
> 37204
> Time taken (including network latency): 3.131 seconds
>
> [localhost:10000] shark>  SELECT count(*) from customers_cached where
> dt_cr>unix_timestamp('2013-08-01 00:00:00') and
> dt_cr<unix_timestamp('2013-12-06 23:59:59');
> 37204
> Time taken (including network latency): 1.538 seconds
>
> I'm running this on a cluster with one master and two slaves and was hoping
> that the partitioned table would be noticeably faster but it looks as
> though the partitioning has slowed things down... Is this the case, or is
> there some additional configuration that I need to do to speed things up?
>
> Best Wishes,
>
> Gordon
>

Re: Using partitioning to speed up queries in Shark

Posted by Mayur Rustagi <ma...@gmail.com>.

- dev list & + user list
Shark is not officially supported anymore so you are better off moving to
Spark SQL.
Shark doesnt support Hive partitioning logic anyways, it has its version of
partitioning on in-memory blocks but is independent of whether you
partition your data in hive or not.



Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Fri, Nov 7, 2014 at 3:31 AM, Gordon Benjamin <gordon.benjamin65@gmail.com
> wrote:

> Hi All,
>
> I'm using Spark/Shark as the foundation for some reporting that I'm doing
> and have a customers table with approximately 3 million rows that I've
> cached in memory.
>
> I've also created a partitioned table that I've also cached in memory on a
> per day basis
>
> FROM
> customers_cached
> INSERT OVERWRITE TABLE
> part_customers_cached
> PARTITION(createday)
> SELECT id,email,dt_cr, to_date(dt_cr) as createday where
> dt_cr>unix_timestamp('2013-01-01 00:00:00') and
> dt_cr<unix_timestamp('2013-12-31 23:59:59');
> set exec.dynamic.partition=true;
>
> set exec.dynamic.partition.mode=nonstrict;
>
> however when I run the following basic tests I get this type of performance
>
> [localhost:10000] shark> select count(*) from part_customers_cached where
>  createday >= '2014-08-01' and createday <= '2014-12-06';
> 37204
> Time taken (including network latency): 3.131 seconds
>
> [localhost:10000] shark>  SELECT count(*) from customers_cached where
> dt_cr>unix_timestamp('2013-08-01 00:00:00') and
> dt_cr<unix_timestamp('2013-12-06 23:59:59');
> 37204
> Time taken (including network latency): 1.538 seconds
>
> I'm running this on a cluster with one master and two slaves and was hoping
> that the partitioned table would be noticeably faster but it looks as
> though the partitioning has slowed things down... Is this the case, or is
> there some additional configuration that I need to do to speed things up?
>
> Best Wishes,
>
> Gordon
>