You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2015/12/01 17:40:38 UTC

Using spark in tandem with Hive

What if we decide to use spark with Hive. I look to hear similar views

 

My test bed comprised

 

1.    Spark version 1.5.2

2.    Hive version 1.2.1

3.    Hadoop version 2.6

 

 

I made Spark to use Hive metastore. So using spark-sql I can pretty do
whatever one can do with HiveQL 

 

I created and populated an ORC table in spark-sql.. It took 90 seconds to
create and populate the table with 1.7 million rows

 

spark-sql> select count(1) from tt;

1767886

Time taken: 5.169 seconds, Fetched 1 row(s)

 

Now let me try to do the said operation on the same table with HCL and MR

 

hive> use asehadoop;

OK

Time taken: 0.639 seconds

hive> select count(1) from tt;

Query ID = hduser_20151201162717_e3102633-f501-413b-b9cb-384ac50880ac

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1448969636093_0001, Tracking URL =
http://rhes564:8088/proxy/application_1448969636093_0001/

Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
job_1448969636093_0001

Hadoop job information for Stage-1: number of mappers: 1; number of
reducers: 1

2015-12-01 16:27:27,154 Stage-1 map = 0%,  reduce = 0%

2015-12-01 16:27:35,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
2.65 sec

2015-12-01 16:27:41,611 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU
3.71 sec

MapReduce Total cumulative CPU time: 3 seconds 710 msec

Ended Job = job_1448969636093_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.71 sec   HDFS Read:
520151 HDFS Write: 8 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 710 msec

OK

1767886

Time taken: 25.635 seconds, Fetched: 1 row(s)

 

So 5 seconds in Spark versus 25 seconds in Hive

 

On a point query Hive does not seem to return the correct timing?

 

hive> select * from tt where data_object_id = 10;

Time taken: 0.063 seconds, Fetched: 72 row(s)

 

Whereas in Spark I get

 

spark-sql>  select * from tt where data_object_id = 10;

Time taken: 9.002 seconds, Fetched 72 row(s)

 

9 seconds looks far more plausible to me than 0.063 seonds. Or in an
unlikely event Spark returns elapsed time, whereas Hive returns execution
time?

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 
<http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908
.pdf>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

RE: Using spark in tandem with Hive

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Thanks.

 

My test bed has the following components.

 

 

1.    Spark version 1.5.2

2.    Hive version 1.2.1

3.    Hadoop version 2.6

 

I will try your suggestions, however, we have to consider that the underlying table is based on a Hive table, to keep the systematics the same so to speak for comparison.

 

I increased the rows in table tt to 56 million and this is what a simple query compares 10 seconds in spark compared to 143 seconds with HIVE/MR.

 

In spark-sql

 

spark-sql> select count(1) from tt where object_id > 1000 and object_type = 'TABLE';

2251008

Time taken: 10.295 seconds, Fetched 1 row(s)

 

In hive with standard MR (no TEZ)

 

hive> select count(1) from tt where object_id > 1000 and object_type = 'TABLE';

Query ID = hduser_20151201202549_b698c002-6de0-4353-9a4a-3ba06e7c0428

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1448969636093_0004, Tracking URL = http://rhes564:8088/proxy/application_1448969636093_0004/

Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1448969636093_0004

Hadoop job information for Stage-1: number of mappers: 11; number of reducers: 1

2015-12-01 20:25:58,765 Stage-1 map = 0%,  reduce = 0%

2015-12-01 20:26:08,140 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 6.38 sec

2015-12-01 20:26:10,201 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 7.5 sec

2015-12-01 20:26:19,539 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 13.92 sec

2015-12-01 20:26:20,570 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 15.07 sec

2015-12-01 20:26:30,870 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 21.55 sec

2015-12-01 20:26:31,897 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 22.7 sec

2015-12-01 20:26:42,191 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 29.45 sec

2015-12-01 20:26:43,217 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 30.41 sec

2015-12-01 20:26:52,465 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 36.83 sec

2015-12-01 20:26:53,493 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 37.78 sec

2015-12-01 20:27:04,778 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 44.12 sec

2015-12-01 20:27:05,806 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 45.1 sec

2015-12-01 20:27:17,126 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 51.67 sec

2015-12-01 20:27:18,150 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 52.75 sec

2015-12-01 20:27:28,424 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 59.24 sec

2015-12-01 20:27:29,453 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 60.2 sec

2015-12-01 20:27:40,805 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 66.63 sec

2015-12-01 20:27:42,855 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 68.53 sec

2015-12-01 20:27:54,156 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 74.99 sec

2015-12-01 20:27:55,180 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 76.15 sec

2015-12-01 20:28:05,483 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU 82.64 sec

2015-12-01 20:28:06,509 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 83.4 sec

2015-12-01 20:28:10,622 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 84.72 sec

MapReduce Total cumulative CPU time: 1 minutes 24 seconds 720 msec

Ended Job = job_1448969636093_0004

MapReduce Jobs Launched:

Stage-Stage-1: Map: 11  Reduce: 1   Cumulative CPU: 84.72 sec   HDFS Read: 56718493 HDFS Write: 8 SUCCESS

Total MapReduce CPU Time Spent: 1 minutes 24 seconds 720 msec

OK

2251008

Time taken: 143.452 seconds, Fetched: 1 row(s)

 

 

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Jörn Franke [mailto:jornfranke@gmail.com] 
Sent: 01 December 2015 20:38
To: user@hive.apache.org
Subject: Re: Using spark in tandem with Hive

 

You Should use TEZ (preferably >0.8 and a release of hive supporting it, because it has tez service which allows more lower latency queries) instead of mr to get the first query faster. The second query is probably faster in hive because you use statistics, which to my knowledge are not leveraged by Spark (only for broadcast joins). 

However you can still create statistics for each column by executing the same command you use for the table, but adding for columns to it (probably also not supported by Spark). You may get faster access if you do not mark the table as transactional (I am not sure if spark can handle them properly). I do not know your data, but you should check if you want to sort the data on certain columns. Probably the bloom filter on object_id is not necessary because this is covered by the storage index. Bloom filters are only available in Hive >= 1.2. There might be further optimizations (eg partitioning, increasing replication etc), but this would require more knowledge of the data.


On 01 Dec 2015, at 19:20, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

The table was created in spark-sql as ORC table

 

use asehadoop;

drop table if exists tt;

create table tt (

owner                   varchar(30)

,object_name             varchar(30)

,subobject_name          varchar(30)

,object_id               bigint

,data_object_id          bigint

,object_type             varchar(19)

,created                 timestamp

,last_ddl_time           timestamp

,timestamp               varchar(19)

,status                  varchar(7)

,temporary2              varchar(1)

,generated               varchar(1)

,secondary               varchar(1)

,namespace               bigint

,edition_name            varchar(30)

,padding1                varchar(4000)

,padding2                varchar(3500)

,attribute               varchar(32)

,op_type                 int

,op_time                 timestamp

)

CLUSTERED BY (object_id) INTO 256 BUCKETS

STORED AS ORC

TBLPROPERTIES ( "orc.compress"="SNAPPY",

"transactional"="true",

"orc.create.index"="true",

"orc.bloom.filter.columns"="object_id",

"orc.bloom.filter.fpp"="0.05",

"orc.stripe.size"="268435456",

"orc.row.index.stride"="10000" )

;

show create table tt;

INSERT INTO TABLE tt

SELECT

          owner

        , object_name

        , subobject_name

        , object_id

        , data_object_id

        , object_type

        , cast(created AS timestamp)

        , cast(last_ddl_time AS timestamp)

        , timestamp

        , status

        , temporary2

        , generated

        , secondary

        , namespace

        , edition_name

        , padding1

        , padding2

        , attribute

        , 1

        , cast(from_unixtime(unix_timestamp()) AS timestamp)

FROM t_staging

;

 

And it was analysed ass below

 

hive> analyze table tt compute statistics;

Table asehadoop.tt stats: [numFiles=30, numRows=1767886, totalSize=88388380, rawDataSize=5984968162]

OK

Time taken: 0.241 seconds

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Jörn Franke [mailto:jornfranke@gmail.com] 
Sent: 01 December 2015 16:58
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Re: Using spark in tandem with Hive

 

How did you create the tables? Do you have automated statistics activated in Hive?

 

Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for sub second queries ) or use Spark as an execution engine in Hive.


On 01 Dec 2015, at 17:40, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

What if we decide to use spark with Hive. I look to hear similar views

 

My test bed comprised

 

1.       Spark version 1.5.2

2.       Hive version 1.2.1

3.       Hadoop version 2.6

 

 

I made Spark to use Hive metastore. So using spark-sql I can pretty do whatever one can do with HiveQL 

 

I created and populated an ORC table in spark-sql.. It took 90 seconds to create and populate the table with 1.7 million rows

 

spark-sql> select count(1) from tt;

1767886

Time taken: 5.169 seconds, Fetched 1 row(s)

 

Now let me try to do the said operation on the same table with HCL and MR

 

hive> use asehadoop;

OK

Time taken: 0.639 seconds

hive> select count(1) from tt;

Query ID = hduser_20151201162717_e3102633-f501-413b-b9cb-384ac50880ac

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1448969636093_0001, Tracking URL = http://rhes564:8088/proxy/application_1448969636093_0001/

Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1448969636093_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2015-12-01 16:27:27,154 Stage-1 map = 0%,  reduce = 0%

2015-12-01 16:27:35,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.65 sec

2015-12-01 16:27:41,611 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.71 sec

MapReduce Total cumulative CPU time: 3 seconds 710 msec

Ended Job = job_1448969636093_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.71 sec   HDFS Read: 520151 HDFS Write: 8 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 710 msec

OK

1767886

Time taken: 25.635 seconds, Fetched: 1 row(s)

 

So 5 seconds in Spark versus 25 seconds in Hive

 

On a point query Hive does not seem to return the correct timing?

 

hive> select * from tt where data_object_id = 10;

Time taken: 0.063 seconds, Fetched: 72 row(s)

 

Whereas in Spark I get

 

spark-sql>  select * from tt where data_object_id = 10;

Time taken: 9.002 seconds, Fetched 72 row(s)

 

9 seconds looks far more plausible to me than 0.063 seonds. Or in an unlikely event Spark returns elapsed time, whereas Hive returns execution time?

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

Re: Using spark in tandem with Hive

Posted by Jörn Franke <jo...@gmail.com>.

You Should use TEZ (preferably >0.8 and a release of hive supporting it, because it has tez service which allows more lower latency queries) instead of mr to get the first query faster. The second query is probably faster in hive because you use statistics, which to my knowledge are not leveraged by Spark (only for broadcast joins). 
However you can still create statistics for each column by executing the same command you use for the table, but adding for columns to it (probably also not supported by Spark). You may get faster access if you do not mark the table as transactional (I am not sure if spark can handle them properly). I do not know your data, but you should check if you want to sort the data on certain columns. Probably the bloom filter on object_id is not necessary because this is covered by the storage index. Bloom filters are only available in Hive >= 1.2. There might be further optimizations (eg partitioning, increasing replication etc), but this would require more knowledge of the data.

> On 01 Dec 2015, at 19:20, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> The table was created in spark-sql as ORC table
>  
> use asehadoop;
> drop table if exists tt;
> create table tt (
> owner                   varchar(30)
> ,object_name             varchar(30)
> ,subobject_name          varchar(30)
> ,object_id               bigint
> ,data_object_id          bigint
> ,object_type             varchar(19)
> ,created                 timestamp
> ,last_ddl_time           timestamp
> ,timestamp               varchar(19)
> ,status                  varchar(7)
> ,temporary2              varchar(1)
> ,generated               varchar(1)
> ,secondary               varchar(1)
> ,namespace               bigint
> ,edition_name            varchar(30)
> ,padding1                varchar(4000)
> ,padding2                varchar(3500)
> ,attribute               varchar(32)
> ,op_type                 int
> ,op_time                 timestamp
> )
> CLUSTERED BY (object_id) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="SNAPPY",
> "transactional"="true",
> "orc.create.index"="true",
> "orc.bloom.filter.columns"="object_id",
> "orc.bloom.filter.fpp"="0.05",
> "orc.stripe.size"="268435456",
> "orc.row.index.stride"="10000" )
> ;
> show create table tt;
> INSERT INTO TABLE tt
> SELECT
>           owner
>         , object_name
>         , subobject_name
>         , object_id
>         , data_object_id
>         , object_type
>         , cast(created AS timestamp)
>         , cast(last_ddl_time AS timestamp)
>         , timestamp
>         , status
>         , temporary2
>         , generated
>         , secondary
>         , namespace
>         , edition_name
>         , padding1
>         , padding2
>         , attribute
>         , 1
>         , cast(from_unixtime(unix_timestamp()) AS timestamp)
> FROM t_staging
> ;
>  
> And it was analysed ass below
>  
> hive> analyze table tt compute statistics;
> Table asehadoop.tt stats: [numFiles=30, numRows=1767886, totalSize=88388380, rawDataSize=5984968162]
> OK
> Time taken: 0.241 seconds
>  
> HTH
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>  
> From: Jörn Franke [mailto:jornfranke@gmail.com] 
> Sent: 01 December 2015 16:58
> To: user@hive.apache.org
> Subject: Re: Using spark in tandem with Hive
>  
> How did you create the tables? Do you have automated statistics activated in Hive?
>  
> Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for sub second queries ) or use Spark as an execution engine in Hive.
> 
> On 01 Dec 2015, at 17:40, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> What if we decide to use spark with Hive. I look to hear similar views
>  
> My test bed comprised
>  
> 1.       Spark version 1.5.2
> 2.       Hive version 1.2.1
> 3.       Hadoop version 2.6
>  
>  
> I made Spark to use Hive metastore. So using spark-sql I can pretty do whatever one can do with HiveQL
>  
> I created and populated an ORC table in spark-sql.. It took 90 seconds to create and populate the table with 1.7 million rows
>  
> spark-sql> select count(1) from tt;
> 1767886
> Time taken: 5.169 seconds, Fetched 1 row(s)
>  
> Now let me try to do the said operation on the same table with HCL and MR
>  
> hive> use asehadoop;
> OK
> Time taken: 0.639 seconds
> hive> select count(1) from tt;
> Query ID = hduser_20151201162717_e3102633-f501-413b-b9cb-384ac50880ac
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Job = job_1448969636093_0001, Tracking URL = http://rhes564:8088/proxy/application_1448969636093_0001/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1448969636093_0001
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
> 2015-12-01 16:27:27,154 Stage-1 map = 0%,  reduce = 0%
> 2015-12-01 16:27:35,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.65 sec
> 2015-12-01 16:27:41,611 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.71 sec
> MapReduce Total cumulative CPU time: 3 seconds 710 msec
> Ended Job = job_1448969636093_0001
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.71 sec   HDFS Read: 520151 HDFS Write: 8 SUCCESS
> Total MapReduce CPU Time Spent: 3 seconds 710 msec
> OK
> 1767886
> Time taken: 25.635 seconds, Fetched: 1 row(s)
>  
> So 5 seconds in Spark versus 25 seconds in Hive
>  
> On a point query Hive does not seem to return the correct timing?
>  
> hive> select * from tt where data_object_id = 10;
> Time taken: 0.063 seconds, Fetched: 72 row(s)
>  
> Whereas in Spark I get
>  
> spark-sql>  select * from tt where data_object_id = 10;
> Time taken: 9.002 seconds, Fetched 72 row(s)
>  
> 9 seconds looks far more plausible to me than 0.063 seonds. Or in an unlikely event Spark returns elapsed time, whereas Hive returns execution time?
>  
> Thanks
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>

RE: Using spark in tandem with Hive

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

The table was created in spark-sql as ORC table

 

use asehadoop;

drop table if exists tt;

create table tt (

owner                   varchar(30)

,object_name             varchar(30)

,subobject_name          varchar(30)

,object_id               bigint

,data_object_id          bigint

,object_type             varchar(19)

,created                 timestamp

,last_ddl_time           timestamp

,timestamp               varchar(19)

,status                  varchar(7)

,temporary2              varchar(1)

,generated               varchar(1)

,secondary               varchar(1)

,namespace               bigint

,edition_name            varchar(30)

,padding1                varchar(4000)

,padding2                varchar(3500)

,attribute               varchar(32)

,op_type                 int

,op_time                 timestamp

)

CLUSTERED BY (object_id) INTO 256 BUCKETS

STORED AS ORC

TBLPROPERTIES ( "orc.compress"="SNAPPY",

"transactional"="true",

"orc.create.index"="true",

"orc.bloom.filter.columns"="object_id",

"orc.bloom.filter.fpp"="0.05",

"orc.stripe.size"="268435456",

"orc.row.index.stride"="10000" )

;

show create table tt;

INSERT INTO TABLE tt

SELECT

          owner

        , object_name

        , subobject_name

        , object_id

        , data_object_id

        , object_type

        , cast(created AS timestamp)

        , cast(last_ddl_time AS timestamp)

        , timestamp

        , status

        , temporary2

        , generated

        , secondary

        , namespace

        , edition_name

        , padding1

        , padding2

        , attribute

        , 1

        , cast(from_unixtime(unix_timestamp()) AS timestamp)

FROM t_staging

;

 

And it was analysed ass below

 

hive> analyze table tt compute statistics;

Table asehadoop.tt stats: [numFiles=30, numRows=1767886, totalSize=88388380, rawDataSize=5984968162]

OK

Time taken: 0.241 seconds

 

HTH

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Jörn Franke [mailto:jornfranke@gmail.com] 
Sent: 01 December 2015 16:58
To: user@hive.apache.org
Subject: Re: Using spark in tandem with Hive

 

How did you create the tables? Do you have automated statistics activated in Hive?

 

Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for sub second queries ) or use Spark as an execution engine in Hive.


On 01 Dec 2015, at 17:40, Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> > wrote:

What if we decide to use spark with Hive. I look to hear similar views

 

My test bed comprised

 

1.       Spark version 1.5.2

2.       Hive version 1.2.1

3.       Hadoop version 2.6

 

 

I made Spark to use Hive metastore. So using spark-sql I can pretty do whatever one can do with HiveQL 

 

I created and populated an ORC table in spark-sql.. It took 90 seconds to create and populate the table with 1.7 million rows

 

spark-sql> select count(1) from tt;

1767886

Time taken: 5.169 seconds, Fetched 1 row(s)

 

Now let me try to do the said operation on the same table with HCL and MR

 

hive> use asehadoop;

OK

Time taken: 0.639 seconds

hive> select count(1) from tt;

Query ID = hduser_20151201162717_e3102633-f501-413b-b9cb-384ac50880ac

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1448969636093_0001, Tracking URL = http://rhes564:8088/proxy/application_1448969636093_0001/

Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1448969636093_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2015-12-01 16:27:27,154 Stage-1 map = 0%,  reduce = 0%

2015-12-01 16:27:35,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.65 sec

2015-12-01 16:27:41,611 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.71 sec

MapReduce Total cumulative CPU time: 3 seconds 710 msec

Ended Job = job_1448969636093_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.71 sec   HDFS Read: 520151 HDFS Write: 8 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 710 msec

OK

1767886

Time taken: 25.635 seconds, Fetched: 1 row(s)

 

So 5 seconds in Spark versus 25 seconds in Hive

 

On a point query Hive does not seem to return the correct timing?

 

hive> select * from tt where data_object_id = 10;

Time taken: 0.063 seconds, Fetched: 72 row(s)

 

Whereas in Spark I get

 

spark-sql>  select * from tt where data_object_id = 10;

Time taken: 9.002 seconds, Fetched 72 row(s)

 

9 seconds looks far more plausible to me than 0.063 seonds. Or in an unlikely event Spark returns elapsed time, whereas Hive returns execution time?

 

Thanks

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.

Re: Using spark in tandem with Hive

Posted by Jörn Franke <jo...@gmail.com>.

How did you create the tables? Do you have automated statistics activated in Hive?

Btw mr is outdated as a Hive execution engine. Use TEZ (maybe wait for 0.8 for sub second queries ) or use Spark as an execution engine in Hive.

> On 01 Dec 2015, at 17:40, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> What if we decide to use spark with Hive. I look to hear similar views
>  
> My test bed comprised
>  
> 1.    Spark version 1.5.2
> 2.    Hive version 1.2.1
> 3.    Hadoop version 2.6
>  
>  
> I made Spark to use Hive metastore. So using spark-sql I can pretty do whatever one can do with HiveQL
>  
> I created and populated an ORC table in spark-sql.. It took 90 seconds to create and populate the table with 1.7 million rows
>  
> spark-sql> select count(1) from tt;
> 1767886
> Time taken: 5.169 seconds, Fetched 1 row(s)
>  
> Now let me try to do the said operation on the same table with HCL and MR
>  
> hive> use asehadoop;
> OK
> Time taken: 0.639 seconds
> hive> select count(1) from tt;
> Query ID = hduser_20151201162717_e3102633-f501-413b-b9cb-384ac50880ac
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Starting Job = job_1448969636093_0001, Tracking URL = http://rhes564:8088/proxy/application_1448969636093_0001/
> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1448969636093_0001
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
> 2015-12-01 16:27:27,154 Stage-1 map = 0%,  reduce = 0%
> 2015-12-01 16:27:35,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.65 sec
> 2015-12-01 16:27:41,611 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.71 sec
> MapReduce Total cumulative CPU time: 3 seconds 710 msec
> Ended Job = job_1448969636093_0001
> MapReduce Jobs Launched:
> Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.71 sec   HDFS Read: 520151 HDFS Write: 8 SUCCESS
> Total MapReduce CPU Time Spent: 3 seconds 710 msec
> OK
> 1767886
> Time taken: 25.635 seconds, Fetched: 1 row(s)
>  
> So 5 seconds in Spark versus 25 seconds in Hive
>  
> On a point query Hive does not seem to return the correct timing?
>  
> hive> select * from tt where data_object_id = 10;
> Time taken: 0.063 seconds, Fetched: 72 row(s)
>  
> Whereas in Spark I get
>  
> spark-sql>  select * from tt where data_object_id = 10;
> Time taken: 9.002 seconds, Fetched 72 row(s)
>  
> 9 seconds looks far more plausible to me than 0.063 seonds. Or in an unlikely event Spark returns elapsed time, whereas Hive returns execution time?
>  
> Thanks
>  
> Mich Talebzadeh
>  
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly
>  
> http://talebzadehmich.wordpress.com
>  
> NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
>