You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/05/23 18:00:58 UTC

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Have a look at this thread

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 23 May 2016 at 09:10, Mich Talebzadeh <mi...@gmail.com> wrote:

> Hi Timur and everyone.
>
> I will answer your first question as it is very relevant
>
> 1) How to make 2 versions of Spark live together on the same cluster
> (libraries clash, paths, etc.) ?
> Most of the Spark users perform ETL, ML operations on Spark as well. So,
> we may have 3 Spark installations simultaneously
>
> There are two distinct points here.
>
> Using Spark as a  query engine. That is BAU and most forum members use it
> everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster
> managers. You start master that does the management of resources and you
> start slaves to create workers.
>
>  You deploy Spark either by Spark-shell, Spark-sql or submit jobs through
> spark-submit etc. You may or may not use Hive as your database. You may use
> Hbase via Phoenix etc
> If you choose to use Hive as your database, on every host of cluster
> including your master host, you ensure that Hive APIs are installed
> (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to
> cd $SPARK_HOME/conf
> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
> lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 *hive-site.xml ->
> /usr/lib/hive/conf/hive-site.xml*
> Now in hive-site.xml you can define all the parameters needed for Spark
> connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE
> NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or
> workers for Spark 1.3.1! It is just an execution engine like mr etc.
>
> Let us look at how we do that in hive-site,xml. Noting the settings for
> hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2
> below. That tells Hive to use spark 1.3.1 as the execution engine. You just
> install spark 1.3.1 on the host just the binary download it is
> /usr/lib/spark-1.3.1-bin-hadoop2.6
>
> In hive-site.xml, you set the properties.
>
>   <property>
>     <name>hive.execution.engine</name>
>     <value>spark</value>
>     <description>
>       Expects one of [mr, tez, spark].
>       Chooses execution engine. Options are: mr (Map reduce, default),
> tez, spark. While MR
>       remains the default engine for historical reasons, it is itself a
> historical engine
>       and is deprecated in Hive 2 line. It may be removed without further
> warning.
>     </description>
>   </property>
>
>   <property>
>     <name>spark.home</name>
>     <value>/usr/lib/spark-1.3.1-bin-hadoop2</value>
>     <description>something</description>
>   </property>
>
>  <property>
>     <name>hive.merge.sparkfiles</name>
>     <value>false</value>
>     <description>Merge small files at the end of a Spark DAG
> Transformation</description>
>   </property>
>
>  <property>
>     <name>hive.spark.client.future.timeout</name>
>     <value>60s</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>       Timeout for requests from Hive client to remote Spark driver.
>     </description>
>  </property>
>  <property>
>     <name>hive.spark.job.monitor.timeout</name>
>     <value>60s</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>       Timeout for job monitor to get Spark job state.
>     </description>
>  </property>
>
>   <property>
>     <name>hive.spark.client.connect.timeout</name>
>     <value>1000ms</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>       Timeout for remote Spark driver in connecting back to Hive client.
>     </description>
>   </property>
>
>   <property>
>     <name>hive.spark.client.server.connect.timeout</name>
>     <value>90000ms</value>
>     <description>
>       Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>       Timeout for handshake between Hive client and remote Spark driver.
> Checked by both processes.
>     </description>
>   </property>
>   <property>
>     <name>hive.spark.client.secret.bits</name>
>     <value>256</value>
>     <description>Number of bits of randomness in the generated secret for
> communication between Hive client and remote Spark driver. Rounded down to
> the nearest multiple of 8.</description>
>   </property>
>   <property>
>     <name>hive.spark.client.rpc.threads</name>
>     <value>8</value>
>     <description>Maximum number of threads for remote Spark driver's RPC
> event loop.</description>
>   </property>
>
> And other settings as well
>
> That was the Hive stuff for your Spark BAU. So there are two distinct
> things. Now going to Hive itself, you will need to add the correct assembly
> jar file for Hadoop. These are called
>
> spark-assembly-x.y.z-hadoop2.4.0.jar
>
> Where x.y.z in this case is 1.3.1
>
> The assembly file is
>
> spark-assembly-1.3.1-hadoop2.4.0.jar
>
> So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
>
> ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
> /usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
>
> And you need to compile spark from source excluding Hadoop dependencies
>
>
> ./make-distribution.sh --name "hadoop2-without-hive" --tgz
> "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"
>
>
> So Hive uses spark engine by default
>
> If you want to use mr in hive you just do
>
> 0: jdbc:hive2://rhes564:10010/default>
> *set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may
> not be available in the future versions. Consider using a different
> execution engine (i.e. spark, tez) or using Hive 1.X releases.
> No rows affected (0.007 seconds)
>
> With regard to the second question
>
> *2) How stable such construction is on INSERT / UPDATE / CTAS operations?
> Any problems with writing into specific tables / directories, ORC / Parquet
> peculiarities, memory / timeout parameters tuning ?*
> With this set up that is Hive using spark as execution engine, my tests
> look OK. Basically I can do whatever I do with Hive using map-reduceengine.
> The caveat as usual is the amount of memory used by Spark for in-memory
> work. I am afraid that resource constraint will be there no matter how you
> want to deploy Spark
>
> *3)) How stable such construction is in multi-user / multi-tenant
> production environment when several people make different queries
> simultaneously?*
>
> This is subjective how you are going to deploy it and how scalable it is.
> Your mileage varies and you really need to test it for yourself to find
> out.
> Also worth noting that with Spark app using Hive ORC tables you may have
> issues with ORC tables defined ass transactional. You do not have that
> issue with Hive on Spark engine. There are certainly limitations with
> HiveSql construct. For example some properties are not implemented. Case in
> point with Spark-sql
>
> spark-sql> CREATE TEMPORARY TABLE tmp as select * from oraclehadoop.sales
> limit 10;
> Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
> .
> You are likely trying to use an unsupported Hive feature.";
> However, no issue with hive on spark engine
>
> set hive.execution.engine=spark;
> 0: jdbc:hive2://rhes564:10010/default> CREATE TEMPORARY TABLE tmp as
> select * from oraclehadoop.sales limit 10;
> Starting Spark Job = d87e6c68-03f1-4c37-a9d4-f77e117039a4
> Query Hive on Spark job[0] stages:
> INFO  : Completed executing
> command(queryId=hduser_20160523090757_a474efb8-cea8-473e-8899-60bc7934a887);
> Time taken: 43.894 seconds
> INFO  : OK
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 23 May 2016 at 05:57, Mohanraj Ragupathiraj <mo...@gmail.com>
> wrote:
>
>> Great Comparison !! thanks
>>
>> On Mon, May 23, 2016 at 7:42 AM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I have done a number of extensive tests using Spark-shell with Hive DB
>>> and ORC tables.
>>>
>>>
>>>
>>> Now one issue that we typically face is and I quote:
>>>
>>>
>>>
>>> Spark is fast as it uses Memory and DAG. Great but when we save data it
>>> is not fast enough
>>>
>>> OK but there is a solution now. If you use Spark with Hive and you are
>>> on a descent version of Hive >= 0.14, then you can also deploy Spark as
>>> execution engine for Hive. That will make your application run pretty fast
>>> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
>>> what you are gaining speed in both querying and storage.
>>>
>>>
>>>
>>> I have made some comparisons on this set-up and I am sure some of you
>>> will find it useful.
>>>
>>>
>>>
>>> The version of Spark I use for Spark queries (Spark as query tool) is
>>> 1.6.
>>>
>>> The version of Hive I use in Hive 2
>>>
>>> The version of Spark I use as Hive execution engine is 1.3.1 It works
>>> and frankly Spark 1.3.1 as an execution engine is adequate (until we sort
>>> out the Hadoop libraries mismatch).
>>>
>>>
>>>
>>> An example I am using Hive on Spark engine to find the min and max of
>>> IDs for a table with 1 billion rows:
>>>
>>>
>>>
>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
>>> stddev(id) from oraclehadoop.dummy;
>>>
>>> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>
>>>
>>>
>>>
>>>
>>> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>>>
>>>
>>>
>>> INFO  : Completed compiling
>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>> Time taken: 1.911 seconds
>>>
>>> INFO  : Executing
>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>
>>> INFO  : Query ID =
>>> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>>>
>>> INFO  : Total jobs = 1
>>>
>>> INFO  : Launching Job 1 out of 1
>>>
>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>
>>>
>>>
>>> Query Hive on Spark job[0] stages:
>>>
>>> 0
>>>
>>> 1
>>>
>>> Status: Running (Hive on Spark job[0])
>>>
>>> Job Progress Format
>>>
>>> CurrentTime StageId_StageAttemptId:
>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>> [StageCost]
>>>
>>> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>
>>> INFO  :
>>>
>>> Query Hive on Spark job[0] stages:
>>>
>>> INFO  : 0
>>>
>>> INFO  : 1
>>>
>>> INFO  :
>>>
>>> Status: Running (Hive on Spark job[0])
>>>
>>> INFO  : Job Progress Format
>>>
>>> CurrentTime StageId_StageAttemptId:
>>> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
>>> [StageCost]
>>>
>>> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>>>
>>> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22    Stage-1_0: 0/1
>>>
>>> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22    Stage-1_0: 0/1
>>>
>>> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished       Stage-1_0:
>>> 0(+1)/1
>>>
>>> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished       Stage-1_0: 1/1
>>> Finished
>>>
>>> Status: Finished successfully in 53.25 seconds
>>>
>>> OK
>>>
>>> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished
>>> Stage-1_0: 0(+1)/1
>>>
>>> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished
>>> Stage-1_0: 1/1 Finished
>>>
>>> INFO  : Status: Finished successfully in 53.25 seconds
>>>
>>> INFO  : Completed executing
>>> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
>>> Time taken: 56.337 seconds
>>>
>>> INFO  : OK
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | c0  |     c1     |      c2       |          c3           |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> 1 row selected (58.529 seconds)
>>>
>>>
>>>
>>> 58 seconds first run with cold cache is pretty good
>>>
>>>
>>>
>>> And let us compare it with running the same query on map-reduce engine
>>>
>>>
>>>
>>> : jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
>>>
>>> Hive-on-MR is deprecated in Hive 2 and may not be available in the
>>> future versions. Consider using a different execution engine (i.e. spark,
>>> tez) or using Hive 1.X releases.
>>>
>>> No rows affected (0.007 seconds)
>>>
>>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
>>> stddev(id) from oraclehadoop.dummy;
>>>
>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
>>> the future versions. Consider using a different execution engine (i.e.
>>> spark, tez) or using Hive 1.X releases.
>>>
>>> Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>
>>> Total jobs = 1
>>>
>>> Launching Job 1 out of 1
>>>
>>> Number of reduce tasks determined at compile time: 1
>>>
>>> In order to change the average load for a reducer (in bytes):
>>>
>>>   set hive.exec.reducers.bytes.per.reducer=<number>
>>>
>>> In order to limit the maximum number of reducers:
>>>
>>>   set hive.exec.reducers.max=<number>
>>>
>>> In order to set a constant number of reducers:
>>>
>>>   set mapreduce.job.reduces=<number>
>>>
>>> Starting Job = job_1463956731753_0005, Tracking URL =
>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>
>>> Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>> job_1463956731753_0005
>>>
>>> Hadoop job information for Stage-1: number of mappers: 22; number of
>>> reducers: 1
>>>
>>> 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>
>>> INFO  : Compiling
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>
>>> INFO  : Semantic Analysis Completed
>>>
>>> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0,
>>> type:int, comment:null), FieldSchema(name:c1, type:int, comment:null),
>>> FieldSchema(name:c2, type:double, comment:null), FieldSchema(name:c3,
>>> type:double, comment:null)], properties:null)
>>>
>>> INFO  : Completed compiling
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>> Time taken: 0.144 seconds
>>>
>>> INFO  : Executing
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):
>>> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>>>
>>> WARN  : Hive-on-MR is deprecated in Hive 2 and may not be available in
>>> the future versions. Consider using a different execution engine (i.e.
>>> spark, tez) or using Hive 1.X releases.
>>>
>>> INFO  : WARNING: Hive-on-MR is deprecated in Hive 2 and may not be
>>> available in the future versions. Consider using a different execution
>>> engine (i.e. spark, tez) or using Hive 1.X releases.
>>>
>>> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in
>>> the future versions. Consider using a different execution engine (i.e.
>>> spark, tez) or using Hive 1.X releases.
>>>
>>> INFO  : Query ID =
>>> hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc
>>>
>>> INFO  : Total jobs = 1
>>>
>>> INFO  : Launching Job 1 out of 1
>>>
>>> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>>>
>>> INFO  : Number of reduce tasks determined at compile time: 1
>>>
>>> INFO  : In order to change the average load for a reducer (in bytes):
>>>
>>> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
>>>
>>> INFO  : In order to limit the maximum number of reducers:
>>>
>>> INFO  :   set hive.exec.reducers.max=<number>
>>>
>>> INFO  : In order to set a constant number of reducers:
>>>
>>> INFO  :   set mapreduce.job.reduces=<number>
>>>
>>> WARN  : Hadoop command-line option parsing not performed. Implement the
>>> Tool interface and execute your application with ToolRunner to remedy this.
>>>
>>> INFO  : number of splits:22
>>>
>>> INFO  : Submitting tokens for job: job_1463956731753_0005
>>>
>>> INFO  : The url to track the job:
>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>
>>> INFO  : Starting Job = job_1463956731753_0005, Tracking URL =
>>> http://localhost.localdomain:8088/proxy/application_1463956731753_0005/
>>>
>>> INFO  : Kill Command = /home/hduser/hadoop-2.6.0/bin/hadoop job  -kill
>>> job_1463956731753_0005
>>>
>>> INFO  : Hadoop job information for Stage-1: number of mappers: 22;
>>> number of reducers: 1
>>>
>>> INFO  : 2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%
>>>
>>> 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU
>>> 4.56 sec
>>>
>>> INFO  : 2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%,
>>> Cumulative CPU 4.56 sec
>>>
>>> 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU
>>> 9.17 sec
>>>
>>> INFO  : 2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%,
>>> Cumulative CPU 9.17 sec
>>>
>>> 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU
>>> 14.04 sec
>>>
>>> INFO  : 2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%,
>>> Cumulative CPU 14.04 sec
>>>
>>> 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU
>>> 18.64 sec
>>>
>>> INFO  : 2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%,
>>> Cumulative CPU 18.64 sec
>>>
>>> 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU
>>> 23.25 sec
>>>
>>> INFO  : 2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%,
>>> Cumulative CPU 23.25 sec
>>>
>>> 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU
>>> 27.84 sec
>>>
>>> INFO  : 2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%,
>>> Cumulative CPU 27.84 sec
>>>
>>> 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU
>>> 32.56 sec
>>>
>>> INFO  : 2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%,
>>> Cumulative CPU 32.56 sec
>>>
>>> 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU
>>> 37.1 sec
>>>
>>> INFO  : 2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%,
>>> Cumulative CPU 37.1 sec
>>>
>>> 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU
>>> 41.74 sec
>>>
>>> INFO  : 2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%,
>>> Cumulative CPU 41.74 sec
>>>
>>> 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU
>>> 46.32 sec
>>>
>>> INFO  : 2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%,
>>> Cumulative CPU 46.32 sec
>>>
>>> 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU
>>> 50.93 sec
>>>
>>> 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU
>>> 55.55 sec
>>>
>>> INFO  : 2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%,
>>> Cumulative CPU 50.93 sec
>>>
>>> INFO  : 2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%,
>>> Cumulative CPU 55.55 sec
>>>
>>> 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU
>>> 60.25 sec
>>>
>>> INFO  : 2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%,
>>> Cumulative CPU 60.25 sec
>>>
>>> 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU
>>> 64.86 sec
>>>
>>> INFO  : 2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%,
>>> Cumulative CPU 64.86 sec
>>>
>>> 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU
>>> 69.41 sec
>>>
>>> INFO  : 2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%,
>>> Cumulative CPU 69.41 sec
>>>
>>> 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU
>>> 74.06 sec
>>>
>>> INFO  : 2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%,
>>> Cumulative CPU 74.06 sec
>>>
>>> 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU
>>> 78.72 sec
>>>
>>> INFO  : 2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%,
>>> Cumulative CPU 78.72 sec
>>>
>>> 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU
>>> 83.32 sec
>>>
>>> INFO  : 2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%,
>>> Cumulative CPU 83.32 sec
>>>
>>> 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU
>>> 87.9 sec
>>>
>>> INFO  : 2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%,
>>> Cumulative CPU 87.9 sec
>>>
>>> 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU
>>> 92.52 sec
>>>
>>> INFO  : 2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%,
>>> Cumulative CPU 92.52 sec
>>>
>>> 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU
>>> 97.35 sec
>>>
>>> INFO  : 2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%,
>>> Cumulative CPU 97.35 sec
>>>
>>> 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU
>>> 99.6 sec
>>>
>>> INFO  : 2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%,
>>> Cumulative CPU 99.6 sec
>>>
>>> 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative
>>> CPU 101.4 sec
>>>
>>> MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msec
>>>
>>> Ended Job = job_1463956731753_0005
>>>
>>> MapReduce Jobs Launched:
>>>
>>> Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFS
>>> Read: 5318569 HDFS Write: 46 SUCCESS
>>>
>>> Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>
>>> OK
>>>
>>> INFO  : 2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%,
>>> Cumulative CPU 101.4 sec
>>>
>>> INFO  : MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400
>>> msec
>>>
>>> INFO  : Ended Job = job_1463956731753_0005
>>>
>>> INFO  : MapReduce Jobs Launched:
>>>
>>> INFO  : Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec
>>> HDFS Read: 5318569 HDFS Write: 46 SUCCESS
>>>
>>> INFO  : Total MapReduce CPU Time Spent: 1 minutes 41 seconds 400 msec
>>>
>>> INFO  : Completed executing
>>> command(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);
>>> Time taken: 142.525 seconds
>>>
>>> INFO  : OK
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | c0  |     c1     |      c2       |          c3           |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> | 1   | 100000000  | 5.00000005E7  | 2.8867513459481288E7  |
>>>
>>> +-----+------------+---------------+-----------------------+--+
>>>
>>> 1 row selected (142.744 seconds)
>>>
>>>
>>>
>>> OK Hive on map-reduce engine took 142 seconds compared to 58 seconds
>>> with Hive on Spark. So you can obviously gain pretty well by using Hive on
>>> Spark.
>>>
>>>
>>>
>>> Please also note that I did not use any vendor's build for this purpose.
>>> I compiled Spark 1.3.1 myself.
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com/
>>>
>>>
>>>
>>
>>
>>
>> --
>> Thanks and Regards
>> Mohan
>> VISA Pte Limited, Singapore.
>>
>
>

Re: Getting the IP address of Spark Driver in yarn-cluster mode

Posted by Masood Krohy <ma...@intact.net>.

Thanks Steve.

Here is the Python pseudo code that got it working for me:

  import time; 
  import urllib2
  nodes= ({'worker1_hostname':'worker1_ip', ... })
  YARN_app_queue = 'default'
  YARN_address = 'http://YARN_IP:8088'

  YARN_app_startedTimeBegin = str(int(time.time() - 3600)) # We allow 
3,600 sec from start of the app up to this point

  requestedURL = (YARN_address + 
 '/ws/v1/cluster/apps?states=RUNNING&applicationTypes=SPARK&limit=1' + 
                  '&queue=' + YARN_app_queue + 
                  '&startedTimeBegin=' + YARN_app_startedTimeBegin)
  print 'Sent request to YARN: ' + requestedURL
  response = urllib2.urlopen(requestedURL)
  html = response.read()
  amHost_start = html.find('amHostHttpAddress') + 
len('amHostHttpAddress":"')
  amHost_length = len('worker1_hostname')
  amHost = html[amHost_start : amHost_start + amHost_length]
  print 'amHostHttpAddress is: ' + amHost
  try:
      self.websock = ...
      print 'Connected to server running on %s' % nodes[amHost] 
  except:
      print 'Could not connect to server on %s' % nodes[amHost]        



------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation




De :    Steve Loughran <st...@hortonworks.com>
A :     Masood Krohy <ma...@intact.net>
Cc :    "user@spark.apache.org" <us...@spark.apache.org>
Date :  2016-10-24 17:09
Objet : Re: Getting the IP address of Spark Driver in yarn-cluster mode




On 24 Oct 2016, at 19:34, Masood Krohy <ma...@intact.net> wrote:

Hi everyone, 
Is there a way to set the IP address/hostname that the Spark Driver is 
going to be running on when launching a program through spark-submit in 
yarn-cluster mode (PySpark 1.6.0)? 
I do not see an option for this. If not, is there a way to get this IP 
address after the Spark app has started running? (through an API call at 
the beginning of the program to be used in the rest of the program). 
spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console 
(and changes on every run due to yarn cluster mode) and I am wondering if 
this can be accessed within the program. It does not seem to me that a 
YARN node label can be used to tie the Spark Driver/AM to a node, while 
allowing the Executors to run on all the nodes. 



you can grab it off the YARN API itself; there's a REST view as well as a 
fussier RPC level. That is, assuming you want the web view, which is what 
is registered. 

If you know the application ID, you can also construct a URL through the 
YARN proxy; any attempt to talk direct to the AM is going to get 302'd 
back there anyway so any kerberos credentials can be verified.

Re: Getting the IP address of Spark Driver in yarn-cluster mode

Posted by Steve Loughran <st...@hortonworks.com>.

On 24 Oct 2016, at 19:34, Masood Krohy <ma...@intact.net>> wrote:

Hi everyone,

Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark 1.6.0)?

I do not see an option for this. If not, is there a way to get this IP address after the Spark app has started running? (through an API call at the beginning of the program to be used in the rest of the program). spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console (and changes on every run due to yarn cluster mode) and I am wondering if this can be accessed within the program. It does not seem to me that a YARN node label can be used to tie the Spark Driver/AM to a node, while allowing the Executors to run on all the nodes.



you can grab it off the YARN API itself; there's a REST view as well as a fussier RPC level. That is, assuming you want the web view, which is what is registered.

If you know the application ID, you can also construct a URL through the YARN proxy; any attempt to talk direct to the AM is going to get 302'd back there anyway so any kerberos credentials can be verified.

Getting the IP address of Spark Driver in yarn-cluster mode

Posted by Masood Krohy <ma...@intact.net>.

Hi everyone,
Is there a way to set the IP address/hostname that the Spark Driver is 
going to be running on when launching a program through spark-submit in 
yarn-cluster mode (PySpark 1.6.0)?
I do not see an option for this. If not, is there a way to get this IP 
address after the Spark app has started running? (through an API call at 
the beginning of the program to be used in the rest of the program). 
spark-submit outputs “ApplicationMaster host: 10.0.0.9” in the console 
(and changes on every run due to yarn cluster mode) and I am wondering if 
this can be accessed within the program. It does not seem to me that a 
YARN node label can be used to tie the Spark Driver/AM to a node, while 
allowing the Executors to run on all the nodes.
I am running a parameter server along with the Spark Driver that needs to 
be contacted during the program execution; I need the Driver’s IP so that 
other executors can call back to this server. I need to stick to the 
yarn-cluster mode.
Thanks for any hints in advance.
Masood
PS: the closest code I was able to write is this which is not outputting 
what I need:
print sc.statusTracker().getJobInfo( 
sc.statusTracker().getActiveJobsIds()[0] )
# output in YARN stdout log: SparkJobInfo(jobId=4, stageIds=JavaObject 
id=o101, status='SUCCEEDED')

------------------------------
Masood Krohy, Ph.D.
Data Scientist, Intact Lab-R&D
Intact Financial Corporation

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Ashok Kumar <as...@yahoo.com.INVALID>.

Hi Dr Mich,
This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work? 
Thanking you 

    On Monday, 23 May 2016, 19:01, Mich Talebzadeh <mi...@gmail.com> wrote:
 

 Have a look at this thread
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com 
On 23 May 2016 at 09:10, Mich Talebzadeh <mi...@gmail.com> wrote:

Hi Timur and everyone.
I will answer your first question as it is very relevant
1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? 
Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark installations simultaneously

There are two distinct points here.
Using Spark as a  query engine. That is BAU and most forum members use it everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster managers. You start master that does the management of resources and you start slaves to create workers. 
 You deploy Spark either by Spark-shell, Spark-sql or submit jobs through spark-submit etc. You may or may not use Hive as your database. You may use Hbase via Phoenix etcIf you choose to use Hive as your database, on every host of cluster including your master host, you ensure that Hive APIs are installed (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd $SPARK_HOME/conf
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 hive-site.xml -> /usr/lib/hive/conf/hive-site.xml
Now in hive-site.xml you can define all the parameters needed for Spark connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers for Spark 1.3.1! It is just an execution engine like mr etc.
Let us look at how we do that in hive-site,xml. Noting the settings for hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 below. That tells Hive to use spark 1.3.1 as the execution engine. You just install spark 1.3.1 on the host just the binary download it is /usr/lib/spark-1.3.1-bin-hadoop2.6
In hive-site.xml, you set the properties.
  <property>
    <name>hive.execution.engine</name>
    <value>spark</value>
    <description>
      Expects one of [mr, tez, spark].
      Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
      remains the default engine for historical reasons, it is itself a historical engine
      and is deprecated in Hive 2 line. It may be removed without further warning.
    </description>
  </property>  <property>
    <name>spark.home</name>
    <value>/usr/lib/spark-1.3.1-bin-hadoop2</value>
    <description>something</description>
  </property>

 <property>
    <name>hive.merge.sparkfiles</name>
    <value>false</value>
    <description>Merge small files at the end of a Spark DAG Transformation</description>
  </property>  <property>
    <name>hive.spark.client.future.timeout</name>
    <value>60s</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified.
      Timeout for requests from Hive client to remote Spark driver.
    </description>
 </property> <property>
    <name>hive.spark.job.monitor.timeout</name>
    <value>60s</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified.
      Timeout for job monitor to get Spark job state.
    </description>
 </property>
  <property>
    <name>hive.spark.client.connect.timeout</name>
    <value>1000ms</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.
      Timeout for remote Spark driver in connecting back to Hive client.
    </description>
  </property>
  <property>
    <name>hive.spark.client.server.connect.timeout</name>
    <value>90000ms</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.
      Timeout for handshake between Hive client and remote Spark driver.  Checked by both processes.
    </description>
  </property>
  <property>
    <name>hive.spark.client.secret.bits</name>
    <value>256</value>
    <description>Number of bits of randomness in the generated secret for communication between Hive client and remote Spark driver. Rounded down to the nearest multiple of 8.</description>
  </property>
  <property>
    <name>hive.spark.client.rpc.threads</name>
    <value>8</value>
    <description>Maximum number of threads for remote Spark driver's RPC event loop.</description>
  </property>
And other settings as well
That was the Hive stuff for your Spark BAU. So there are two distinct things. Now going to Hive itself, you will need to add the correct assembly jar file for Hadoop. These are called 
spark-assembly-x.y.z-hadoop2.4.0.jar 
Where x.y.z in this case is 1.3.1 
The assembly file is
spark-assembly-1.3.1-hadoop2.4.0.jar
So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
/usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
And you need to compile spark from source excluding Hadoop dependencies ./make-distribution.sh --name"hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

So Hive uses spark engine by default 
If you want to use mr in hive you just do
0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (0.007 seconds)

With regard to the second question2) How stable such construction is on INSERT / UPDATE / CTAS operations? Any problems with writing into specific tables / directories, ORC / Parquet peculiarities, memory / timeout parameters tuning ?
With this set up that is Hive using spark as execution engine, my tests look OK. Basically I can do whatever I do with Hive using map-reduceengine. The caveat as usual is the amount of memory used by Spark for in-memory work. I am afraid that resource constraint will be there no matter how you want to deploy Spark

3)) How stable such construction is in multi-user / multi-tenant production environment when several people make different queries simultaneously?

This is subjective how you are going to deploy it and how scalable it is. Your mileage varies and you really need to test it for yourself to find out. Also worth noting that with Spark app using Hive ORC tables you may have issues with ORC tables defined ass transactional. You do not have that issue with Hive on Spark engine. There are certainly limitations with HiveSql construct. For example some properties are not implemented. Case in point with Spark-sql
spark-sql> CREATE TEMPORARY TABLE tmp as select * from oraclehadoop.sales limit 10;
Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
.
You are likely trying to use an unsupported Hive feature.";However, no issue with hive on spark engine
set hive.execution.engine=spark;
0: jdbc:hive2://rhes564:10010/default> CREATE TEMPORARY TABLE tmp as select * from oraclehadoop.sales limit 10;
Starting Spark Job = d87e6c68-03f1-4c37-a9d4-f77e117039a4
Query Hive on Spark job[0] stages:
INFO  : Completed executing command(queryId=hduser_20160523090757_a474efb8-cea8-473e-8899-60bc7934a887); Time taken: 43.894 seconds
INFO  : OK
 
HTH

Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com 
On 23 May 2016 at 05:57, Mohanraj Ragupathiraj <mo...@gmail.com> wrote:

Great Comparison !! thanks
On Mon, May 23, 2016 at 7:42 AM, Mich Talebzadeh <mi...@gmail.com> wrote:

Hi, Ihave done a number of extensive tests using Spark-shell with Hive DB and ORCtables. Nowone issue that we typically face is and I quote: Spark is fast as it uses Memory and DAG. Great but when we save data it is notfast enough
OKbut there is a solution now. If you use Spark with Hive and you are on adescent version of Hive >= 0.14, then you can also deploy Spark as executionengine for Hive. That will make your application run pretty fast as you nolonger rely on the old Map-Reduce for Hive engine. In a nutshell what you aregaining speed in both querying and storage. Ihave made some comparisons on this set-up and I am sure some of you will findit useful. Theversion of Spark I use for Spark queries (Spark as query tool) is 1.6.Theversion of Hive I use in Hive 2Theversion of Spark I use as Hive execution engine is 1.3.1 It works and franklySpark 1.3.1 as an execution engine is adequate (until we sort out the Hadooplibraries mismatch). Anexample I am using Hive on Spark engine to find the min and max of IDs for atable with 1 billion rows: 0: jdbc:hive2://rhes564:10010/default>  select min(id),max(id),avg(id), stddev(id) from oraclehadoop.dummy;Query ID =hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006  Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 INFO  : Completed compilingcommand(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);Time taken: 1.911 secondsINFO  : Executingcommand(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummyINFO  : Query ID =hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006INFO  : Total jobs = 1INFO  : Launching Job 1 out of 1INFO  : Starting task [Stage-1:MAPRED] in serial mode Query Hive on Spark job[0] stages:01Status: Running (Hive on Spark job[0])Job Progress FormatCurrentTime StageId_StageAttemptId:SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount[StageCost]2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/12016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22   Stage-1_0: 0/12016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22   Stage-1_0: 0/12016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22   Stage-1_0: 0/1INFO  :Query Hive on Spark job[0] stages:INFO  : 0INFO  : 1INFO  :Status: Running (Hive on Spark job[0])INFO  : Job Progress FormatCurrentTime StageId_StageAttemptId:SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount[StageCost]INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1INFO  : 2016-05-23 00:21:20,070 Stage-0_0:0(+12)/22    Stage-1_0: 0/1INFO  : 2016-05-23 00:21:23,119 Stage-0_0:0(+12)/22    Stage-1_0: 0/1INFO  : 2016-05-23 00:21:26,156 Stage-0_0:13(+9)/22    Stage-1_0: 0/12016-05-23 00:21:29,181 Stage-0_0: 22/22Finished       Stage-1_0: 0(+1)/12016-05-23 00:21:30,189 Stage-0_0: 22/22Finished       Stage-1_0: 1/1 FinishedStatus: Finished successfully in 53.25 secondsOKINFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22Finished       Stage-1_0: 0(+1)/1INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22Finished       Stage-1_0: 1/1 FinishedINFO  : Status: Finished successfully in 53.25 secondsINFO  : Completed executingcommand(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);Time taken: 56.337 secondsINFO  : OK+-----+------------+---------------+-----------------------+--+| c0  |     c1    |      c2      |         c3           |+-----+------------+---------------+-----------------------+--+| 1   | 100000000  | 5.00000005E7  |2.8867513459481288E7  |+-----+------------+---------------+-----------------------+--+1 row selected (58.529 seconds) 58 seconds first run with coldcache is pretty good And let us compare it with running thesame query on map-reduce engine :jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;Hive-on-MR isdeprecated in Hive 2 and may not be available in the future versions. Considerusing a different execution engine (i.e. spark, tez) or using Hive 1.Xreleases.No rows affected(0.007 seconds)0:jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),stddev(id) from oraclehadoop.dummy;WARNING:Hive-on-MR is deprecated in Hive 2 and may not be available in the futureversions. Consider using a different execution engine (i.e. spark, tez) orusing Hive 1.X releases.Query ID =hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dcTotal jobs = 1Launching Job 1out of 1Number of reducetasks determined at compile time: 1In order to changethe average load for a reducer (in bytes):  sethive.exec.reducers.bytes.per.reducer=<number>In order to limitthe maximum number of reducers:  sethive.exec.reducers.max=<number>In order to set aconstant number of reducers:  setmapreduce.job.reduces=<number>Starting Job =job_1463956731753_0005, Tracking URL =http://localhost.localdomain:8088/proxy/application_1463956731753_0005/Kill Command =/home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1463956731753_0005Hadoop jobinformation for Stage-1: number of mappers: 22; number of reducers: 12016-05-2300:26:38,127 Stage-1 map = 0%,  reduce = 0%INFO  :Compilingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummyINFO  :Semantic Analysis CompletedINFO  :Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, type:int,comment:null), FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2,type:double, comment:null), FieldSchema(name:c3, type:double, comment:null)],properties:null)INFO  :Completed compilingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);Time taken: 0.144 secondsINFO  :Executingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummyWARN  :Hive-on-MR is deprecated in Hive 2 and may not be available in the futureversions. Consider using a different execution engine (i.e. spark, tez) orusing Hive 1.X releases.INFO  :WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in thefuture versions. Consider using a different execution engine (i.e. spark, tez)or using Hive 1.X releases.WARNING:Hive-on-MR is deprecated in Hive 2 and may not be available in the futureversions. Consider using a different execution engine (i.e. spark, tez) orusing Hive 1.X releases.INFO  : QueryID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dcINFO  : Totaljobs = 1INFO  :Launching Job 1 out of 1INFO  :Starting task [Stage-1:MAPRED] in serial modeINFO  :Number of reduce tasks determined at compile time: 1INFO  : Inorder to change the average load for a reducer (in bytes):INFO :   set hive.exec.reducers.bytes.per.reducer=<number>INFO  : Inorder to limit the maximum number of reducers:INFO :   set hive.exec.reducers.max=<number>INFO  : Inorder to set a constant number of reducers:INFO :   set mapreduce.job.reduces=<number>WARN  :Hadoop command-line option parsing not performed. Implement the Tool interfaceand execute your application with ToolRunner to remedy this.INFO  :number of splits:22INFO  :Submitting tokens for job: job_1463956731753_0005INFO  : Theurl to track the job:http://localhost.localdomain:8088/proxy/application_1463956731753_0005/INFO  :Starting Job = job_1463956731753_0005, Tracking URL =http://localhost.localdomain:8088/proxy/application_1463956731753_0005/INFO  : KillCommand = /home/hduser/hadoop-2.6.0/bin/hadoop job  -killjob_1463956731753_0005INFO  :Hadoop job information for Stage-1: number of mappers: 22; number of reducers:1INFO  :2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%2016-05-2300:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 4.56 secINFO  :2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU4.56 sec2016-05-2300:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 9.17 secINFO  :2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU9.17 sec2016-05-2300:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 14.04 secINFO  :2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU14.04 sec2016-05-2300:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 18.64 secINFO  :2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU18.64 sec2016-05-2300:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 23.25 secINFO  :2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU23.25 sec2016-05-2300:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 27.84 secINFO  :2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU27.84 sec2016-05-2300:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 32.56 secINFO  :2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU32.56 sec2016-05-2300:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 37.1 secINFO  :2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU37.1 sec2016-05-2300:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 41.74 secINFO  :2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU41.74 sec2016-05-2300:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 46.32 secINFO  :2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU46.32 sec2016-05-2300:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 50.93 sec2016-05-2300:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 55.55 secINFO  :2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU50.93 secINFO  :2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 55.55sec2016-05-2300:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 60.25 secINFO  :2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU60.25 sec2016-05-2300:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 64.86 secINFO  :2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU64.86 sec2016-05-2300:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 69.41 secINFO  :2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU69.41 sec2016-05-2300:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 74.06 secINFO  :2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU74.06 sec2016-05-2300:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 78.72 secINFO  :2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU78.72 sec2016-05-2300:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 83.32 secINFO  :2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU83.32 sec2016-05-2300:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 87.9 secINFO  :2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU87.9 sec2016-05-2300:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 92.52 secINFO  :2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU92.52 sec2016-05-2300:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 97.35 secINFO  :2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU97.35 sec2016-05-2300:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 99.6 secINFO  :2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 99.6sec2016-05-2300:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 101.4 secMapReduce Totalcumulative CPU time: 1 minutes 41 seconds 400 msecEnded Job =job_1463956731753_0005MapReduce JobsLaunched:Stage-Stage-1:Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFSRead: 5318569 HDFS Write: 46 SUCCESSTotal MapReduceCPU Time Spent: 1 minutes 41 seconds 400 msecOKINFO  :2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU101.4 secINFO  :MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msecINFO  : EndedJob = job_1463956731753_0005INFO  :MapReduce Jobs Launched:INFO  :Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESSINFO  : TotalMapReduce CPU Time Spent: 1 minutes 41 seconds 400 msecINFO  :Completed executingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);Time taken: 142.525 secondsINFO  : OK+-----+------------+---------------+-----------------------+--+| c0 |     c1    |      c2      |         c3           |+-----+------------+---------------+-----------------------+--+| 1   |100000000  | 5.00000005E7  | 2.8867513459481288E7  |+-----+------------+---------------+-----------------------+--+1 row selected(142.744 seconds) OK Hive on map-reduce engine took142 seconds compared to 58 seconds with Hive on Spark. So you can obviouslygain pretty well by using Hive on Spark. Please also note that I did not use anyvendor's build for this purpose. I compiled Spark 1.3.1 myself. HTH  Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com/ 



-- 
Thanks and Regards
Mohan
VISA Pte Limited, Singapore.

Re: Using Spark on Hive with Hive also using Spark as its execution engine

Posted by Ashok Kumar <as...@yahoo.com>.

Hi Dr Mich,
This is very good news. I will be interested to know how Hive engages with Spark as an engine. What Spark processes are used to make this work? 
Thanking you 

    On Monday, 23 May 2016, 19:01, Mich Talebzadeh <mi...@gmail.com> wrote:
 

 Have a look at this thread
Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com 
On 23 May 2016 at 09:10, Mich Talebzadeh <mi...@gmail.com> wrote:

Hi Timur and everyone.
I will answer your first question as it is very relevant
1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? 
Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark installations simultaneously

There are two distinct points here.
Using Spark as a  query engine. That is BAU and most forum members use it everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster managers. You start master that does the management of resources and you start slaves to create workers. 
 You deploy Spark either by Spark-shell, Spark-sql or submit jobs through spark-submit etc. You may or may not use Hive as your database. You may use Hbase via Phoenix etcIf you choose to use Hive as your database, on every host of cluster including your master host, you ensure that Hive APIs are installed (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd $SPARK_HOME/conf
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 hive-site.xml -> /usr/lib/hive/conf/hive-site.xml
Now in hive-site.xml you can define all the parameters needed for Spark connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers for Spark 1.3.1! It is just an execution engine like mr etc.
Let us look at how we do that in hive-site,xml. Noting the settings for hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 below. That tells Hive to use spark 1.3.1 as the execution engine. You just install spark 1.3.1 on the host just the binary download it is /usr/lib/spark-1.3.1-bin-hadoop2.6
In hive-site.xml, you set the properties.
  <property>
    <name>hive.execution.engine</name>
    <value>spark</value>
    <description>
      Expects one of [mr, tez, spark].
      Chooses execution engine. Options are: mr (Map reduce, default), tez, spark. While MR
      remains the default engine for historical reasons, it is itself a historical engine
      and is deprecated in Hive 2 line. It may be removed without further warning.
    </description>
  </property>  <property>
    <name>spark.home</name>
    <value>/usr/lib/spark-1.3.1-bin-hadoop2</value>
    <description>something</description>
  </property>

 <property>
    <name>hive.merge.sparkfiles</name>
    <value>false</value>
    <description>Merge small files at the end of a Spark DAG Transformation</description>
  </property>  <property>
    <name>hive.spark.client.future.timeout</name>
    <value>60s</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified.
      Timeout for requests from Hive client to remote Spark driver.
    </description>
 </property> <property>
    <name>hive.spark.job.monitor.timeout</name>
    <value>60s</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is sec if not specified.
      Timeout for job monitor to get Spark job state.
    </description>
 </property>
  <property>
    <name>hive.spark.client.connect.timeout</name>
    <value>1000ms</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.
      Timeout for remote Spark driver in connecting back to Hive client.
    </description>
  </property>
  <property>
    <name>hive.spark.client.server.connect.timeout</name>
    <value>90000ms</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is msec if not specified.
      Timeout for handshake between Hive client and remote Spark driver.  Checked by both processes.
    </description>
  </property>
  <property>
    <name>hive.spark.client.secret.bits</name>
    <value>256</value>
    <description>Number of bits of randomness in the generated secret for communication between Hive client and remote Spark driver. Rounded down to the nearest multiple of 8.</description>
  </property>
  <property>
    <name>hive.spark.client.rpc.threads</name>
    <value>8</value>
    <description>Maximum number of threads for remote Spark driver's RPC event loop.</description>
  </property>
And other settings as well
That was the Hive stuff for your Spark BAU. So there are two distinct things. Now going to Hive itself, you will need to add the correct assembly jar file for Hadoop. These are called 
spark-assembly-x.y.z-hadoop2.4.0.jar 
Where x.y.z in this case is 1.3.1 
The assembly file is
spark-assembly-1.3.1-hadoop2.4.0.jar
So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
/usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
And you need to compile spark from source excluding Hadoop dependencies ./make-distribution.sh --name"hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

So Hive uses spark engine by default 
If you want to use mr in hive you just do
0: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (0.007 seconds)

With regard to the second question2) How stable such construction is on INSERT / UPDATE / CTAS operations? Any problems with writing into specific tables / directories, ORC / Parquet peculiarities, memory / timeout parameters tuning ?
With this set up that is Hive using spark as execution engine, my tests look OK. Basically I can do whatever I do with Hive using map-reduceengine. The caveat as usual is the amount of memory used by Spark for in-memory work. I am afraid that resource constraint will be there no matter how you want to deploy Spark

3)) How stable such construction is in multi-user / multi-tenant production environment when several people make different queries simultaneously?

This is subjective how you are going to deploy it and how scalable it is. Your mileage varies and you really need to test it for yourself to find out. Also worth noting that with Spark app using Hive ORC tables you may have issues with ORC tables defined ass transactional. You do not have that issue with Hive on Spark engine. There are certainly limitations with HiveSql construct. For example some properties are not implemented. Case in point with Spark-sql
spark-sql> CREATE TEMPORARY TABLE tmp as select * from oraclehadoop.sales limit 10;
Error in query: Unhandled clauses: TEMPORARY 1, 2,2, 7
.
You are likely trying to use an unsupported Hive feature.";However, no issue with hive on spark engine
set hive.execution.engine=spark;
0: jdbc:hive2://rhes564:10010/default> CREATE TEMPORARY TABLE tmp as select * from oraclehadoop.sales limit 10;
Starting Spark Job = d87e6c68-03f1-4c37-a9d4-f77e117039a4
Query Hive on Spark job[0] stages:
INFO  : Completed executing command(queryId=hduser_20160523090757_a474efb8-cea8-473e-8899-60bc7934a887); Time taken: 43.894 seconds
INFO  : OK
 
HTH

Dr Mich Talebzadeh LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com 
On 23 May 2016 at 05:57, Mohanraj Ragupathiraj <mo...@gmail.com> wrote:

Great Comparison !! thanks
On Mon, May 23, 2016 at 7:42 AM, Mich Talebzadeh <mi...@gmail.com> wrote:

Hi, Ihave done a number of extensive tests using Spark-shell with Hive DB and ORCtables. Nowone issue that we typically face is and I quote: Spark is fast as it uses Memory and DAG. Great but when we save data it is notfast enough
OKbut there is a solution now. If you use Spark with Hive and you are on adescent version of Hive >= 0.14, then you can also deploy Spark as executionengine for Hive. That will make your application run pretty fast as you nolonger rely on the old Map-Reduce for Hive engine. In a nutshell what you aregaining speed in both querying and storage. Ihave made some comparisons on this set-up and I am sure some of you will findit useful. Theversion of Spark I use for Spark queries (Spark as query tool) is 1.6.Theversion of Hive I use in Hive 2Theversion of Spark I use as Hive execution engine is 1.3.1 It works and franklySpark 1.3.1 as an execution engine is adequate (until we sort out the Hadooplibraries mismatch). Anexample I am using Hive on Spark engine to find the min and max of IDs for atable with 1 billion rows: 0: jdbc:hive2://rhes564:10010/default>  select min(id),max(id),avg(id), stddev(id) from oraclehadoop.dummy;Query ID =hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006  Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151 INFO  : Completed compilingcommand(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);Time taken: 1.911 secondsINFO  : Executingcommand(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummyINFO  : Query ID =hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006INFO  : Total jobs = 1INFO  : Launching Job 1 out of 1INFO  : Starting task [Stage-1:MAPRED] in serial mode Query Hive on Spark job[0] stages:01Status: Running (Hive on Spark job[0])Job Progress FormatCurrentTime StageId_StageAttemptId:SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount[StageCost]2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/12016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22   Stage-1_0: 0/12016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22   Stage-1_0: 0/12016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22   Stage-1_0: 0/1INFO  :Query Hive on Spark job[0] stages:INFO  : 0INFO  : 1INFO  :Status: Running (Hive on Spark job[0])INFO  : Job Progress FormatCurrentTime StageId_StageAttemptId:SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount[StageCost]INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1INFO  : 2016-05-23 00:21:20,070 Stage-0_0:0(+12)/22    Stage-1_0: 0/1INFO  : 2016-05-23 00:21:23,119 Stage-0_0:0(+12)/22    Stage-1_0: 0/1INFO  : 2016-05-23 00:21:26,156 Stage-0_0:13(+9)/22    Stage-1_0: 0/12016-05-23 00:21:29,181 Stage-0_0: 22/22Finished       Stage-1_0: 0(+1)/12016-05-23 00:21:30,189 Stage-0_0: 22/22Finished       Stage-1_0: 1/1 FinishedStatus: Finished successfully in 53.25 secondsOKINFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22Finished       Stage-1_0: 0(+1)/1INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22Finished       Stage-1_0: 1/1 FinishedINFO  : Status: Finished successfully in 53.25 secondsINFO  : Completed executingcommand(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);Time taken: 56.337 secondsINFO  : OK+-----+------------+---------------+-----------------------+--+| c0  |     c1    |      c2      |         c3           |+-----+------------+---------------+-----------------------+--+| 1   | 100000000  | 5.00000005E7  |2.8867513459481288E7  |+-----+------------+---------------+-----------------------+--+1 row selected (58.529 seconds) 58 seconds first run with coldcache is pretty good And let us compare it with running thesame query on map-reduce engine :jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;Hive-on-MR isdeprecated in Hive 2 and may not be available in the future versions. Considerusing a different execution engine (i.e. spark, tez) or using Hive 1.Xreleases.No rows affected(0.007 seconds)0:jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),stddev(id) from oraclehadoop.dummy;WARNING:Hive-on-MR is deprecated in Hive 2 and may not be available in the futureversions. Consider using a different execution engine (i.e. spark, tez) orusing Hive 1.X releases.Query ID =hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dcTotal jobs = 1Launching Job 1out of 1Number of reducetasks determined at compile time: 1In order to changethe average load for a reducer (in bytes):  sethive.exec.reducers.bytes.per.reducer=<number>In order to limitthe maximum number of reducers:  sethive.exec.reducers.max=<number>In order to set aconstant number of reducers:  setmapreduce.job.reduces=<number>Starting Job =job_1463956731753_0005, Tracking URL =http://localhost.localdomain:8088/proxy/application_1463956731753_0005/Kill Command =/home/hduser/hadoop-2.6.0/bin/hadoop job  -kill job_1463956731753_0005Hadoop jobinformation for Stage-1: number of mappers: 22; number of reducers: 12016-05-2300:26:38,127 Stage-1 map = 0%,  reduce = 0%INFO  :Compilingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummyINFO  :Semantic Analysis CompletedINFO  :Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:c0, type:int,comment:null), FieldSchema(name:c1, type:int, comment:null), FieldSchema(name:c2,type:double, comment:null), FieldSchema(name:c3, type:double, comment:null)],properties:null)INFO  :Completed compilingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);Time taken: 0.144 secondsINFO  :Executingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc):select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummyWARN  :Hive-on-MR is deprecated in Hive 2 and may not be available in the futureversions. Consider using a different execution engine (i.e. spark, tez) orusing Hive 1.X releases.INFO  :WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in thefuture versions. Consider using a different execution engine (i.e. spark, tez)or using Hive 1.X releases.WARNING:Hive-on-MR is deprecated in Hive 2 and may not be available in the futureversions. Consider using a different execution engine (i.e. spark, tez) orusing Hive 1.X releases.INFO  : QueryID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dcINFO  : Totaljobs = 1INFO  :Launching Job 1 out of 1INFO  :Starting task [Stage-1:MAPRED] in serial modeINFO  :Number of reduce tasks determined at compile time: 1INFO  : Inorder to change the average load for a reducer (in bytes):INFO :   set hive.exec.reducers.bytes.per.reducer=<number>INFO  : Inorder to limit the maximum number of reducers:INFO :   set hive.exec.reducers.max=<number>INFO  : Inorder to set a constant number of reducers:INFO :   set mapreduce.job.reduces=<number>WARN  :Hadoop command-line option parsing not performed. Implement the Tool interfaceand execute your application with ToolRunner to remedy this.INFO  :number of splits:22INFO  :Submitting tokens for job: job_1463956731753_0005INFO  : Theurl to track the job:http://localhost.localdomain:8088/proxy/application_1463956731753_0005/INFO  :Starting Job = job_1463956731753_0005, Tracking URL =http://localhost.localdomain:8088/proxy/application_1463956731753_0005/INFO  : KillCommand = /home/hduser/hadoop-2.6.0/bin/hadoop job  -killjob_1463956731753_0005INFO  :Hadoop job information for Stage-1: number of mappers: 22; number of reducers:1INFO  :2016-05-23 00:26:38,127 Stage-1 map = 0%,  reduce = 0%2016-05-2300:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 4.56 secINFO  :2016-05-23 00:26:44,367 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU4.56 sec2016-05-2300:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 9.17 secINFO  :2016-05-23 00:26:50,558 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU9.17 sec2016-05-2300:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 14.04 secINFO  :2016-05-23 00:26:56,747 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU14.04 sec2016-05-2300:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 18.64 secINFO  :2016-05-23 00:27:02,944 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU18.64 sec2016-05-2300:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 23.25 secINFO  :2016-05-23 00:27:08,105 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU23.25 sec2016-05-2300:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 27.84 secINFO  :2016-05-23 00:27:14,298 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU27.84 sec2016-05-2300:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 32.56 secINFO  :2016-05-23 00:27:20,484 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU32.56 sec2016-05-2300:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 37.1 secINFO  :2016-05-23 00:27:26,659 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU37.1 sec2016-05-2300:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 41.74 secINFO  :2016-05-23 00:27:32,839 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU41.74 sec2016-05-2300:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 46.32 secINFO  :2016-05-23 00:27:39,003 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU46.32 sec2016-05-2300:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 50.93 sec2016-05-2300:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 55.55 secINFO  :2016-05-23 00:27:45,173 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU50.93 secINFO  :2016-05-23 00:27:50,316 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 55.55sec2016-05-2300:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 60.25 secINFO  :2016-05-23 00:27:56,482 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU60.25 sec2016-05-2300:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 64.86 secINFO  :2016-05-23 00:28:02,642 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU64.86 sec2016-05-2300:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 69.41 secINFO  :2016-05-23 00:28:08,814 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU69.41 sec2016-05-2300:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 74.06 secINFO  :2016-05-23 00:28:14,977 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU74.06 sec2016-05-2300:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU 78.72 secINFO  :2016-05-23 00:28:21,134 Stage-1 map = 77%,  reduce = 0%, Cumulative CPU78.72 sec2016-05-2300:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU 83.32 secINFO  :2016-05-23 00:28:27,282 Stage-1 map = 82%,  reduce = 0%, Cumulative CPU83.32 sec2016-05-2300:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 87.9 secINFO  :2016-05-23 00:28:33,437 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU87.9 sec2016-05-2300:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 92.52 secINFO  :2016-05-23 00:28:38,579 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU92.52 sec2016-05-2300:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 97.35 secINFO  :2016-05-23 00:28:44,759 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU97.35 sec2016-05-2300:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 99.6 secINFO  :2016-05-23 00:28:49,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 99.6sec2016-05-2300:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 101.4 secMapReduce Totalcumulative CPU time: 1 minutes 41 seconds 400 msecEnded Job =job_1463956731753_0005MapReduce JobsLaunched:Stage-Stage-1:Map: 22  Reduce: 1   Cumulative CPU: 101.4 sec   HDFSRead: 5318569 HDFS Write: 46 SUCCESSTotal MapReduceCPU Time Spent: 1 minutes 41 seconds 400 msecOKINFO  :2016-05-23 00:28:54,043 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU101.4 secINFO  :MapReduce Total cumulative CPU time: 1 minutes 41 seconds 400 msecINFO  : EndedJob = job_1463956731753_0005INFO  :MapReduce Jobs Launched:INFO  :Stage-Stage-1: Map: 22  Reduce: 1   Cumulative CPU: 101.4sec   HDFS Read: 5318569 HDFS Write: 46 SUCCESSINFO  : TotalMapReduce CPU Time Spent: 1 minutes 41 seconds 400 msecINFO  :Completed executingcommand(queryId=hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc);Time taken: 142.525 secondsINFO  : OK+-----+------------+---------------+-----------------------+--+| c0 |     c1    |      c2      |         c3           |+-----+------------+---------------+-----------------------+--+| 1   |100000000  | 5.00000005E7  | 2.8867513459481288E7  |+-----+------------+---------------+-----------------------+--+1 row selected(142.744 seconds) OK Hive on map-reduce engine took142 seconds compared to 58 seconds with Hive on Spark. So you can obviouslygain pretty well by using Hive on Spark. Please also note that I did not use anyvendor's build for this purpose. I compiled Spark 1.3.1 myself. HTH  Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com/ 



-- 
Thanks and Regards
Mohan
VISA Pte Limited, Singapore.