You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Michael Segel <ms...@hotmail.com> on 2016/06/21 23:58:58 UTC

Silly question about Yarn client vs Yarn cluster modes...

Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running. 

I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables. 


Yarn has two modes… client and cluster. 

I get it that in cluster mode… everything is running on the cluster. 
But in client mode, the driver is running on the edge node while the workers are running on the cluster.

When I run a sparkSQL command that generates a new RDD, does the result set live on the cluster with the workers, and gets referenced by the driver, or does the result set get migrated to the driver running on the client? (I’m pretty sure I know the answer, but its never safe to assume anything…) 

The follow up questions:

1) If I kill the  app running the driver on the edge node… will that cause YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What happens and how quickly? 

1a) If using the client mode… can I spin up and spin down the number of executors on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated with that executor are gone, however the spark context is still alive on the edge node? [again assuming that the spark context lives with the driver.]) 

2) Any I/O between my spark job and the outside world… (e.g. walking through the data set and writing out a data set to a file) will occur on the edge node where the driver is located?  (This may seem kinda silly, but what happens when you want to expose the result set to the world… ? ) 

Now for something slightly different… 

Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  I realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD will reside on the client only.  (That is I could in theory run this as a single spark instance.) 
If I were to run this on the cluster… then the result set would stream thru the beeline gate way and would reside back on the cluster sitting in RDDs within each executor? 

I realize that these are silly questions but I need to make sure that I know the flow of the data and where it ultimately resides.  There really is a method to my madness, and if I could explain it… these questions really would make sense. ;-) 

TIA, 

-Mike


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Michael Segel <ms...@hotmail.com>.
The only documentation on this… in terms of direction … (that I could find)

If your client is not close to the cluster (e.g. your PC) then you definitely want to go cluster to improve performance.
If your client is close to the cluster (e.g. an edge node) then you could go either client or cluster.  Note that by going client, more resources are going to be used on the edge node.

HTH

-Mike

> On Jun 22, 2016, at 1:51 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> 
> On Wed, Jun 22, 2016 at 1:32 PM, Mich Talebzadeh
> <mi...@gmail.com> wrote:
>> Does it also depend on the number of Spark nodes involved in choosing which
>> way to go?
> 
> Not really.
> 
> -- 
> Marcelo
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Marcelo Vanzin <va...@cloudera.com>.
On Wed, Jun 22, 2016 at 1:32 PM, Mich Talebzadeh
<mi...@gmail.com> wrote:
> Does it also depend on the number of Spark nodes involved in choosing which
> way to go?

Not really.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Mich Talebzadeh <mi...@gmail.com>.
Thanks Marcelo,

Sounds like cluster mode is more resilient than the client-mode.

Does it also depend on the number of Spark nodes involved in choosing which
way to go?

Cheers


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 21:27, Marcelo Vanzin <va...@cloudera.com> wrote:

> Trying to keep the answer short and simple...
>
> On Wed, Jun 22, 2016 at 1:19 PM, Michael Segel
> <ms...@hotmail.com> wrote:
> > But this gets to the question… what are the real differences between
> client
> > and cluster modes?
> > What are the pros/cons and use cases where one has advantages over the
> > other?
>
> - client mode requires the process that launched the app remain alive.
> Meaning the host where it lives has to stay alive, and it may not be
> super-friendly to ssh sessions dying, for example, unless you use
> nohup.
>
> - client mode driver logs are printed to stderr by default. yes you
> can change that, but in cluster mode, they're all collected by yarn
> without any user intervention.
>
> - if your edge node (from where the app is launched) isn't really part
> of the cluster (e.g., lives in an outside network with firewalls or
> higher latency), you may run into issues.
>
> - in cluster mode, your driver's cpu / memory usage is accounted for
> in YARN; this matters if your edge node is part of the cluster (and
> could be running yarn containers), since in client mode your driver
> will potentially use a lot of memory / cpu.
>
> - finally, in cluster mode YARN can restart your application without
> user interference. this is useful for things that need to stay up
> (think a long running streaming job, for example).
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Marcelo Vanzin <va...@cloudera.com>.
Trying to keep the answer short and simple...

On Wed, Jun 22, 2016 at 1:19 PM, Michael Segel
<ms...@hotmail.com> wrote:
> But this gets to the question… what are the real differences between client
> and cluster modes?
> What are the pros/cons and use cases where one has advantages over the
> other?

- client mode requires the process that launched the app remain alive.
Meaning the host where it lives has to stay alive, and it may not be
super-friendly to ssh sessions dying, for example, unless you use
nohup.

- client mode driver logs are printed to stderr by default. yes you
can change that, but in cluster mode, they're all collected by yarn
without any user intervention.

- if your edge node (from where the app is launched) isn't really part
of the cluster (e.g., lives in an outside network with firewalls or
higher latency), you may run into issues.

- in cluster mode, your driver's cpu / memory usage is accounted for
in YARN; this matters if your edge node is part of the cluster (and
could be running yarn containers), since in client mode your driver
will potentially use a lot of memory / cpu.

- finally, in cluster mode YARN can restart your application without
user interference. this is useful for things that need to stay up
(think a long running streaming job, for example).


-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Michael Segel <ms...@hotmail.com>.
LOL… I hate YARN, but unfortunately I don’t get to make the call on which tools we’re going to use, I just get paid to make stuff work on the tools provided. ;-) 

Testing is somewhat problematic.  You have to really test at some large enough fraction of scale. 
Fortunately for this issue (YARN client/cluster) is just in how you launch a job so its really just benchmarking the time difference. 

But this gets to the question… what are the real differences between client and cluster modes? 

What are the pros/cons and use cases where one has advantages over the other?  
I couldn’t find anything on the web, so I’m here asking these ‘silly’ questions… ;-)

The issue of Mesos over YARN is that you now have to look at Myriad to help glue stuff together, or so I’m told. 
And you now may have to add an additional vendor support contract unless your Hadoop vendor is willing to support Mesos. 

But of course, YMMV and maybe someone else can add something of value than my $0.02 worth. ;-) 

-Mike

> On Jun 22, 2016, at 12:04 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> This is exactly the sort of topics that distinguish lab work from enterprise practice :)
> 
> The question on YARN client versus YARN cluster mode. I am not sure how much in real life it is going to make an impact if I choose one over the other?
> 
> These days I yell developers that it is perfectly valid to use Spark local mode to their dev/unit testing. at least they know how to look after it with the help of Spark WEB GUI.
> 
> Also anyone has tried using Mesos instead of Spark?. What does it offer above YARN.
> 
> Cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 22 June 2016 at 19:04, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> JDBC reliability problem? 
> 
> Ok… a bit more explanation… 
> 
> Usually when you have to go back to a legacy system, its because the data set is usually metadata and is relatively small.  Its not the sort of data that gets ingested in to a data lake unless you’re also ingesting the metadata and are using HBase/MapRDB , Cassandra or something like that. 
> 
> On top of all of this… when dealing with PII, you have to consider data at rest. This is also why you will see some things that make parts of the design a bit counter intuitive and why you start to think of storage/compute models over traditional cluster design. 
> 
> A single threaded JDBC connection to build a local RDD should be fine and I would really be concerned if that was unreliable.  That would mean tools like Spark or Drill would have serious reliability issues when considering legacy system access. 
> 
> And of course I’m still trying to dig in to YARN’s client vs cluster option. 
> 
> 
> thx
> 
>> On Jun 22, 2016, at 9:36 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Thanks Mike for clarification.
>> 
>> I think there is another option to get data out of RDBMS through some form of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or files. scp that file from the RDBMS directory to a private directory on HDFS system  and push it into HDFS. That will by-pass the JDBC reliability problem and I guess in this case one is in more control.
>> 
>> I do concur that there are security issues with this. For example that local file system may have to have encryption etc  that will make it tedious. I believe Jorn mentioned this somewhere.
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 22 June 2016 at 15:59, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
>> Hi, 
>> 
>> Just to clear a few things up… 
>> 
>> First I know its hard to describe some problems because they deal with client confidential information. 
>> (Also some basic ‘dead hooker’ thought problems to work through before facing them at a client.) 
>> The questions I pose here are very general and deal with some basic design issues/consideration. 
>> 
>> 
>> W.R.T JDBC / Beeline:
>> There are many use cases where you don’t want to migrate some or all data to HDFS.  This is why tools like Apache Drill exist. At the same time… there are different cluster design patterns.  One such pattern is a storage/compute model where you have multiple clusters acting either as compute clusters which pull data from storage clusters. An example would be spinning up an EMR cluster and running a M/R job where you read from S3 and output to S3.  Or within your enterprise you have your Data Lake (Data Sewer) and then a compute cluster for analytics. 
>> 
>> In addition, you have some very nasty design issues to deal with like security. Yes, that’s the very dirty word nobody wants to deal with and in most of these tools, security is an afterthought.  
>> So you may not have direct access to the cluster or an edge node. You only have access to a single port on a single machine through the firewall, which is running beeline so you can pull data from your storage cluster. 
>> 
>> Its very possible that you have to pull data from the cluster thru beeline to store the data within a spark job running on the cluster. (Oh the irony! ;-) 
>> 
>> Its important to understand that due to design constraints, options like sqoop or running a query directly against Hive may not be possible and these use cases do exist when dealing with PII information.
>> 
>> Its also important to realize that you may have to pull data from multiple data sources for some not so obvious but important reasons… 
>> So your spark app has to be able to generate an RDD from data in an RDBS (Oracle, DB2, etc …) persist it for local lookups and then pull data from the cluster and then output it either back to the cluster, another cluster, or someplace else.  All of these design issues occur when you’re dealing with large enterprises. 
>> 
>> 
>> But I digress… 
>> 
>> The reason I started this thread was to get a better handle on where things run when we make decisions to run either in client or cluster mode in YARN. 
>> Issues like starting/stopping long running apps are an issue in production.  Tying up cluster resources while your application remains dormant waiting for the next batch of events to come through. 
>> Trying to understand the gotchas of each design consideration so that we can weigh the pros and cons of a decision when designing a solution…. 
>> 
>> 
>> Spark is relatively new and its a bit disruptive because it forces you to think in terms of storage/compute models or Jim Scott’s holistic ‘Zeta Architecture’.  In addition, it forces you to rethink your cluster hardware design too. 
>> 
>> HTH to clarify the questions and of course thanks for the replies. 
>> 
>> -Mike
>> 
>>> On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> If you are going to get data out of an RDBMS like Oracle then the correct procedure is:
>>> 
>>> Use Hive on Spark execution engine. That improves Hive performance
>>> You can use JDBC through Spark itself. No issue there. It will use JDBC provided by HiveContext
>>> JDBC is fine. Every vendor has a utility to migrate a full database from one to another using JDBC. For example SAP relies on JDBC to migrate a whole Oracle schema to SAP ASE
>>> I have imported an Oracle table of 1 billion rows through Spark into Hive ORC table. It works fine. Actually I use Spark to do the job for these type of imports. Register a tempTable from DF and use it to put data into Hive table. You can create Hive table explicitly in Spark and do an INSERT/SELECT into it rather than save etc.  
>>> You can access Hive tables through HiveContext. Beeline is a client tool that connects to Hive thrift server. I don't think it comes into equation here
>>> Finally one experiment worth multiples of these speculation. Try for yourself and fine out.
>>> If you want to use JDBC for an RDBMS table then you will need to download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc
>>> Like anything else your mileage varies and need to experiment with it. Otherwise these are all opinions.
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 22 June 2016 at 06:46, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>>> I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the lack of reliability by jdbc. 
>>> 
>>> Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend to use it, because you do not want to pull data all the time out of the database when you execute your application. Furthermore, you have to handle connection interruptions, the multiple serialization/deserialization efforts, if one executor crashes you have to repull some or all of the data from the database etc
>>> 
>>> Within the cluster it does not make sense to me to pull data via jdbc from hive. All the benefits such as data locality, reliability etc would be gone.
>>> 
>>> Hive supports different execution engines (TEZ, Spark), formats (Orc, parquet) and further optimizations to make the analysis fast. It always depends on your use case.
>>> 
>>> On 22 Jun 2016, at 05:47, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
>>> 
>>>> 
>>>> Sorry, I think you misunderstood. 
>>>> Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true.  Would you say the same if you were pulling data in to spark from Oracle or DB2? 
>>>> There are a couple of different design patterns and use cases where data could be stored in Hive yet your only access method is via a JDBC or Thift/Rest service.  Think also of compute / storage cluster implementations. 
>>>> 
>>>> WRT to #2, not exactly what I meant, by exposing the data… and there are limitations to the thift service…
>>>> 
>>>>> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> 1. Yes, in the sense you control number of executors from spark application config. 
>>>>> 2. Any IO will be done from executors (never ever on driver, unless you explicitly call collect()). For example, connection to a DB happens one for each worker (and used by local executors). Also, if you run a reduceByKey job and write to hdfs, you will find a bunch of files were written from various executors. What happens when you want to expose the data to world: Spark Thrift Server (STS), which is a long running spark application (ie spark context) which can serve data from RDDs. 
>>>>> 
>>>>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  -  
>>>>> This is NOT a spark application, and there is no RDD created. Beeline is just a jdbc client tool. You use beeline to connect to HS2 or STS. 
>>>>> 
>>>>> In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  -- 
>>>>> This is never true. When you connect Hive from spark, spark actually reads hive metastore and streams data directly from HDFS. Hive MR jobs do not play any role here, making spark faster than hive. 
>>>>> 
>>>>> HTH....
>>>>> 
>>>>> Ayan
>>>>> 
>>>>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
>>>>> Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running.
>>>>> 
>>>>> I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables.
>>>>> 
>>>>> 
>>>>> Yarn has two modes… client and cluster.
>>>>> 
>>>>> I get it that in cluster mode… everything is running on the cluster.
>>>>> But in client mode, the driver is running on the edge node while the workers are running on the cluster.
>>>>> 
>>>>> When I run a sparkSQL command that generates a new RDD, does the result set live on the cluster with the workers, and gets referenced by the driver, or does the result set get migrated to the driver running on the client? (I’m pretty sure I know the answer, but its never safe to assume anything…)
>>>>> 
>>>>> The follow up questions:
>>>>> 
>>>>> 1) If I kill the  app running the driver on the edge node… will that cause YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What happens and how quickly?
>>>>> 
>>>>> 1a) If using the client mode… can I spin up and spin down the number of executors on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated with that executor are gone, however the spark context is still alive on the edge node? [again assuming that the spark context lives with the driver.])
>>>>> 
>>>>> 2) Any I/O between my spark job and the outside world… (e.g. walking through the data set and writing out a data set to a file) will occur on the edge node where the driver is located?  (This may seem kinda silly, but what happens when you want to expose the result set to the world… ? )
>>>>> 
>>>>> Now for something slightly different…
>>>>> 
>>>>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  I realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD will reside on the client only.  (That is I could in theory run this as a single spark instance.)
>>>>> If I were to run this on the cluster… then the result set would stream thru the beeline gate way and would reside back on the cluster sitting in RDDs within each executor?
>>>>> 
>>>>> I realize that these are silly questions but I need to make sure that I know the flow of the data and where it ultimately resides.  There really is a method to my madness, and if I could explain it… these questions really would make sense. ;-)
>>>>> 
>>>>> TIA,
>>>>> 
>>>>> -Mike
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>>> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Ayan Guha
>>>> 
>>> 
>> 
>> 
> 
> 


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Mich Talebzadeh <mi...@gmail.com>.
This is exactly the sort of topics that distinguish lab work from
enterprise practice :)

The question on YARN client versus YARN cluster mode. I am not sure how
much in real life it is going to make an impact if I choose one over the
other?

These days I yell developers that it is perfectly valid to use Spark local
mode to their dev/unit testing. at least they know how to look after it
with the help of Spark WEB GUI.

Also anyone has tried using Mesos instead of Spark?. What does it offer
above YARN.

Cheers

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 19:04, Michael Segel <ms...@hotmail.com> wrote:

> JDBC reliability problem?
>
> Ok… a bit more explanation…
>
> Usually when you have to go back to a legacy system, its because the data
> set is usually metadata and is relatively small.  Its not the sort of data
> that gets ingested in to a data lake unless you’re also ingesting the
> metadata and are using HBase/MapRDB , Cassandra or something like that.
>
> On top of all of this… when dealing with PII, you have to consider data at
> rest. This is also why you will see some things that make parts of the
> design a bit counter intuitive and why you start to think of
> storage/compute models over traditional cluster design.
>
> A single threaded JDBC connection to build a local RDD should be fine and
> I would really be concerned if that was unreliable.  That would mean tools
> like Spark or Drill would have serious reliability issues when considering
> legacy system access.
>
> And of course I’m still trying to dig in to YARN’s client vs cluster
> option.
>
>
> thx
>
> On Jun 22, 2016, at 9:36 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Thanks Mike for clarification.
>
> I think there is another option to get data out of RDBMS through some form
> of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or
> files. scp that file from the RDBMS directory to a private directory on
> HDFS system  and push it into HDFS. That will by-pass the JDBC reliability
> problem and I guess in this case one is in more control.
>
> I do concur that there are security issues with this. For example that
> local file system may have to have encryption etc  that will make it
> tedious. I believe Jorn mentioned this somewhere.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 June 2016 at 15:59, Michael Segel <ms...@hotmail.com> wrote:
>
>> Hi,
>>
>> Just to clear a few things up…
>>
>> First I know its hard to describe some problems because they deal with
>> client confidential information.
>> (Also some basic ‘dead hooker’ thought problems to work through before
>> facing them at a client.)
>> The questions I pose here are very general and deal with some basic
>> design issues/consideration.
>>
>>
>> W.R.T JDBC / Beeline:
>>
>> There are many use cases where you don’t want to migrate some or all data
>> to HDFS.  This is why tools like Apache Drill exist. At the same time…
>> there are different cluster design patterns.  One such pattern is a
>> storage/compute model where you have multiple clusters acting either as
>> compute clusters which pull data from storage clusters. An example would be
>> spinning up an EMR cluster and running a M/R job where you read from S3 and
>> output to S3.  Or within your enterprise you have your Data Lake (Data
>> Sewer) and then a compute cluster for analytics.
>>
>> In addition, you have some very nasty design issues to deal with like
>> security. Yes, that’s the very dirty word nobody wants to deal with and in
>> most of these tools, security is an afterthought.
>> So you may not have direct access to the cluster or an edge node. You
>> only have access to a single port on a single machine through the firewall,
>> which is running beeline so you can pull data from your storage cluster.
>>
>> Its very possible that you have to pull data from the cluster thru
>> beeline to store the data within a spark job running on the cluster. (Oh
>> the irony! ;-)
>>
>> Its important to understand that due to design constraints, options like
>> sqoop or running a query directly against Hive may not be possible and
>> these use cases do exist when dealing with PII information.
>>
>> Its also important to realize that you may have to pull data from
>> multiple data sources for some not so obvious but important reasons…
>> So your spark app has to be able to generate an RDD from data in an RDBS
>> (Oracle, DB2, etc …) persist it for local lookups and then pull data from
>> the cluster and then output it either back to the cluster, another cluster,
>> or someplace else.  All of these design issues occur when you’re dealing
>> with large enterprises.
>>
>>
>> But I digress…
>>
>> The reason I started this thread was to get a better handle on where
>> things run when we make decisions to run either in client or cluster mode
>> in YARN.
>> Issues like starting/stopping long running apps are an issue in
>> production.  Tying up cluster resources while your application remains
>> dormant waiting for the next batch of events to come through.
>> Trying to understand the gotchas of each design consideration so that we
>> can weigh the pros and cons of a decision when designing a solution….
>>
>>
>> Spark is relatively new and its a bit disruptive because it forces you to
>> think in terms of storage/compute models or Jim Scott’s holistic ‘Zeta
>> Architecture’.  In addition, it forces you to rethink your cluster hardware
>> design too.
>>
>> HTH to clarify the questions and of course thanks for the replies.
>>
>> -Mike
>>
>> On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> If you are going to get data out of an RDBMS like Oracle then the correct
>> procedure is:
>>
>>
>>    1. Use Hive on Spark execution engine. That improves Hive performance
>>    2. You can use JDBC through Spark itself. No issue there. It will use
>>    JDBC provided by HiveContext
>>    3. JDBC is fine. Every vendor has a utility to migrate a full
>>    database from one to another using JDBC. For example SAP relies on JDBC to
>>    migrate a whole Oracle schema to SAP ASE
>>    4. I have imported an Oracle table of 1 billion rows through Spark
>>    into Hive ORC table. It works fine. Actually I use Spark to do the job for
>>    these type of imports. Register a tempTable from DF and use it to put data
>>    into Hive table. You can create Hive table explicitly in Spark and do an
>>    INSERT/SELECT into it rather than save etc.
>>    5. You can access Hive tables through HiveContext. Beeline is a
>>    client tool that connects to Hive thrift server. I don't think it comes
>>    into equation here
>>    6. Finally one experiment worth multiples of these speculation. Try
>>    for yourself and fine out.
>>    7. If you want to use JDBC for an RDBMS table then you will need to
>>    download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc
>>
>> Like anything else your mileage varies and need to experiment with it.
>> Otherwise these are all opinions.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 June 2016 at 06:46, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> I would import data via sqoop and put it on HDFS. It has some mechanisms
>>> to handle the lack of reliability by jdbc.
>>>
>>> Then you can process the data via Spark. You could also use jdbc rdd but
>>> I do not recommend to use it, because you do not want to pull data all the
>>> time out of the database when you execute your application. Furthermore,
>>> you have to handle connection interruptions, the multiple
>>> serialization/deserialization efforts, if one executor crashes you have to
>>> repull some or all of the data from the database etc
>>>
>>> Within the cluster it does not make sense to me to pull data via jdbc
>>> from hive. All the benefits such as data locality, reliability etc would be
>>> gone.
>>>
>>> Hive supports different execution engines (TEZ, Spark), formats (Orc,
>>> parquet) and further optimizations to make the analysis fast. It always
>>> depends on your use case.
>>>
>>> On 22 Jun 2016, at 05:47, Michael Segel <ms...@hotmail.com>
>>> wrote:
>>>
>>>
>>> Sorry, I think you misunderstood.
>>> Spark can read from JDBC sources so to say using beeline as a way to
>>> access data is not a spark application isn’t really true.  Would you say
>>> the same if you were pulling data in to spark from Oracle or DB2?
>>> There are a couple of different design patterns and use cases where data
>>> could be stored in Hive yet your only access method is via a JDBC or
>>> Thift/Rest service.  Think also of compute / storage cluster
>>> implementations.
>>>
>>> WRT to #2, not exactly what I meant, by exposing the data… and there are
>>> limitations to the thift service…
>>>
>>> On Jun 21, 2016, at 5:44 PM, ayan guha <gu...@gmail.com> wrote:
>>>
>>> 1. Yes, in the sense you control number of executors from spark
>>> application config.
>>> 2. Any IO will be done from executors (never ever on driver, unless you
>>> explicitly call collect()). For example, connection to a DB happens one for
>>> each worker (and used by local executors). Also, if you run a reduceByKey
>>> job and write to hdfs, you will find a bunch of files were written from
>>> various executors. What happens when you want to expose the data to world:
>>> Spark Thrift Server (STS), which is a long running spark application (ie
>>> spark context) which can serve data from RDDs.
>>>
>>> Suppose I have a data source… like a couple of hive tables and I access
>>> the tables via beeline. (JDBC)  -
>>> This is NOT a spark application, and there is no RDD created. Beeline is
>>> just a jdbc client tool. You use beeline to connect to HS2 or STS.
>>>
>>> In this case… Hive generates a map/reduce job and then would stream the
>>> result set back to the client node where the RDD result set would be built.
>>>  --
>>> This is never true. When you connect Hive from spark, spark actually
>>> reads hive metastore and streams data directly from HDFS. Hive MR jobs do
>>> not play any role here, making spark faster than hive.
>>>
>>> HTH....
>>>
>>> Ayan
>>>
>>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <
>>> msegel_hadoop@hotmail.com> wrote:
>>>
>>>> Ok, its at the end of the day and I’m trying to make sure I understand
>>>> the locale of where things are running.
>>>>
>>>> I have an application where I have to query a bunch of sources,
>>>> creating some RDDs and then I need to join off the RDDs and some other
>>>> lookup tables.
>>>>
>>>>
>>>> Yarn has two modes… client and cluster.
>>>>
>>>> I get it that in cluster mode… everything is running on the cluster.
>>>> But in client mode, the driver is running on the edge node while the
>>>> workers are running on the cluster.
>>>>
>>>> When I run a sparkSQL command that generates a new RDD, does the result
>>>> set live on the cluster with the workers, and gets referenced by the
>>>> driver, or does the result set get migrated to the driver running on the
>>>> client? (I’m pretty sure I know the answer, but its never safe to assume
>>>> anything…)
>>>>
>>>> The follow up questions:
>>>>
>>>> 1) If I kill the  app running the driver on the edge node… will that
>>>> cause YARN to free up the cluster’s resources? (In cluster mode… that
>>>> doesn’t happen) What happens and how quickly?
>>>>
>>>> 1a) If using the client mode… can I spin up and spin down the number of
>>>> executors on the cluster? (Assuming that when I kill an executor any
>>>> portion of the RDDs associated with that executor are gone, however the
>>>> spark context is still alive on the edge node? [again assuming that the
>>>> spark context lives with the driver.])
>>>>
>>>> 2) Any I/O between my spark job and the outside world… (e.g. walking
>>>> through the data set and writing out a data set to a file) will occur on
>>>> the edge node where the driver is located?  (This may seem kinda silly, but
>>>> what happens when you want to expose the result set to the world… ? )
>>>>
>>>> Now for something slightly different…
>>>>
>>>> Suppose I have a data source… like a couple of hive tables and I access
>>>> the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce
>>>> job and then would stream the result set back to the client node where the
>>>> RDD result set would be built.  I realize that I could run Hive on top of
>>>> spark, but that’s a separate issue. Here the RDD will reside on the client
>>>> only.  (That is I could in theory run this as a single spark instance.)
>>>> If I were to run this on the cluster… then the result set would stream
>>>> thru the beeline gate way and would reside back on the cluster sitting in
>>>> RDDs within each executor?
>>>>
>>>> I realize that these are silly questions but I need to make sure that I
>>>> know the flow of the data and where it ultimately resides.  There really is
>>>> a method to my madness, and if I could explain it… these questions really
>>>> would make sense. ;-)
>>>>
>>>> TIA,
>>>>
>>>> -Mike
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>>
>>>
>>
>>
>
>

Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Michael Segel <ms...@hotmail.com>.
JDBC reliability problem? 

Ok… a bit more explanation… 

Usually when you have to go back to a legacy system, its because the data set is usually metadata and is relatively small.  Its not the sort of data that gets ingested in to a data lake unless you’re also ingesting the metadata and are using HBase/MapRDB , Cassandra or something like that. 

On top of all of this… when dealing with PII, you have to consider data at rest. This is also why you will see some things that make parts of the design a bit counter intuitive and why you start to think of storage/compute models over traditional cluster design. 

A single threaded JDBC connection to build a local RDD should be fine and I would really be concerned if that was unreliable.  That would mean tools like Spark or Drill would have serious reliability issues when considering legacy system access. 

And of course I’m still trying to dig in to YARN’s client vs cluster option. 


thx

> On Jun 22, 2016, at 9:36 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Thanks Mike for clarification.
> 
> I think there is another option to get data out of RDBMS through some form of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or files. scp that file from the RDBMS directory to a private directory on HDFS system  and push it into HDFS. That will by-pass the JDBC reliability problem and I guess in this case one is in more control.
> 
> I do concur that there are security issues with this. For example that local file system may have to have encryption etc  that will make it tedious. I believe Jorn mentioned this somewhere.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 22 June 2016 at 15:59, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> Hi, 
> 
> Just to clear a few things up… 
> 
> First I know its hard to describe some problems because they deal with client confidential information. 
> (Also some basic ‘dead hooker’ thought problems to work through before facing them at a client.) 
> The questions I pose here are very general and deal with some basic design issues/consideration. 
> 
> 
> W.R.T JDBC / Beeline:
> There are many use cases where you don’t want to migrate some or all data to HDFS.  This is why tools like Apache Drill exist. At the same time… there are different cluster design patterns.  One such pattern is a storage/compute model where you have multiple clusters acting either as compute clusters which pull data from storage clusters. An example would be spinning up an EMR cluster and running a M/R job where you read from S3 and output to S3.  Or within your enterprise you have your Data Lake (Data Sewer) and then a compute cluster for analytics. 
> 
> In addition, you have some very nasty design issues to deal with like security. Yes, that’s the very dirty word nobody wants to deal with and in most of these tools, security is an afterthought.  
> So you may not have direct access to the cluster or an edge node. You only have access to a single port on a single machine through the firewall, which is running beeline so you can pull data from your storage cluster. 
> 
> Its very possible that you have to pull data from the cluster thru beeline to store the data within a spark job running on the cluster. (Oh the irony! ;-) 
> 
> Its important to understand that due to design constraints, options like sqoop or running a query directly against Hive may not be possible and these use cases do exist when dealing with PII information.
> 
> Its also important to realize that you may have to pull data from multiple data sources for some not so obvious but important reasons… 
> So your spark app has to be able to generate an RDD from data in an RDBS (Oracle, DB2, etc …) persist it for local lookups and then pull data from the cluster and then output it either back to the cluster, another cluster, or someplace else.  All of these design issues occur when you’re dealing with large enterprises. 
> 
> 
> But I digress… 
> 
> The reason I started this thread was to get a better handle on where things run when we make decisions to run either in client or cluster mode in YARN. 
> Issues like starting/stopping long running apps are an issue in production.  Tying up cluster resources while your application remains dormant waiting for the next batch of events to come through. 
> Trying to understand the gotchas of each design consideration so that we can weigh the pros and cons of a decision when designing a solution…. 
> 
> 
> Spark is relatively new and its a bit disruptive because it forces you to think in terms of storage/compute models or Jim Scott’s holistic ‘Zeta Architecture’.  In addition, it forces you to rethink your cluster hardware design too. 
> 
> HTH to clarify the questions and of course thanks for the replies. 
> 
> -Mike
> 
>> On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> If you are going to get data out of an RDBMS like Oracle then the correct procedure is:
>> 
>> Use Hive on Spark execution engine. That improves Hive performance
>> You can use JDBC through Spark itself. No issue there. It will use JDBC provided by HiveContext
>> JDBC is fine. Every vendor has a utility to migrate a full database from one to another using JDBC. For example SAP relies on JDBC to migrate a whole Oracle schema to SAP ASE
>> I have imported an Oracle table of 1 billion rows through Spark into Hive ORC table. It works fine. Actually I use Spark to do the job for these type of imports. Register a tempTable from DF and use it to put data into Hive table. You can create Hive table explicitly in Spark and do an INSERT/SELECT into it rather than save etc.  
>> You can access Hive tables through HiveContext. Beeline is a client tool that connects to Hive thrift server. I don't think it comes into equation here
>> Finally one experiment worth multiples of these speculation. Try for yourself and fine out.
>> If you want to use JDBC for an RDBMS table then you will need to download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc
>> Like anything else your mileage varies and need to experiment with it. Otherwise these are all opinions.
>> 
>> HTH
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 22 June 2016 at 06:46, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
>> I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the lack of reliability by jdbc. 
>> 
>> Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend to use it, because you do not want to pull data all the time out of the database when you execute your application. Furthermore, you have to handle connection interruptions, the multiple serialization/deserialization efforts, if one executor crashes you have to repull some or all of the data from the database etc
>> 
>> Within the cluster it does not make sense to me to pull data via jdbc from hive. All the benefits such as data locality, reliability etc would be gone.
>> 
>> Hive supports different execution engines (TEZ, Spark), formats (Orc, parquet) and further optimizations to make the analysis fast. It always depends on your use case.
>> 
>> On 22 Jun 2016, at 05:47, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
>> 
>>> 
>>> Sorry, I think you misunderstood. 
>>> Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true.  Would you say the same if you were pulling data in to spark from Oracle or DB2? 
>>> There are a couple of different design patterns and use cases where data could be stored in Hive yet your only access method is via a JDBC or Thift/Rest service.  Think also of compute / storage cluster implementations. 
>>> 
>>> WRT to #2, not exactly what I meant, by exposing the data… and there are limitations to the thift service…
>>> 
>>>> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> 1. Yes, in the sense you control number of executors from spark application config. 
>>>> 2. Any IO will be done from executors (never ever on driver, unless you explicitly call collect()). For example, connection to a DB happens one for each worker (and used by local executors). Also, if you run a reduceByKey job and write to hdfs, you will find a bunch of files were written from various executors. What happens when you want to expose the data to world: Spark Thrift Server (STS), which is a long running spark application (ie spark context) which can serve data from RDDs. 
>>>> 
>>>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  -  
>>>> This is NOT a spark application, and there is no RDD created. Beeline is just a jdbc client tool. You use beeline to connect to HS2 or STS. 
>>>> 
>>>> In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  -- 
>>>> This is never true. When you connect Hive from spark, spark actually reads hive metastore and streams data directly from HDFS. Hive MR jobs do not play any role here, making spark faster than hive. 
>>>> 
>>>> HTH....
>>>> 
>>>> Ayan
>>>> 
>>>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
>>>> Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running.
>>>> 
>>>> I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables.
>>>> 
>>>> 
>>>> Yarn has two modes… client and cluster.
>>>> 
>>>> I get it that in cluster mode… everything is running on the cluster.
>>>> But in client mode, the driver is running on the edge node while the workers are running on the cluster.
>>>> 
>>>> When I run a sparkSQL command that generates a new RDD, does the result set live on the cluster with the workers, and gets referenced by the driver, or does the result set get migrated to the driver running on the client? (I’m pretty sure I know the answer, but its never safe to assume anything…)
>>>> 
>>>> The follow up questions:
>>>> 
>>>> 1) If I kill the  app running the driver on the edge node… will that cause YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What happens and how quickly?
>>>> 
>>>> 1a) If using the client mode… can I spin up and spin down the number of executors on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated with that executor are gone, however the spark context is still alive on the edge node? [again assuming that the spark context lives with the driver.])
>>>> 
>>>> 2) Any I/O between my spark job and the outside world… (e.g. walking through the data set and writing out a data set to a file) will occur on the edge node where the driver is located?  (This may seem kinda silly, but what happens when you want to expose the result set to the world… ? )
>>>> 
>>>> Now for something slightly different…
>>>> 
>>>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  I realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD will reside on the client only.  (That is I could in theory run this as a single spark instance.)
>>>> If I were to run this on the cluster… then the result set would stream thru the beeline gate way and would reside back on the cluster sitting in RDDs within each executor?
>>>> 
>>>> I realize that these are silly questions but I need to make sure that I know the flow of the data and where it ultimately resides.  There really is a method to my madness, and if I could explain it… these questions really would make sense. ;-)
>>>> 
>>>> TIA,
>>>> 
>>>> -Mike
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Best Regards,
>>>> Ayan Guha
>>> 
>> 
> 
> 


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Mich Talebzadeh <mi...@gmail.com>.
Thanks Mike for clarification.

I think there is another option to get data out of RDBMS through some form
of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or
files. scp that file from the RDBMS directory to a private directory on
HDFS system  and push it into HDFS. That will by-pass the JDBC reliability
problem and I guess in this case one is in more control.

I do concur that there are security issues with this. For example that
local file system may have to have encryption etc  that will make it
tedious. I believe Jorn mentioned this somewhere.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 15:59, Michael Segel <ms...@hotmail.com> wrote:

> Hi,
>
> Just to clear a few things up…
>
> First I know its hard to describe some problems because they deal with
> client confidential information.
> (Also some basic ‘dead hooker’ thought problems to work through before
> facing them at a client.)
> The questions I pose here are very general and deal with some basic design
> issues/consideration.
>
>
> W.R.T JDBC / Beeline:
>
> There are many use cases where you don’t want to migrate some or all data
> to HDFS.  This is why tools like Apache Drill exist. At the same time…
> there are different cluster design patterns.  One such pattern is a
> storage/compute model where you have multiple clusters acting either as
> compute clusters which pull data from storage clusters. An example would be
> spinning up an EMR cluster and running a M/R job where you read from S3 and
> output to S3.  Or within your enterprise you have your Data Lake (Data
> Sewer) and then a compute cluster for analytics.
>
> In addition, you have some very nasty design issues to deal with like
> security. Yes, that’s the very dirty word nobody wants to deal with and in
> most of these tools, security is an afterthought.
> So you may not have direct access to the cluster or an edge node. You only
> have access to a single port on a single machine through the firewall,
> which is running beeline so you can pull data from your storage cluster.
>
> Its very possible that you have to pull data from the cluster thru beeline
> to store the data within a spark job running on the cluster. (Oh the irony!
> ;-)
>
> Its important to understand that due to design constraints, options like
> sqoop or running a query directly against Hive may not be possible and
> these use cases do exist when dealing with PII information.
>
> Its also important to realize that you may have to pull data from multiple
> data sources for some not so obvious but important reasons…
> So your spark app has to be able to generate an RDD from data in an RDBS
> (Oracle, DB2, etc …) persist it for local lookups and then pull data from
> the cluster and then output it either back to the cluster, another cluster,
> or someplace else.  All of these design issues occur when you’re dealing
> with large enterprises.
>
>
> But I digress…
>
> The reason I started this thread was to get a better handle on where
> things run when we make decisions to run either in client or cluster mode
> in YARN.
> Issues like starting/stopping long running apps are an issue in
> production.  Tying up cluster resources while your application remains
> dormant waiting for the next batch of events to come through.
> Trying to understand the gotchas of each design consideration so that we
> can weigh the pros and cons of a decision when designing a solution….
>
>
> Spark is relatively new and its a bit disruptive because it forces you to
> think in terms of storage/compute models or Jim Scott’s holistic ‘Zeta
> Architecture’.  In addition, it forces you to rethink your cluster hardware
> design too.
>
> HTH to clarify the questions and of course thanks for the replies.
>
> -Mike
>
> On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> If you are going to get data out of an RDBMS like Oracle then the correct
> procedure is:
>
>
>    1. Use Hive on Spark execution engine. That improves Hive performance
>    2. You can use JDBC through Spark itself. No issue there. It will use
>    JDBC provided by HiveContext
>    3. JDBC is fine. Every vendor has a utility to migrate a full database
>    from one to another using JDBC. For example SAP relies on JDBC to migrate a
>    whole Oracle schema to SAP ASE
>    4. I have imported an Oracle table of 1 billion rows through Spark
>    into Hive ORC table. It works fine. Actually I use Spark to do the job for
>    these type of imports. Register a tempTable from DF and use it to put data
>    into Hive table. You can create Hive table explicitly in Spark and do an
>    INSERT/SELECT into it rather than save etc.
>    5. You can access Hive tables through HiveContext. Beeline is a client
>    tool that connects to Hive thrift server. I don't think it comes into
>    equation here
>    6. Finally one experiment worth multiples of these speculation. Try
>    for yourself and fine out.
>    7. If you want to use JDBC for an RDBMS table then you will need to
>    download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc
>
> Like anything else your mileage varies and need to experiment with it.
> Otherwise these are all opinions.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 June 2016 at 06:46, Jörn Franke <jo...@gmail.com> wrote:
>
>> I would import data via sqoop and put it on HDFS. It has some mechanisms
>> to handle the lack of reliability by jdbc.
>>
>> Then you can process the data via Spark. You could also use jdbc rdd but
>> I do not recommend to use it, because you do not want to pull data all the
>> time out of the database when you execute your application. Furthermore,
>> you have to handle connection interruptions, the multiple
>> serialization/deserialization efforts, if one executor crashes you have to
>> repull some or all of the data from the database etc
>>
>> Within the cluster it does not make sense to me to pull data via jdbc
>> from hive. All the benefits such as data locality, reliability etc would be
>> gone.
>>
>> Hive supports different execution engines (TEZ, Spark), formats (Orc,
>> parquet) and further optimizations to make the analysis fast. It always
>> depends on your use case.
>>
>> On 22 Jun 2016, at 05:47, Michael Segel <ms...@hotmail.com>
>> wrote:
>>
>>
>> Sorry, I think you misunderstood.
>> Spark can read from JDBC sources so to say using beeline as a way to
>> access data is not a spark application isn’t really true.  Would you say
>> the same if you were pulling data in to spark from Oracle or DB2?
>> There are a couple of different design patterns and use cases where data
>> could be stored in Hive yet your only access method is via a JDBC or
>> Thift/Rest service.  Think also of compute / storage cluster
>> implementations.
>>
>> WRT to #2, not exactly what I meant, by exposing the data… and there are
>> limitations to the thift service…
>>
>> On Jun 21, 2016, at 5:44 PM, ayan guha <gu...@gmail.com> wrote:
>>
>> 1. Yes, in the sense you control number of executors from spark
>> application config.
>> 2. Any IO will be done from executors (never ever on driver, unless you
>> explicitly call collect()). For example, connection to a DB happens one for
>> each worker (and used by local executors). Also, if you run a reduceByKey
>> job and write to hdfs, you will find a bunch of files were written from
>> various executors. What happens when you want to expose the data to world:
>> Spark Thrift Server (STS), which is a long running spark application (ie
>> spark context) which can serve data from RDDs.
>>
>> Suppose I have a data source… like a couple of hive tables and I access
>> the tables via beeline. (JDBC)  -
>> This is NOT a spark application, and there is no RDD created. Beeline is
>> just a jdbc client tool. You use beeline to connect to HS2 or STS.
>>
>> In this case… Hive generates a map/reduce job and then would stream the
>> result set back to the client node where the RDD result set would be built.
>>  --
>> This is never true. When you connect Hive from spark, spark actually
>> reads hive metastore and streams data directly from HDFS. Hive MR jobs do
>> not play any role here, making spark faster than hive.
>>
>> HTH....
>>
>> Ayan
>>
>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_hadoop@hotmail.com
>> > wrote:
>>
>>> Ok, its at the end of the day and I’m trying to make sure I understand
>>> the locale of where things are running.
>>>
>>> I have an application where I have to query a bunch of sources, creating
>>> some RDDs and then I need to join off the RDDs and some other lookup tables.
>>>
>>>
>>> Yarn has two modes… client and cluster.
>>>
>>> I get it that in cluster mode… everything is running on the cluster.
>>> But in client mode, the driver is running on the edge node while the
>>> workers are running on the cluster.
>>>
>>> When I run a sparkSQL command that generates a new RDD, does the result
>>> set live on the cluster with the workers, and gets referenced by the
>>> driver, or does the result set get migrated to the driver running on the
>>> client? (I’m pretty sure I know the answer, but its never safe to assume
>>> anything…)
>>>
>>> The follow up questions:
>>>
>>> 1) If I kill the  app running the driver on the edge node… will that
>>> cause YARN to free up the cluster’s resources? (In cluster mode… that
>>> doesn’t happen) What happens and how quickly?
>>>
>>> 1a) If using the client mode… can I spin up and spin down the number of
>>> executors on the cluster? (Assuming that when I kill an executor any
>>> portion of the RDDs associated with that executor are gone, however the
>>> spark context is still alive on the edge node? [again assuming that the
>>> spark context lives with the driver.])
>>>
>>> 2) Any I/O between my spark job and the outside world… (e.g. walking
>>> through the data set and writing out a data set to a file) will occur on
>>> the edge node where the driver is located?  (This may seem kinda silly, but
>>> what happens when you want to expose the result set to the world… ? )
>>>
>>> Now for something slightly different…
>>>
>>> Suppose I have a data source… like a couple of hive tables and I access
>>> the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce
>>> job and then would stream the result set back to the client node where the
>>> RDD result set would be built.  I realize that I could run Hive on top of
>>> spark, but that’s a separate issue. Here the RDD will reside on the client
>>> only.  (That is I could in theory run this as a single spark instance.)
>>> If I were to run this on the cluster… then the result set would stream
>>> thru the beeline gate way and would reside back on the cluster sitting in
>>> RDDs within each executor?
>>>
>>> I realize that these are silly questions but I need to make sure that I
>>> know the flow of the data and where it ultimately resides.  There really is
>>> a method to my madness, and if I could explain it… these questions really
>>> would make sense. ;-)
>>>
>>> TIA,
>>>
>>> -Mike
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>
>

Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Michael Segel <ms...@hotmail.com>.
Hi, 

Just to clear a few things up… 

First I know its hard to describe some problems because they deal with client confidential information. 
(Also some basic ‘dead hooker’ thought problems to work through before facing them at a client.) 
The questions I pose here are very general and deal with some basic design issues/consideration. 


W.R.T JDBC / Beeline:
There are many use cases where you don’t want to migrate some or all data to HDFS.  This is why tools like Apache Drill exist. At the same time… there are different cluster design patterns.  One such pattern is a storage/compute model where you have multiple clusters acting either as compute clusters which pull data from storage clusters. An example would be spinning up an EMR cluster and running a M/R job where you read from S3 and output to S3.  Or within your enterprise you have your Data Lake (Data Sewer) and then a compute cluster for analytics. 

In addition, you have some very nasty design issues to deal with like security. Yes, that’s the very dirty word nobody wants to deal with and in most of these tools, security is an afterthought.  
So you may not have direct access to the cluster or an edge node. You only have access to a single port on a single machine through the firewall, which is running beeline so you can pull data from your storage cluster. 

Its very possible that you have to pull data from the cluster thru beeline to store the data within a spark job running on the cluster. (Oh the irony! ;-) 

Its important to understand that due to design constraints, options like sqoop or running a query directly against Hive may not be possible and these use cases do exist when dealing with PII information.

Its also important to realize that you may have to pull data from multiple data sources for some not so obvious but important reasons… 
So your spark app has to be able to generate an RDD from data in an RDBS (Oracle, DB2, etc …) persist it for local lookups and then pull data from the cluster and then output it either back to the cluster, another cluster, or someplace else.  All of these design issues occur when you’re dealing with large enterprises. 


But I digress… 

The reason I started this thread was to get a better handle on where things run when we make decisions to run either in client or cluster mode in YARN. 
Issues like starting/stopping long running apps are an issue in production.  Tying up cluster resources while your application remains dormant waiting for the next batch of events to come through. 
Trying to understand the gotchas of each design consideration so that we can weigh the pros and cons of a decision when designing a solution…. 


Spark is relatively new and its a bit disruptive because it forces you to think in terms of storage/compute models or Jim Scott’s holistic ‘Zeta Architecture’.  In addition, it forces you to rethink your cluster hardware design too. 

HTH to clarify the questions and of course thanks for the replies. 

-Mike

> On Jun 21, 2016, at 11:17 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> If you are going to get data out of an RDBMS like Oracle then the correct procedure is:
> 
> Use Hive on Spark execution engine. That improves Hive performance
> You can use JDBC through Spark itself. No issue there. It will use JDBC provided by HiveContext
> JDBC is fine. Every vendor has a utility to migrate a full database from one to another using JDBC. For example SAP relies on JDBC to migrate a whole Oracle schema to SAP ASE
> I have imported an Oracle table of 1 billion rows through Spark into Hive ORC table. It works fine. Actually I use Spark to do the job for these type of imports. Register a tempTable from DF and use it to put data into Hive table. You can create Hive table explicitly in Spark and do an INSERT/SELECT into it rather than save etc.  
> You can access Hive tables through HiveContext. Beeline is a client tool that connects to Hive thrift server. I don't think it comes into equation here
> Finally one experiment worth multiples of these speculation. Try for yourself and fine out.
> If you want to use JDBC for an RDBMS table then you will need to download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc
> Like anything else your mileage varies and need to experiment with it. Otherwise these are all opinions.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 22 June 2016 at 06:46, Jörn Franke <jornfranke@gmail.com <ma...@gmail.com>> wrote:
> I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the lack of reliability by jdbc. 
> 
> Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend to use it, because you do not want to pull data all the time out of the database when you execute your application. Furthermore, you have to handle connection interruptions, the multiple serialization/deserialization efforts, if one executor crashes you have to repull some or all of the data from the database etc
> 
> Within the cluster it does not make sense to me to pull data via jdbc from hive. All the benefits such as data locality, reliability etc would be gone.
> 
> Hive supports different execution engines (TEZ, Spark), formats (Orc, parquet) and further optimizations to make the analysis fast. It always depends on your use case.
> 
> On 22 Jun 2016, at 05:47, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> 
>> 
>> Sorry, I think you misunderstood. 
>> Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true.  Would you say the same if you were pulling data in to spark from Oracle or DB2? 
>> There are a couple of different design patterns and use cases where data could be stored in Hive yet your only access method is via a JDBC or Thift/Rest service.  Think also of compute / storage cluster implementations. 
>> 
>> WRT to #2, not exactly what I meant, by exposing the data… and there are limitations to the thift service…
>> 
>>> On Jun 21, 2016, at 5:44 PM, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 1. Yes, in the sense you control number of executors from spark application config. 
>>> 2. Any IO will be done from executors (never ever on driver, unless you explicitly call collect()). For example, connection to a DB happens one for each worker (and used by local executors). Also, if you run a reduceByKey job and write to hdfs, you will find a bunch of files were written from various executors. What happens when you want to expose the data to world: Spark Thrift Server (STS), which is a long running spark application (ie spark context) which can serve data from RDDs. 
>>> 
>>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  -  
>>> This is NOT a spark application, and there is no RDD created. Beeline is just a jdbc client tool. You use beeline to connect to HS2 or STS. 
>>> 
>>> In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  -- 
>>> This is never true. When you connect Hive from spark, spark actually reads hive metastore and streams data directly from HDFS. Hive MR jobs do not play any role here, making spark faster than hive. 
>>> 
>>> HTH....
>>> 
>>> Ayan
>>> 
>>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
>>> Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running.
>>> 
>>> I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables.
>>> 
>>> 
>>> Yarn has two modes… client and cluster.
>>> 
>>> I get it that in cluster mode… everything is running on the cluster.
>>> But in client mode, the driver is running on the edge node while the workers are running on the cluster.
>>> 
>>> When I run a sparkSQL command that generates a new RDD, does the result set live on the cluster with the workers, and gets referenced by the driver, or does the result set get migrated to the driver running on the client? (I’m pretty sure I know the answer, but its never safe to assume anything…)
>>> 
>>> The follow up questions:
>>> 
>>> 1) If I kill the  app running the driver on the edge node… will that cause YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What happens and how quickly?
>>> 
>>> 1a) If using the client mode… can I spin up and spin down the number of executors on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated with that executor are gone, however the spark context is still alive on the edge node? [again assuming that the spark context lives with the driver.])
>>> 
>>> 2) Any I/O between my spark job and the outside world… (e.g. walking through the data set and writing out a data set to a file) will occur on the edge node where the driver is located?  (This may seem kinda silly, but what happens when you want to expose the result set to the world… ? )
>>> 
>>> Now for something slightly different…
>>> 
>>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  I realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD will reside on the client only.  (That is I could in theory run this as a single spark instance.)
>>> If I were to run this on the cluster… then the result set would stream thru the beeline gate way and would reside back on the cluster sitting in RDDs within each executor?
>>> 
>>> I realize that these are silly questions but I need to make sure that I know the flow of the data and where it ultimately resides.  There really is a method to my madness, and if I could explain it… these questions really would make sense. ;-)
>>> 
>>> TIA,
>>> 
>>> -Mike
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
>> 
> 


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Mich Talebzadeh <mi...@gmail.com>.
If you are going to get data out of an RDBMS like Oracle then the correct
procedure is:


   1. Use Hive on Spark execution engine. That improves Hive performance
   2. You can use JDBC through Spark itself. No issue there. It will use
   JDBC provided by HiveContext
   3. JDBC is fine. Every vendor has a utility to migrate a full database
   from one to another using JDBC. For example SAP relies on JDBC to migrate a
   whole Oracle schema to SAP ASE
   4. I have imported an Oracle table of 1 billion rows through Spark into
   Hive ORC table. It works fine. Actually I use Spark to do the job for these
   type of imports. Register a tempTable from DF and use it to put data into
   Hive table. You can create Hive table explicitly in Spark and do an
   INSERT/SELECT into it rather than save etc.
   5. You can access Hive tables through HiveContext. Beeline is a client
   tool that connects to Hive thrift server. I don't think it comes into
   equation here
   6. Finally one experiment worth multiples of these speculation. Try for
   yourself and fine out.
   7. If you want to use JDBC for an RDBMS table then you will need to
   download the relevant JAR file. For example for Oracle it is ojdbc6.jar etc

Like anything else your mileage varies and need to experiment with it.
Otherwise these are all opinions.

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 June 2016 at 06:46, Jörn Franke <jo...@gmail.com> wrote:

> I would import data via sqoop and put it on HDFS. It has some mechanisms
> to handle the lack of reliability by jdbc.
>
> Then you can process the data via Spark. You could also use jdbc rdd but I
> do not recommend to use it, because you do not want to pull data all the
> time out of the database when you execute your application. Furthermore,
> you have to handle connection interruptions, the multiple
> serialization/deserialization efforts, if one executor crashes you have to
> repull some or all of the data from the database etc
>
> Within the cluster it does not make sense to me to pull data via jdbc from
> hive. All the benefits such as data locality, reliability etc would be gone.
>
> Hive supports different execution engines (TEZ, Spark), formats (Orc,
> parquet) and further optimizations to make the analysis fast. It always
> depends on your use case.
>
> On 22 Jun 2016, at 05:47, Michael Segel <ms...@hotmail.com> wrote:
>
>
> Sorry, I think you misunderstood.
> Spark can read from JDBC sources so to say using beeline as a way to
> access data is not a spark application isn’t really true.  Would you say
> the same if you were pulling data in to spark from Oracle or DB2?
> There are a couple of different design patterns and use cases where data
> could be stored in Hive yet your only access method is via a JDBC or
> Thift/Rest service.  Think also of compute / storage cluster
> implementations.
>
> WRT to #2, not exactly what I meant, by exposing the data… and there are
> limitations to the thift service…
>
> On Jun 21, 2016, at 5:44 PM, ayan guha <gu...@gmail.com> wrote:
>
> 1. Yes, in the sense you control number of executors from spark
> application config.
> 2. Any IO will be done from executors (never ever on driver, unless you
> explicitly call collect()). For example, connection to a DB happens one for
> each worker (and used by local executors). Also, if you run a reduceByKey
> job and write to hdfs, you will find a bunch of files were written from
> various executors. What happens when you want to expose the data to world:
> Spark Thrift Server (STS), which is a long running spark application (ie
> spark context) which can serve data from RDDs.
>
> Suppose I have a data source… like a couple of hive tables and I access
> the tables via beeline. (JDBC)  -
> This is NOT a spark application, and there is no RDD created. Beeline is
> just a jdbc client tool. You use beeline to connect to HS2 or STS.
>
> In this case… Hive generates a map/reduce job and then would stream the
> result set back to the client node where the RDD result set would be built.
>  --
> This is never true. When you connect Hive from spark, spark actually reads
> hive metastore and streams data directly from HDFS. Hive MR jobs do not
> play any role here, making spark faster than hive.
>
> HTH....
>
> Ayan
>
> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <ms...@hotmail.com>
> wrote:
>
>> Ok, its at the end of the day and I’m trying to make sure I understand
>> the locale of where things are running.
>>
>> I have an application where I have to query a bunch of sources, creating
>> some RDDs and then I need to join off the RDDs and some other lookup tables.
>>
>>
>> Yarn has two modes… client and cluster.
>>
>> I get it that in cluster mode… everything is running on the cluster.
>> But in client mode, the driver is running on the edge node while the
>> workers are running on the cluster.
>>
>> When I run a sparkSQL command that generates a new RDD, does the result
>> set live on the cluster with the workers, and gets referenced by the
>> driver, or does the result set get migrated to the driver running on the
>> client? (I’m pretty sure I know the answer, but its never safe to assume
>> anything…)
>>
>> The follow up questions:
>>
>> 1) If I kill the  app running the driver on the edge node… will that
>> cause YARN to free up the cluster’s resources? (In cluster mode… that
>> doesn’t happen) What happens and how quickly?
>>
>> 1a) If using the client mode… can I spin up and spin down the number of
>> executors on the cluster? (Assuming that when I kill an executor any
>> portion of the RDDs associated with that executor are gone, however the
>> spark context is still alive on the edge node? [again assuming that the
>> spark context lives with the driver.])
>>
>> 2) Any I/O between my spark job and the outside world… (e.g. walking
>> through the data set and writing out a data set to a file) will occur on
>> the edge node where the driver is located?  (This may seem kinda silly, but
>> what happens when you want to expose the result set to the world… ? )
>>
>> Now for something slightly different…
>>
>> Suppose I have a data source… like a couple of hive tables and I access
>> the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce
>> job and then would stream the result set back to the client node where the
>> RDD result set would be built.  I realize that I could run Hive on top of
>> spark, but that’s a separate issue. Here the RDD will reside on the client
>> only.  (That is I could in theory run this as a single spark instance.)
>> If I were to run this on the cluster… then the result set would stream
>> thru the beeline gate way and would reside back on the cluster sitting in
>> RDDs within each executor?
>>
>> I realize that these are silly questions but I need to make sure that I
>> know the flow of the data and where it ultimately resides.  There really is
>> a method to my madness, and if I could explain it… these questions really
>> would make sense. ;-)
>>
>> TIA,
>>
>> -Mike
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>

Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Jörn Franke <jo...@gmail.com>.
I would import data via sqoop and put it on HDFS. It has some mechanisms to handle the lack of reliability by jdbc. 

Then you can process the data via Spark. You could also use jdbc rdd but I do not recommend to use it, because you do not want to pull data all the time out of the database when you execute your application. Furthermore, you have to handle connection interruptions, the multiple serialization/deserialization efforts, if one executor crashes you have to repull some or all of the data from the database etc

Within the cluster it does not make sense to me to pull data via jdbc from hive. All the benefits such as data locality, reliability etc would be gone.

Hive supports different execution engines (TEZ, Spark), formats (Orc, parquet) and further optimizations to make the analysis fast. It always depends on your use case.

> On 22 Jun 2016, at 05:47, Michael Segel <ms...@hotmail.com> wrote:
> 
> 
> Sorry, I think you misunderstood. 
> Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true.  Would you say the same if you were pulling data in to spark from Oracle or DB2? 
> There are a couple of different design patterns and use cases where data could be stored in Hive yet your only access method is via a JDBC or Thift/Rest service.  Think also of compute / storage cluster implementations. 
> 
> WRT to #2, not exactly what I meant, by exposing the data… and there are limitations to the thift service…
> 
>> On Jun 21, 2016, at 5:44 PM, ayan guha <gu...@gmail.com> wrote:
>> 
>> 1. Yes, in the sense you control number of executors from spark application config. 
>> 2. Any IO will be done from executors (never ever on driver, unless you explicitly call collect()). For example, connection to a DB happens one for each worker (and used by local executors). Also, if you run a reduceByKey job and write to hdfs, you will find a bunch of files were written from various executors. What happens when you want to expose the data to world: Spark Thrift Server (STS), which is a long running spark application (ie spark context) which can serve data from RDDs. 
>> 
>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  -  
>> This is NOT a spark application, and there is no RDD created. Beeline is just a jdbc client tool. You use beeline to connect to HS2 or STS. 
>> 
>> In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  -- 
>> This is never true. When you connect Hive from spark, spark actually reads hive metastore and streams data directly from HDFS. Hive MR jobs do not play any role here, making spark faster than hive. 
>> 
>> HTH....
>> 
>> Ayan
>> 
>>> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <ms...@hotmail.com> wrote:
>>> Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running.
>>> 
>>> I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables.
>>> 
>>> 
>>> Yarn has two modes… client and cluster.
>>> 
>>> I get it that in cluster mode… everything is running on the cluster.
>>> But in client mode, the driver is running on the edge node while the workers are running on the cluster.
>>> 
>>> When I run a sparkSQL command that generates a new RDD, does the result set live on the cluster with the workers, and gets referenced by the driver, or does the result set get migrated to the driver running on the client? (I’m pretty sure I know the answer, but its never safe to assume anything…)
>>> 
>>> The follow up questions:
>>> 
>>> 1) If I kill the  app running the driver on the edge node… will that cause YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What happens and how quickly?
>>> 
>>> 1a) If using the client mode… can I spin up and spin down the number of executors on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated with that executor are gone, however the spark context is still alive on the edge node? [again assuming that the spark context lives with the driver.])
>>> 
>>> 2) Any I/O between my spark job and the outside world… (e.g. walking through the data set and writing out a data set to a file) will occur on the edge node where the driver is located?  (This may seem kinda silly, but what happens when you want to expose the result set to the world… ? )
>>> 
>>> Now for something slightly different…
>>> 
>>> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  I realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD will reside on the client only.  (That is I could in theory run this as a single spark instance.)
>>> If I were to run this on the cluster… then the result set would stream thru the beeline gate way and would reside back on the cluster sitting in RDDs within each executor?
>>> 
>>> I realize that these are silly questions but I need to make sure that I know the flow of the data and where it ultimately resides.  There really is a method to my madness, and if I could explain it… these questions really would make sense. ;-)
>>> 
>>> TIA,
>>> 
>>> -Mike
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha
> 

Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by ayan guha <gu...@gmail.com>.
I may be wrong here, but beeline is basically a client library. So you
"connect" to STS and/or HS2 using beeline.

Spark connecting to jdbc is different discussion and no way related to
beeline. When you read data from DB (Oracle, DB2 etc) then you do not use
beeline, but use jdbc connection to the DB.....

#2 agree on thrift limitations, but not sure if there are any other
mechanism to share data to other systems such as API/BI Tools etc......

On Wed, Jun 22, 2016 at 1:47 PM, Michael Segel <ms...@hotmail.com>
wrote:

>
> Sorry, I think you misunderstood.
> Spark can read from JDBC sources so to say using beeline as a way to
> access data is not a spark application isn’t really true.  Would you say
> the same if you were pulling data in to spark from Oracle or DB2?
> There are a couple of different design patterns and use cases where data
> could be stored in Hive yet your only access method is via a JDBC or
> Thift/Rest service.  Think also of compute / storage cluster
> implementations.
>
> WRT to #2, not exactly what I meant, by exposing the data… and there are
> limitations to the thift service…
>
> On Jun 21, 2016, at 5:44 PM, ayan guha <gu...@gmail.com> wrote:
>
> 1. Yes, in the sense you control number of executors from spark
> application config.
> 2. Any IO will be done from executors (never ever on driver, unless you
> explicitly call collect()). For example, connection to a DB happens one for
> each worker (and used by local executors). Also, if you run a reduceByKey
> job and write to hdfs, you will find a bunch of files were written from
> various executors. What happens when you want to expose the data to world:
> Spark Thrift Server (STS), which is a long running spark application (ie
> spark context) which can serve data from RDDs.
>
> Suppose I have a data source… like a couple of hive tables and I access
> the tables via beeline. (JDBC)  -
> This is NOT a spark application, and there is no RDD created. Beeline is
> just a jdbc client tool. You use beeline to connect to HS2 or STS.
>
> In this case… Hive generates a map/reduce job and then would stream the
> result set back to the client node where the RDD result set would be built.
>  --
> This is never true. When you connect Hive from spark, spark actually reads
> hive metastore and streams data directly from HDFS. Hive MR jobs do not
> play any role here, making spark faster than hive.
>
> HTH....
>
> Ayan
>
> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <ms...@hotmail.com>
> wrote:
>
>> Ok, its at the end of the day and I’m trying to make sure I understand
>> the locale of where things are running.
>>
>> I have an application where I have to query a bunch of sources, creating
>> some RDDs and then I need to join off the RDDs and some other lookup tables.
>>
>>
>> Yarn has two modes… client and cluster.
>>
>> I get it that in cluster mode… everything is running on the cluster.
>> But in client mode, the driver is running on the edge node while the
>> workers are running on the cluster.
>>
>> When I run a sparkSQL command that generates a new RDD, does the result
>> set live on the cluster with the workers, and gets referenced by the
>> driver, or does the result set get migrated to the driver running on the
>> client? (I’m pretty sure I know the answer, but its never safe to assume
>> anything…)
>>
>> The follow up questions:
>>
>> 1) If I kill the  app running the driver on the edge node… will that
>> cause YARN to free up the cluster’s resources? (In cluster mode… that
>> doesn’t happen) What happens and how quickly?
>>
>> 1a) If using the client mode… can I spin up and spin down the number of
>> executors on the cluster? (Assuming that when I kill an executor any
>> portion of the RDDs associated with that executor are gone, however the
>> spark context is still alive on the edge node? [again assuming that the
>> spark context lives with the driver.])
>>
>> 2) Any I/O between my spark job and the outside world… (e.g. walking
>> through the data set and writing out a data set to a file) will occur on
>> the edge node where the driver is located?  (This may seem kinda silly, but
>> what happens when you want to expose the result set to the world… ? )
>>
>> Now for something slightly different…
>>
>> Suppose I have a data source… like a couple of hive tables and I access
>> the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce
>> job and then would stream the result set back to the client node where the
>> RDD result set would be built.  I realize that I could run Hive on top of
>> spark, but that’s a separate issue. Here the RDD will reside on the client
>> only.  (That is I could in theory run this as a single spark instance.)
>> If I were to run this on the cluster… then the result set would stream
>> thru the beeline gate way and would reside back on the cluster sitting in
>> RDDs within each executor?
>>
>> I realize that these are silly questions but I need to make sure that I
>> know the flow of the data and where it ultimately resides.  There really is
>> a method to my madness, and if I could explain it… these questions really
>> would make sense. ;-)
>>
>> TIA,
>>
>> -Mike
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
>
>


-- 
Best Regards,
Ayan Guha

Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by Michael Segel <ms...@hotmail.com>.
Sorry, I think you misunderstood. 
Spark can read from JDBC sources so to say using beeline as a way to access data is not a spark application isn’t really true.  Would you say the same if you were pulling data in to spark from Oracle or DB2? 
There are a couple of different design patterns and use cases where data could be stored in Hive yet your only access method is via a JDBC or Thift/Rest service.  Think also of compute / storage cluster implementations. 

WRT to #2, not exactly what I meant, by exposing the data… and there are limitations to the thift service…

> On Jun 21, 2016, at 5:44 PM, ayan guha <gu...@gmail.com> wrote:
> 
> 1. Yes, in the sense you control number of executors from spark application config. 
> 2. Any IO will be done from executors (never ever on driver, unless you explicitly call collect()). For example, connection to a DB happens one for each worker (and used by local executors). Also, if you run a reduceByKey job and write to hdfs, you will find a bunch of files were written from various executors. What happens when you want to expose the data to world: Spark Thrift Server (STS), which is a long running spark application (ie spark context) which can serve data from RDDs. 
> 
> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  -  
> This is NOT a spark application, and there is no RDD created. Beeline is just a jdbc client tool. You use beeline to connect to HS2 or STS. 
> 
> In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  -- 
> This is never true. When you connect Hive from spark, spark actually reads hive metastore and streams data directly from HDFS. Hive MR jobs do not play any role here, making spark faster than hive. 
> 
> HTH....
> 
> Ayan
> 
> On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <msegel_hadoop@hotmail.com <ma...@hotmail.com>> wrote:
> Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running.
> 
> I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables.
> 
> 
> Yarn has two modes… client and cluster.
> 
> I get it that in cluster mode… everything is running on the cluster.
> But in client mode, the driver is running on the edge node while the workers are running on the cluster.
> 
> When I run a sparkSQL command that generates a new RDD, does the result set live on the cluster with the workers, and gets referenced by the driver, or does the result set get migrated to the driver running on the client? (I’m pretty sure I know the answer, but its never safe to assume anything…)
> 
> The follow up questions:
> 
> 1) If I kill the  app running the driver on the edge node… will that cause YARN to free up the cluster’s resources? (In cluster mode… that doesn’t happen) What happens and how quickly?
> 
> 1a) If using the client mode… can I spin up and spin down the number of executors on the cluster? (Assuming that when I kill an executor any portion of the RDDs associated with that executor are gone, however the spark context is still alive on the edge node? [again assuming that the spark context lives with the driver.])
> 
> 2) Any I/O between my spark job and the outside world… (e.g. walking through the data set and writing out a data set to a file) will occur on the edge node where the driver is located?  (This may seem kinda silly, but what happens when you want to expose the result set to the world… ? )
> 
> Now for something slightly different…
> 
> Suppose I have a data source… like a couple of hive tables and I access the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce job and then would stream the result set back to the client node where the RDD result set would be built.  I realize that I could run Hive on top of spark, but that’s a separate issue. Here the RDD will reside on the client only.  (That is I could in theory run this as a single spark instance.)
> If I were to run this on the cluster… then the result set would stream thru the beeline gate way and would reside back on the cluster sitting in RDDs within each executor?
> 
> I realize that these are silly questions but I need to make sure that I know the flow of the data and where it ultimately resides.  There really is a method to my madness, and if I could explain it… these questions really would make sense. ;-)
> 
> TIA,
> 
> -Mike
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha


Re: Silly question about Yarn client vs Yarn cluster modes...

Posted by ayan guha <gu...@gmail.com>.
1. Yes, in the sense you control number of executors from spark application
config.
2. Any IO will be done from executors (never ever on driver, unless you
explicitly call collect()). For example, connection to a DB happens one for
each worker (and used by local executors). Also, if you run a reduceByKey
job and write to hdfs, you will find a bunch of files were written from
various executors. What happens when you want to expose the data to world:
Spark Thrift Server (STS), which is a long running spark application (ie
spark context) which can serve data from RDDs.

Suppose I have a data source… like a couple of hive tables and I access the
tables via beeline. (JDBC)  -
This is NOT a spark application, and there is no RDD created. Beeline is
just a jdbc client tool. You use beeline to connect to HS2 or STS.

In this case… Hive generates a map/reduce job and then would stream the
result set back to the client node where the RDD result set would be built.
 --
This is never true. When you connect Hive from spark, spark actually reads
hive metastore and streams data directly from HDFS. Hive MR jobs do not
play any role here, making spark faster than hive.

HTH....

Ayan

On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <ms...@hotmail.com>
wrote:

> Ok, its at the end of the day and I’m trying to make sure I understand the
> locale of where things are running.
>
> I have an application where I have to query a bunch of sources, creating
> some RDDs and then I need to join off the RDDs and some other lookup tables.
>
>
> Yarn has two modes… client and cluster.
>
> I get it that in cluster mode… everything is running on the cluster.
> But in client mode, the driver is running on the edge node while the
> workers are running on the cluster.
>
> When I run a sparkSQL command that generates a new RDD, does the result
> set live on the cluster with the workers, and gets referenced by the
> driver, or does the result set get migrated to the driver running on the
> client? (I’m pretty sure I know the answer, but its never safe to assume
> anything…)
>
> The follow up questions:
>
> 1) If I kill the  app running the driver on the edge node… will that cause
> YARN to free up the cluster’s resources? (In cluster mode… that doesn’t
> happen) What happens and how quickly?
>
> 1a) If using the client mode… can I spin up and spin down the number of
> executors on the cluster? (Assuming that when I kill an executor any
> portion of the RDDs associated with that executor are gone, however the
> spark context is still alive on the edge node? [again assuming that the
> spark context lives with the driver.])
>
> 2) Any I/O between my spark job and the outside world… (e.g. walking
> through the data set and writing out a data set to a file) will occur on
> the edge node where the driver is located?  (This may seem kinda silly, but
> what happens when you want to expose the result set to the world… ? )
>
> Now for something slightly different…
>
> Suppose I have a data source… like a couple of hive tables and I access
> the tables via beeline. (JDBC)  In this case… Hive generates a map/reduce
> job and then would stream the result set back to the client node where the
> RDD result set would be built.  I realize that I could run Hive on top of
> spark, but that’s a separate issue. Here the RDD will reside on the client
> only.  (That is I could in theory run this as a single spark instance.)
> If I were to run this on the cluster… then the result set would stream
> thru the beeline gate way and would reside back on the cluster sitting in
> RDDs within each executor?
>
> I realize that these are silly questions but I need to make sure that I
> know the flow of the data and where it ultimately resides.  There really is
> a method to my madness, and if I could explain it… these questions really
> would make sense. ;-)
>
> TIA,
>
> -Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha