You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2016/02/11 23:30:26 UTC

Question on Spark architecture and DAG

Hi,

I have used Hive on Spark engine and of course Hive tables and its pretty
impressive comparing Hive using MR engine.

 

Let us assume that I use spark shell. Spark shell is a client that connects
to spark master running on a host and port like below

spark-shell --master spark://50.140.197.217:7077:

Ok once I connect I create an RDD to read a text file:

val oralog = sc.textFile("/test/alert_mydb.log")

I then search for word Errors in that file

oralog.filter(line => line.contains("Errors")).collect().foreach(line =>
println(line))

 

Questions:

 

1.	In order to display the lines (the result set) containing word
"Errors", the content of the file (i.e. the blocks on HDFS) need to be read
into memory. Is my understanding correct that as per RDD notes those blocks
from the file will be partitioned across the cluster and each node will have
its share of blocks in memory?
2.	Once the result is returned back they need to be sent to the client
that has made the connection to master. I guess this is a simple TCP
operation much like any relational database sending the result back?
3.	Once the results are returned if no request has been made to keep
the data in memory, those blocks in memory will be discarded?
4.	Regardless of the storage block size on disk (128MB, 256MB etc), the
memory pages are 2K in relational databases? Is this the case in Spark as
well?

Thanks,

 Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

RE: Question on Spark architecture and DAG

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.

Thanks Andy much appreciated

 

Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 

From: Andy Davidson [mailto:Andy@SantaCruzIntegration.com] 
Sent: 12 February 2016 21:17
To: Mich Talebzadeh <mi...@peridale.co.uk>; user@spark.apache.org
Subject: Re: Question on Spark architecture and DAG

 

 

 

From: Mich Talebzadeh <mich@peridale.co.uk <ma...@peridale.co.uk> >
Date: Thursday, February 11, 2016 at 2:30 PM
To: "user @spark" <user@spark.apache.org <ma...@spark.apache.org> >
Subject: Question on Spark architecture and DAG

 

Hi,

I have used Hive on Spark engine and of course Hive tables and its pretty
impressive comparing Hive using MR engine.

 

Let us assume that I use spark shell. Spark shell is a client that connects
to spark master running on a host and port like below

spark-shell --master spark://50.140.197.217:7077:

Ok once I connect I create an RDD to read a text file:

val oralog = sc.textFile("/test/alert_mydb.log")

I then search for word Errors in that file

oralog.filter(line => line.contains("Errors")).collect().foreach(line =>
println(line))

 

Questions:

 

1.	In order to display the lines (the result set) containing word
"Errors", the content of the file (i.e. the blocks on HDFS) need to be read
into memory. Is my understanding correct that as per RDD notes those blocks
from the file will be partitioned across the cluster and each node will have
its share of blocks in memory?

 

 

Typically results are written to disk. For example look at
rdd.saveAsTextFile(). You can also use "collect" to copy the RDD data into
the drivers local memory. You need to be careful that all the data will fit
in memory.

 

2.	 
3.	Once the result is returned back they need to be sent to the client
that has made the connection to master. I guess this is a simple TCP
operation much like any relational database sending the result back?

 

 

I run several spark streaming apps. One collects data, does some clean up
and publishes the results to down stream systems using activeMQ. Some of our
other apps just write on a socket

 

4.	 
5.	Once the results are returned if no request has been made to keep
the data in memory, those blocks in memory will be discarded?

 

There are couple of thing to consider, for example if your batch job
completes all memory is returned. Programaticaly you make RDD persistent or
cause them to be cached in memory

 

6.	 
7.	Regardless of the storage block size on disk (128MB, 256MB etc), the
memory pages are 2K in relational databases? Is this the case in Spark as
well?

Thanks,

 Mich Talebzadeh

 

LinkedIn
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABU
rV8Pw>
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

 <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

Re: Question on Spark architecture and DAG

Posted by Andy Davidson <An...@SantaCruzIntegration.com>.


From:  Mich Talebzadeh <mi...@peridale.co.uk>
Date:  Thursday, February 11, 2016 at 2:30 PM
To:  "user @spark" <us...@spark.apache.org>
Subject:  Question on Spark architecture and DAG

> Hi,
> 
> I have used Hive on Spark engine and of course Hive tables and its pretty
> impressive comparing Hive using MR engine.
> 
>  
> 
> Let us assume that I use spark shell. Spark shell is a client that connects to
> spark master running on a host and port like below
> 
> spark-shell --master spark://50.140.197.217:7077:
> 
> Ok once I connect I create an RDD to read a text file:
> 
> val oralog = sc.textFile("/test/alert_mydb.log")
> 
> I then search for word Errors in that file
> 
> oralog.filter(line => line.contains("Errors")).collect().foreach(line =>
> println(line))
> 
>  
> 
> Questions:
> 
>  
> 1. In order to display the lines (the result set) containing word "Errors",
> the content of the file (i.e. the blocks on HDFS) need to be read into memory.
> Is my understanding correct that as per RDD notes those blocks from the file
> will be partitioned across the cluster and each node will have its share of
> blocks in memory?


Typically results are written to disk. For example look at
rdd.saveAsTextFile(). You can also use ³collect² to copy the RDD data into
the drivers local memory. You need to be careful that all the data will fit
in memory.

> 1. 
> 2. Once the result is returned back they need to be sent to the client that
> has made the connection to master. I guess this is a simple TCP operation much
> like any relational database sending the result back?


I run several spark streaming apps. One collects data, does some clean up
and publishes the results to down stream systems using activeMQ. Some of our
other apps just write on a socket

> 1. 
> 2. Once the results are returned if no request has been made to keep the data
> in memory, those blocks in memory will be discarded?

There are couple of thing to consider, for example if your batch job
completes all memory is returned. Programaticaly you make RDD persistent or
cause them to be cached in memory

> 1. 
> 2. Regardless of the storage block size on disk (128MB, 256MB etc), the memory
> pages are 2K in relational databases? Is this the case in Spark as well?
> Thanks,
> 
>  Mich Talebzadeh
> 
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8
> Pw 
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV
> 8Pw> 
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Technology Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>  
>