You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jacob Eisinger <je...@us.ibm.com> on 2014/04/25 17:23:01 UTC

Securing Spark's Network

 Howdy,

We tried running Spark 0.9.1 stand-alone inside docker containers distributed over multiple hosts. This is complicated due to Spark opening up ephemeral / dynamic ports for the workers and the CLI.  To ensure our docker solution doesn't break Spark in unexpected ways and maintains a secure cluster, I am interested in understanding more about Spark's network architecture. I'd appreciate it if you could you point us to any documentation!

A couple specific questions:
What are these ports being used for?

Checking out the code / experiments, it looks like asynchronous communication for shuffling around results. Anything else?How do you secure the network?

Network administrators tend to secure and monitor the network at the port level. If these ports are dynamic and open randomly, firewalls are not easily configured and security alarms are raised. Is there a way to limit the range easily? (We did investigate setting the kernel parameter ip_local_reserved_ports, but this is broken [1] on some versions of Linux's cgroups.)

Thanks,
Jacob

[1] https://github.com/lxc/lxc/issues/97 

Jacob D. Eisinger
IBM Emerging Technologies
jeising@us.ibm.com - (512) 286-6075

Re: File list read into single RDD

Posted by Pat Ferrel <pa...@gmail.com>.

Thanks this really helps. 

As long as I stick to HDFS paths, and files I’m good. I do know that code a bit but have never used it to say take input from one cluster via “hdfs://server:port/path” and output to another via “hdfs://another-server:another-port/path”. This seems to be supported by Spark so I’ll have to go back and look at how to do this in the HDFS api.

Specifically I’ll need to examine the directory/file structure on one cluster then check some things on what is potentially another cluster before output. I have usually assumed only one HDFS instance so it may just be a matter of me being more careful and preserving full URIs. In the past I may have made assumptions that output is to the same dir tree as the input. Maybe it’s a matter of being more scrupulous about that assumption.

It’s a bit hard to test this case since I have never really had access to two clusters so I’ll have to develop some new habits at least.

On May 18, 2014, at 11:13 AM, Andrew Ash <an...@andrewash.com> wrote:

Spark's sc.textFile() method delegates to sc.hadoopFile(), which uses Hadoop's FileInputFormat.setInputPaths() call.  There is no alternate storage system, Spark just delegates to Hadoop for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so you can use Spark on data in S3 using s3:/// just the same as you would with HDFS.  See Apache's documentation on S3 for more details.

As far as interacting with a FileSystem (HDFS or other) to list files, delete files, navigate paths, etc. from your driver program, you should be able to just instantiate a FileSystem object and use the normal Hadoop APIs from there.  The Apache getting started docs on reading/writing from Hadoop DFS should work the same for non-HDFS examples too.

I do think we could use a little "recipe" in our documentation to make interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind sharing, we can format it for including in future Spark docs.

Cheers!
Andrew

On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel <pa...@gmail.com> wrote:
Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction level beyond what is available in HDFS. In order to preserve that flexibility what APIs should I be using? It would be easy to say, HDFS only and use HDFS APIs but that would seem to limit things. Especially where you would like to read from one cluster and write to another. This is not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine the structure of the file system, I’m unclear how I should do it without sacrificing Spark’s flexibility.

On Apr 29, 2014, at 12:55 AM, Christophe Préaud <ch...@kelkoo.com> wrote:

Hi,

You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
> Not that I know of. We were discussing it on another thread and it came up. 
> 
> I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. 
> 
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
> 
> But that's not obvious.
> 
> Nick
> 
> 2014년 4월 28일 월요일, Pat Ferrel<pa...@gmail.com> 님이 작성한 메시지:
> Perfect. 
> 
> BTW just so I know where to look next time, was that in some docs?
> 
> On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <ni...@gmail.com> wrote:
> 
> Yep, as I just found out, you can also provide 
> sc.textFile() with a comma-delimited string of all the files you want to load.
> 
> For example:
> 
> sc.textFile('/path/to/file1,/path/to/file2')
> So once you have your list of files, concatenate their paths like that and pass the single string to 
> textFile().
> 
> Nick
> 
> 
> 
> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pa...@gmail.com> wrote:
> sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list?
> 
> 

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: File list read into single RDD

Posted by Andrew Ash <an...@andrewash.com>.

Spark's sc.textFile()<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456>
method
delegates to sc.hadoopFile(), which uses Hadoop's
FileInputFormat.setInputPaths()<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L546>call.
 There is no alternate storage system, Spark just delegates to Hadoop
for the .textFile() call.

Hadoop can also support multiple URI schemes, not just hdfs:/// paths, so
you can use Spark on data in S3 using s3:/// just the same as you would
with HDFS.  See Apache's documentation on
S3<https://wiki.apache.org/hadoop/AmazonS3> for
more details.

As far as interacting with a FileSystem (HDFS or other) to list files,
delete files, navigate paths, etc. from your driver program, you should be
able to just instantiate a FileSystem object and use the normal Hadoop APIs
from there.  The Apache getting started docs on reading/writing from Hadoop
DFS <https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample> should work
the same for non-HDFS examples too.

I do think we could use a little "recipe" in our documentation to make
interacting with HDFS a bit more straightforward.

Pat, if you get something that covers your case that you don't mind
sharing, we can format it for including in future Spark docs.

Cheers!
Andrew

On Sun, May 18, 2014 at 9:13 AM, Pat Ferrel <pa...@gmail.com> wrote:

> Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI.
> Since Spark supports several FS schemes I’m unclear about how much to
> assume about using the hadoop file systems APIs and conventions. Concretely
> if I pass a pattern in with a HTTPS file system, will the pattern work?
>
> How does Spark implement its storage system? This seems to be an
> abstraction level beyond what is available in HDFS. In order to preserve
> that flexibility what APIs should I be using? It would be easy to say, HDFS
> only and use HDFS APIs but that would seem to limit things. Especially
> where you would like to read from one cluster and write to another. This is
> not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.
>
> If I can stick to passing URIs to sc.textFile() I’m ok but if I need to
> examine the structure of the file system, I’m unclear how I should do it
> without sacrificing Spark’s flexibility.
>
> On Apr 29, 2014, at 12:55 AM, Christophe Préaud <
> christophe.preaud@kelkoo.com> wrote:
>
>  Hi,
>
> You can also use any path pattern as defined here:
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
>
> e.g.:
>
> sc.textFile('{/path/to/file1,/path/to/file2}')
>
> Christophe.
>
> On 29/04/2014 05:07, Nicholas Chammas wrote:
>
> Not that I know of. We were discussing it on another thread and it came
> up.
>
>  I think if you look up the Hadoop FileInputFormat API (which Spark uses)
> you'll see it mentioned there in the docs.
>
>
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
>
>  But that's not obvious.
>
>  Nick
>
> 2014년 4월 28일 월요일, Pat Ferrel<pa...@gmail.com> 님이 작성한 메시지:
>
>> Perfect.
>>
>>  BTW just so I know where to look next time, was that in some docs?
>>
>>   On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <
>> nicholas.chammas@gmail.com> wrote:
>>
>>  Yep, as I just found out, you can also provide sc.textFile() with a
>> comma-delimited string of all the files you want to load.
>>
>> For example:
>>
>> sc.textFile('/path/to/file1,/path/to/file2')
>>
>> So once you have your list of files, concatenate their paths like that
>> and pass the single string to textFile().
>>
>> Nick
>>
>>
>> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pa...@gmail.com> wrote:
>>
>>> sc.textFile(URI) supports reading multiple files in parallel but only
>>> with a wildcard. I need to walk a dir tree, match a regex to create a list
>>> of files, then I’d like to read them into a single RDD in parallel. I
>>> understand these could go into separate RDDs then a union RDD can be
>>> created. Is there a way to create a single RDD from a URI list?
>>
>>
>>
>>
>
> ------------------------------
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>
>

Re: File list read into single RDD

Posted by Pat Ferrel <pa...@gmail.com>.

Doesn’t using an HDFS path pattern then restrict the URI to an HDFS URI. Since Spark supports several FS schemes I’m unclear about how much to assume about using the hadoop file systems APIs and conventions. Concretely if I pass a pattern in with a HTTPS file system, will the pattern work? 

How does Spark implement its storage system? This seems to be an abstraction level beyond what is available in HDFS. In order to preserve that flexibility what APIs should I be using? It would be easy to say, HDFS only and use HDFS APIs but that would seem to limit things. Especially where you would like to read from one cluster and write to another. This is not so easy to do inside the HDFS APIs, or is advanced beyond my knowledge.

If I can stick to passing URIs to sc.textFile() I’m ok but if I need to examine the structure of the file system, I’m unclear how I should do it without sacrificing Spark’s flexibility.

On Apr 29, 2014, at 12:55 AM, Christophe Préaud <ch...@kelkoo.com> wrote:

Hi,

You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:
sc.textFile('{/path/to/file1,/path/to/file2}')
Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
> Not that I know of. We were discussing it on another thread and it came up. 
> 
> I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs. 
> 
> http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html
> 
> But that's not obvious.
> 
> Nick
> 
> 2014년 4월 28일 월요일, Pat Ferrel<pa...@gmail.com> 님이 작성한 메시지:
> Perfect. 
> 
> BTW just so I know where to look next time, was that in some docs?
> 
> On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <ni...@gmail.com> wrote:
> 
> Yep, as I just found out, you can also provide 
> sc.textFile() with a comma-delimited string of all the files you want to load.
> 
> For example:
> 
> sc.textFile('/path/to/file1,/path/to/file2')
> So once you have your list of files, concatenate their paths like that and pass the single string to 
> textFile().
> 
> Nick
> 
> 
> 
> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pa...@gmail.com> wrote:
> sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list?
> 
> 

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: File list read into single RDD

Posted by Christophe Préaud <ch...@kelkoo.com>.

Hi,

You can also use any path pattern as defined here: http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29

e.g.:

sc.textFile('{/path/to/file1,/path/to/file2}')

Christophe.

On 29/04/2014 05:07, Nicholas Chammas wrote:
Not that I know of. We were discussing it on another thread and it came up.

I think if you look up the Hadoop FileInputFormat API (which Spark uses) you'll see it mentioned there in the docs.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

But that's not obvious.

Nick

2014년 4월 28일 월요일, Pat Ferrel<pa...@gmail.com>> 님이 작성한 메시지:
Perfect.

BTW just so I know where to look next time, was that in some docs?

On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <nicholas.chammas@gmail.com<javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>> wrote:

Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load.

For example:

sc.textFile('/path/to/file1,/path/to/file2')

So once you have your list of files, concatenate their paths like that and pass the single string to textFile().

Nick

On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pat.ferrel@gmail.com<javascript:_e(%7B%7D,'cvml','pat.ferrel@gmail.com');>> wrote:
sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list?

________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.

Re: File list read into single RDD

Posted by Nicholas Chammas <ni...@gmail.com>.

Not that I know of. We were discussing it on another thread and it came up.

I think if you look up the Hadoop FileInputFormat API (which Spark uses)
you'll see it mentioned there in the docs.

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/FileInputFormat.html

But that's not obvious.

Nick

2014년 4월 28일 월요일, Pat Ferrel<pa...@gmail.com>님이 작성한 메시지:

> Perfect.
>
> BTW just so I know where to look next time, was that in some docs?
>
> On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <nicholas.chammas@gmail.com<javascript:_e(%7B%7D,'cvml','nicholas.chammas@gmail.com');>>
> wrote:
>
> Yep, as I just found out, you can also provide sc.textFile() with a
> comma-delimited string of all the files you want to load.
>
> For example:
>
> sc.textFile('/path/to/file1,/path/to/file2')
>
> So once you have your list of files, concatenate their paths like that and
> pass the single string to textFile().
>
> Nick
>
>
> On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pat.ferrel@gmail.com<javascript:_e(%7B%7D,'cvml','pat.ferrel@gmail.com');>
> > wrote:
>
>> sc.textFile(URI) supports reading multiple files in parallel but only
>> with a wildcard. I need to walk a dir tree, match a regex to create a list
>> of files, then I’d like to read them into a single RDD in parallel. I
>> understand these could go into separate RDDs then a union RDD can be
>> created. Is there a way to create a single RDD from a URI list?
>
>
>
>

Re: File list read into single RDD

Posted by Pat Ferrel <pa...@gmail.com>.

Perfect. 

BTW just so I know where to look next time, was that in some docs?

On Apr 28, 2014, at 7:04 PM, Nicholas Chammas <ni...@gmail.com> wrote:

Yep, as I just found out, you can also provide sc.textFile() with a comma-delimited string of all the files you want to load.

For example:

sc.textFile('/path/to/file1,/path/to/file2')
So once you have your list of files, concatenate their paths like that and pass the single string to textFile().

Nick



On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pa...@gmail.com> wrote:
sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list?

Re: File list read into single RDD

Posted by Nicholas Chammas <ni...@gmail.com>.

Yep, as I just found out, you can also provide sc.textFile() with a
comma-delimited string of all the files you want to load.

For example:

sc.textFile('/path/to/file1,/path/to/file2')

So once you have your list of files, concatenate their paths like that and
pass the single string to textFile().

Nick

On Mon, Apr 28, 2014 at 7:23 PM, Pat Ferrel <pa...@gmail.com> wrote:

> sc.textFile(URI) supports reading multiple files in parallel but only with
> a wildcard. I need to walk a dir tree, match a regex to create a list of
> files, then I’d like to read them into a single RDD in parallel. I
> understand these could go into separate RDDs then a union RDD can be
> created. Is there a way to create a single RDD from a URI list?

File list read into single RDD

Posted by Pat Ferrel <pa...@gmail.com>.

sc.textFile(URI) supports reading multiple files in parallel but only with a wildcard. I need to walk a dir tree, match a regex to create a list of files, then I’d like to read them into a single RDD in parallel. I understand these could go into separate RDDs then a union RDD can be created. Is there a way to create a single RDD from a URI list?

Re: Securing Spark's Network

Posted by Jacob Eisinger <je...@us.ibm.com>.

Thanks Akhil!  It definitely looks like all ports are open to all nodes in the Spark cluster on EC2.

The ephemeral ports [1] (the bottom ports you mentioned) are the ones that I am interested in.  In particular, I am interested from the security and containerization perspective.  Opening inbound ports in a large range like this is hard to monitor, secure, and control for a network administrator.  I would like to understand why this architecture was chosen --- or if another alternative design might have more benefits:
A. Port Range Configuration
Allow for configuration of the range of ports a job can use
pros - easy to configure; uses standard Spark configuration
cons - large ranges of ports still be opened; at deployment time, the numbers of jobs that a worker can process need to be determined

B. VPN
requires a dedicated VPN endpoint
pros - secured
cons - VPN may become the bottleneck for network communication

C. Proxy Requests
enable a set of ports for all jobs to use; deploy a proxy that will redirect traffic to the correct job listening only on the local loopback adapter, lo.
pros - simple for the network administrator to monitor
cons - increase the size of messages by adding in job id to each message instead of the port; proxy may slow network access a bit and increase CPU on workers.
Does the concern about opening ports in a large range make sense to you all?  Possible solutions make sense?

Thanks,
Jacob

[1] http://en.wikipedia.org/wiki/Ephemeral_port - Linux ports 32768 to 61000

Jacob D. Eisinger
IBM Emerging Technologies
jeising@us.ibm.com - (512) 286-6075

-----akhil@mobipulse.in wrote: -----
To: user@spark.apache.org
From: Akhil Das 
Sent by: akhil@mobipulse.in
Date: 04/25/2014 05:07PM
Subject: Re: Securing Spark's Network

Hi Jacob,

This is how the security groups/ports looks like on EC2



  

As you can see it sets all TCP/UDP access open to all nodes using the security group sg-a817b70 (These are probably the slaves) and also it opens the ports following for public.  
22: SSH
4040-4045: Workers
5080: Apache ( for ganglia)
8080-8081: MasterUI
10000: Shark Server (Is custom in my case- I integrated the ec2 scripts with a WebUI)
19999, 50030, 50070, 60070: Not sure what are these  




On Sat, Apr 26, 2014 at 2:54 AM, Jacob Eisinger <je...@us.ibm.com> wrote:
  
 Howdy Akhil,
 
 Thanks - that did help!  And, it made me think about how the EC2 scripts work [1] to set up security.  From my understanding of EC2 security groups [2], this just sets up external access, right?  (This has no effect on internal communication between the instances, right?)
   
 I am still confused as to why I am seeing the workers open up new ports for each job.
 
 Jacob
 
 [1] https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L230
   [2] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group
  

 
 Jacob D. Eisinger
 IBM Emerging Technologies
 jeising@us.ibm.com - (512) 286-6075
 
 Akhil Das ---04/25/2014 12:51:08 PM---Hi Jacob, This post might give you a brief idea about the ports being used
   
 From:        Akhil Das <ak...@sigmoidanalytics.com>
 To:        user@spark.apache.org
 Date:        04/25/2014 12:51 PM
 Subject:        Re: Securing Spark's Network
 Sent by:        akhil@mobipulse.in
 

 
 
 Hi Jacob,
 
 This post might give you a brief idea about the ports being used 
 
 https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA
 
 
 
 
 
 On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger <je...@us.ibm.com> wrote: 
Howdy,
 
 We tried running Spark 0.9.1 stand-alone inside docker containers distributed over multiple hosts. This is complicated due to Spark opening up ephemeral / dynamic ports for the workers and the CLI.  To ensure our docker solution doesn't break Spark in unexpected ways and maintains a secure cluster, I am interested in understanding more about Spark's network architecture. I'd appreciate it if you could you point us to any documentation!
   
 A couple specific questions: 1.        What are these ports being used for?

 Checking out the code / experiments, it looks like asynchronous communication for shuffling around results. Anything else?
 2.        How do you secure the network?

 Network administrators tend to secure and monitor the network at the port level. If these ports are dynamic and open randomly, firewalls are not easily configured and security alarms are raised. Is there a way to limit the range easily? (We did investigate setting the kernel parameter ip_local_reserved_ports, but this is broken [1] on some versions of Linux's cgroups.)  
 
 Thanks,
 Jacob
 
 [1] https://github.com/lxc/lxc/issues/97 
 
 Jacob D. Eisinger
 IBM Emerging Technologies
 jeising@us.ibm.com - (512) 286-6075

Re: Securing Spark's Network

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Hi Jacob,

This is how the security groups/ports looks like on EC2


[image: Inline image 1]

As you can see it sets all TCP/UDP access open to all nodes using the
security group sg-a817b70 (These are probably the slaves) and also it opens
the ports following for public.

22: SSH
4040-4045: Workers
5080: Apache ( for ganglia)
8080-8081: MasterUI
10000: Shark Server (Is custom in my case- I integrated the ec2 scripts
with a WebUI)
19999, 50030, 50070, 60070: Not sure what are these




On Sat, Apr 26, 2014 at 2:54 AM, Jacob Eisinger <je...@us.ibm.com> wrote:

> Howdy Akhil,
>
> Thanks - that did help!  And, it made me think about how the EC2 scripts
> work [1] to set up security.  From my understanding of EC2 security groups
> [2], this just sets up external access, right?  (This has no effect on
> internal communication between the instances, right?)
>
> I am still confused as to why I am seeing the workers open up new ports
> for each job.
>
> Jacob
>
> [1] https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L230
> [2]
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group
>
>
> Jacob D. Eisinger
> IBM Emerging Technologies
> jeising@us.ibm.com - (512) 286-6075
>
> [image: Inactive hide details for Akhil Das ---04/25/2014 12:51:08 PM---Hi
> Jacob, This post might give you a brief idea about the ports]Akhil Das
> ---04/25/2014 12:51:08 PM---Hi Jacob, This post might give you a brief idea
> about the ports being used
>
> From: Akhil Das <ak...@sigmoidanalytics.com>
> To: user@spark.apache.org
> Date: 04/25/2014 12:51 PM
> Subject: Re: Securing Spark's Network
> Sent by: akhil@mobipulse.in
> ------------------------------
>
>
>
> Hi Jacob,
>
> This post might give you a brief idea about the ports being used
>
> *https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA*<https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA>
>
>
>
>
>
> On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger <*j...@us.ibm.com>>
> wrote:
>
>    Howdy,
>
>    We tried running Spark 0.9.1 stand-alone inside docker containers
>    distributed over multiple hosts. This is complicated due to Spark opening
>    up ephemeral / dynamic ports for the workers and the CLI.  To ensure our
>    docker solution doesn't break Spark in unexpected ways and maintains a
>    secure cluster, I am interested in understanding more about Spark's network
>    architecture. I'd appreciate it if you could you point us to any
>    documentation!
>
>    A couple specific questions:
>    1. What are these ports being used for?
>
>       Checking out the code / experiments, it looks like asynchronous
>       communication for shuffling around results. Anything else?
>
>       2. How do you secure the network?
>
>       Network administrators tend to secure and monitor the network at
>       the port level. If these ports are dynamic and open randomly, firewalls are
>       not easily configured and security alarms are raised. Is there a way to
>       limit the range easily? (We did investigate setting the kernel parameter
>       ip_local_reserved_ports, but this is broken [1] on some versions of Linux's
>       cgroups.)
>
>    Thanks,
>    Jacob
>
>    [1] *https://github.com/lxc/lxc/issues/97*<https://github.com/lxc/lxc/issues/97>
>
>
>    Jacob D. Eisinger
>    IBM Emerging Technologies
> *jeising@us.ibm.com* <je...@us.ibm.com> - *(512) 286-6075*<%28512%29%20286-6075>
>
>
>

Re: Securing Spark's Network

Posted by Jacob Eisinger <je...@us.ibm.com>.

Howdy Akhil,

Thanks - that did help!  And, it made me think about how the EC2 scripts
work [1] to set up security.  From my understanding of EC2 security groups
[2], this just sets up external access, right?  (This has no effect on
internal communication between the instances, right?)

I am still confused as to why I am seeing the workers open up new ports for
each job.

Jacob

[1] https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L230
[2]
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group

Jacob D. Eisinger
IBM Emerging Technologies
jeising@us.ibm.com - (512) 286-6075



From:	Akhil Das <ak...@sigmoidanalytics.com>
To:	user@spark.apache.org
Date:	04/25/2014 12:51 PM
Subject:	Re: Securing Spark's Network
Sent by:	akhil@mobipulse.in



Hi Jacob,

This post might give you a brief idea about the ports being used

https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA





On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger <je...@us.ibm.com> wrote:
  Howdy,

  We tried running Spark 0.9.1 stand-alone inside docker containers
  distributed over multiple hosts. This is complicated due to Spark opening
  up ephemeral / dynamic ports for the workers and the CLI.  To ensure our
  docker solution doesn't break Spark in unexpected ways and maintains a
  secure cluster, I am interested in understanding more about Spark's
  network architecture. I'd appreciate it if you could you point us to any
  documentation!

  A couple specific questions:
     1.	What are these ports being used for?
        Checking out the code / experiments, it looks like asynchronous
        communication for shuffling around results. Anything else?
     2.	How do you secure the network?
        Network administrators tend to secure and monitor the network at
        the port level. If these ports are dynamic and open randomly,
        firewalls are not easily configured and security alarms are raised.
        Is there a way to limit the range easily? (We did investigate
        setting the kernel parameter ip_local_reserved_ports, but this is
        broken [1] on some versions of Linux's cgroups.)

  Thanks,
  Jacob

  [1] https://github.com/lxc/lxc/issues/97

  Jacob D. Eisinger
  IBM Emerging Technologies
  jeising@us.ibm.com - (512) 286-6075

Re: Securing Spark's Network

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Hi Jacob,

This post might give you a brief idea about the ports being used

https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA





On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger <je...@us.ibm.com> wrote:

> Howdy,
>
> We tried running Spark 0.9.1 stand-alone inside docker containers
> distributed over multiple hosts. This is complicated due to Spark opening
> up ephemeral / dynamic ports for the workers and the CLI.  To ensure our
> docker solution doesn't break Spark in unexpected ways and maintains a
> secure cluster, I am interested in understanding more about Spark's network
> architecture. I'd appreciate it if you could you point us to any
> documentation!
>
> A couple specific questions:
>
>    1. What are these ports being used for?
>    Checking out the code / experiments, it looks like asynchronous
>    communication for shuffling around results. Anything else?
>    2. How do you secure the network?
>    Network administrators tend to secure and monitor the network at the
>    port level. If these ports are dynamic and open randomly, firewalls are not
>    easily configured and security alarms are raised. Is there a way to limit
>    the range easily? (We did investigate setting the kernel parameter
>    ip_local_reserved_ports, but this is broken [1] on some versions of Linux's
>    cgroups.)
>
>
> Thanks,
> Jacob
>
> [1] https://github.com/lxc/lxc/issues/97
>
> Jacob D. Eisinger
> IBM Emerging Technologies
> jeising@us.ibm.com - (512) 286-6075