You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Wilbert S." <wi...@gmail.com> on 2020/06/01 12:19:55 UTC

Re: Spark Security

Hello,

This is what happens when I load the data using sparklyr::spark_read_csv()
in R. It creates a "derby.log" file that says something along the lines of:

Sun May 31 14:17:02 EDT 2020:
Booting Derby version The Apache Software Foundation - Apache Derby -
10.12.1.1 - (1704137): instance xxxxxxx
on database directory memory:C:\Users\wseoane\2020-05-31 sparklyr on three
rows\databaseName=metastore_db with class loader
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$xxxxxxxxx
Loaded from
file:/C:/Users/wseoane/AppData/Local/spark/spark-2.4.3-bin-hadoop2.7/jars/derby-10.12.1.1.jar
java.vendor=Oracle Corporation
java.runtime.version=1.8.0_241-b07
user.dir=C:\Users\wseoane\2020-05-31 sparklyr on three rows
os.name=Windows 10
os.arch=xxxxx
os.version=10.0
derby.system.home=null
Database Class Loader started - derby.database.classpath=''


I can then click to view details about the Spark connection in my browser
while I have the Spark connection in sparklyr. Here are the results from a
test .tsv file:
Jobs:
[image: Jobs 2020-05-31 142103.png]
SQL:
[image: SQL 2020-05-31 142217.png]
Stages:
[image: Stages 2020-05-31 142217.png]
Storage:
[image: Storage 2020-05-31 142217.png]

So, since sparklyr::spark_read_csv() reads in the data locally and not in
the cloud, security is determined by my company's IT department correct
(i.e. the firewalls that the IT department has in place in the network and
the antivirus software they have installed on my computer and etc.)? If it
were on the cloud, the cloud would need it's own layer of security ("up to
whoever runs the cluster") but that is not relevant here since I am
using sparklyr::spark_read_csv(),
correct?


Thanks,

Wilbert Seoane



On Fri, May 29, 2020 at 3:17 PM Sean Owen <sr...@gmail.com> wrote:

> If you load a file on your computer, that is unrelated to Spark.
> Whatever you load via Spark APIs will at some point live in memory on the
> Spark cluster, or the storage you back it with if you store it.
> Whether the cluster and storage are secure (like, ACLs / auth enabled) is
> up to whoever runs the cluster.
>
> On Fri, May 29, 2020 at 1:54 PM <wi...@gmail.com> wrote:
>
>> Hi Sean
>>
>> I mean that I won’t be opening up my client for any data breaches or
>> anything like that by connecting to Spark and loading in their data using
>> sparklyr in R studio.
>>
>> Connecting with spark and loading in a tsv file on my local computer is
>> secure correct?
>>
>>
>> Thanks
>>
>> Wilbert J. Seoane
>>
>> Sent from iPhone
>>
>> On May 29, 2020, at 11:25 AM, Sean Owen <sr...@gmail.com> wrote:
>>
>> 
>> What do you mean by secure here?
>>
>> On Fri, May 29, 2020 at 10:21 AM <wi...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I plan to load in a local .tsv file from my hard drive using sparklyr
>>> (an R package). I have figured out how to do this already on small files.
>>>
>>> When I decide to receive my client’s large .tsv file, can I be confident
>>> that loading in data this way will be secure? I know that this creates a
>>> Spark connection to help process the data more quickly, but I want to
>>> verify that the data will be secure after loading it with the Spark
>>> connection and sparklyr.
>>>
>>>
>>> Thanks,
>>>
>>> Wilbert J. Seoane
>>>
>>> Sent from iPhone
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>

Re: Spark Security

Posted by Sean Owen <sr...@gmail.com>.
spark_read_csv() does not read locally; again it is using Spark to read it.

If you are literally running a local Spark cluster locally on your machine,
then all that is happening on your machine via Spark, because the
driver/executors are one local process.
Otherwise, it is running wherever the Spark cluster is running - some
machines within your org, or in the cloud, or wherever it was run. You
would be running a driver process somewhere else.

Yes, what is relevant is network firewalls on the machines where Spark
runs. (And potentially enabling auth in Spark itself).
Of course it also matters where the data is. Spark has nothing to say about
how the data is being stored.



On Mon, Jun 1, 2020 at 7:20 AM Wilbert S. <wi...@gmail.com> wrote:

> Hello,
>
> This is what happens when I load the data using sparklyr::spark_read_csv()
> in R. It creates a "derby.log" file that says something along the lines of:
>
> Sun May 31 14:17:02 EDT 2020:
> Booting Derby version The Apache Software Foundation - Apache Derby -
> 10.12.1.1 - (1704137): instance xxxxxxx
> on database directory memory:C:\Users\wseoane\2020-05-31 sparklyr on three
> rows\databaseName=metastore_db with class loader
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$xxxxxxxxx
> Loaded from
> file:/C:/Users/wseoane/AppData/Local/spark/spark-2.4.3-bin-hadoop2.7/jars/derby-10.12.1.1.jar
> java.vendor=Oracle Corporation
> java.runtime.version=1.8.0_241-b07
> user.dir=C:\Users\wseoane\2020-05-31 sparklyr on three rows
> os.name=Windows 10
> os.arch=xxxxx
> os.version=10.0
> derby.system.home=null
> Database Class Loader started - derby.database.classpath=''
>
>
> I can then click to view details about the Spark connection in my browser
> while I have the Spark connection in sparklyr. Here are the results from a
> test .tsv file:
> Jobs:
> [image: Jobs 2020-05-31 142103.png]
> SQL:
> [image: SQL 2020-05-31 142217.png]
> Stages:
> [image: Stages 2020-05-31 142217.png]
> Storage:
> [image: Storage 2020-05-31 142217.png]
>
> So, since sparklyr::spark_read_csv() reads in the data locally and not in
> the cloud, security is determined by my company's IT department correct
> (i.e. the firewalls that the IT department has in place in the network and
> the antivirus software they have installed on my computer and etc.)? If it
> were on the cloud, the cloud would need it's own layer of security ("up to
> whoever runs the cluster") but that is not relevant here since I am using sparklyr::spark_read_csv(),
> correct?
>
>
> Thanks,
>
> Wilbert Seoane
>