You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jeremy Brent <j....@ieee.org.INVALID> on 2023/05/22 20:17:22 UTC

Setting spark.local.dir, SPARK_LOCAL_DIRS AND java.io.tmpdir has no effect on where PySpark is writing our shuffle data

Hi Spark Community,


We are using PySpark 3.3.1 on a 3 node cluster – 1 master and 2 workers.
All nodes are AWS EC2’s with an Ubuntu OS version 22.04.


We set `SPARK_LOCAL_DIRS` in `conf/spark-env.sh` on all machines in cluster

[image: image.png]

 `spark.local.dir` in `conf/spark-defaults.conf` on all machines in cluster

[image: image.png]

and java.io.tmpdir during the job itself

[image: image.png]

We can confirm that the directories are set by looking at the history
server:

[image: image.png]

[image: image.png]

And by printing `os.environ.get("SPARK_LOCAL_DIRS")` during the job itself
which results in the same as above.


We are running a standalone cluster, but just to be safe, I checked
`os.environ.get(“LOCAL_DIRS”)` during the job itself to cover the case
mentioned in the documentation (In Spark 1.0 and later this will be
overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN)
environment variables set by the cluster manager.") but that returned
`None`.


The permissions to the destination directory (/mnt/…./tmp) is the same all
the way through, `drwxrwxrwt   3 root root` and are the same permissions
that `/tmp` has.

However, the job is still writing temp shuffle files to the default `/tmp`
folder on our worker executors. [image: image.png]

What are we missing here?? Let me know if any clarifications or more
information is needed.

Many thanks in advance,
-- 
Jeremy Brent
Product Engineering Data Scientist
Data Intelligence & Machine Learning
Office: 732-562-6030
Cell:  732-336-0499

Re: Setting spark.local.dir, SPARK_LOCAL_DIRS AND java.io.tmpdir has no effect on where PySpark is writing our shuffle data

Posted by Jeremy Brent <j....@ieee.org.INVALID>.

Hi Mich,

If I’m not mistaken, when running in YARN mode, LOCAL_DIRS needs to be set.
(*Note:* This will be overridden by SPARK_LOCAL_DIRS (Standalone),
MESOS_SANDBOX (Mesos) or LOCAL_DIRS (YARN) environment variables set by the
cluster manager.) [
https://spark.apache.org/docs/latest/configuration.html#application-properties
— spark.local.dir]

*On my worker node, the following are written to the /tmp directory: *

*root@ri-worker-0:~# ls -l
/tmp/spark-3c1059dc-5b70-44e2-9901-49a75c493ae9/total 4drwx------ 4 root
root 4096 May 24 22:04 executor-2aa8588f-aa3d-4160-8f15-bf6691e3b3f2 *




*root@ri-worker-0:~# ls -l
/tmp/spark-3c1059dc-5b70-44e2-9901-49a75c493ae9/executor-2aa8588f-aa3d-4160-8f15-bf6691e3b3f2/total
8drwxr-xr-x 66 root root 4096 May 24 22:05
blockmgr-998d8895-837b-4b68-af8e-67beeb61f5c2drwx------  2 root root 4096
May 24 22:04 spark-0f4c4703-df04-4422-a03b-75f44a7f3dca*

The shuffle files are stored within
blockmgr-998d8895-837b-4b68-af8e-67beeb61f5c2

*On my worker node, the following are written to $SPARK_LOCAL_DIRS:*





*root@ri-worker-0:~# ls -l /mnt/data_ebs/infrastructure/spark/tmp/total
152drwxr-xr-x 20 root root    262 May 24 22:05
blockmgr-fd10c86c-186b-4866-bba2-e755d27d9d64-rw-r--r--  1 root root 154886
May 24 22:01 liblz4-java-6777456331165790786.so-rw-r--r--  1 root root
 0 May 24 22:01 liblz4-java-6777456331165790786.so.lckdrwx------  4 root
root    124 May 24 22:01 spark-a8f23973-ed94-4d35-8bba-af7f90966111*

The screenshots in my previous email relay the same information in tree
form.

All the best,
Jeremy

On Wed, May 24, 2023 at 5:29 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Jeremy,
>
> In YARN mode on the driver in $SPARK_LOCAL_DIRS, I have
>
> drwx------.  4 hduser hadoop 4096 May 24 22:08
> spark-63c1e0cd-1f25-40d8-acb0-2404aab49eba
> drwxr-xr-x. 18 hduser hadoop 4096 May 24 22:09
> blockmgr-b4963787-e3bd-4868-8396-18d114cc61fb
>
>
> Now if I look at the worker in the  $SPARK_LOCAL_DIRS  directory, I don't
> see any such results. My assumption is that shuffle results are  written in
> the driver $SPARK_LOCAL_DIR only. This makes sense because otherwise the
> monitoring and collection of shuffle results is going to be a nightmare. In
> your *worker node*  on /tmp directory what is written?
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 24 May 2023 at 21:40, Jeremy Brent <j....@ieee.org> wrote:
>
>> Hi Mich,
>>
>> Thanks for getting back. Just a heads up, your original response was sent
>> to j.brent@ieee.org.invalid so I did not receive it.
>>
>> It looks like you are suggesting to add `export` and quotes around the
>> path in spark-env.sh.
>>
>> We made that modification and were able to write successfully to the
>> defined path when running in LOCAL mode. Note that when using the original
>> spark-env.sh, no export or quotes, the LOCAL job still wrote to the defined
>> path.  That said, neither solution is working when running the job on the
>> cluster (we changed the spark-env.sh file on all machines).
>>
>> When running a job on the full cluster, we are writing some information
>> to /mnt/..../tmp on our workers (the blockmgr subfolders are empty):
>> [image: image.png]
>>
>> but our temp shuffle files are still written to /tmp.
>>
>> [image: image.png]
>>
>> I also tested changing SPARK_LOCAL_DIRS to /mnt/tmp in the spark-env.sh
>> files on all machines and ran a job across the cluster. The driver writes
>> to /mnt/tmp but our workers still write to /tmp.
>>
>> Let me know if you need information.
>>
>> All the best,
>> Jeremy Brent
>>
>> On Wed, May 24, 2023 at 10:33 AM Arash Nemati Hayati <
>> a.nematihayati@ieee.org> wrote:
>>
>>>
>>>
>>> ---------- Forwarded message ---------
>>> From: Mich Talebzadeh <mi...@gmail.com>
>>> Date: Tue, May 23, 2023 at 3:53 AM
>>> Subject: Re: Setting spark.local.dir, SPARK_LOCAL_DIRS AND
>>> java.io.tmpdir has no effect on where PySpark is writing our shuffle data
>>> To: Jeremy Brent <j....@ieee.org.invalid>
>>> Cc: <us...@spark.apache.org>, Arash Nemati Hayati <
>>> a.nematihayati@ieee.org>
>>>
>>>
>>> Hi Jeremy,
>>>
>>> This should work
>>>
>>> Tested on the LOCAL mode in spark 3.4.0
>>>
>>> In $SPARK_HOME/conf add this line to spark-env.sh
>>>
>>> export SPARK_LOCAL_DIRS="/ssd/hduser/spark/tmp"   ## change the
>>> location to yours
>>>
>>> Do this on the driver and workers
>>>
>>> And you can check to verify what is written
>>>
>>> cd /ssd/hduser/spark/tmp
>>> ls -l
>>> drwxr-xr-x. 66 hduser hadoop 4096 May 23 08:37
>>> blockmgr-e4fd3b45-f787-45a0-adf9-b275b0329fdf
>>> drwx------.  4 hduser hadoop 4096 May 23 08:37
>>> spark-ebd0e307-0d5a-4b34-8d33-967c7fe04c21
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 22 May 2023 at 21:22, Jeremy Brent <j....@ieee.org.invalid>
>>> wrote:
>>>
>>>> Hi Spark Community,
>>>>
>>>>
>>>> We are using PySpark 3.3.1 on a 3 node cluster – 1 master and 2
>>>> workers. All nodes are AWS EC2’s with an Ubuntu OS version 22.04.
>>>>
>>>>
>>>> We set `SPARK_LOCAL_DIRS` in `conf/spark-env.sh` on all machines in
>>>> cluster
>>>>
>>>> [image: image.png]
>>>>
>>>>  `spark.local.dir` in `conf/spark-defaults.conf` on all machines in
>>>> cluster
>>>>
>>>> [image: image.png]
>>>>
>>>> and java.io.tmpdir during the job itself
>>>>
>>>> [image: image.png]
>>>>
>>>> We can confirm that the directories are set by looking at the history
>>>> server:
>>>>
>>>> [image: image.png]
>>>>
>>>> [image: image.png]
>>>>
>>>> And by printing `os.environ.get("SPARK_LOCAL_DIRS")` during the job
>>>> itself which results in the same as above.
>>>>
>>>>
>>>> We are running a standalone cluster, but just to be safe, I checked
>>>> `os.environ.get(“LOCAL_DIRS”)` during the job itself to cover the case
>>>> mentioned in the documentation (In Spark 1.0 and later this will be
>>>> overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN)
>>>> environment variables set by the cluster manager.") but that returned
>>>> `None`.
>>>>
>>>>
>>>> The permissions to the destination directory (/mnt/…./tmp) is the same
>>>> all the way through, `drwxrwxrwt   3 root root` and are the same
>>>> permissions that `/tmp` has.
>>>>
>>>> However, the job is still writing temp shuffle files to the default
>>>> `/tmp` folder on our worker executors. [image: image.png]
>>>>
>>>> What are we missing here?? Let me know if any clarifications or more
>>>> information is needed.
>>>>
>>>> Many thanks in advance,
>>>> --
>>>> Jeremy Brent
>>>> Product Engineering Data Scientist
>>>> Data Intelligence & Machine Learning
>>>> Office: 732-562-6030
>>>> Cell:  732-336-0499
>>>>
>>>
>>>
>>> --
>>>
>>> Arash Nemati Hayati, PhD
>>> Director of Data & Analytics Product Engineering
>>> IEEE
>>> Office:  732-562-6053
>>> Mobile: 848-319-0348
>>> a.nematihayati@ieee.org
>>>
>>
>>
>> --
>> Jeremy Brent
>> Product Engineering Data Scientist
>> Data Intelligence & Machine Learning
>> Office: 732-562-6030
>> Cell:  732-336-0499
>>
>

Re: Setting spark.local.dir, SPARK_LOCAL_DIRS AND java.io.tmpdir has no effect on where PySpark is writing our shuffle data

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Jeremy,

In YARN mode on the driver in $SPARK_LOCAL_DIRS, I have

drwx------.  4 hduser hadoop 4096 May 24 22:08
spark-63c1e0cd-1f25-40d8-acb0-2404aab49eba
drwxr-xr-x. 18 hduser hadoop 4096 May 24 22:09
blockmgr-b4963787-e3bd-4868-8396-18d114cc61fb


Now if I look at the worker in the  $SPARK_LOCAL_DIRS  directory, I don't
see any such results. My assumption is that shuffle results are  written in
the driver $SPARK_LOCAL_DIR only. This makes sense because otherwise the
monitoring and collection of shuffle results is going to be a nightmare. In
your *worker node*  on /tmp directory what is written?

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 24 May 2023 at 21:40, Jeremy Brent <j....@ieee.org> wrote:

> Hi Mich,
>
> Thanks for getting back. Just a heads up, your original response was sent
> to j.brent@ieee.org.invalid so I did not receive it.
>
> It looks like you are suggesting to add `export` and quotes around the
> path in spark-env.sh.
>
> We made that modification and were able to write successfully to the
> defined path when running in LOCAL mode. Note that when using the original
> spark-env.sh, no export or quotes, the LOCAL job still wrote to the defined
> path.  That said, neither solution is working when running the job on the
> cluster (we changed the spark-env.sh file on all machines).
>
> When running a job on the full cluster, we are writing some information to
> /mnt/..../tmp on our workers (the blockmgr subfolders are empty):
> [image: image.png]
>
> but our temp shuffle files are still written to /tmp.
>
> [image: image.png]
>
> I also tested changing SPARK_LOCAL_DIRS to /mnt/tmp in the spark-env.sh
> files on all machines and ran a job across the cluster. The driver writes
> to /mnt/tmp but our workers still write to /tmp.
>
> Let me know if you need information.
>
> All the best,
> Jeremy Brent
>
> On Wed, May 24, 2023 at 10:33 AM Arash Nemati Hayati <
> a.nematihayati@ieee.org> wrote:
>
>>
>>
>> ---------- Forwarded message ---------
>> From: Mich Talebzadeh <mi...@gmail.com>
>> Date: Tue, May 23, 2023 at 3:53 AM
>> Subject: Re: Setting spark.local.dir, SPARK_LOCAL_DIRS AND java.io.tmpdir
>> has no effect on where PySpark is writing our shuffle data
>> To: Jeremy Brent <j....@ieee.org.invalid>
>> Cc: <us...@spark.apache.org>, Arash Nemati Hayati <a.nematihayati@ieee.org
>> >
>>
>>
>> Hi Jeremy,
>>
>> This should work
>>
>> Tested on the LOCAL mode in spark 3.4.0
>>
>> In $SPARK_HOME/conf add this line to spark-env.sh
>>
>> export SPARK_LOCAL_DIRS="/ssd/hduser/spark/tmp"   ## change the location
>> to yours
>>
>> Do this on the driver and workers
>>
>> And you can check to verify what is written
>>
>> cd /ssd/hduser/spark/tmp
>> ls -l
>> drwxr-xr-x. 66 hduser hadoop 4096 May 23 08:37
>> blockmgr-e4fd3b45-f787-45a0-adf9-b275b0329fdf
>> drwx------.  4 hduser hadoop 4096 May 23 08:37
>> spark-ebd0e307-0d5a-4b34-8d33-967c7fe04c21
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 22 May 2023 at 21:22, Jeremy Brent <j....@ieee.org.invalid>
>> wrote:
>>
>>> Hi Spark Community,
>>>
>>>
>>> We are using PySpark 3.3.1 on a 3 node cluster – 1 master and 2 workers.
>>> All nodes are AWS EC2’s with an Ubuntu OS version 22.04.
>>>
>>>
>>> We set `SPARK_LOCAL_DIRS` in `conf/spark-env.sh` on all machines in
>>> cluster
>>>
>>> [image: image.png]
>>>
>>>  `spark.local.dir` in `conf/spark-defaults.conf` on all machines in
>>> cluster
>>>
>>> [image: image.png]
>>>
>>> and java.io.tmpdir during the job itself
>>>
>>> [image: image.png]
>>>
>>> We can confirm that the directories are set by looking at the history
>>> server:
>>>
>>> [image: image.png]
>>>
>>> [image: image.png]
>>>
>>> And by printing `os.environ.get("SPARK_LOCAL_DIRS")` during the job
>>> itself which results in the same as above.
>>>
>>>
>>> We are running a standalone cluster, but just to be safe, I checked
>>> `os.environ.get(“LOCAL_DIRS”)` during the job itself to cover the case
>>> mentioned in the documentation (In Spark 1.0 and later this will be
>>> overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN)
>>> environment variables set by the cluster manager.") but that returned
>>> `None`.
>>>
>>>
>>> The permissions to the destination directory (/mnt/…./tmp) is the same
>>> all the way through, `drwxrwxrwt   3 root root` and are the same
>>> permissions that `/tmp` has.
>>>
>>> However, the job is still writing temp shuffle files to the default
>>> `/tmp` folder on our worker executors. [image: image.png]
>>>
>>> What are we missing here?? Let me know if any clarifications or more
>>> information is needed.
>>>
>>> Many thanks in advance,
>>> --
>>> Jeremy Brent
>>> Product Engineering Data Scientist
>>> Data Intelligence & Machine Learning
>>> Office: 732-562-6030
>>> Cell:  732-336-0499
>>>
>>
>>
>> --
>>
>> Arash Nemati Hayati, PhD
>> Director of Data & Analytics Product Engineering
>> IEEE
>> Office:  732-562-6053
>> Mobile: 848-319-0348
>> a.nematihayati@ieee.org
>>
>
>
> --
> Jeremy Brent
> Product Engineering Data Scientist
> Data Intelligence & Machine Learning
> Office: 732-562-6030
> Cell:  732-336-0499
>

Re: Setting spark.local.dir, SPARK_LOCAL_DIRS AND java.io.tmpdir has no effect on where PySpark is writing our shuffle data

Posted by Jeremy Brent <j....@ieee.org.INVALID>.

Hi Mich,

Thanks for getting back. Just a heads up, your original response was sent
to j.brent@ieee.org.invalid so I did not receive it.

It looks like you are suggesting to add `export` and quotes around the path
in spark-env.sh.

We made that modification and were able to write successfully to the
defined path when running in LOCAL mode. Note that when using the original
spark-env.sh, no export or quotes, the LOCAL job still wrote to the defined
path.  That said, neither solution is working when running the job on the
cluster (we changed the spark-env.sh file on all machines).

When running a job on the full cluster, we are writing some information to
/mnt/..../tmp on our workers (the blockmgr subfolders are empty):
[image: image.png]

but our temp shuffle files are still written to /tmp.

[image: image.png]

I also tested changing SPARK_LOCAL_DIRS to /mnt/tmp in the spark-env.sh
files on all machines and ran a job across the cluster. The driver writes
to /mnt/tmp but our workers still write to /tmp.

Let me know if you need information.

All the best,
Jeremy Brent

On Wed, May 24, 2023 at 10:33 AM Arash Nemati Hayati <
a.nematihayati@ieee.org> wrote:

>
>
> ---------- Forwarded message ---------
> From: Mich Talebzadeh <mi...@gmail.com>
> Date: Tue, May 23, 2023 at 3:53 AM
> Subject: Re: Setting spark.local.dir, SPARK_LOCAL_DIRS AND java.io.tmpdir
> has no effect on where PySpark is writing our shuffle data
> To: Jeremy Brent <j....@ieee.org.invalid>
> Cc: <us...@spark.apache.org>, Arash Nemati Hayati <a....@ieee.org>
>
>
> Hi Jeremy,
>
> This should work
>
> Tested on the LOCAL mode in spark 3.4.0
>
> In $SPARK_HOME/conf add this line to spark-env.sh
>
> export SPARK_LOCAL_DIRS="/ssd/hduser/spark/tmp"   ## change the location
> to yours
>
> Do this on the driver and workers
>
> And you can check to verify what is written
>
> cd /ssd/hduser/spark/tmp
> ls -l
> drwxr-xr-x. 66 hduser hadoop 4096 May 23 08:37
> blockmgr-e4fd3b45-f787-45a0-adf9-b275b0329fdf
> drwx------.  4 hduser hadoop 4096 May 23 08:37
> spark-ebd0e307-0d5a-4b34-8d33-967c7fe04c21
>
> HTH
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 22 May 2023 at 21:22, Jeremy Brent <j....@ieee.org.invalid>
> wrote:
>
>> Hi Spark Community,
>>
>>
>> We are using PySpark 3.3.1 on a 3 node cluster – 1 master and 2 workers.
>> All nodes are AWS EC2’s with an Ubuntu OS version 22.04.
>>
>>
>> We set `SPARK_LOCAL_DIRS` in `conf/spark-env.sh` on all machines in
>> cluster
>>
>> [image: image.png]
>>
>>  `spark.local.dir` in `conf/spark-defaults.conf` on all machines in
>> cluster
>>
>> [image: image.png]
>>
>> and java.io.tmpdir during the job itself
>>
>> [image: image.png]
>>
>> We can confirm that the directories are set by looking at the history
>> server:
>>
>> [image: image.png]
>>
>> [image: image.png]
>>
>> And by printing `os.environ.get("SPARK_LOCAL_DIRS")` during the job
>> itself which results in the same as above.
>>
>>
>> We are running a standalone cluster, but just to be safe, I checked
>> `os.environ.get(“LOCAL_DIRS”)` during the job itself to cover the case
>> mentioned in the documentation (In Spark 1.0 and later this will be
>> overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN)
>> environment variables set by the cluster manager.") but that returned
>> `None`.
>>
>>
>> The permissions to the destination directory (/mnt/…./tmp) is the same
>> all the way through, `drwxrwxrwt   3 root root` and are the same
>> permissions that `/tmp` has.
>>
>> However, the job is still writing temp shuffle files to the default
>> `/tmp` folder on our worker executors. [image: image.png]
>>
>> What are we missing here?? Let me know if any clarifications or more
>> information is needed.
>>
>> Many thanks in advance,
>> --
>> Jeremy Brent
>> Product Engineering Data Scientist
>> Data Intelligence & Machine Learning
>> Office: 732-562-6030
>> Cell:  732-336-0499
>>
>
>
> --
>
> Arash Nemati Hayati, PhD
> Director of Data & Analytics Product Engineering
> IEEE
> Office:  732-562-6053
> Mobile: 848-319-0348
> a.nematihayati@ieee.org
>


-- 
Jeremy Brent
Product Engineering Data Scientist
Data Intelligence & Machine Learning
Office: 732-562-6030
Cell:  732-336-0499

Re: Setting spark.local.dir, SPARK_LOCAL_DIRS AND java.io.tmpdir has no effect on where PySpark is writing our shuffle data

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Jeremy,

This should work

Tested on the LOCAL mode in spark 3.4.0

In $SPARK_HOME/conf add this line to spark-env.sh

export SPARK_LOCAL_DIRS="/ssd/hduser/spark/tmp"   ## change the location to
yours

Do this on the driver and workers

And you can check to verify what is written

cd /ssd/hduser/spark/tmp
ls -l
drwxr-xr-x. 66 hduser hadoop 4096 May 23 08:37
blockmgr-e4fd3b45-f787-45a0-adf9-b275b0329fdf
drwx------.  4 hduser hadoop 4096 May 23 08:37
spark-ebd0e307-0d5a-4b34-8d33-967c7fe04c21

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 22 May 2023 at 21:22, Jeremy Brent <j....@ieee.org.invalid> wrote:

> Hi Spark Community,
>
>
> We are using PySpark 3.3.1 on a 3 node cluster – 1 master and 2 workers.
> All nodes are AWS EC2’s with an Ubuntu OS version 22.04.
>
>
> We set `SPARK_LOCAL_DIRS` in `conf/spark-env.sh` on all machines in cluster
>
> [image: image.png]
>
>  `spark.local.dir` in `conf/spark-defaults.conf` on all machines in cluster
>
> [image: image.png]
>
> and java.io.tmpdir during the job itself
>
> [image: image.png]
>
> We can confirm that the directories are set by looking at the history
> server:
>
> [image: image.png]
>
> [image: image.png]
>
> And by printing `os.environ.get("SPARK_LOCAL_DIRS")` during the job itself
> which results in the same as above.
>
>
> We are running a standalone cluster, but just to be safe, I checked
> `os.environ.get(“LOCAL_DIRS”)` during the job itself to cover the case
> mentioned in the documentation (In Spark 1.0 and later this will be
> overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN)
> environment variables set by the cluster manager.") but that returned
> `None`.
>
>
> The permissions to the destination directory (/mnt/…./tmp) is the same all
> the way through, `drwxrwxrwt   3 root root` and are the same permissions
> that `/tmp` has.
>
> However, the job is still writing temp shuffle files to the default `/tmp`
> folder on our worker executors. [image: image.png]
>
> What are we missing here?? Let me know if any clarifications or more
> information is needed.
>
> Many thanks in advance,
> --
> Jeremy Brent
> Product Engineering Data Scientist
> Data Intelligence & Machine Learning
> Office: 732-562-6030
> Cell:  732-336-0499
>