You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Tanveer Ahmad - EWI <T....@tudelft.nl> on 2020/06/10 21:22:51 UTC

Running plasma_store_server (in background) on each Spark worker node

Hi all,

I want to run an external command (plasma_store_server -m 3000000000 -s /tmp/store0 &) in the background on each worker node of my Spark cluster<https://userinfo.surfsara.nl/systems/cartesius/software/spark>. So that that external process should be running during the whole Spark job.

The plasma_store_server process is used for storing and retrieving Apache Arrow data in Apache Spark.

I am using PySpark for Spark programming and SLURM for Spark cluster<https://userinfo.surfsara.nl/systems/cartesius/software/spark> creation.

Any help will be highly appreciated!

Regards,

Tanveer Ahmad

Re: Running plasma_store_server (in background) on each Spark worker node

Posted by Tanveer Ahmad - EWI <T....@tudelft.nl>.

Hi Micah,


Thank you so much.

I am able to run Plasma in Spark cluster through map() methods on each worker nodes.


Regards,
Tanveer Ahmad

________________________________
From: Micah Kornfield <em...@gmail.com>
Sent: Friday, June 12, 2020 7:00:03 AM
To: user@arrow.apache.org
Subject: Re: Running plasma_store_server (in background) on each Spark worker node

Hi Tanveer,
How to ensure the server is running probably depends on your cluster management system (I'm not familiar with Slurm).  But if you only have 6 machines, you could probably SSH into each of them and start the process by hand.

Ray's cluster management [1] might be another place to look for examples (I believe Ray spawns a plasma server on each cluster node).

Generally, the scope of Arrow doesn't include cluster management, so there might not be too much in the way of responses on this list.

Hope this helps.

Micah

[1] https://docs.ray.io/en/master/autoscaling.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.ray.io_en_master_autoscaling.html&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=Tbfzlwztco4XkkK1m5w8G5Hg0Ol1Wmtub1btBAlExpA&s=u1tr70B5H5MgdeY7gGYdcw7Fr1KzkZZ0Ig7eqKbNFSQ&e=>

On Wed, Jun 10, 2020 at 3:48 PM Tanveer Ahmad - EWI <T....@tudelft.nl>> wrote:

Hi Neal,


Yes, my question is: How can I run Plasma Store in each worker node on Spark cluster.

Suppose my cluster consist of 6 nodes (1 master plus 5 workers), I want to run Plasma Store on all 5 worker nodes. Thanks.


Regards,
Tanveer Ahmad
________________________________
From: Neal Richardson <ne...@gmail.com>>
Sent: Thursday, June 11, 2020 12:40:47 AM
To: user@arrow.apache.org<ma...@arrow.apache.org>
Subject: Re: Running plasma_store_server (in background) on each Spark worker node

Hi Tanveer,
Do you have any specific questions, or have you encountered trouble with your setup?

Neal

On Wed, Jun 10, 2020 at 2:23 PM Tanveer Ahmad - EWI <T....@tudelft.nl>> wrote:

Hi all,

I want to run an external command (plasma_store_server -m 3000000000 -s /tmp/store0 &) in the background on each worker node of my Spark cluster<https://urldefense.proofpoint.com/v2/url?u=https-3A__userinfo.surfsara.nl_systems_cartesius_software_spark&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=OTFkDBpK8Wz6I3ICsdIVeAGOElHKBdYn32SJXTiW--Y&s=TCwFQ8RNAB50SqrdEDQlggcrojrUYiabQx4sdfq980A&e=>. So that that external process should be running during the whole Spark job.

The plasma_store_server process is used for storing and retrieving Apache Arrow data in Apache Spark.

I am using PySpark for Spark programming and SLURM for Spark cluster<https://urldefense.proofpoint.com/v2/url?u=https-3A__userinfo.surfsara.nl_systems_cartesius_software_spark&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=OTFkDBpK8Wz6I3ICsdIVeAGOElHKBdYn32SJXTiW--Y&s=TCwFQ8RNAB50SqrdEDQlggcrojrUYiabQx4sdfq980A&e=> creation.

Any help will be highly appreciated!

Regards,

Tanveer Ahmad

Re: Running plasma_store_server (in background) on each Spark worker node

Posted by Micah Kornfield <em...@gmail.com>.

Hi Tanveer,
How to ensure the server is running probably depends on your cluster
management system (I'm not familiar with Slurm).  But if you only have 6
machines, you could probably SSH into each of them and start the process by
hand.

Ray's cluster management [1] might be another place to look for examples (I
believe Ray spawns a plasma server on each cluster node).

Generally, the scope of Arrow doesn't include cluster management, so there
might not be too much in the way of responses on this list.

Hope this helps.

Micah

[1] https://docs.ray.io/en/master/autoscaling.html

On Wed, Jun 10, 2020 at 3:48 PM Tanveer Ahmad - EWI <T....@tudelft.nl>
wrote:

> Hi Neal,
>
>
> Yes, my question is: How can I run Plasma Store in each worker node on
> Spark cluster.
>
> Suppose my cluster consist of 6 nodes (1 master plus 5 workers), I want
> to run Plasma Store on all 5 worker nodes. Thanks.
>
>
> Regards,
> Tanveer Ahmad
> ------------------------------
> *From:* Neal Richardson <ne...@gmail.com>
> *Sent:* Thursday, June 11, 2020 12:40:47 AM
> *To:* user@arrow.apache.org
> *Subject:* Re: Running plasma_store_server (in background) on each Spark
> worker node
>
> Hi Tanveer,
> Do you have any specific questions, or have you encountered trouble with
> your setup?
>
> Neal
>
> On Wed, Jun 10, 2020 at 2:23 PM Tanveer Ahmad - EWI <T....@tudelft.nl>
> wrote:
>
>> Hi all,
>>
>> I want to run an external command (plasma_store_server -m 3000000000 -s
>> /tmp/store0 &) in the background on each worker node of my Spark cluster
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__userinfo.surfsara.nl_systems_cartesius_software_spark&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=OTFkDBpK8Wz6I3ICsdIVeAGOElHKBdYn32SJXTiW--Y&s=TCwFQ8RNAB50SqrdEDQlggcrojrUYiabQx4sdfq980A&e=>.
>> So that that external process should be running during the whole Spark job.
>>
>> The plasma_store_server process is used for storing and retrieving Apache
>> Arrow data in Apache Spark.
>>
>> I am using PySpark for Spark programming and SLURM for Spark cluster
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__userinfo.surfsara.nl_systems_cartesius_software_spark&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=OTFkDBpK8Wz6I3ICsdIVeAGOElHKBdYn32SJXTiW--Y&s=TCwFQ8RNAB50SqrdEDQlggcrojrUYiabQx4sdfq980A&e=>
>>  creation.
>>
>> Any help will be highly appreciated!
>> Regards,
>>
>> Tanveer Ahmad
>>
>>

Re: Running plasma_store_server (in background) on each Spark worker node

Posted by Tanveer Ahmad - EWI <T....@tudelft.nl>.

Hi Neal,


Yes, my question is: How can I run Plasma Store in each worker node on Spark cluster.

Suppose my cluster consist of 6 nodes (1 master plus 5 workers), I want to run Plasma Store on all 5 worker nodes. Thanks.


Regards,
Tanveer Ahmad
________________________________
From: Neal Richardson <ne...@gmail.com>
Sent: Thursday, June 11, 2020 12:40:47 AM
To: user@arrow.apache.org
Subject: Re: Running plasma_store_server (in background) on each Spark worker node

Hi Tanveer,
Do you have any specific questions, or have you encountered trouble with your setup?

Neal

On Wed, Jun 10, 2020 at 2:23 PM Tanveer Ahmad - EWI <T....@tudelft.nl>> wrote:

Hi all,

I want to run an external command (plasma_store_server -m 3000000000 -s /tmp/store0 &) in the background on each worker node of my Spark cluster<https://urldefense.proofpoint.com/v2/url?u=https-3A__userinfo.surfsara.nl_systems_cartesius_software_spark&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=OTFkDBpK8Wz6I3ICsdIVeAGOElHKBdYn32SJXTiW--Y&s=TCwFQ8RNAB50SqrdEDQlggcrojrUYiabQx4sdfq980A&e=>. So that that external process should be running during the whole Spark job.

The plasma_store_server process is used for storing and retrieving Apache Arrow data in Apache Spark.

I am using PySpark for Spark programming and SLURM for Spark cluster<https://urldefense.proofpoint.com/v2/url?u=https-3A__userinfo.surfsara.nl_systems_cartesius_software_spark&d=DwMFaQ&c=XYzUhXBD2cD-CornpT4QE19xOJBbRy-TBPLK0X9U2o8&r=0FbbJetCCSYzJEnEDCQ1rNv76vTL6SUFCukKhvNosPs&m=OTFkDBpK8Wz6I3ICsdIVeAGOElHKBdYn32SJXTiW--Y&s=TCwFQ8RNAB50SqrdEDQlggcrojrUYiabQx4sdfq980A&e=> creation.

Any help will be highly appreciated!

Regards,

Tanveer Ahmad

Re: Running plasma_store_server (in background) on each Spark worker node

Posted by Neal Richardson <ne...@gmail.com>.

Hi Tanveer,
Do you have any specific questions, or have you encountered trouble with
your setup?

Neal

On Wed, Jun 10, 2020 at 2:23 PM Tanveer Ahmad - EWI <T....@tudelft.nl>
wrote:

> Hi all,
>
> I want to run an external command (plasma_store_server -m 3000000000 -s
> /tmp/store0 &) in the background on each worker node of my Spark cluster
> <https://userinfo.surfsara.nl/systems/cartesius/software/spark>. So that
> that external process should be running during the whole Spark job.
>
> The plasma_store_server process is used for storing and retrieving Apache
> Arrow data in Apache Spark.
>
> I am using PySpark for Spark programming and SLURM for Spark cluster
> <https://userinfo.surfsara.nl/systems/cartesius/software/spark> creation.
>
> Any help will be highly appreciated!
> Regards,
>
> Tanveer Ahmad
>
>