You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Steve Lewis <lo...@gmail.com> on 2014/11/14 20:18:43 UTC

How do you force a Spark Application to run in multiple tasks

 I have instrumented word count to track how many machines the code runs
on. I use an accumulator to maintain a Set or MacAddresses. I find that
everything is done on a single machine. This is probably optimal for word
count but not the larger problems I am working on.
How to a force processing to be split into multiple tasks. How to I access
the task and attempt numbers to track which processing happens in which
attempt. Also is using MacAddress to determine which machine is running the
code.
As far as I can tell a simple word count is running in one thread on  one
machine and the remainder of the cluster does nothing,
This is consistent with tests where I write to sdout from functions and see
little output on most machines in the cluster

Re: How do you force a Spark Application to run in multiple tasks

Posted by Daniel Siegmann <da...@velos.io>.

I've never used Mesos, sorry.

On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis <lo...@gmail.com> wrote:

> The cluster runs Mesos and I can see the tasks in the Mesos UI but most
> are not doing much - any hints about that UI
>
> On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann <
> daniel.siegmann@velos.io> wrote:
>
>> Most of the information you're asking for can be found on the Spark web
>> UI (see here <http://spark.apache.org/docs/1.1.0/monitoring.html>). You
>> can see which tasks are being processed by which nodes.
>>
>> If you're using HDFS and your file size is smaller than the HDFS block
>> size you will only have one partition (remember, there is exactly one task
>> for each partition in a stage). If you want to force it to have more
>> partitions, you can call RDD.repartition(numPartitions). Note that this
>> will introduce a shuffle you wouldn't otherwise have.
>>
>> Also make sure your job is allocated more than one core in your cluster
>> (you can see this on the web UI).
>>
>> On Fri, Nov 14, 2014 at 2:18 PM, Steve Lewis <lo...@gmail.com>
>> wrote:
>>
>>>  I have instrumented word count to track how many machines the code runs
>>> on. I use an accumulator to maintain a Set or MacAddresses. I find that
>>> everything is done on a single machine. This is probably optimal for word
>>> count but not the larger problems I am working on.
>>> How to a force processing to be split into multiple tasks. How to I
>>> access the task and attempt numbers to track which processing happens in
>>> which attempt. Also is using MacAddress to determine which machine is
>>> running the code.
>>> As far as I can tell a simple word count is running in one thread on
>>>  one machine and the remainder of the cluster does nothing,
>>> This is consistent with tests where I write to sdout from functions and
>>> see little output on most machines in the cluster
>>>
>>>
>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 54 W 40th St, New York, NY 10018
>> E: daniel.siegmann@velos.io W: www.velos.io
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegmann@velos.io W: www.velos.io

Re: How do you force a Spark Application to run in multiple tasks

Posted by Steve Lewis <lo...@gmail.com>.

The cluster runs Mesos and I can see the tasks in the Mesos UI but most are
not doing much - any hints about that UI

On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann <da...@velos.io>
wrote:

> Most of the information you're asking for can be found on the Spark web UI
> (see here <http://spark.apache.org/docs/1.1.0/monitoring.html>). You can
> see which tasks are being processed by which nodes.
>
> If you're using HDFS and your file size is smaller than the HDFS block
> size you will only have one partition (remember, there is exactly one task
> for each partition in a stage). If you want to force it to have more
> partitions, you can call RDD.repartition(numPartitions). Note that this
> will introduce a shuffle you wouldn't otherwise have.
>
> Also make sure your job is allocated more than one core in your cluster
> (you can see this on the web UI).
>
> On Fri, Nov 14, 2014 at 2:18 PM, Steve Lewis <lo...@gmail.com>
> wrote:
>
>>  I have instrumented word count to track how many machines the code runs
>> on. I use an accumulator to maintain a Set or MacAddresses. I find that
>> everything is done on a single machine. This is probably optimal for word
>> count but not the larger problems I am working on.
>> How to a force processing to be split into multiple tasks. How to I
>> access the task and attempt numbers to track which processing happens in
>> which attempt. Also is using MacAddress to determine which machine is
>> running the code.
>> As far as I can tell a simple word count is running in one thread on  one
>> machine and the remainder of the cluster does nothing,
>> This is consistent with tests where I write to sdout from functions and
>> see little output on most machines in the cluster
>>
>>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 54 W 40th St, New York, NY 10018
> E: daniel.siegmann@velos.io W: www.velos.io
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: How do you force a Spark Application to run in multiple tasks

Posted by Daniel Siegmann <da...@velos.io>.

Most of the information you're asking for can be found on the Spark web UI (see
here <http://spark.apache.org/docs/1.1.0/monitoring.html>). You can see
which tasks are being processed by which nodes.

If you're using HDFS and your file size is smaller than the HDFS block size
you will only have one partition (remember, there is exactly one task for
each partition in a stage). If you want to force it to have more
partitions, you can call RDD.repartition(numPartitions). Note that this
will introduce a shuffle you wouldn't otherwise have.

Also make sure your job is allocated more than one core in your cluster
(you can see this on the web UI).

On Fri, Nov 14, 2014 at 2:18 PM, Steve Lewis <lo...@gmail.com> wrote:

>  I have instrumented word count to track how many machines the code runs
> on. I use an accumulator to maintain a Set or MacAddresses. I find that
> everything is done on a single machine. This is probably optimal for word
> count but not the larger problems I am working on.
> How to a force processing to be split into multiple tasks. How to I access
> the task and attempt numbers to track which processing happens in which
> attempt. Also is using MacAddress to determine which machine is running the
> code.
> As far as I can tell a simple word count is running in one thread on  one
> machine and the remainder of the cluster does nothing,
> This is consistent with tests where I write to sdout from functions and
> see little output on most machines in the cluster
>
>

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegmann@velos.io W: www.velos.io