You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Stephen Boesch <ja...@gmail.com> on 2015/09/08 20:46:40 UTC

Help interpreting output from running java test-framework example

I am in the process of learning how to run a mesos cluster with the intent
for it to be the resource manager for Spark.  As a small step in that
direction a basic test of mesos was performed, as suggested by the Mesos
Getting Started page.

In the following output we see tasks launched and resources offered on a 20
node cluster:

[stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
$(hostname -s):5050
I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
master@10.64.204.124:5050
I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
Attempting to register without authentication
I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
20150908-182014-2093760522-5050-15313-0000
Registered! ID = 20150908-182014-2093760522-5050-15313-0000
Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0 and
mem: 119855.0
Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0 and
mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus: 16.0
and mem: 119855.0
Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus: 16.0
and mem: 119855.0
Status update: task 0 is in state TASK_LOST
Aborting because task 0 is in unexpected state TASK_LOST with reason
'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message
'Executor terminated'
I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver
I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework
'20150908-182014-2093760522-5050-15313-0000'
I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver
I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework
'20150908-182014-2093760522-5050-15313-0000'


Why did the task transition to TASK_LOST ?   Is there a misconfiguration on
the cluster?

Re: Help interpreting output from running java test-framework example

Posted by David Greenberg <ds...@gmail.com>.
As you know, Mesoscon Europe is fast approaching. At Mesoscon Europe, I'll
be giving a talk on our advanced, preempting, multi-tenant spark on Mesos
scheduler--Cook. Most excitingly, this framework will be fully open source
by then! So, you might be able to switch to Mesos even sooner.

If you're interested in giving it a spin sooner (in the next few days),
email me directly--we could use a new user's eyes on our documentation, to
make sure we didn't leave anything out.
On Fri, Sep 18, 2015 at 3:53 AM Marco Massenzio <ma...@mesosphere.io> wrote:

> Thanks, Stephen - feedback much appreciated!
>
> *Marco Massenzio*
>
> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Thu, Sep 17, 2015 at 5:03 PM, Stephen Boesch <ja...@gmail.com> wrote:
>
>> Compared to Yarn Mesos is just faster. Mesos has a smaller  startup time
>> and the delay between tasks is smaller.  The run times for terasort 100GB
>> tended towards 110sec median on Mesos vs about double that on Yarn.
>>
>> Unfortunately we require mature Multi-Tenancy/Isolation/Queues support
>> -which is still initial stages of WIP for Mesos. So we will need to use
>> YARN for the near and likely medium term.
>>
>>
>>
>> 2015-09-17 15:52 GMT-07:00 Marco Massenzio <ma...@mesosphere.io>:
>>
>>> Hey Stephen,
>>>
>>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>>>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>>>> mesos is  still just taking linearly increased time  compared to smaller
>>>> datasizes.
>>>
>>>
>>> Obviously delighted to hear that, BUT me not much like "but" :)
>>> I've added Tim who is one of the main contributors to our Mesos/Spark
>>> bindings, and it would be great to hear your use case/experience and find
>>> out whether we can improve on that front too!
>>>
>>> As the case may be, we could also jump on a hangout if it makes
>>> conversation easier/faster.
>>>
>>> Cheers,
>>>
>>> *Marco Massenzio*
>>>
>>> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>>>
>>> On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <ja...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Vinod. I went back to see the logs and nothing interesting .
>>>> However int he process I found that my spark port was pointing to 7077
>>>> instead of 5050. After re-running .. spark on mesos worked!
>>>>
>>>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>>>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>>>> mesos is  still just taking linearly increased time  compared to smaller
>>>> datasizes.
>>>>
>>>> We have significant additional work to incorporate mesos into
>>>> operations and support but given the strong perforrmance and stability
>>>> characterstics we are initially seeing here that effort is likely to get
>>>> underway.
>>>>
>>>>
>>>>
>>>> 2015-09-09 12:54 GMT-07:00 Vinod Kone <vi...@gmail.com>:
>>>>
>>>>> sounds like it. can you see what the slave/agent and executor logs say?
>>>>>
>>>>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> I am in the process of learning how to run a mesos cluster with the
>>>>>> intent for it to be the resource manager for Spark.  As a small step in
>>>>>> that direction a basic test of mesos was performed, as suggested by the
>>>>>> Mesos Getting Started page.
>>>>>>
>>>>>> In the following output we see tasks launched and resources offered
>>>>>> on a 20 node cluster:
>>>>>>
>>>>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
>>>>>> $(hostname -s):5050
>>>>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
>>>>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
>>>>>> master@10.64.204.124:5050
>>>>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
>>>>>> Attempting to register without authentication
>>>>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
>>>>>> 20150908-182014-2093760522-5050-15313-0000
>>>>>> Registered! ID = 20150908-182014-2093760522-5050-15313-0000
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus:
>>>>>> 16.0 and mem: 119855.0
>>>>>> Status update: task 0 is in state TASK_LOST
>>>>>> Aborting because task 0 is in unexpected state TASK_LOST with reason
>>>>>> 'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message
>>>>>> 'Executor terminated'
>>>>>> I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver
>>>>>> I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework
>>>>>> '20150908-182014-2093760522-5050-15313-0000'
>>>>>> I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver
>>>>>> I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework
>>>>>> '20150908-182014-2093760522-5050-15313-0000'
>>>>>>
>>>>>>
>>>>>> Why did the task transition to TASK_LOST ?   Is there a
>>>>>> misconfiguration on the cluster?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Help interpreting output from running java test-framework example

Posted by Marco Massenzio <ma...@mesosphere.io>.
Thanks, Stephen - feedback much appreciated!

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Thu, Sep 17, 2015 at 5:03 PM, Stephen Boesch <ja...@gmail.com> wrote:

> Compared to Yarn Mesos is just faster. Mesos has a smaller  startup time
> and the delay between tasks is smaller.  The run times for terasort 100GB
> tended towards 110sec median on Mesos vs about double that on Yarn.
>
> Unfortunately we require mature Multi-Tenancy/Isolation/Queues support
> -which is still initial stages of WIP for Mesos. So we will need to use
> YARN for the near and likely medium term.
>
>
>
> 2015-09-17 15:52 GMT-07:00 Marco Massenzio <ma...@mesosphere.io>:
>
>> Hey Stephen,
>>
>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>>> mesos is  still just taking linearly increased time  compared to smaller
>>> datasizes.
>>
>>
>> Obviously delighted to hear that, BUT me not much like "but" :)
>> I've added Tim who is one of the main contributors to our Mesos/Spark
>> bindings, and it would be great to hear your use case/experience and find
>> out whether we can improve on that front too!
>>
>> As the case may be, we could also jump on a hangout if it makes
>> conversation easier/faster.
>>
>> Cheers,
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>>
>> On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <ja...@gmail.com> wrote:
>>
>>> Thanks Vinod. I went back to see the logs and nothing interesting .
>>> However int he process I found that my spark port was pointing to 7077
>>> instead of 5050. After re-running .. spark on mesos worked!
>>>
>>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>>> mesos is  still just taking linearly increased time  compared to smaller
>>> datasizes.
>>>
>>> We have significant additional work to incorporate mesos into operations
>>> and support but given the strong perforrmance and stability characterstics
>>> we are initially seeing here that effort is likely to get underway.
>>>
>>>
>>>
>>> 2015-09-09 12:54 GMT-07:00 Vinod Kone <vi...@gmail.com>:
>>>
>>>> sounds like it. can you see what the slave/agent and executor logs say?
>>>>
>>>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> I am in the process of learning how to run a mesos cluster with the
>>>>> intent for it to be the resource manager for Spark.  As a small step in
>>>>> that direction a basic test of mesos was performed, as suggested by the
>>>>> Mesos Getting Started page.
>>>>>
>>>>> In the following output we see tasks launched and resources offered on
>>>>> a 20 node cluster:
>>>>>
>>>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
>>>>> $(hostname -s):5050
>>>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
>>>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
>>>>> master@10.64.204.124:5050
>>>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
>>>>> Attempting to register without authentication
>>>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
>>>>> 20150908-182014-2093760522-5050-15313-0000
>>>>> Registered! ID = 20150908-182014-2093760522-5050-15313-0000
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus:
>>>>> 16.0 and mem: 119855.0
>>>>> Status update: task 0 is in state TASK_LOST
>>>>> Aborting because task 0 is in unexpected state TASK_LOST with reason
>>>>> 'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message
>>>>> 'Executor terminated'
>>>>> I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver
>>>>> I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework
>>>>> '20150908-182014-2093760522-5050-15313-0000'
>>>>> I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver
>>>>> I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework
>>>>> '20150908-182014-2093760522-5050-15313-0000'
>>>>>
>>>>>
>>>>> Why did the task transition to TASK_LOST ?   Is there a
>>>>> misconfiguration on the cluster?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Help interpreting output from running java test-framework example

Posted by Stephen Boesch <ja...@gmail.com>.
Compared to Yarn Mesos is just faster. Mesos has a smaller  startup time
and the delay between tasks is smaller.  The run times for terasort 100GB
tended towards 110sec median on Mesos vs about double that on Yarn.

Unfortunately we require mature Multi-Tenancy/Isolation/Queues support
-which is still initial stages of WIP for Mesos. So we will need to use
YARN for the near and likely medium term.



2015-09-17 15:52 GMT-07:00 Marco Massenzio <ma...@mesosphere.io>:

> Hey Stephen,
>
> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>> mesos is  still just taking linearly increased time  compared to smaller
>> datasizes.
>
>
> Obviously delighted to hear that, BUT me not much like "but" :)
> I've added Tim who is one of the main contributors to our Mesos/Spark
> bindings, and it would be great to hear your use case/experience and find
> out whether we can improve on that front too!
>
> As the case may be, we could also jump on a hangout if it makes
> conversation easier/faster.
>
> Cheers,
>
> *Marco Massenzio*
>
> *Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*
>
> On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <ja...@gmail.com> wrote:
>
>> Thanks Vinod. I went back to see the logs and nothing interesting .
>> However int he process I found that my spark port was pointing to 7077
>> instead of 5050. After re-running .. spark on mesos worked!
>>
>> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
>> addition Mesos  is handling datasizes that yarn simply dies on  it. But
>> mesos is  still just taking linearly increased time  compared to smaller
>> datasizes.
>>
>> We have significant additional work to incorporate mesos into operations
>> and support but given the strong perforrmance and stability characterstics
>> we are initially seeing here that effort is likely to get underway.
>>
>>
>>
>> 2015-09-09 12:54 GMT-07:00 Vinod Kone <vi...@gmail.com>:
>>
>>> sounds like it. can you see what the slave/agent and executor logs say?
>>>
>>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <ja...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> I am in the process of learning how to run a mesos cluster with the
>>>> intent for it to be the resource manager for Spark.  As a small step in
>>>> that direction a basic test of mesos was performed, as suggested by the
>>>> Mesos Getting Started page.
>>>>
>>>> In the following output we see tasks launched and resources offered on
>>>> a 20 node cluster:
>>>>
>>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
>>>> $(hostname -s):5050
>>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
>>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
>>>> master@10.64.204.124:5050
>>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
>>>> Attempting to register without authentication
>>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
>>>> 20150908-182014-2093760522-5050-15313-0000
>>>> Registered! ID = 20150908-182014-2093760522-5050-15313-0000
>>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
>>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
>>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
>>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
>>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0
>>>> and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus:
>>>> 16.0 and mem: 119855.0
>>>> Status update: task 0 is in state TASK_LOST
>>>> Aborting because task 0 is in unexpected state TASK_LOST with reason
>>>> 'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message
>>>> 'Executor terminated'
>>>> I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver
>>>> I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework
>>>> '20150908-182014-2093760522-5050-15313-0000'
>>>> I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver
>>>> I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework
>>>> '20150908-182014-2093760522-5050-15313-0000'
>>>>
>>>>
>>>> Why did the task transition to TASK_LOST ?   Is there a
>>>> misconfiguration on the cluster?
>>>>
>>>
>>>
>>
>

Re: Help interpreting output from running java test-framework example

Posted by Marco Massenzio <ma...@mesosphere.io>.
Hey Stephen,

The spark on mesos is twice as fast as yarn on our 20 node cluster. In
> addition Mesos  is handling datasizes that yarn simply dies on  it. But
> mesos is  still just taking linearly increased time  compared to smaller
> datasizes.


Obviously delighted to hear that, BUT me not much like "but" :)
I've added Tim who is one of the main contributors to our Mesos/Spark
bindings, and it would be great to hear your use case/experience and find
out whether we can improve on that front too!

As the case may be, we could also jump on a hangout if it makes
conversation easier/faster.

Cheers,

*Marco Massenzio*

*Distributed Systems Engineerhttp://codetrips.com <http://codetrips.com>*

On Wed, Sep 9, 2015 at 1:33 PM, Stephen Boesch <ja...@gmail.com> wrote:

> Thanks Vinod. I went back to see the logs and nothing interesting .
> However int he process I found that my spark port was pointing to 7077
> instead of 5050. After re-running .. spark on mesos worked!
>
> The spark on mesos is twice as fast as yarn on our 20 node cluster. In
> addition Mesos  is handling datasizes that yarn simply dies on  it. But
> mesos is  still just taking linearly increased time  compared to smaller
> datasizes.
>
> We have significant additional work to incorporate mesos into operations
> and support but given the strong perforrmance and stability characterstics
> we are initially seeing here that effort is likely to get underway.
>
>
>
> 2015-09-09 12:54 GMT-07:00 Vinod Kone <vi...@gmail.com>:
>
>> sounds like it. can you see what the slave/agent and executor logs say?
>>
>> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <ja...@gmail.com>
>> wrote:
>>
>>>
>>> I am in the process of learning how to run a mesos cluster with the
>>> intent for it to be the resource manager for Spark.  As a small step in
>>> that direction a basic test of mesos was performed, as suggested by the
>>> Mesos Getting Started page.
>>>
>>> In the following output we see tasks launched and resources offered on a
>>> 20 node cluster:
>>>
>>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
>>> $(hostname -s):5050
>>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
>>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
>>> master@10.64.204.124:5050
>>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
>>> Attempting to register without authentication
>>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
>>> 20150908-182014-2093760522-5050-15313-0000
>>> Registered! ID = 20150908-182014-2093760522-5050-15313-0000
>>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0
>>> and mem: 119855.0
>>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
>>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus: 16.0
>>> and mem: 119855.0
>>> Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus: 16.0
>>> and mem: 119855.0
>>> Status update: task 0 is in state TASK_LOST
>>> Aborting because task 0 is in unexpected state TASK_LOST with reason
>>> 'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message
>>> 'Executor terminated'
>>> I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver
>>> I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework
>>> '20150908-182014-2093760522-5050-15313-0000'
>>> I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver
>>> I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework
>>> '20150908-182014-2093760522-5050-15313-0000'
>>>
>>>
>>> Why did the task transition to TASK_LOST ?   Is there a misconfiguration
>>> on the cluster?
>>>
>>
>>
>

Re: Help interpreting output from running java test-framework example

Posted by Stephen Boesch <ja...@gmail.com>.
Thanks Vinod. I went back to see the logs and nothing interesting .
However int he process I found that my spark port was pointing to 7077
instead of 5050. After re-running .. spark on mesos worked!

The spark on mesos is twice as fast as yarn on our 20 node cluster. In
addition Mesos  is handling datasizes that yarn simply dies on  it. But
mesos is  still just taking linearly increased time  compared to smaller
datasizes.

We have significant additional work to incorporate mesos into operations
and support but given the strong perforrmance and stability characterstics
we are initially seeing here that effort is likely to get underway.



2015-09-09 12:54 GMT-07:00 Vinod Kone <vi...@gmail.com>:

> sounds like it. can you see what the slave/agent and executor logs say?
>
> On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <ja...@gmail.com> wrote:
>
>>
>> I am in the process of learning how to run a mesos cluster with the
>> intent for it to be the resource manager for Spark.  As a small step in
>> that direction a basic test of mesos was performed, as suggested by the
>> Mesos Getting Started page.
>>
>> In the following output we see tasks launched and resources offered on a
>> 20 node cluster:
>>
>> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
>> $(hostname -s):5050
>> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
>> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
>> master@10.64.204.124:5050
>> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
>> Attempting to register without authentication
>> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
>> 20150908-182014-2093760522-5050-15313-0000
>> Registered! ID = 20150908-182014-2093760522-5050-15313-0000
>> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0
>> and mem: 119855.0
>> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
>> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
>> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
>> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
>> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
>> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus: 16.0
>> and mem: 119855.0
>> Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus: 16.0
>> and mem: 119855.0
>> Status update: task 0 is in state TASK_LOST
>> Aborting because task 0 is in unexpected state TASK_LOST with reason
>> 'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message
>> 'Executor terminated'
>> I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver
>> I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework
>> '20150908-182014-2093760522-5050-15313-0000'
>> I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver
>> I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework
>> '20150908-182014-2093760522-5050-15313-0000'
>>
>>
>> Why did the task transition to TASK_LOST ?   Is there a misconfiguration
>> on the cluster?
>>
>
>

Re: Help interpreting output from running java test-framework example

Posted by Vinod Kone <vi...@gmail.com>.
sounds like it. can you see what the slave/agent and executor logs say?

On Tue, Sep 8, 2015 at 11:46 AM, Stephen Boesch <ja...@gmail.com> wrote:

>
> I am in the process of learning how to run a mesos cluster with the intent
> for it to be the resource manager for Spark.  As a small step in that
> direction a basic test of mesos was performed, as suggested by the Mesos
> Getting Started page.
>
> In the following output we see tasks launched and resources offered on a
> 20 node cluster:
>
> [stack@yarnmaster-8245 build]$ ./src/examples/java/test-framework
> $(hostname -s):5050
> I0908 18:40:10.900964 31959 sched.cpp:157] Version: 0.23.0
> I0908 18:40:10.918957 32000 sched.cpp:254] New master detected at
> master@10.64.204.124:5050
> I0908 18:40:10.921525 32000 sched.cpp:264] No credentials provided.
> Attempting to register without authentication
> I0908 18:40:10.928963 31997 sched.cpp:448] Framework registered with
> 20150908-182014-2093760522-5050-15313-0000
> Registered! ID = 20150908-182014-2093760522-5050-15313-0000
> Received offer 20150908-182014-2093760522-5050-15313-O0 with cpus: 16.0
> and mem: 119855.0
> Launching task 0 using offer 20150908-182014-2093760522-5050-15313-O0
> Launching task 1 using offer 20150908-182014-2093760522-5050-15313-O0
> Launching task 2 using offer 20150908-182014-2093760522-5050-15313-O0
> Launching task 3 using offer 20150908-182014-2093760522-5050-15313-O0
> Launching task 4 using offer 20150908-182014-2093760522-5050-15313-O0
> Received offer 20150908-182014-2093760522-5050-15313-O1 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O2 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O3 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O4 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O5 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O6 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O7 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O8 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O9 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O10 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O11 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O12 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O13 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O14 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O15 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O16 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O17 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O18 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O19 with cpus: 16.0
> and mem: 119855.0
> Received offer 20150908-182014-2093760522-5050-15313-O20 with cpus: 16.0
> and mem: 119855.0
> Status update: task 0 is in state TASK_LOST
> Aborting because task 0 is in unexpected state TASK_LOST with reason
> 'REASON_EXECUTOR_TERMINATED' from source 'SOURCE_SLAVE' with message
> 'Executor terminated'
> I0908 18:40:12.466081 31996 sched.cpp:1625] Asked to abort the driver
> I0908 18:40:12.467051 31996 sched.cpp:861] Aborting framework
> '20150908-182014-2093760522-5050-15313-0000'
> I0908 18:40:12.468053 31959 sched.cpp:1591] Asked to stop the driver
> I0908 18:40:12.468683 31991 sched.cpp:835] Stopping framework
> '20150908-182014-2093760522-5050-15313-0000'
>
>
> Why did the task transition to TASK_LOST ?   Is there a misconfiguration
> on the cluster?
>