You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by "Nick R. Katsipoulakis" <ni...@gmail.com> on 2015/06/26 21:53:45 UTC

When does nimbus decides that an executor is not alive

Hello,

I have been running a sample topology and I can see on the nimbus.log
messages like the following:

2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor
tpch-q5-top-1-1435347835:[5 5] not alive
2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
tpch-q5-top-1-1435347835:[13 13] not alive
2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
tpch-q5-top-1-1435347835:[21 21] not alive
2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
tpch-q5-top-1-1435347835:[29 29] not alive

So, my question is when does the nimbus come to the above decision? By the
way, none of the above machines has crashed on there is an exception in the
code. The only problem is that the resource utilization in those machines
reaches high levels. Is the former a case where nimbus declares an executor
as "not alive"?

Thanks,
Nick

Re: When does nimbus decides that an executor is not alive

Posted by Javier Gonzalez <ja...@gmail.com>.

I would advise to try running a single worker per machine/supervisor, as
that way you have cheaper communications between storm components
(inter-jvm being faster than communicating between separate jvm processes).
I think we also configured to have the same amount of parallel tasks
executing as cores available in the machine, to avoid overhead due to
thread context switching.

Regards,
Javier

On Sun, Jun 28, 2015 at 1:12 PM, Nick R. Katsipoulakis <
nick.katsip@gmail.com> wrote:

> Hello again,
>
> Actually, I should give more info about the system load. On each
> supervisor machine, I have a number of workers (JVM processes) executing a
> number of executors (Java threads). Therefore, for each JVM memory use
> percentage I get (I get them through Java's Runtime class), I really have
> the percent of how much JVM memory is used by all the executors running in
> the same JVM process. So, even if one thread does not use too many
> resources on one JVM, maybe, another JVM on the same machine is taking up
> all the resources, and I end up in a congested environment.
>
> I will try to look into GC going on, but I guess I will have to do some
> research on the matter because I do not know many things about Java GC.
>
> Thank you for your time.
>
> Regards,
> Nick
>
>
> 2015-06-28 13:02 GMT-04:00 Javier Gonzalez <ja...@gmail.com>:
>
>> Perhaps you could put explicit GC logs in the childopts so that you see
>> if you have "GC grinding" in the jvm running the worker that gets
>> disconnected. I suggested it since you mentioned that the machine is under
>> heavy load.
>>
>> Another thing that sometimes caused something like that was when the
>> machine came under heavy load from outside processes, since we were testing
>> on a shared machine. Is it your case?
>>
>> Regards,
>> JG
>>
>> On Sun, Jun 28, 2015 at 11:46 AM, Nick R. Katsipoulakis <
>> nick.katsip@gmail.com> wrote:
>>
>>> Javier thank you for your response.
>>>
>>> So, do you suggest that I change to "workers.childopts" to more memory
>>> than I have now? Currently I have it set to 4 GBs and some of the executors
>>> do not use all of it (I monitor the JVM memory usage on each executor from
>>> the Bolt code). But, I guess I can try it and see if it works.
>>>
>>> Thank you again.
>>>
>>> Regards,
>>> Nick
>>>
>>> 2015-06-28 11:32 GMT-04:00 Javier Gonzalez <ja...@gmail.com>:
>>>
>>>> It could be that heavy usage of an executor's machine prevents the
>>>> executor from communicating with nimbus, hence it appears "dead" to nimbus,
>>>> even though it's still working. I think we saw something like this some
>>>> time during our PoC development, and it was fixed by allocating more memory
>>>> to our workers - not enough memory was causing the workers to incur in
>>>> heavy GC cycles.
>>>>
>>>> Regards,
>>>> Javier
>>>>
>>>> On Fri, Jun 26, 2015 at 3:53 PM, Nick R. Katsipoulakis <
>>>> nick.katsip@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have been running a sample topology and I can see on the nimbus.log
>>>>> messages like the following:
>>>>>
>>>>> 2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor
>>>>> tpch-q5-top-1-1435347835:[5 5] not alive
>>>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>>>> tpch-q5-top-1-1435347835:[13 13] not alive
>>>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>>>> tpch-q5-top-1-1435347835:[21 21] not alive
>>>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>>>> tpch-q5-top-1-1435347835:[29 29] not alive
>>>>>
>>>>> So, my question is when does the nimbus come to the above decision? By
>>>>> the way, none of the above machines has crashed on there is an exception in
>>>>> the code. The only problem is that the resource utilization in those
>>>>> machines reaches high levels. Is the former a case where nimbus declares an
>>>>> executor as "not alive"?
>>>>>
>>>>> Thanks,
>>>>> Nick
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Javier González Nicolini
>>>>
>>>
>>>
>>>
>>> --
>>> Nikolaos Romanos Katsipoulakis,
>>> University of Pittsburgh, PhD candidate
>>>
>>
>>
>>
>> --
>> Javier González Nicolini
>>
>
>
>
> --
> Nikolaos Romanos Katsipoulakis,
> University of Pittsburgh, PhD candidate
>



-- 
Javier González Nicolini

Re: When does nimbus decides that an executor is not alive

Posted by "Nick R. Katsipoulakis" <ni...@gmail.com>.

Hello again,

Actually, I should give more info about the system load. On each supervisor
machine, I have a number of workers (JVM processes) executing a number of
executors (Java threads). Therefore, for each JVM memory use percentage I
get (I get them through Java's Runtime class), I really have the percent of
how much JVM memory is used by all the executors running in the same JVM
process. So, even if one thread does not use too many resources on one JVM,
maybe, another JVM on the same machine is taking up all the resources, and
I end up in a congested environment.

I will try to look into GC going on, but I guess I will have to do some
research on the matter because I do not know many things about Java GC.

Thank you for your time.

Regards,
Nick


2015-06-28 13:02 GMT-04:00 Javier Gonzalez <ja...@gmail.com>:

> Perhaps you could put explicit GC logs in the childopts so that you see if
> you have "GC grinding" in the jvm running the worker that gets
> disconnected. I suggested it since you mentioned that the machine is under
> heavy load.
>
> Another thing that sometimes caused something like that was when the
> machine came under heavy load from outside processes, since we were testing
> on a shared machine. Is it your case?
>
> Regards,
> JG
>
> On Sun, Jun 28, 2015 at 11:46 AM, Nick R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
>> Javier thank you for your response.
>>
>> So, do you suggest that I change to "workers.childopts" to more memory
>> than I have now? Currently I have it set to 4 GBs and some of the executors
>> do not use all of it (I monitor the JVM memory usage on each executor from
>> the Bolt code). But, I guess I can try it and see if it works.
>>
>> Thank you again.
>>
>> Regards,
>> Nick
>>
>> 2015-06-28 11:32 GMT-04:00 Javier Gonzalez <ja...@gmail.com>:
>>
>>> It could be that heavy usage of an executor's machine prevents the
>>> executor from communicating with nimbus, hence it appears "dead" to nimbus,
>>> even though it's still working. I think we saw something like this some
>>> time during our PoC development, and it was fixed by allocating more memory
>>> to our workers - not enough memory was causing the workers to incur in
>>> heavy GC cycles.
>>>
>>> Regards,
>>> Javier
>>>
>>> On Fri, Jun 26, 2015 at 3:53 PM, Nick R. Katsipoulakis <
>>> nick.katsip@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I have been running a sample topology and I can see on the nimbus.log
>>>> messages like the following:
>>>>
>>>> 2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor
>>>> tpch-q5-top-1-1435347835:[5 5] not alive
>>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>>> tpch-q5-top-1-1435347835:[13 13] not alive
>>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>>> tpch-q5-top-1-1435347835:[21 21] not alive
>>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>>> tpch-q5-top-1-1435347835:[29 29] not alive
>>>>
>>>> So, my question is when does the nimbus come to the above decision? By
>>>> the way, none of the above machines has crashed on there is an exception in
>>>> the code. The only problem is that the resource utilization in those
>>>> machines reaches high levels. Is the former a case where nimbus declares an
>>>> executor as "not alive"?
>>>>
>>>> Thanks,
>>>> Nick
>>>>
>>>
>>>
>>>
>>> --
>>> Javier González Nicolini
>>>
>>
>>
>>
>> --
>> Nikolaos Romanos Katsipoulakis,
>> University of Pittsburgh, PhD candidate
>>
>
>
>
> --
> Javier González Nicolini
>



-- 
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

Re: When does nimbus decides that an executor is not alive

Posted by Javier Gonzalez <ja...@gmail.com>.

Perhaps you could put explicit GC logs in the childopts so that you see if
you have "GC grinding" in the jvm running the worker that gets
disconnected. I suggested it since you mentioned that the machine is under
heavy load.

Another thing that sometimes caused something like that was when the
machine came under heavy load from outside processes, since we were testing
on a shared machine. Is it your case?

Regards,
JG

On Sun, Jun 28, 2015 at 11:46 AM, Nick R. Katsipoulakis <
nick.katsip@gmail.com> wrote:

> Javier thank you for your response.
>
> So, do you suggest that I change to "workers.childopts" to more memory
> than I have now? Currently I have it set to 4 GBs and some of the executors
> do not use all of it (I monitor the JVM memory usage on each executor from
> the Bolt code). But, I guess I can try it and see if it works.
>
> Thank you again.
>
> Regards,
> Nick
>
> 2015-06-28 11:32 GMT-04:00 Javier Gonzalez <ja...@gmail.com>:
>
>> It could be that heavy usage of an executor's machine prevents the
>> executor from communicating with nimbus, hence it appears "dead" to nimbus,
>> even though it's still working. I think we saw something like this some
>> time during our PoC development, and it was fixed by allocating more memory
>> to our workers - not enough memory was causing the workers to incur in
>> heavy GC cycles.
>>
>> Regards,
>> Javier
>>
>> On Fri, Jun 26, 2015 at 3:53 PM, Nick R. Katsipoulakis <
>> nick.katsip@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have been running a sample topology and I can see on the nimbus.log
>>> messages like the following:
>>>
>>> 2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[5 5] not alive
>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[13 13] not alive
>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[21 21] not alive
>>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>>> tpch-q5-top-1-1435347835:[29 29] not alive
>>>
>>> So, my question is when does the nimbus come to the above decision? By
>>> the way, none of the above machines has crashed on there is an exception in
>>> the code. The only problem is that the resource utilization in those
>>> machines reaches high levels. Is the former a case where nimbus declares an
>>> executor as "not alive"?
>>>
>>> Thanks,
>>> Nick
>>>
>>
>>
>>
>> --
>> Javier González Nicolini
>>
>
>
>
> --
> Nikolaos Romanos Katsipoulakis,
> University of Pittsburgh, PhD candidate
>



-- 
Javier González Nicolini

Re: When does nimbus decides that an executor is not alive

Posted by "Nick R. Katsipoulakis" <ni...@gmail.com>.

Javier thank you for your response.

So, do you suggest that I change to "workers.childopts" to more memory than
I have now? Currently I have it set to 4 GBs and some of the executors do
not use all of it (I monitor the JVM memory usage on each executor from the
Bolt code). But, I guess I can try it and see if it works.

Thank you again.

Regards,
Nick

2015-06-28 11:32 GMT-04:00 Javier Gonzalez <ja...@gmail.com>:

> It could be that heavy usage of an executor's machine prevents the
> executor from communicating with nimbus, hence it appears "dead" to nimbus,
> even though it's still working. I think we saw something like this some
> time during our PoC development, and it was fixed by allocating more memory
> to our workers - not enough memory was causing the workers to incur in
> heavy GC cycles.
>
> Regards,
> Javier
>
> On Fri, Jun 26, 2015 at 3:53 PM, Nick R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
>> Hello,
>>
>> I have been running a sample topology and I can see on the nimbus.log
>> messages like the following:
>>
>> 2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor
>> tpch-q5-top-1-1435347835:[5 5] not alive
>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>> tpch-q5-top-1-1435347835:[13 13] not alive
>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>> tpch-q5-top-1-1435347835:[21 21] not alive
>> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
>> tpch-q5-top-1-1435347835:[29 29] not alive
>>
>> So, my question is when does the nimbus come to the above decision? By
>> the way, none of the above machines has crashed on there is an exception in
>> the code. The only problem is that the resource utilization in those
>> machines reaches high levels. Is the former a case where nimbus declares an
>> executor as "not alive"?
>>
>> Thanks,
>> Nick
>>
>
>
>
> --
> Javier González Nicolini
>



-- 
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

Re: When does nimbus decides that an executor is not alive

Posted by Javier Gonzalez <ja...@gmail.com>.

It could be that heavy usage of an executor's machine prevents the executor
from communicating with nimbus, hence it appears "dead" to nimbus, even
though it's still working. I think we saw something like this some time
during our PoC development, and it was fixed by allocating more memory to
our workers - not enough memory was causing the workers to incur in heavy
GC cycles.

Regards,
Javier

On Fri, Jun 26, 2015 at 3:53 PM, Nick R. Katsipoulakis <
nick.katsip@gmail.com> wrote:

> Hello,
>
> I have been running a sample topology and I can see on the nimbus.log
> messages like the following:
>
> 2015-06-26T19:46:35.556+0000 b.s.d.nimbus [INFO] Executor
> tpch-q5-top-1-1435347835:[5 5] not alive
> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
> tpch-q5-top-1-1435347835:[13 13] not alive
> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
> tpch-q5-top-1-1435347835:[21 21] not alive
> 2015-06-26T19:46:35.557+0000 b.s.d.nimbus [INFO] Executor
> tpch-q5-top-1-1435347835:[29 29] not alive
>
> So, my question is when does the nimbus come to the above decision? By the
> way, none of the above machines has crashed on there is an exception in the
> code. The only problem is that the resource utilization in those machines
> reaches high levels. Is the former a case where nimbus declares an executor
> as "not alive"?
>
> Thanks,
> Nick
>



-- 
Javier González Nicolini