You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Сорокин Дмитрий Владимирович <DV...@sberbank.ru> on 2017/11/28 13:30:20 UTC

Re: [!!Mass Mail]Re: Ignite Enhancement Proposal #7 (Internal problems detection)

Vladimir,

These policies (policy, in fact) can be configured in IgniteConfiguration by calling setFailureProcessingPolicy(FailureProcessingPolicy flrPlc) method.

--
Дмитрий Сорокин
Тел.: 8-789-13512
Моб.: +7 (916) 560-39-63


28.11.17, 10:28 пользователь "Vladimir Ozerov" <vo...@gridgain.com> написал:

    Dmitry,

    How these policies will be configured? Do you have any API in mind?

    On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dm...@apache.org> wrote:

    > No objections here. Additional policies like EXEC might be added later
    > depending on user needs.
    >
    > —
    > Denis
    >
    > > On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин <sb...@gmail.com>
    > wrote:
    > >
    > > Denis,
    > > I propose start with first three policies (it's already implemented, just
    > > await some code combing, commit & review).
    > > About of fourth policy (EXEC) I think that it's rather additional
    > property
    > > (some script path) than policy.
    > >
    > > 2017-11-23 0:43 GMT+03:00 Denis Magda <dm...@apache.org>:
    > >
    > >> Just provide FailureProcessingPolicy with possible reactions:
    > >> - NOOP - exceptions will be reported, metrics will be triggered but an
    > >> affected Ignite process won’t be touched.
    > >> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite
    > >> process termination.
    > >> - RESTART - NOOP actions + process restart.
    > >> - EXEC - execute a custom script provided by the user.
    > >>
    > >> If needed the policy can be set per know failure such is OOM,
    > Persistence
    > >> errors so that the user can act accordingly basing on a context.
    > >>
    > >> —
    > >> Denis
    > >>
    > >>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <vo...@gridgain.com>
    > >> wrote:
    > >>>
    > >>> In the first iteration I would focus only on reporting facilities, to
    > let
    > >>> administrator spot dangerous situation. And in the second phase, when
    > all
    > >>> reporting and metrics are ready, we can think on some automatic
    > actions.
    > >>>
    > >>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov <
    > >> mcherkasov@gridgain.com
    > >>>> wrote:
    > >>>
    > >>>> Hi Anton,
    > >>>>
    > >>>> I don't think that we should shutdown node in case of
    > >> IgniteOOMException,
    > >>>> if one node has no space, then other probably  don't have it too, so
    > re
    > >>>> -balancing will cause IgniteOOM on all other nodes and will kill the
    > >> whole
    > >>>> cluster. I think for some configurations cluster should survive and
    > >> allow
    > >>>> to user clean cache or/and add more nodes.
    > >>>>
    > >>>> Thanks,
    > >>>> Mikhail.
    > >>>>
    > >>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" <
    > >>>> avinogradov@gridgain.com> написал:
    > >>>>
    > >>>>> Igniters,
    > >>>>>
    > >>>>> Internal problems may and, unfortunately, cause unexpected cluster
    > >>>>> behavior.
    > >>>>> We should determine behavior in case any of internal problem
    > happened.
    > >>>>>
    > >>>>> Well known internal problems can be split to:
    > >>>>> 1) OOM or any other reason cause node crash
    > >>>>>
    > >>>>> 2) Situations required graceful node shutdown with custom
    > notification
    > >>>>> - IgniteOutOfMemoryException
    > >>>>> - Persistence errors
    > >>>>> - ExchangeWorker exits with error
    > >>>>>
    > >>>>> 3) Prefomance issues should be covered by metrics
    > >>>>> - GC STW duration
    > >>>>> - Timed out tasks and jobs
    > >>>>> - TX deadlock
    > >>>>> - Hanged Tx (waits for some service)
    > >>>>> - Java Deadlocks
    > >>>>>
    > >>>>> I created special issue [1] to make sure all these metrics will be
    > >>>>> presented at WebConsole or VisorConsole (what's preferred?)
    > >>>>>
    > >>>>> 4) Situations required external monitoring implementation
    > >>>>> - GC STW duration exceed maximum possible length (node should be
    > >> stopped
    > >>>>> before STW finished)
    > >>>>>
    > >>>>> All this problems were reported by different persons different time
    > >> ago,
    > >>>>> So, we should reanalyze each of them and, possible, find better ways
    > to
    > >>>>> solve them than it described at issues.
    > >>>>>
    > >>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention
    > >> something
    > >>>>> else :)
    > >>>>>
    > >>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961
    > >>>>> [2]
    > >>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
    > >>>>> 7%3A+Ignite+internal+problems+detection
    > >>>>>
    > >>>>
    > >>
    > >>
    >
    >


УВЕДОМЛЕНИЕ О КОНФИДЕНЦИАЛЬНОСТИ: Это электронное сообщение и любые документы, приложенные к нему, содержат конфиденциальную информацию. Настоящим уведомляем Вас о том, что если это сообщение не предназначено Вам, использование, копирование, распространение информации, содержащейся в настоящем сообщении, а также осуществление любых действий на основе этой информации, строго запрещено. Если Вы получили это сообщение по ошибке, пожалуйста, сообщите об этом отправителю по электронной почте и удалите это сообщение. CONFIDENTIALITY NOTICE: This email and any files attached to it are confidential. If you are not the intended recipient you are notified that using, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error please notify the sender and delete this email.