You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ignite.apache.org by ткаленко кирилл <tk...@yandex.ru> on 2020/05/04 07:38:09 UTC

Extended logging for rebalance performance analysis

Hi, Igniters!

I'd like to share a new small feature in AI [1].

Current rebalance logging does not allow you to quickly answer following questions:
1)How long was the balance(divided by supplier)? 
2)How many records and bytes per supplier were rebalanced?
3)How many times did rebalance restart?
4)Which partitions were rebalanced and from which nodes did they receive them?
5)When did rebalance for all cache groups end?

What you can see in logs now:

1)Starting rebalance with order of cache groups.
Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0], top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false, evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]

2)Start rebalance of cache group from a specific supplier, specifying partition ids and mode - historical or full.
Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99], histPartitions=[]]

3)Getting partial or complete partitions of cache group.
Completed rebalancing [grp=ignite-sys-cache, supplier=8c525892-703b-4fc4-b28b-b2f139700000, topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
Completed (final) rebalancing [grp=ignite-sys-cache, supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001, topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]

4)End rebalance of cache group.
Completed rebalance future: RebalanceFuture [grp=CacheGroupContext [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200, receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1, lastCancelledTime=-1]

Rebalance statistics:

To speed up rebalance analysis, statistics will be output for each cache group and total for all cache groups. 
If duration rebalance for cache group is greater than threshold value, partition distribution is output.
Statistics will you to analyze duration of the balance for each supplier to understand which of them has been transmitting data for longest time.

System properties are used to output statistics:

IGNITE_QUIET - to output statistics, value must be false;
IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold duration rebalance of cache group after which partitions distribution is output, set in milliseconds, default value is 10 minutes.

Statistics examples:

Successful full and historical rebalance of group cache, without partitions distribution.
Rebalance information per cache group (successful rebalance): [id=3181548, name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest0] 
Rebalance information per cache group (successful rebalance): [id=3181547, name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]

Successful full and historical rebalance of group cache, with partitions distribution.
Rebalance information per cache group (successful rebalance): [id=3181548, name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5, d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater than 5 ms, printing detailed information about partitions distribution (threshold can be changed by setting number of milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 = [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 = [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 = [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 = [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 = [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr - primary, bu - backup, su - supplier node, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest2] [2=rebalancing.RebalanceStatisticsTest0] 
Rebalance information per cache group (successful rebalance): [id=3181547, name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10, he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater than 100 ms, printing detailed information about partitions distribution (threshold can be changed by setting number of milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 = [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 = [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 = [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 = [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr - primary, bu - backup, su - supplier node, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] [1=rebalancing.RebalanceStatisticsTest2] [2=rebalancing.RebalanceStatisticsTest1] 

Interrupted rebalance of group cache.
Rebalance information per cache group (interrupted rebalance): [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969, finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0] 

Total full and historical rebalance for all cache groups.
Rebalance total information (including successful and not rebalances): [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780, d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms] [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1] [1=rebalancing.RebalanceStatisticsTest0] 
Rebalance total information (including successful and not rebalances): [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116, d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294 ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h - historical, nodeId mapping (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] 

[1] - https://issues.apache.org/jira/browse/IGNITE-12080

Re: Extended logging for rebalance performance analysis

Posted by Stanislav Lukyanov <st...@gmail.com>.

Hi Kirill,

I think it would be helpful to add to the log line
    Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
a few more metrics:
    avgSpeed=<bytesRcvd/duration>, partitionsNum=N, histPartitionsNum=M, histBytesRcvd=X KB, histDuration=Yms, histAvgSpeed=<histBytesRcvd/histDuration>, fullPartitionsNum=K, fullBytesRcvd=X KB, fullDuration=Yms, fullAvgSpeed=<fullBytesRcvd/fullDuration>

I believe these additional metrics are good in that a user can easily setup monitoring based on them, and they give a lot of useful info about rebalance performance in general. No need to look for the "Starting rebalance" message, everything is in one place.
If any of these are hard to add then we can skip them (I suspect histDuration and fullDuration can't be measured in practice but don't know for sure). Let's add what we can.

Thanks,
Stan

> On 3 Jul 2020, at 17:21, ткаленко кирилл <tk...@yandex.ru> wrote:
> 
> Sorry, forget.
> 
> [1] - org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
> 
> 03.07.2020, 17:20, "ткаленко кирилл" <tk...@yandex.ru>:
>> Hi, Stan!
>> 
>> I don't understand you yet.
>> 
>> Now you can use metrics as it was done in the test [1]. Or can you tell me where to do this, for example when completing rebalancing for all groups?
>> 
>> See what is now available and added in the logs:
>> 1)Which group is rebalanced and which type of rebalance.
>> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>> 
>> 2) Completion of rebalancing from one of the suppliers.
>> Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
>> 
>> 3) Completion of the entire rebalance.
>> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, duration=41ms, bytesRcvd=40,4 KB]
>> 
>> These messages have a common parameter rebalanceId=1.
>> 
>> 03.07.2020, 16:48, "Stanislav Lukyanov" <st...@gmail.com>:
>>>>   On 3 Jul 2020, at 09:51, ткаленко кирилл <tk...@yandex.ru> wrote:
>>>> 
>>>>   To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>> 
>>>  You can calculate it, and I believe that this is the first thing anyone would do when reading these logs and metrics.
>>>  If that's an essential thing then maybe it should be available out of the box?
>>> 
>>>>   This also works correctly with the historical rebalance.
>>>>   Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information.
>>>> 
>>>>   [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>> 
>>>  I'm talking about adding info on how much data has been transferred during rebalance.
>>>  When rebalance completes I'd like to know how much data has been transferred, was it historical or full, what was the average rebalance speed.
>>> 
>>>  There are two reasons for having all that.
>>> 
>>>  First, it helps to analyze the issues by searching the logs and looking for anomalies.
>>> 
>>>  Second, this allows to automate alerts: e.g. if you know your typical historical rebalance speed, you can trigger an alert if it drops below that.
>>> 
>>>>   03.07.2020, 02:49, "Stanislav Lukyanov" <st...@gmail.com>:
>>>>>   Kirill,
>>>>> 
>>>>>   I've looked through the patch.
>>>>>   Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed.
>>>>>   Do you think it's reasonable to also add it to the logs? Maybe even to the metrics?
>>>>> 
>>>>>   Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier?
>>>>>   Maybe even estimate transfer speed in entries and bytes for each rebalance type?
>>>>> 
>>>>>   Thanks,
>>>>>   Stan
>>>>> 
>>>>>>    On 29 Jun 2020, at 11:50, Ivan Rakov <iv...@gmail.com> wrote:
>>>>>> 
>>>>>>    +1 to Alex G.
>>>>>> 
>>>>>>    From my experience, the most interesting cases with Ignite rebalancing
>>>>>>    happen exactly in production. According to the fact that we already have
>>>>>>    detailed rebalancing logging, adding info about rebalance performance looks
>>>>>>    like a reasonable improvement. With new logs we'll be able to detect and
>>>>>>    investigate situations when rebalance is slow due to uneven suppliers
>>>>>>    distribution or network issues.
>>>>>>    Option to disable the feature in runtime shouldn't be used often, but it
>>>>>>    will keep us on the safe side in case something goes wrong.
>>>>>>    The format described in
>>>>>>    https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>>    good to me.
>>>>>> 
>>>>>>    On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
>>>>>>    wrote:
>>>>>> 
>>>>>>>    Hello, Alexey!
>>>>>>> 
>>>>>>>    Currently there is no way to disable / enable it, but it seems that the
>>>>>>>    logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>>>>>>>    and compact. Of course, you can add disabling / enabling statistics
>>>>>>>    collection via jmx for example.
>>>>>>> 
>>>>>>>    23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
>>>>>>>>    Hello Maxim, folks,
>>>>>>>> 
>>>>>>>>    ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>>>>>>> 
>>>>>>>>>    We won't do performance analysis on the production environment. Each
>>>>>>>>>    time we need performance analysis it will be done on a test
>>>>>>>>>    environment with verbose logging enabled. Thus I suggest moving these
>>>>>>>>>    changes to a separate `profiling` module and extend the logging much
>>>>>>>>>    more without any ышяу limitations. The same as these [2] [3]
>>>>>>>>>    activities do.
>>>>>>>> 
>>>>>>>>    I strongly disagree with this statement. I am not sure who is meant here
>>>>>>>>    by 'we', but I see a strong momentum in increasing observability tooling
>>>>>>>>    that helps people to understand what exactly happens in the production
>>>>>>>>    environment [1]. Not everybody can afford two identical environments for
>>>>>>>>    testing. We should make sure users have enough information to understand
>>>>>>>>    the root cause after the incident happened, and not force them to
>>>>>>>    reproduce
>>>>>>>>    it, let alone make them add another module to the classpath and restart
>>>>>>>    the
>>>>>>>>    nodes.
>>>>>>>>    I think having this functionality in the core module with the ability to
>>>>>>>>    disable/enable it is the right approach. Having the information printed
>>>>>>>    to
>>>>>>>>    log is ok, having it in an event that can be sent to a monitoring/tracing
>>>>>>>>    subsystem is even better.
>>>>>>>> 
>>>>>>>>    Kirill, can we enable and disable this feature in runtime to avoid the
>>>>>>>    very
>>>>>>>>    same nodes restart?
>>>>>>>> 
>>>>>>>>    [1]
>>>>>>>    https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Discussed in personal correspondence with Stas, decided to improve the message:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
	partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
	topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]

into:
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, 
	partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 KB/sec,
	histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
	fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
	topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]

Where:
partitions=2 - total number of partitions received
entries=60 - total number of entries received
duration=8ms - duration from first demand of message to output of message to log
bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB

avgSpeed= bytesRcvd/duration in KB/sec

histPartitions=1 - total number of partitions received by historical mode
histEntries=30 - total number of entries received by historical mode
histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical mode

fullPartitions=1 - total number of partitions received by full mode
fullEntries=30 - total number of entries received by full mode
fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode

27.07.2020, 11:50, "ткаленко кирилл" <tk...@yandex.ru>:
> Discussed in personal correspondence with Stas, decided to improve the message:
> Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB,
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
>
> into:
> Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001,
>     partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, avgSpeed=5,9 KB/sec,
>     histPartitions=1, histEntries=30, histBytesRcvd=1 KB,
>     fullPartitions=1, fullEntries=30, fullBytesRcvd=3 KB
>     topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
>
> Where:
> partitions=2 - total number of partitions received
> entries=60 - total number of entries received
> duration=8ms - duration from first demand of message to output of message to log
> bytesRcvd=5,9 KB - total number of bytes received in B,KB,MB,GB
>
> avgSpeed= bytesRcvd/duration in KB/sec
>
> histPartitions=1 - total number of partitions received by historical mode
> histEntries=30 - total number of entries received by historical mode
> histBytesRcvd=1 KB - total number of bytes received in B,KB,MB,GB by historical mode
>
> fullPartitions=1 - total number of partitions received by full mode
> fullEntries=30 - total number of entries received by full mode
> fullBytesRcvd=3 KB - total number of bytes received in B,KB,MB,GB by full mode
>
> 03.07.2020, 17:21, "ткаленко кирилл" <tk...@yandex.ru>:
>> Sorry, forget.
>>
>> [1] - org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance
>>
>> 03.07.2020, 17:20, "ткаленко кирилл" <tk...@yandex.ru>:
>>>  Hi, Stan!
>>>
>>>  I don't understand you yet.
>>>
>>>  Now you can use metrics as it was done in the test [1]. Or can you tell me where to do this, for example when completing rebalancing for all groups?
>>>
>>>  See what is now available and added in the logs:
>>>  1)Which group is rebalanced and which type of rebalance.
>>>  Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>>>
>>>  2) Completion of rebalancing from one of the suppliers.
>>>  Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
>>>
>>>  3) Completion of the entire rebalance.
>>>  Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, duration=41ms, bytesRcvd=40,4 KB]
>>>
>>>  These messages have a common parameter rebalanceId=1.
>>>
>>>  03.07.2020, 16:48, "Stanislav Lukyanov" <st...@gmail.com>:
>>>>>    On 3 Jul 2020, at 09:51, ткаленко кирилл <tk...@yandex.ru> wrote:
>>>>>
>>>>>    To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>>>
>>>>   You can calculate it, and I believe that this is the first thing anyone would do when reading these logs and metrics.
>>>>   If that's an essential thing then maybe it should be available out of the box?
>>>>
>>>>>    This also works correctly with the historical rebalance.
>>>>>    Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information.
>>>>>
>>>>>    [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>>>
>>>>   I'm talking about adding info on how much data has been transferred during rebalance.
>>>>   When rebalance completes I'd like to know how much data has been transferred, was it historical or full, what was the average rebalance speed.
>>>>
>>>>   There are two reasons for having all that.
>>>>
>>>>   First, it helps to analyze the issues by searching the logs and looking for anomalies.
>>>>
>>>>   Second, this allows to automate alerts: e.g. if you know your typical historical rebalance speed, you can trigger an alert if it drops below that.
>>>>
>>>>>    03.07.2020, 02:49, "Stanislav Lukyanov" <st...@gmail.com>:
>>>>>>    Kirill,
>>>>>>
>>>>>>    I've looked through the patch.
>>>>>>    Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed.
>>>>>>    Do you think it's reasonable to also add it to the logs? Maybe even to the metrics?
>>>>>>
>>>>>>    Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier?
>>>>>>    Maybe even estimate transfer speed in entries and bytes for each rebalance type?
>>>>>>
>>>>>>    Thanks,
>>>>>>    Stan
>>>>>>
>>>>>>>     On 29 Jun 2020, at 11:50, Ivan Rakov <iv...@gmail.com> wrote:
>>>>>>>
>>>>>>>     +1 to Alex G.
>>>>>>>
>>>>>>>     From my experience, the most interesting cases with Ignite rebalancing
>>>>>>>     happen exactly in production. According to the fact that we already have
>>>>>>>     detailed rebalancing logging, adding info about rebalance performance looks
>>>>>>>     like a reasonable improvement. With new logs we'll be able to detect and
>>>>>>>     investigate situations when rebalance is slow due to uneven suppliers
>>>>>>>     distribution or network issues.
>>>>>>>     Option to disable the feature in runtime shouldn't be used often, but it
>>>>>>>     will keep us on the safe side in case something goes wrong.
>>>>>>>     The format described in
>>>>>>>     https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>>>     good to me.
>>>>>>>
>>>>>>>     On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
>>>>>>>     wrote:
>>>>>>>
>>>>>>>>     Hello, Alexey!
>>>>>>>>
>>>>>>>>     Currently there is no way to disable / enable it, but it seems that the
>>>>>>>>     logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>>>>>>>>     and compact. Of course, you can add disabling / enabling statistics
>>>>>>>>     collection via jmx for example.
>>>>>>>>
>>>>>>>>     23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
>>>>>>>>>     Hello Maxim, folks,
>>>>>>>>>
>>>>>>>>>     ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>>>>>>>>
>>>>>>>>>>     We won't do performance analysis on the production environment. Each
>>>>>>>>>>     time we need performance analysis it will be done on a test
>>>>>>>>>>     environment with verbose logging enabled. Thus I suggest moving these
>>>>>>>>>>     changes to a separate `profiling` module and extend the logging much
>>>>>>>>>>     more without any ышяу limitations. The same as these [2] [3]
>>>>>>>>>>     activities do.
>>>>>>>>>
>>>>>>>>>     I strongly disagree with this statement. I am not sure who is meant here
>>>>>>>>>     by 'we', but I see a strong momentum in increasing observability tooling
>>>>>>>>>     that helps people to understand what exactly happens in the production
>>>>>>>>>     environment [1]. Not everybody can afford two identical environments for
>>>>>>>>>     testing. We should make sure users have enough information to understand
>>>>>>>>>     the root cause after the incident happened, and not force them to
>>>>>>>>     reproduce
>>>>>>>>>     it, let alone make them add another module to the classpath and restart
>>>>>>>>     the
>>>>>>>>>     nodes.
>>>>>>>>>     I think having this functionality in the core module with the ability to
>>>>>>>>>     disable/enable it is the right approach. Having the information printed
>>>>>>>>     to
>>>>>>>>>     log is ok, having it in an event that can be sent to a monitoring/tracing
>>>>>>>>>     subsystem is even better.
>>>>>>>>>
>>>>>>>>>     Kirill, can we enable and disable this feature in runtime to avoid the
>>>>>>>>     very
>>>>>>>>>     same nodes restart?
>>>>>>>>>
>>>>>>>>>     [1]
>>>>>>>>     https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Sorry, forget.

[1] - org.apache.ignite.internal.processors.cache.CacheGroupsMetricsRebalanceTest#testCacheGroupRebalance

03.07.2020, 17:20, "ткаленко кирилл" <tk...@yandex.ru>:
> Hi, Stan!
>
> I don't understand you yet.
>
> Now you can use metrics as it was done in the test [1]. Or can you tell me where to do this, for example when completing rebalancing for all groups?
>
> See what is now available and added in the logs:
> 1)Which group is rebalanced and which type of rebalance.
> Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]
>
> 2) Completion of rebalancing from one of the suppliers.
> Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]
>
> 3) Completion of the entire rebalance.
> Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, duration=41ms, bytesRcvd=40,4 KB]
>
> These messages have a common parameter rebalanceId=1.
>
> 03.07.2020, 16:48, "Stanislav Lukyanov" <st...@gmail.com>:
>>>   On 3 Jul 2020, at 09:51, ткаленко кирилл <tk...@yandex.ru> wrote:
>>>
>>>   To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>>
>>  You can calculate it, and I believe that this is the first thing anyone would do when reading these logs and metrics.
>>  If that's an essential thing then maybe it should be available out of the box?
>>
>>>   This also works correctly with the historical rebalance.
>>>   Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information.
>>>
>>>   [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>>
>>  I'm talking about adding info on how much data has been transferred during rebalance.
>>  When rebalance completes I'd like to know how much data has been transferred, was it historical or full, what was the average rebalance speed.
>>
>>  There are two reasons for having all that.
>>
>>  First, it helps to analyze the issues by searching the logs and looking for anomalies.
>>
>>  Second, this allows to automate alerts: e.g. if you know your typical historical rebalance speed, you can trigger an alert if it drops below that.
>>
>>>   03.07.2020, 02:49, "Stanislav Lukyanov" <st...@gmail.com>:
>>>>   Kirill,
>>>>
>>>>   I've looked through the patch.
>>>>   Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed.
>>>>   Do you think it's reasonable to also add it to the logs? Maybe even to the metrics?
>>>>
>>>>   Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier?
>>>>   Maybe even estimate transfer speed in entries and bytes for each rebalance type?
>>>>
>>>>   Thanks,
>>>>   Stan
>>>>
>>>>>    On 29 Jun 2020, at 11:50, Ivan Rakov <iv...@gmail.com> wrote:
>>>>>
>>>>>    +1 to Alex G.
>>>>>
>>>>>    From my experience, the most interesting cases with Ignite rebalancing
>>>>>    happen exactly in production. According to the fact that we already have
>>>>>    detailed rebalancing logging, adding info about rebalance performance looks
>>>>>    like a reasonable improvement. With new logs we'll be able to detect and
>>>>>    investigate situations when rebalance is slow due to uneven suppliers
>>>>>    distribution or network issues.
>>>>>    Option to disable the feature in runtime shouldn't be used often, but it
>>>>>    will keep us on the safe side in case something goes wrong.
>>>>>    The format described in
>>>>>    https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>>    good to me.
>>>>>
>>>>>    On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
>>>>>    wrote:
>>>>>
>>>>>>    Hello, Alexey!
>>>>>>
>>>>>>    Currently there is no way to disable / enable it, but it seems that the
>>>>>>    logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>>>>>>    and compact. Of course, you can add disabling / enabling statistics
>>>>>>    collection via jmx for example.
>>>>>>
>>>>>>    23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
>>>>>>>    Hello Maxim, folks,
>>>>>>>
>>>>>>>    ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>>>>>>
>>>>>>>>    We won't do performance analysis on the production environment. Each
>>>>>>>>    time we need performance analysis it will be done on a test
>>>>>>>>    environment with verbose logging enabled. Thus I suggest moving these
>>>>>>>>    changes to a separate `profiling` module and extend the logging much
>>>>>>>>    more without any ышяу limitations. The same as these [2] [3]
>>>>>>>>    activities do.
>>>>>>>
>>>>>>>    I strongly disagree with this statement. I am not sure who is meant here
>>>>>>>    by 'we', but I see a strong momentum in increasing observability tooling
>>>>>>>    that helps people to understand what exactly happens in the production
>>>>>>>    environment [1]. Not everybody can afford two identical environments for
>>>>>>>    testing. We should make sure users have enough information to understand
>>>>>>>    the root cause after the incident happened, and not force them to
>>>>>>    reproduce
>>>>>>>    it, let alone make them add another module to the classpath and restart
>>>>>>    the
>>>>>>>    nodes.
>>>>>>>    I think having this functionality in the core module with the ability to
>>>>>>>    disable/enable it is the right approach. Having the information printed
>>>>>>    to
>>>>>>>    log is ok, having it in an event that can be sent to a monitoring/tracing
>>>>>>>    subsystem is even better.
>>>>>>>
>>>>>>>    Kirill, can we enable and disable this feature in runtime to avoid the
>>>>>>    very
>>>>>>>    same nodes restart?
>>>>>>>
>>>>>>>    [1]
>>>>>>    https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Hi, Stan!

I don't understand you yet.

Now you can use metrics as it was done in the test [1]. Or can you tell me where to do this, for example when completing rebalancing for all groups?

See what is now available and added in the logs:
1)Which group is rebalanced and which type of rebalance.
Starting rebalance routine [grp0, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, fullPartitions=[4, 7], histPartitions=[], rebalanceId=1]

2) Completion of rebalancing from one of the suppliers.
Completed rebalancing [grp=grp0, supplier=3f2ae7cf-2bfe-455a-a76a-01fe27a00001, partitions=2, entries=60, duration=8ms, bytesRcvd=5,9 KB, topVer=AffinityTopologyVersion [topVer=4, minorTopVer=0], progress=1/3, rebalanceId=1]

3) Completion of the entire rebalance.
Completed rebalance chain: [rebalanceId=1, partitions=116, entries=400, duration=41ms, bytesRcvd=40,4 KB]

These messages have a common parameter rebalanceId=1.

03.07.2020, 16:48, "Stanislav Lukyanov" <st...@gmail.com>:
>>  On 3 Jul 2020, at 09:51, ткаленко кирилл <tk...@yandex.ru> wrote:
>>
>>  To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes".
>
> You can calculate it, and I believe that this is the first thing anyone would do when reading these logs and metrics.
> If that's an essential thing then maybe it should be available out of the box?
>
>>  This also works correctly with the historical rebalance.
>>  Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information.
>>
>>  [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]
>
> I'm talking about adding info on how much data has been transferred during rebalance.
> When rebalance completes I'd like to know how much data has been transferred, was it historical or full, what was the average rebalance speed.
>
> There are two reasons for having all that.
>
> First, it helps to analyze the issues by searching the logs and looking for anomalies.
>
> Second, this allows to automate alerts: e.g. if you know your typical historical rebalance speed, you can trigger an alert if it drops below that.
>
>>  03.07.2020, 02:49, "Stanislav Lukyanov" <st...@gmail.com>:
>>>  Kirill,
>>>
>>>  I've looked through the patch.
>>>  Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed.
>>>  Do you think it's reasonable to also add it to the logs? Maybe even to the metrics?
>>>
>>>  Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier?
>>>  Maybe even estimate transfer speed in entries and bytes for each rebalance type?
>>>
>>>  Thanks,
>>>  Stan
>>>
>>>>   On 29 Jun 2020, at 11:50, Ivan Rakov <iv...@gmail.com> wrote:
>>>>
>>>>   +1 to Alex G.
>>>>
>>>>   From my experience, the most interesting cases with Ignite rebalancing
>>>>   happen exactly in production. According to the fact that we already have
>>>>   detailed rebalancing logging, adding info about rebalance performance looks
>>>>   like a reasonable improvement. With new logs we'll be able to detect and
>>>>   investigate situations when rebalance is slow due to uneven suppliers
>>>>   distribution or network issues.
>>>>   Option to disable the feature in runtime shouldn't be used often, but it
>>>>   will keep us on the safe side in case something goes wrong.
>>>>   The format described in
>>>>   https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>>   good to me.
>>>>
>>>>   On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
>>>>   wrote:
>>>>
>>>>>   Hello, Alexey!
>>>>>
>>>>>   Currently there is no way to disable / enable it, but it seems that the
>>>>>   logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>>>>>   and compact. Of course, you can add disabling / enabling statistics
>>>>>   collection via jmx for example.
>>>>>
>>>>>   23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
>>>>>>   Hello Maxim, folks,
>>>>>>
>>>>>>   ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>>>>>
>>>>>>>   We won't do performance analysis on the production environment. Each
>>>>>>>   time we need performance analysis it will be done on a test
>>>>>>>   environment with verbose logging enabled. Thus I suggest moving these
>>>>>>>   changes to a separate `profiling` module and extend the logging much
>>>>>>>   more without any ышяу limitations. The same as these [2] [3]
>>>>>>>   activities do.
>>>>>>
>>>>>>   I strongly disagree with this statement. I am not sure who is meant here
>>>>>>   by 'we', but I see a strong momentum in increasing observability tooling
>>>>>>   that helps people to understand what exactly happens in the production
>>>>>>   environment [1]. Not everybody can afford two identical environments for
>>>>>>   testing. We should make sure users have enough information to understand
>>>>>>   the root cause after the incident happened, and not force them to
>>>>>   reproduce
>>>>>>   it, let alone make them add another module to the classpath and restart
>>>>>   the
>>>>>>   nodes.
>>>>>>   I think having this functionality in the core module with the ability to
>>>>>>   disable/enable it is the right approach. Having the information printed
>>>>>   to
>>>>>>   log is ok, having it in an event that can be sent to a monitoring/tracing
>>>>>>   subsystem is even better.
>>>>>>
>>>>>>   Kirill, can we enable and disable this feature in runtime to avoid the
>>>>>   very
>>>>>>   same nodes restart?
>>>>>>
>>>>>>   [1]
>>>>>   https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by Stanislav Lukyanov <st...@gmail.com>.

> On 3 Jul 2020, at 09:51, ткаленко кирилл <tk...@yandex.ru> wrote:
> 
> To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes".

You can calculate it, and I believe that this is the first thing anyone would do when reading these logs and metrics.
If that's an essential thing then maybe it should be available out of the box?

> 
> This also works correctly with the historical rebalance.
> Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information.
> 
> [2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]

I'm talking about adding info on how much data has been transferred during rebalance.
When rebalance completes I'd like to know how much data has been transferred, was it historical or full, what was the average rebalance speed.

There are two reasons for having all that.

First, it helps to analyze the issues by searching the logs and looking for anomalies.

Second, this allows to automate alerts: e.g. if you know your typical historical rebalance speed, you can trigger an alert if it drops below that.

> 
> 03.07.2020, 02:49, "Stanislav Lukyanov" <st...@gmail.com>:
>> Kirill,
>> 
>> I've looked through the patch.
>> Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed.
>> Do you think it's reasonable to also add it to the logs? Maybe even to the metrics?
>> 
>> Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier?
>> Maybe even estimate transfer speed in entries and bytes for each rebalance type?
>> 
>> Thanks,
>> Stan
>> 
>>>  On 29 Jun 2020, at 11:50, Ivan Rakov <iv...@gmail.com> wrote:
>>> 
>>>  +1 to Alex G.
>>> 
>>>  From my experience, the most interesting cases with Ignite rebalancing
>>>  happen exactly in production. According to the fact that we already have
>>>  detailed rebalancing logging, adding info about rebalance performance looks
>>>  like a reasonable improvement. With new logs we'll be able to detect and
>>>  investigate situations when rebalance is slow due to uneven suppliers
>>>  distribution or network issues.
>>>  Option to disable the feature in runtime shouldn't be used often, but it
>>>  will keep us on the safe side in case something goes wrong.
>>>  The format described in
>>>  https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>>  good to me.
>>> 
>>>  On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
>>>  wrote:
>>> 
>>>>  Hello, Alexey!
>>>> 
>>>>  Currently there is no way to disable / enable it, but it seems that the
>>>>  logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>>>>  and compact. Of course, you can add disabling / enabling statistics
>>>>  collection via jmx for example.
>>>> 
>>>>  23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
>>>>>  Hello Maxim, folks,
>>>>> 
>>>>>  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>>>> 
>>>>>>  We won't do performance analysis on the production environment. Each
>>>>>>  time we need performance analysis it will be done on a test
>>>>>>  environment with verbose logging enabled. Thus I suggest moving these
>>>>>>  changes to a separate `profiling` module and extend the logging much
>>>>>>  more without any ышяу limitations. The same as these [2] [3]
>>>>>>  activities do.
>>>>> 
>>>>>  I strongly disagree with this statement. I am not sure who is meant here
>>>>>  by 'we', but I see a strong momentum in increasing observability tooling
>>>>>  that helps people to understand what exactly happens in the production
>>>>>  environment [1]. Not everybody can afford two identical environments for
>>>>>  testing. We should make sure users have enough information to understand
>>>>>  the root cause after the incident happened, and not force them to
>>>>  reproduce
>>>>>  it, let alone make them add another module to the classpath and restart
>>>>  the
>>>>>  nodes.
>>>>>  I think having this functionality in the core module with the ability to
>>>>>  disable/enable it is the right approach. Having the information printed
>>>>  to
>>>>>  log is ok, having it in an event that can be sent to a monitoring/tracing
>>>>>  subsystem is even better.
>>>>> 
>>>>>  Kirill, can we enable and disable this feature in runtime to avoid the
>>>>  very
>>>>>  same nodes restart?
>>>>> 
>>>>>  [1]
>>>>  https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

        LongMetric startTime = mreg.findMetric("RebalancingStartTime");
        LongMetric lastCancelledTime = mreg.findMetric("RebalancingLastCancelledTime");
        LongMetric endTime = mreg.findMetric("RebalancingEndTime");
        LongMetric partitionsLeft = mreg.findMetric();
        LongMetric receivedKeys = mreg.findMetric();
        LongMetric receivedBytes = mreg.findMetric();
		
Hi, Stas!

To calculate the average value, you can use the existing metrics "RebalancingStartTime", "RebalancingLastCancelledTime", "RebalancingEndTime", "RebalancingPartitionsLeft", "RebalancingReceivedKeys" and "RebalancingReceivedBytes".

This also works correctly with the historical rebalance.
Now we can see rebalance type for each group and for each supplier in logs. I don't think we should duplicate this information.

[2020-07-03 09:49:31,481][INFO ][sys-#160%rebalancing.RebalanceStatisticsTest2%][root] Starting rebalance routine [ignite-sys-cache, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0], supplier=a8be67b8-8ec7-4175-aa04-a59577100000, fullPartitions=[0, 2, 4, 6, 8], histPartitions=[], rebalanceId=1]

03.07.2020, 02:49, "Stanislav Lukyanov" <st...@gmail.com>:
> Kirill,
>
> I've looked through the patch.
> Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed.
> Do you think it's reasonable to also add it to the logs? Maybe even to the metrics?
>
> Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier?
> Maybe even estimate transfer speed in entries and bytes for each rebalance type?
>
> Thanks,
> Stan
>
>>  On 29 Jun 2020, at 11:50, Ivan Rakov <iv...@gmail.com> wrote:
>>
>>  +1 to Alex G.
>>
>>  From my experience, the most interesting cases with Ignite rebalancing
>>  happen exactly in production. According to the fact that we already have
>>  detailed rebalancing logging, adding info about rebalance performance looks
>>  like a reasonable improvement. With new logs we'll be able to detect and
>>  investigate situations when rebalance is slow due to uneven suppliers
>>  distribution or network issues.
>>  Option to disable the feature in runtime shouldn't be used often, but it
>>  will keep us on the safe side in case something goes wrong.
>>  The format described in
>>  https://issues.apache.org/jira/browse/IGNITE-12080 looks
>>  good to me.
>>
>>  On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
>>  wrote:
>>
>>>  Hello, Alexey!
>>>
>>>  Currently there is no way to disable / enable it, but it seems that the
>>>  logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>>>  and compact. Of course, you can add disabling / enabling statistics
>>>  collection via jmx for example.
>>>
>>>  23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
>>>>  Hello Maxim, folks,
>>>>
>>>>  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>>>
>>>>>  We won't do performance analysis on the production environment. Each
>>>>>  time we need performance analysis it will be done on a test
>>>>>  environment with verbose logging enabled. Thus I suggest moving these
>>>>>  changes to a separate `profiling` module and extend the logging much
>>>>>  more without any ышяу limitations. The same as these [2] [3]
>>>>>  activities do.
>>>>
>>>>  I strongly disagree with this statement. I am not sure who is meant here
>>>>  by 'we', but I see a strong momentum in increasing observability tooling
>>>>  that helps people to understand what exactly happens in the production
>>>>  environment [1]. Not everybody can afford two identical environments for
>>>>  testing. We should make sure users have enough information to understand
>>>>  the root cause after the incident happened, and not force them to
>>>  reproduce
>>>>  it, let alone make them add another module to the classpath and restart
>>>  the
>>>>  nodes.
>>>>  I think having this functionality in the core module with the ability to
>>>>  disable/enable it is the right approach. Having the information printed
>>>  to
>>>>  log is ok, having it in an event that can be sent to a monitoring/tracing
>>>>  subsystem is even better.
>>>>
>>>>  Kirill, can we enable and disable this feature in runtime to avoid the
>>>  very
>>>>  same nodes restart?
>>>>
>>>>  [1]
>>>  https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by Stanislav Lukyanov <st...@gmail.com>.

Kirill,

I've looked through the patch.
Looks good, but it feels like the first thing someone will try to do given bytesRcvd and duration is to divide one by another to get an average speed.
Do you think it's reasonable to also add it to the logs? Maybe even to the metrics?

Also, this works with historical rebalance, right? Can we specify how much data was transferred via historical or full rebalance from each supplier? 
Maybe even estimate transfer speed in entries and bytes for each rebalance type?

Thanks,
Stan

> On 29 Jun 2020, at 11:50, Ivan Rakov <iv...@gmail.com> wrote:
> 
> +1 to Alex G.
> 
> From my experience, the most interesting cases with Ignite rebalancing
> happen exactly in production. According to the fact that we already have
> detailed rebalancing logging, adding info about rebalance performance looks
> like a reasonable improvement. With new logs we'll be able to detect and
> investigate situations when rebalance is slow due to uneven suppliers
> distribution or network issues.
> Option to disable the feature in runtime shouldn't be used often, but it
> will keep us on the safe side in case something goes wrong.
> The format described in
> https://issues.apache.org/jira/browse/IGNITE-12080 looks
> good to me.
> 
> On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
> wrote:
> 
>> Hello, Alexey!
>> 
>> Currently there is no way to disable / enable it, but it seems that the
>> logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
>> and compact. Of course, you can add disabling / enabling statistics
>> collection via jmx for example.
>> 
>> 23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
>>> Hello Maxim, folks,
>>> 
>>> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>> 
>>>> We won't do performance analysis on the production environment. Each
>>>> time we need performance analysis it will be done on a test
>>>> environment with verbose logging enabled. Thus I suggest moving these
>>>> changes to a separate `profiling` module and extend the logging much
>>>> more without any ышяу limitations. The same as these [2] [3]
>>>> activities do.
>>> 
>>> I strongly disagree with this statement. I am not sure who is meant here
>>> by 'we', but I see a strong momentum in increasing observability tooling
>>> that helps people to understand what exactly happens in the production
>>> environment [1]. Not everybody can afford two identical environments for
>>> testing. We should make sure users have enough information to understand
>>> the root cause after the incident happened, and not force them to
>> reproduce
>>> it, let alone make them add another module to the classpath and restart
>> the
>>> nodes.
>>> I think having this functionality in the core module with the ability to
>>> disable/enable it is the right approach. Having the information printed
>> to
>>> log is ok, having it in an event that can be sent to a monitoring/tracing
>>> subsystem is even better.
>>> 
>>> Kirill, can we enable and disable this feature in runtime to avoid the
>> very
>>> same nodes restart?
>>> 
>>> [1]
>> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
>>

Re: Extended logging for rebalance performance analysis

Posted by Ivan Rakov <iv...@gmail.com>.

+1 to Alex G.

From my experience, the most interesting cases with Ignite rebalancing
happen exactly in production. According to the fact that we already have
detailed rebalancing logging, adding info about rebalance performance looks
like a reasonable improvement. With new logs we'll be able to detect and
investigate situations when rebalance is slow due to uneven suppliers
distribution or network issues.
Option to disable the feature in runtime shouldn't be used often, but it
will keep us on the safe side in case something goes wrong.
The format described in
https://issues.apache.org/jira/browse/IGNITE-12080 looks
good to me.

On Tue, Jun 23, 2020 at 7:01 PM ткаленко кирилл <tk...@yandex.ru>
wrote:

> Hello, Alexey!
>
> Currently there is no way to disable / enable it, but it seems that the
> logs will not be overloaded, since Alexei Scherbakov offer seems reasonable
> and compact. Of course, you can add disabling / enabling statistics
> collection via jmx for example.
>
> 23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
> > Hello Maxim, folks,
> >
> > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
> >
> >>  We won't do performance analysis on the production environment. Each
> >>  time we need performance analysis it will be done on a test
> >>  environment with verbose logging enabled. Thus I suggest moving these
> >>  changes to a separate `profiling` module and extend the logging much
> >>  more without any ышяу limitations. The same as these [2] [3]
> >>  activities do.
> >
> >  I strongly disagree with this statement. I am not sure who is meant here
> > by 'we', but I see a strong momentum in increasing observability tooling
> > that helps people to understand what exactly happens in the production
> > environment [1]. Not everybody can afford two identical environments for
> > testing. We should make sure users have enough information to understand
> > the root cause after the incident happened, and not force them to
> reproduce
> > it, let alone make them add another module to the classpath and restart
> the
> > nodes.
> > I think having this functionality in the core module with the ability to
> > disable/enable it is the right approach. Having the information printed
> to
> > log is ok, having it in an event that can be sent to a monitoring/tracing
> > subsystem is even better.
> >
> > Kirill, can we enable and disable this feature in runtime to avoid the
> very
> > same nodes restart?
> >
> > [1]
> https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/
>

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Hello, Alexey!

Currently there is no way to disable / enable it, but it seems that the logs will not be overloaded, since Alexei Scherbakov offer seems reasonable and compact. Of course, you can add disabling / enabling statistics collection via jmx for example.

23.06.2020, 18:47, "Alexey Goncharuk" <al...@gmail.com>:
> Hello Maxim, folks,
>
> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>
>>  We won't do performance analysis on the production environment. Each
>>  time we need performance analysis it will be done on a test
>>  environment with verbose logging enabled. Thus I suggest moving these
>>  changes to a separate `profiling` module and extend the logging much
>>  more without any ышяу limitations. The same as these [2] [3]
>>  activities do.
>
>  I strongly disagree with this statement. I am not sure who is meant here
> by 'we', but I see a strong momentum in increasing observability tooling
> that helps people to understand what exactly happens in the production
> environment [1]. Not everybody can afford two identical environments for
> testing. We should make sure users have enough information to understand
> the root cause after the incident happened, and not force them to reproduce
> it, let alone make them add another module to the classpath and restart the
> nodes.
> I think having this functionality in the core module with the ability to
> disable/enable it is the right approach. Having the information printed to
> log is ok, having it in an event that can be sent to a monitoring/tracing
> subsystem is even better.
>
> Kirill, can we enable and disable this feature in runtime to avoid the very
> same nodes restart?
>
> [1] https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by Alexey Goncharuk <al...@gmail.com>.

Hello Maxim, folks,

ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:

> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>

 I strongly disagree with this statement. I am not sure who is meant here
by 'we', but I see a strong momentum in increasing observability tooling
that helps people to understand what exactly happens in the production
environment [1]. Not everybody can afford two identical environments for
testing. We should make sure users have enough information to understand
the root cause after the incident happened, and not force them to reproduce
it, let alone make them add another module to the classpath and restart the
nodes.
I think having this functionality in the core module with the ability to
disable/enable it is the right approach. Having the information printed to
log is ok, having it in an event that can be sent to a monitoring/tracing
subsystem is even better.

Kirill, can we enable and disable this feature in runtime to avoid the very
same nodes restart?

[1] https://www.honeycomb.io/blog/yes-i-test-in-production-and-so-do-you/

Re: Extended logging for rebalance performance analysis

Posted by Alexei Scherbakov <al...@gmail.com>.

Hi, Kirill.

Looks good to me.

вт, 23 июн. 2020 г. в 18:05, ткаленко кирилл <tk...@yandex.ru>:

> Hello, Alexey!
>
> I suggest that we decide what we can do within ticket [1].
>
> Add "rebalanceId" and "topVer" related to rebalance to all messages.
>
> Add statistical information to a log message:
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [rebalanceId=1, grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100,
> duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Add a message to log after rebalancing for all cache groups is complete:
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M]
>
> Any comments or suggestions?
>
> [1] - https://issues.apache.org/jira/browse/IGNITE-12080
>
> 20.05.2020, 23:08, "ткаленко кирилл" <tk...@yandex.ru>:
> > Hello, Alexey! Unfortunately, my response was delayed.
> >
> > Point 2: You can do as you suggested, I think it is still worth
> specifying how many partitions were obtained.
> >
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5,
> entries=100, duration=12ms,
> > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > progress=1/2]
> >
> > Point 3: is It "rebalanceId"?
> >
> > Point 5: I think we can output a summary for each supplier, so as not to
> keep it in mind.
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> > [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> partitions=5, entries=100, duration=12ms, bytesRcvd=5M],
> > [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00002, partitions=5,
> entries=100, duration=12ms, bytesRcvd=5M]]
> >
> > I can add "rebalanceId" to each message that you gave at above.
> >
> > A detailed message will help us understand how correctly the suppliers
> were selected.
> >
> > 06.05.2020, 22:08, "Alexei Scherbakov" <al...@gmail.com>:
> >>  Hello.
> >>
> >>  Let's look at existing rebalancing log for a single group:
> >>
> >>  [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> >>  [order=[ignite-sys-cache, cache1, cache2, default],
> >>  top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002]
> >>  ...
> >>  [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> >>  mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> >>  partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> >>  minorTopVer=1]]
> >>  [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> >>  mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> >>  partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> >>  minorTopVer=1]]
> >>  [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine
> [cache1,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5,
> 7, 9,
> >>  11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> >>  [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine
> [cache1,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8,
> 10,
> >>  16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> >>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [grp=cache1,
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> >>  [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> >>  [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> >>  [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> >>  RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> >>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> rebalanceId=2,
> >>  routines=2]
> >>
> >>  From these logs I'm already can get answers to 1 and 4.
> >>  The logs look concise and easy to read and understand, and should
> >>  remain what way.
> >>
> >>  But I think some proposed improvements can be done here without harm.
> >>
> >>  2. OK, let's add it to supplier info per cache with additional info:
> >>
> >>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing
> [grp=cache1,
> >>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100,
> duration=12ms,
> >>  bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> >>  progress=1/2]
> >>
> >>  3. This information is already printed to log.
> >>
> >>  5. OK, let's add a line which is printed after all groups in the chain
> are
> >>  rebalanced or cancelled with a summary:
> >>
> >>  [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> >>  [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
> >>
> >>  Another thing I would suggest: add rebalanceId to all messages for
> ease of
> >>  grepping and to distinguish multiple rebalances for a same topology
> version.
> >>
> >>  Regarding the "detailed" message - I wasn't able to figure out any case
> >>  where it is useful. Not sure if we need it at all.
> >>  Kirill Tkalenko, can you give me an example?
> >>
> >>  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
> >>
> >>>   Kirill,
> >>>
> >>>   Thank you for raising this topic. It's true that the rebalance
> process
> >>>   still requires additional information for analyzing issues. Please,
> >>>   don't think that I'm against your changes :-)
> >>>
> >>>   * My short answer. *
> >>>
> >>>   We won't do performance analysis on the production environment. Each
> >>>   time we need performance analysis it will be done on a test
> >>>   environment with verbose logging enabled. Thus I suggest moving these
> >>>   changes to a separate `profiling` module and extend the logging much
> >>>   more without any ышяу limitations. The same as these [2] [3]
> >>>   activities do.
> >>>
> >>>   Let's keep the `core` module as simple as possible.
> >>>   Let's design the right API for accessing rebalance internals for
> >>>   profiling tools.
> >>>   Can you, please, remove all changes from your PR [6] which are not
> >>>   related to the proposed topic? (e.g. variable renamings).
> >>>
> >>>   * The long answer. *
> >>>
> >>>   Here are my thoughts on this. There are two different types of issues
> >>>   in the rebalance process. The first case must be covered by daily
> >>>   monitoring subsystems, the second case must be covered by additional
> >>>   profiling tools.
> >>>   1. errors during the rebalancing (e.g. rebalance not happens when
> required)
> >>>   2. performance rebalancing issues (e.g. rebalance is slow)
> >>>
> >>>   Daily monitoring tools (JMX, Logging) are always turned on and
> >>>   shouldn't require additional systems resources themselves. Since
> these
> >>>   metrics must be lightweight any internal aggregation machinery is not
> >>>   used on them. All these metrics are collected from each node
> >>>   independently. Please, take a look at this issue [1] which covers
> most
> >>>   of your questions mentioned above.
> >>>
> >>>   For all available metrics, we can configure LogExporterSpi, so they
> >>>   will be available in logs.
> >>>
> >>>   > 1)How long was the balance(divided by supplier)?
> >>>   rebalancingStartTime, rebalancingEndTime already exists for cache
> groups
> >>>   [4].
> >>>   We can add the same for suppliers.
> >>>
> >>>   > 2)How many records and bytes per supplier were rebalanced?
> >>>   We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> >>>   We will have rebalancingExpectedKeys [5].
> >>>   We can add a new metric per cache keys, per supplier.
> >>>
> >>>   > 3)How many times did the rebalance restart?
> >>>   rebalancingLastCancelledTime [4] metric exists.
> >>>   Do we need to keep historical data on it?
> >>>
> >>>   > 4)Which partitions were rebalanced and from which nodes did they
> receive
> >>>   them?
> >>>   Let's print this information to log prior to the rebalance process
> >>>   starts. This will be helpful information and do not require a lot of
> >>>   changes.
> >>>
> >>>   > 5)When did rebalance for all cache groups end?
> >>>   This metric may be aggregated from rebalancingEndTime [4] by pulling
> >>>   it from all nodes for all caches.
> >>>
> >>>   Performance rebalancing issues are related to profiling tools. They
> >>>   may require additional system resources and definitely require a
> >>>   dedicated environment for tests. We can't do performance analysis on
> >>>   production environments due to performance impact.
> >>>   I see some disadvantages of adding such tools to production code:
> >>>   - verbose logging may affect performance.
> >>>   - the problematic process may become even worse if an automatic
> >>>   threshold suddenly turns on.
> >>>   - new code changes will require additional efforts to keep logging
> >>>   up-to-date.
> >>>
> >>>   [1] https://issues.apache.org/jira/browse/IGNITE-12183
> >>>   [2] https://issues.apache.org/jira/browse/IGNITE-12666
> >>>   [3]
> >>>
> https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> >>>   [4] https://issues.apache.org/jira/browse/IGNITE-12193
> >>>   [5] https://issues.apache.org/jira/browse/IGNITE-12194
> >>>   [6] https://github.com/apache/ignite/pull/7705/files
> >>>
> >>>   On Wed, 6 May 2020 at 19:50, Ivan Rakov <iv...@gmail.com>
> wrote:
> >>>   >
> >>>   > Hi,
> >>>   >
> >>>   > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> >>>   > > duration rebalance of cache group after which partitions
> distribution
> >>>   is
> >>>   > > output, set in milliseconds, default value is 10 minutes.
> >>>   >
> >>>   > Does it mean that if the rebalancing process took less than 10
> minutes,
> >>>   > only a short version of the message (with supplier statistics)
> will show
> >>>   up?
> >>>   >
> >>>   > In general, I have no objections.
> >>>   >
> >>>   >
> >>>   > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <
> tkalkirill@yandex.ru>
> >>>   > wrote:
> >>>   >
> >>>   > > Hi, Igniters!
> >>>   > >
> >>>   > > I'd like to share a new small feature in AI [1].
> >>>   > >
> >>>   > > Current rebalance logging does not allow you to quickly answer
> >>>   following
> >>>   > > questions:
> >>>   > > 1)How long was the balance(divided by supplier)?
> >>>   > > 2)How many records and bytes per supplier were rebalanced?
> >>>   > > 3)How many times did rebalance restart?
> >>>   > > 4)Which partitions were rebalanced and from which nodes did they
> >>>   receive
> >>>   > > them?
> >>>   > > 5)When did rebalance for all cache groups end?
> >>>   > >
> >>>   > > What you can see in logs now:
> >>>   > >
> >>>   > > 1)Starting rebalance with order of cache groups.
> >>>   > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> >>>   > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> force=false,
> >>>   > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
> >>>   > >
> >>>   > > 2)Start rebalance of cache group from a specific supplier,
> specifying
> >>>   > > partition ids and mode - historical or full.
> >>>   > > Starting rebalance routine [ignite-sys-cache,
> >>>   > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> >>>   > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> fullPartitions=[0-99],
> >>>   > > histPartitions=[]]
> >>>   > >
> >>>   > > 3)Getting partial or complete partitions of cache group.
> >>>   > > Completed rebalancing [grp=ignite-sys-cache,
> >>>   > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> >>>   > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=1/2]
> >>>   > > Completed (final) rebalancing [grp=ignite-sys-cache,
> >>>   > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> >>>   > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=2/2]
> >>>   > >
> >>>   > > 4)End rebalance of cache group.
> >>>   > > Completed rebalance future: RebalanceFuture
> [grp=CacheGroupContext
> >>>   > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> >>>   > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> >>>   > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607,
> endTime=-1,
> >>>   > > lastCancelledTime=-1]
> >>>   > >
> >>>   > > Rebalance statistics:
> >>>   > >
> >>>   > > To speed up rebalance analysis, statistics will be output for
> each
> >>>   cache
> >>>   > > group and total for all cache groups.
> >>>   > > If duration rebalance for cache group is greater than threshold
> value,
> >>>   > > partition distribution is output.
> >>>   > > Statistics will you to analyze duration of the balance for each
> >>>   supplier
> >>>   > > to understand which of them has been transmitting data for
> longest
> >>>   time.
> >>>   > >
> >>>   > > System properties are used to output statistics:
> >>>   > >
> >>>   > > IGNITE_QUIET - to output statistics, value must be false;
> >>>   > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD -
> threshold
> >>>   > > duration rebalance of cache group after which partitions
> distribution
> >>>   is
> >>>   > > output, set in milliseconds, default value is 10 minutes.
> >>>   > >
> >>>   > > Statistics examples:
> >>>   > >
> >>>   > > Successful full and historical rebalance of group cache, without
> >>>   > > partitions distribution.
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181548,
> >>>   > > name=grp1, startTime=2020-04-13 10:55:16,117,
> finishTime=2020-04-13
> >>>   > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   p=5,
> >>>   > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>>   entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest0]
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181547,
> >>>   > > name=grp0, startTime=2020-04-13 15:01:44,000,
> finishTime=2020-04-13
> >>>   > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   hp=10,
> >>>   > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e -
> entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>>   > >
> >>>   > > Successful full and historical rebalance of group cache, with
> >>>   partitions
> >>>   > > distribution.
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181548,
> >>>   > > name=grp1, startTime=2020-04-13 10:55:16,117,
> finishTime=2020-04-13
> >>>   > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   p=5,
> >>>   > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>>   entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was
> greater
> >>>   > > than 5 ms, printing detailed information about partitions
> distribution
> >>>   > > (threshold can be changed by setting number of milliseconds into
> >>>   > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> >>>   > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su]
> Aliases: pr -
> >>>   > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>>   mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest2]
> >>>   > > [2=rebalancing.RebalanceStatisticsTest0]
> >>>   > > Rebalance information per cache group (successful rebalance):
> >>>   [id=3181547,
> >>>   > > name=grp0, startTime=2020-04-13 15:01:44,000,
> finishTime=2020-04-13
> >>>   > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>>   hp=10,
> >>>   > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e -
> entries, b -
> >>>   > > bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>>   Rebalance
> >>>   > > duration was greater than 100 ms, printing detailed information
> about
> >>>   > > partitions distribution (threshold can be changed by setting
> number of
> >>>   > > milliseconds into
> >>>   IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> >>>   > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> >>>   > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> >>>   > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> >>>   > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> >>>   > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases:
> pr -
> >>>   > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>>   mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest2]
> >>>   > > [2=rebalancing.RebalanceStatisticsTest1]
> >>>   > >
> >>>   > > Interrupted rebalance of group cache.
> >>>   > > Rebalance information per cache group (interrupted rebalance):
> >>>   > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> >>>   > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
> >>>   > >
> >>>   > > Total full and historical rebalance for all cache groups.
> >>>   > > Rebalance total information (including successful and not
> rebalances):
> >>>   > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13
> 10:55:18,780,
> >>>   > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000,
> d=54 ms]
> >>>   > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p -
> partitions, e -
> >>>   > > entries, b - bytes, d - duration, h - historical, nodeId mapping
> >>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>>   > > [1=rebalancing.RebalanceStatisticsTest0]
> >>>   > > Rebalance total information (including successful and not
> rebalances):
> >>>   > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13
> 15:01:44,116,
> >>>   > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500,
> hb=50445,
> >>>   d=294
> >>>   > > ms] Aliases: p - partitions, e - entries, b - bytes, d -
> duration, h -
> >>>   > > historical, nodeId mapping (nodeId=id,consistentId)
> >>>   > > [0=rebalancing.RebalanceStatisticsTest0]
> >>>   > >
> >>>   > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080
> >>
> >>  --
> >>
> >>  Best regards,
> >>  Alexei Scherbakov
>


-- 

Best regards,
Alexei Scherbakov

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Hello, Alexey!

I suggest that we decide what we can do within ticket [1].

Add "rebalanceId" and "topVer" related to rebalance to all messages.

Add statistical information to a log message:
[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [rebalanceId=1, grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100, duration=12ms,
bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
progress=1/2]

Add a message to log after rebalancing for all cache groups is complete:
[2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
[rebalanceId=2, partitions=10, entries=200, duration=50ms, bytesRcvd=10M]

Any comments or suggestions?

[1] - https://issues.apache.org/jira/browse/IGNITE-12080

20.05.2020, 23:08, "ткаленко кирилл" <tk...@yandex.ru>:
> Hello, Alexey! Unfortunately, my response was delayed.
>
> Point 2: You can do as you suggested, I think it is still worth specifying how many partitions were obtained.
>
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100, duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Point 3: is It "rebalanceId"?
>
> Point 5: I think we can output a summary for each supplier, so as not to keep it in mind.
>
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100, duration=12ms, bytesRcvd=5M],
> [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00002, partitions=5, entries=100, duration=12ms, bytesRcvd=5M]]
>
> I can add "rebalanceId" to each message that you gave at above.
>
> A detailed message will help us understand how correctly the suppliers were selected.
>
> 06.05.2020, 22:08, "Alexei Scherbakov" <al...@gmail.com>:
>>  Hello.
>>
>>  Let's look at existing rebalancing log for a single group:
>>
>>  [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
>>  [order=[ignite-sys-cache, cache1, cache2, default],
>>  top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002]
>>  ...
>>  [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
>>  mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
>>  partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
>>  minorTopVer=1]]
>>  [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
>>  mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
>>  partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
>>  minorTopVer=1]]
>>  [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5, 7, 9,
>>  11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
>>  [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8, 10,
>>  16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
>>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
>>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
>>  [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
>>  [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
>>  [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
>>  RebalanceFuture [grp=CacheGroupContext [grp=cache1],
>>  topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
>>  routines=2]
>>
>>  From these logs I'm already can get answers to 1 and 4.
>>  The logs look concise and easy to read and understand, and should
>>  remain what way.
>>
>>  But I think some proposed improvements can be done here without harm.
>>
>>  2. OK, let's add it to supplier info per cache with additional info:
>>
>>  [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
>>  supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100, duration=12ms,
>>  bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
>>  progress=1/2]
>>
>>  3. This information is already printed to log.
>>
>>  5. OK, let's add a line which is printed after all groups in the chain are
>>  rebalanced or cancelled with a summary:
>>
>>  [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
>>  [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
>>
>>  Another thing I would suggest: add rebalanceId to all messages for ease of
>>  grepping and to distinguish multiple rebalances for a same topology version.
>>
>>  Regarding the "detailed" message - I wasn't able to figure out any case
>>  where it is useful. Not sure if we need it at all.
>>  Kirill Tkalenko, can you give me an example?
>>
>>  ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>>
>>>   Kirill,
>>>
>>>   Thank you for raising this topic. It's true that the rebalance process
>>>   still requires additional information for analyzing issues. Please,
>>>   don't think that I'm against your changes :-)
>>>
>>>   * My short answer. *
>>>
>>>   We won't do performance analysis on the production environment. Each
>>>   time we need performance analysis it will be done on a test
>>>   environment with verbose logging enabled. Thus I suggest moving these
>>>   changes to a separate `profiling` module and extend the logging much
>>>   more without any ышяу limitations. The same as these [2] [3]
>>>   activities do.
>>>
>>>   Let's keep the `core` module as simple as possible.
>>>   Let's design the right API for accessing rebalance internals for
>>>   profiling tools.
>>>   Can you, please, remove all changes from your PR [6] which are not
>>>   related to the proposed topic? (e.g. variable renamings).
>>>
>>>   * The long answer. *
>>>
>>>   Here are my thoughts on this. There are two different types of issues
>>>   in the rebalance process. The first case must be covered by daily
>>>   monitoring subsystems, the second case must be covered by additional
>>>   profiling tools.
>>>   1. errors during the rebalancing (e.g. rebalance not happens when required)
>>>   2. performance rebalancing issues (e.g. rebalance is slow)
>>>
>>>   Daily monitoring tools (JMX, Logging) are always turned on and
>>>   shouldn't require additional systems resources themselves. Since these
>>>   metrics must be lightweight any internal aggregation machinery is not
>>>   used on them. All these metrics are collected from each node
>>>   independently. Please, take a look at this issue [1] which covers most
>>>   of your questions mentioned above.
>>>
>>>   For all available metrics, we can configure LogExporterSpi, so they
>>>   will be available in logs.
>>>
>>>   > 1)How long was the balance(divided by supplier)?
>>>   rebalancingStartTime, rebalancingEndTime already exists for cache groups
>>>   [4].
>>>   We can add the same for suppliers.
>>>
>>>   > 2)How many records and bytes per supplier were rebalanced?
>>>   We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
>>>   We will have rebalancingExpectedKeys [5].
>>>   We can add a new metric per cache keys, per supplier.
>>>
>>>   > 3)How many times did the rebalance restart?
>>>   rebalancingLastCancelledTime [4] metric exists.
>>>   Do we need to keep historical data on it?
>>>
>>>   > 4)Which partitions were rebalanced and from which nodes did they receive
>>>   them?
>>>   Let's print this information to log prior to the rebalance process
>>>   starts. This will be helpful information and do not require a lot of
>>>   changes.
>>>
>>>   > 5)When did rebalance for all cache groups end?
>>>   This metric may be aggregated from rebalancingEndTime [4] by pulling
>>>   it from all nodes for all caches.
>>>
>>>   Performance rebalancing issues are related to profiling tools. They
>>>   may require additional system resources and definitely require a
>>>   dedicated environment for tests. We can't do performance analysis on
>>>   production environments due to performance impact.
>>>   I see some disadvantages of adding such tools to production code:
>>>   - verbose logging may affect performance.
>>>   - the problematic process may become even worse if an automatic
>>>   threshold suddenly turns on.
>>>   - new code changes will require additional efforts to keep logging
>>>   up-to-date.
>>>
>>>   [1] https://issues.apache.org/jira/browse/IGNITE-12183
>>>   [2] https://issues.apache.org/jira/browse/IGNITE-12666
>>>   [3]
>>>   https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
>>>   [4] https://issues.apache.org/jira/browse/IGNITE-12193
>>>   [5] https://issues.apache.org/jira/browse/IGNITE-12194
>>>   [6] https://github.com/apache/ignite/pull/7705/files
>>>
>>>   On Wed, 6 May 2020 at 19:50, Ivan Rakov <iv...@gmail.com> wrote:
>>>   >
>>>   > Hi,
>>>   >
>>>   > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>>   > > duration rebalance of cache group after which partitions distribution
>>>   is
>>>   > > output, set in milliseconds, default value is 10 minutes.
>>>   >
>>>   > Does it mean that if the rebalancing process took less than 10 minutes,
>>>   > only a short version of the message (with supplier statistics) will show
>>>   up?
>>>   >
>>>   > In general, I have no objections.
>>>   >
>>>   >
>>>   > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tk...@yandex.ru>
>>>   > wrote:
>>>   >
>>>   > > Hi, Igniters!
>>>   > >
>>>   > > I'd like to share a new small feature in AI [1].
>>>   > >
>>>   > > Current rebalance logging does not allow you to quickly answer
>>>   following
>>>   > > questions:
>>>   > > 1)How long was the balance(divided by supplier)?
>>>   > > 2)How many records and bytes per supplier were rebalanced?
>>>   > > 3)How many times did rebalance restart?
>>>   > > 4)Which partitions were rebalanced and from which nodes did they
>>>   receive
>>>   > > them?
>>>   > > 5)When did rebalance for all cache groups end?
>>>   > >
>>>   > > What you can see in logs now:
>>>   > >
>>>   > > 1)Starting rebalance with order of cache groups.
>>>   > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
>>>   > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
>>>   > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
>>>   > >
>>>   > > 2)Start rebalance of cache group from a specific supplier, specifying
>>>   > > partition ids and mode - historical or full.
>>>   > > Starting rebalance routine [ignite-sys-cache,
>>>   > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
>>>   > > supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
>>>   > > histPartitions=[]]
>>>   > >
>>>   > > 3)Getting partial or complete partitions of cache group.
>>>   > > Completed rebalancing [grp=ignite-sys-cache,
>>>   > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
>>>   > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
>>>   > > Completed (final) rebalancing [grp=ignite-sys-cache,
>>>   > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
>>>   > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>>>   > >
>>>   > > 4)End rebalance of cache group.
>>>   > > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
>>>   > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
>>>   > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
>>>   > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
>>>   > > lastCancelledTime=-1]
>>>   > >
>>>   > > Rebalance statistics:
>>>   > >
>>>   > > To speed up rebalance analysis, statistics will be output for each
>>>   cache
>>>   > > group and total for all cache groups.
>>>   > > If duration rebalance for cache group is greater than threshold value,
>>>   > > partition distribution is output.
>>>   > > Statistics will you to analyze duration of the balance for each
>>>   supplier
>>>   > > to understand which of them has been transmitting data for longest
>>>   time.
>>>   > >
>>>   > > System properties are used to output statistics:
>>>   > >
>>>   > > IGNITE_QUIET - to output statistics, value must be false;
>>>   > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>>   > > duration rebalance of cache group after which partitions distribution
>>>   is
>>>   > > output, set in milliseconds, default value is 10 minutes.
>>>   > >
>>>   > > Statistics examples:
>>>   > >
>>>   > > Successful full and historical rebalance of group cache, without
>>>   > > partitions distribution.
>>>   > > Rebalance information per cache group (successful rebalance):
>>>   [id=3181548,
>>>   > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>>   > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
>>>   p=5,
>>>   > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
>>>   entries, b -
>>>   > > bytes, d - duration, h - historical, nodeId mapping
>>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>>   > > [1=rebalancing.RebalanceStatisticsTest0]
>>>   > > Rebalance information per cache group (successful rebalance):
>>>   [id=3181547,
>>>   > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>>   > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0,
>>>   hp=10,
>>>   > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>>   > > bytes, d - duration, h - historical, nodeId mapping
>>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>>   > >
>>>   > > Successful full and historical rebalance of group cache, with
>>>   partitions
>>>   > > distribution.
>>>   > > Rebalance information per cache group (successful rebalance):
>>>   [id=3181548,
>>>   > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>>   > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
>>>   p=5,
>>>   > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
>>>   entries, b -
>>>   > > bytes, d - duration, h - historical, nodeId mapping
>>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>>   > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
>>>   > > than 5 ms, printing detailed information about partitions distribution
>>>   > > (threshold can be changed by setting number of milliseconds into
>>>   > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
>>>   > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
>>>   > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
>>>   > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
>>>   > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
>>>   > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
>>>   > > primary, bu - backup, su - supplier node, h - historical, nodeId
>>>   mapping
>>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>>   > > [1=rebalancing.RebalanceStatisticsTest2]
>>>   > > [2=rebalancing.RebalanceStatisticsTest0]
>>>   > > Rebalance information per cache group (successful rebalance):
>>>   [id=3181547,
>>>   > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>>   > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0,
>>>   hp=10,
>>>   > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>>   > > bytes, d - duration, h - historical, nodeId mapping
>>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>>   Rebalance
>>>   > > duration was greater than 100 ms, printing detailed information about
>>>   > > partitions distribution (threshold can be changed by setting number of
>>>   > > milliseconds into
>>>   IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
>>>   > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
>>>   > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
>>>   > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
>>>   > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
>>>   > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
>>>   > > primary, bu - backup, su - supplier node, h - historical, nodeId
>>>   mapping
>>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>>   > > [1=rebalancing.RebalanceStatisticsTest2]
>>>   > > [2=rebalancing.RebalanceStatisticsTest1]
>>>   > >
>>>   > > Interrupted rebalance of group cache.
>>>   > > Rebalance information per cache group (interrupted rebalance):
>>>   > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
>>>   > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
>>>   > >
>>>   > > Total full and historical rebalance for all cache groups.
>>>   > > Rebalance total information (including successful and not rebalances):
>>>   > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
>>>   > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
>>>   > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
>>>   > > entries, b - bytes, d - duration, h - historical, nodeId mapping
>>>   > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>>   > > [1=rebalancing.RebalanceStatisticsTest0]
>>>   > > Rebalance total information (including successful and not rebalances):
>>>   > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
>>>   > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445,
>>>   d=294
>>>   > > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
>>>   > > historical, nodeId mapping (nodeId=id,consistentId)
>>>   > > [0=rebalancing.RebalanceStatisticsTest0]
>>>   > >
>>>   > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080
>>
>>  --
>>
>>  Best regards,
>>  Alexei Scherbakov

Re: Extended logging for rebalance performance analysis

Posted by Alexei Scherbakov <al...@gmail.com>.

Hi, Kirill.

2. Ok for me.

3. We already have log message about rebalance cancellation for specific
rebalanceId and log message about new rebalancing has started with next
rebalanceId.

5. We already have summary information per group for each supplier added in
p.2
I would avoid duplication here or put it under debug logging.

Information about selected suppliers can already be obtained from logs:
[2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]

so I still do not see any value in "detailed" message.

I think this is enough to understand rebalancing flow using proper grepping
by topVer and rebalanceId.

All additional aggregation should be done by external tools using
rebalancing metrics.
I'm on the same page with Maxim Muzafarov here.



ср, 20 мая 2020 г. в 23:08, ткаленко кирилл <tk...@yandex.ru>:

> Hello, Alexey! Unfortunately, my response was delayed.
>
> Point 2: You can do as you suggested, I think it is still worth specifying
> how many partitions were obtained.
>
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100,
> duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> Point 3: is It "rebalanceId"?
>
> Point 5: I think we can output a summary for each supplier, so as not to
> keep it in mind.
>
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> partitions=5, entries=100, duration=12ms, bytesRcvd=5M],
> [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00002, partitions=5, entries=100,
> duration=12ms, bytesRcvd=5M]]
>
> I can add "rebalanceId" to each message that you gave at above.
>
> A detailed message will help us understand how correctly the suppliers
> were selected.
>
> 06.05.2020, 22:08, "Alexei Scherbakov" <al...@gmail.com>:
> > Hello.
> >
> > Let's look at existing rebalancing log for a single group:
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> > [order=[ignite-sys-cache, cache1, cache2, default],
> > top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002]
> > ...
> > [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> > mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> > partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> > minorTopVer=1]]
> > [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> > mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> > partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> > minorTopVer=1]]
> > [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5, 7,
> 9,
> > 11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> > [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8, 10,
> > 16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> > [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> > [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> > [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> > RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> > topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
> > routines=2]
> >
> > From these logs I'm already can get answers to 1 and 4.
> > The logs look concise and easy to read and understand, and should
> > remain what way.
> >
> > But I think some proposed improvements can be done here without harm.
> >
> > 2. OK, let's add it to supplier info per cache with additional info:
> >
> > [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> > supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100,
> duration=12ms,
> > bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> > progress=1/2]
> >
> > 3. This information is already printed to log.
> >
> > 5. OK, let's add a line which is printed after all groups in the chain
> are
> > rebalanced or cancelled with a summary:
> >
> > [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> > [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
> >
> > Another thing I would suggest: add rebalanceId to all messages for ease
> of
> > grepping and to distinguish multiple rebalances for a same topology
> version.
> >
> > Regarding the "detailed" message - I wasn't able to figure out any case
> > where it is useful. Not sure if we need it at all.
> > Kirill Tkalenko, can you give me an example?
> >
> > ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
> >
> >>  Kirill,
> >>
> >>  Thank you for raising this topic. It's true that the rebalance process
> >>  still requires additional information for analyzing issues. Please,
> >>  don't think that I'm against your changes :-)
> >>
> >>  * My short answer. *
> >>
> >>  We won't do performance analysis on the production environment. Each
> >>  time we need performance analysis it will be done on a test
> >>  environment with verbose logging enabled. Thus I suggest moving these
> >>  changes to a separate `profiling` module and extend the logging much
> >>  more without any ышяу limitations. The same as these [2] [3]
> >>  activities do.
> >>
> >>  Let's keep the `core` module as simple as possible.
> >>  Let's design the right API for accessing rebalance internals for
> >>  profiling tools.
> >>  Can you, please, remove all changes from your PR [6] which are not
> >>  related to the proposed topic? (e.g. variable renamings).
> >>
> >>  * The long answer. *
> >>
> >>  Here are my thoughts on this. There are two different types of issues
> >>  in the rebalance process. The first case must be covered by daily
> >>  monitoring subsystems, the second case must be covered by additional
> >>  profiling tools.
> >>  1. errors during the rebalancing (e.g. rebalance not happens when
> required)
> >>  2. performance rebalancing issues (e.g. rebalance is slow)
> >>
> >>  Daily monitoring tools (JMX, Logging) are always turned on and
> >>  shouldn't require additional systems resources themselves. Since these
> >>  metrics must be lightweight any internal aggregation machinery is not
> >>  used on them. All these metrics are collected from each node
> >>  independently. Please, take a look at this issue [1] which covers most
> >>  of your questions mentioned above.
> >>
> >>  For all available metrics, we can configure LogExporterSpi, so they
> >>  will be available in logs.
> >>
> >>  > 1)How long was the balance(divided by supplier)?
> >>  rebalancingStartTime, rebalancingEndTime already exists for cache
> groups
> >>  [4].
> >>  We can add the same for suppliers.
> >>
> >>  > 2)How many records and bytes per supplier were rebalanced?
> >>  We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> >>  We will have rebalancingExpectedKeys [5].
> >>  We can add a new metric per cache keys, per supplier.
> >>
> >>  > 3)How many times did the rebalance restart?
> >>  rebalancingLastCancelledTime [4] metric exists.
> >>  Do we need to keep historical data on it?
> >>
> >>  > 4)Which partitions were rebalanced and from which nodes did they
> receive
> >>  them?
> >>  Let's print this information to log prior to the rebalance process
> >>  starts. This will be helpful information and do not require a lot of
> >>  changes.
> >>
> >>  > 5)When did rebalance for all cache groups end?
> >>  This metric may be aggregated from rebalancingEndTime [4] by pulling
> >>  it from all nodes for all caches.
> >>
> >>  Performance rebalancing issues are related to profiling tools. They
> >>  may require additional system resources and definitely require a
> >>  dedicated environment for tests. We can't do performance analysis on
> >>  production environments due to performance impact.
> >>  I see some disadvantages of adding such tools to production code:
> >>  - verbose logging may affect performance.
> >>  - the problematic process may become even worse if an automatic
> >>  threshold suddenly turns on.
> >>  - new code changes will require additional efforts to keep logging
> >>  up-to-date.
> >>
> >>  [1] https://issues.apache.org/jira/browse/IGNITE-12183
> >>  [2] https://issues.apache.org/jira/browse/IGNITE-12666
> >>  [3]
> >>
> https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> >>  [4] https://issues.apache.org/jira/browse/IGNITE-12193
> >>  [5] https://issues.apache.org/jira/browse/IGNITE-12194
> >>  [6] https://github.com/apache/ignite/pull/7705/files
> >>
> >>  On Wed, 6 May 2020 at 19:50, Ivan Rakov <iv...@gmail.com> wrote:
> >>  >
> >>  > Hi,
> >>  >
> >>  > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> >>  > > duration rebalance of cache group after which partitions
> distribution
> >>  is
> >>  > > output, set in milliseconds, default value is 10 minutes.
> >>  >
> >>  > Does it mean that if the rebalancing process took less than 10
> minutes,
> >>  > only a short version of the message (with supplier statistics) will
> show
> >>  up?
> >>  >
> >>  > In general, I have no objections.
> >>  >
> >>  >
> >>  > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <
> tkalkirill@yandex.ru>
> >>  > wrote:
> >>  >
> >>  > > Hi, Igniters!
> >>  > >
> >>  > > I'd like to share a new small feature in AI [1].
> >>  > >
> >>  > > Current rebalance logging does not allow you to quickly answer
> >>  following
> >>  > > questions:
> >>  > > 1)How long was the balance(divided by supplier)?
> >>  > > 2)How many records and bytes per supplier were rebalanced?
> >>  > > 3)How many times did rebalance restart?
> >>  > > 4)Which partitions were rebalanced and from which nodes did they
> >>  receive
> >>  > > them?
> >>  > > 5)When did rebalance for all cache groups end?
> >>  > >
> >>  > > What you can see in logs now:
> >>  > >
> >>  > > 1)Starting rebalance with order of cache groups.
> >>  > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> >>  > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> >>  > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
> >>  > >
> >>  > > 2)Start rebalance of cache group from a specific supplier,
> specifying
> >>  > > partition ids and mode - historical or full.
> >>  > > Starting rebalance routine [ignite-sys-cache,
> >>  > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> >>  > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> fullPartitions=[0-99],
> >>  > > histPartitions=[]]
> >>  > >
> >>  > > 3)Getting partial or complete partitions of cache group.
> >>  > > Completed rebalancing [grp=ignite-sys-cache,
> >>  > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> >>  > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=1/2]
> >>  > > Completed (final) rebalancing [grp=ignite-sys-cache,
> >>  > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> >>  > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0],
> progress=2/2]
> >>  > >
> >>  > > 4)End rebalance of cache group.
> >>  > > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> >>  > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> >>  > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> >>  > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607,
> endTime=-1,
> >>  > > lastCancelledTime=-1]
> >>  > >
> >>  > > Rebalance statistics:
> >>  > >
> >>  > > To speed up rebalance analysis, statistics will be output for each
> >>  cache
> >>  > > group and total for all cache groups.
> >>  > > If duration rebalance for cache group is greater than threshold
> value,
> >>  > > partition distribution is output.
> >>  > > Statistics will you to analyze duration of the balance for each
> >>  supplier
> >>  > > to understand which of them has been transmitting data for longest
> >>  time.
> >>  > >
> >>  > > System properties are used to output statistics:
> >>  > >
> >>  > > IGNITE_QUIET - to output statistics, value must be false;
> >>  > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> >>  > > duration rebalance of cache group after which partitions
> distribution
> >>  is
> >>  > > output, set in milliseconds, default value is 10 minutes.
> >>  > >
> >>  > > Statistics examples:
> >>  > >
> >>  > > Successful full and historical rebalance of group cache, without
> >>  > > partitions distribution.
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181548,
> >>  > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> >>  > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
> >>  p=5,
> >>  > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>  entries, b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest0]
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181547,
> >>  > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> >>  > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>  hp=10,
> >>  > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries,
> b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>  > >
> >>  > > Successful full and historical rebalance of group cache, with
> >>  partitions
> >>  > > distribution.
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181548,
> >>  > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> >>  > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
> >>  p=5,
> >>  > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> >>  entries, b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was
> greater
> >>  > > than 5 ms, printing detailed information about partitions
> distribution
> >>  > > (threshold can be changed by setting number of milliseconds into
> >>  > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> >>  > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases:
> pr -
> >>  > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>  mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest2]
> >>  > > [2=rebalancing.RebalanceStatisticsTest0]
> >>  > > Rebalance information per cache group (successful rebalance):
> >>  [id=3181547,
> >>  > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> >>  > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics:
> [nodeId=0,
> >>  hp=10,
> >>  > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries,
> b -
> >>  > > bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>  Rebalance
> >>  > > duration was greater than 100 ms, printing detailed information
> about
> >>  > > partitions distribution (threshold can be changed by setting
> number of
> >>  > > milliseconds into
> >>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> >>  > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> >>  > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> >>  > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> >>  > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> >>  > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases:
> pr -
> >>  > > primary, bu - backup, su - supplier node, h - historical, nodeId
> >>  mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >>  > > [1=rebalancing.RebalanceStatisticsTest2]
> >>  > > [2=rebalancing.RebalanceStatisticsTest1]
> >>  > >
> >>  > > Interrupted rebalance of group cache.
> >>  > > Rebalance information per cache group (interrupted rebalance):
> >>  > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> >>  > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
> >>  > >
> >>  > > Total full and historical rebalance for all cache groups.
> >>  > > Rebalance total information (including successful and not
> rebalances):
> >>  > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13
> 10:55:18,780,
> >>  > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000,
> d=54 ms]
> >>  > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions,
> e -
> >>  > > entries, b - bytes, d - duration, h - historical, nodeId mapping
> >>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> >>  > > [1=rebalancing.RebalanceStatisticsTest0]
> >>  > > Rebalance total information (including successful and not
> rebalances):
> >>  > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13
> 15:01:44,116,
> >>  > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445,
> >>  d=294
> >>  > > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration,
> h -
> >>  > > historical, nodeId mapping (nodeId=id,consistentId)
> >>  > > [0=rebalancing.RebalanceStatisticsTest0]
> >>  > >
> >>  > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080
> >
> > --
> >
> > Best regards,
> > Alexei Scherbakov
>


-- 

Best regards,
Alexei Scherbakov

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Hello, Alexey! Unfortunately, my response was delayed.

Point 2: You can do as you suggested, I think it is still worth specifying how many partitions were obtained.

[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100, duration=12ms,
bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
progress=1/2]

Point 3: is It "rebalanceId"?

Point 5: I think we can output a summary for each supplier, so as not to keep it in mind.

[2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
[rebalanceId=2, [supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, partitions=5, entries=100, duration=12ms, bytesRcvd=5M], 
[supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00002, partitions=5, entries=100, duration=12ms, bytesRcvd=5M]]

I can add "rebalanceId" to each message that you gave at above.

A detailed message will help us understand how correctly the suppliers were selected.

06.05.2020, 22:08, "Alexei Scherbakov" <al...@gmail.com>:
> Hello.
>
> Let's look at existing rebalancing log for a single group:
>
> [2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
> [order=[ignite-sys-cache, cache1, cache2, default],
> top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002]
> ...
> [2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
> mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
> minorTopVer=1]]
> [2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
> mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
> minorTopVer=1]]
> [2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5, 7, 9,
> 11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
> [2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8, 10,
> 16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
> [2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
> [grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
> [2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
> RebalanceFuture [grp=CacheGroupContext [grp=cache1],
> topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
> routines=2]
>
> From these logs I'm already can get answers to 1 and 4.
> The logs look concise and easy to read and understand, and should
> remain what way.
>
> But I think some proposed improvements can be done here without harm.
>
> 2. OK, let's add it to supplier info per cache with additional info:
>
> [2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
> supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100, duration=12ms,
> bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
> progress=1/2]
>
> 3. This information is already printed to log.
>
> 5. OK, let's add a line which is printed after all groups in the chain are
> rebalanced or cancelled with a summary:
>
> [2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
> [rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]
>
> Another thing I would suggest: add rebalanceId to all messages for ease of
> grepping and to distinguish multiple rebalances for a same topology version.
>
> Regarding the "detailed" message - I wasn't able to figure out any case
> where it is useful. Not sure if we need it at all.
> Kirill Tkalenko, can you give me an example?
>
> ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:
>
>>  Kirill,
>>
>>  Thank you for raising this topic. It's true that the rebalance process
>>  still requires additional information for analyzing issues. Please,
>>  don't think that I'm against your changes :-)
>>
>>  * My short answer. *
>>
>>  We won't do performance analysis on the production environment. Each
>>  time we need performance analysis it will be done on a test
>>  environment with verbose logging enabled. Thus I suggest moving these
>>  changes to a separate `profiling` module and extend the logging much
>>  more without any ышяу limitations. The same as these [2] [3]
>>  activities do.
>>
>>  Let's keep the `core` module as simple as possible.
>>  Let's design the right API for accessing rebalance internals for
>>  profiling tools.
>>  Can you, please, remove all changes from your PR [6] which are not
>>  related to the proposed topic? (e.g. variable renamings).
>>
>>  * The long answer. *
>>
>>  Here are my thoughts on this. There are two different types of issues
>>  in the rebalance process. The first case must be covered by daily
>>  monitoring subsystems, the second case must be covered by additional
>>  profiling tools.
>>  1. errors during the rebalancing (e.g. rebalance not happens when required)
>>  2. performance rebalancing issues (e.g. rebalance is slow)
>>
>>  Daily monitoring tools (JMX, Logging) are always turned on and
>>  shouldn't require additional systems resources themselves. Since these
>>  metrics must be lightweight any internal aggregation machinery is not
>>  used on them. All these metrics are collected from each node
>>  independently. Please, take a look at this issue [1] which covers most
>>  of your questions mentioned above.
>>
>>  For all available metrics, we can configure LogExporterSpi, so they
>>  will be available in logs.
>>
>>  > 1)How long was the balance(divided by supplier)?
>>  rebalancingStartTime, rebalancingEndTime already exists for cache groups
>>  [4].
>>  We can add the same for suppliers.
>>
>>  > 2)How many records and bytes per supplier were rebalanced?
>>  We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
>>  We will have rebalancingExpectedKeys [5].
>>  We can add a new metric per cache keys, per supplier.
>>
>>  > 3)How many times did the rebalance restart?
>>  rebalancingLastCancelledTime [4] metric exists.
>>  Do we need to keep historical data on it?
>>
>>  > 4)Which partitions were rebalanced and from which nodes did they receive
>>  them?
>>  Let's print this information to log prior to the rebalance process
>>  starts. This will be helpful information and do not require a lot of
>>  changes.
>>
>>  > 5)When did rebalance for all cache groups end?
>>  This metric may be aggregated from rebalancingEndTime [4] by pulling
>>  it from all nodes for all caches.
>>
>>  Performance rebalancing issues are related to profiling tools. They
>>  may require additional system resources and definitely require a
>>  dedicated environment for tests. We can't do performance analysis on
>>  production environments due to performance impact.
>>  I see some disadvantages of adding such tools to production code:
>>  - verbose logging may affect performance.
>>  - the problematic process may become even worse if an automatic
>>  threshold suddenly turns on.
>>  - new code changes will require additional efforts to keep logging
>>  up-to-date.
>>
>>  [1] https://issues.apache.org/jira/browse/IGNITE-12183
>>  [2] https://issues.apache.org/jira/browse/IGNITE-12666
>>  [3]
>>  https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
>>  [4] https://issues.apache.org/jira/browse/IGNITE-12193
>>  [5] https://issues.apache.org/jira/browse/IGNITE-12194
>>  [6] https://github.com/apache/ignite/pull/7705/files
>>
>>  On Wed, 6 May 2020 at 19:50, Ivan Rakov <iv...@gmail.com> wrote:
>>  >
>>  > Hi,
>>  >
>>  > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  > > duration rebalance of cache group after which partitions distribution
>>  is
>>  > > output, set in milliseconds, default value is 10 minutes.
>>  >
>>  > Does it mean that if the rebalancing process took less than 10 minutes,
>>  > only a short version of the message (with supplier statistics) will show
>>  up?
>>  >
>>  > In general, I have no objections.
>>  >
>>  >
>>  > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tk...@yandex.ru>
>>  > wrote:
>>  >
>>  > > Hi, Igniters!
>>  > >
>>  > > I'd like to share a new small feature in AI [1].
>>  > >
>>  > > Current rebalance logging does not allow you to quickly answer
>>  following
>>  > > questions:
>>  > > 1)How long was the balance(divided by supplier)?
>>  > > 2)How many records and bytes per supplier were rebalanced?
>>  > > 3)How many times did rebalance restart?
>>  > > 4)Which partitions were rebalanced and from which nodes did they
>>  receive
>>  > > them?
>>  > > 5)When did rebalance for all cache groups end?
>>  > >
>>  > > What you can see in logs now:
>>  > >
>>  > > 1)Starting rebalance with order of cache groups.
>>  > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
>>  > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
>>  > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
>>  > >
>>  > > 2)Start rebalance of cache group from a specific supplier, specifying
>>  > > partition ids and mode - historical or full.
>>  > > Starting rebalance routine [ignite-sys-cache,
>>  > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
>>  > > supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
>>  > > histPartitions=[]]
>>  > >
>>  > > 3)Getting partial or complete partitions of cache group.
>>  > > Completed rebalancing [grp=ignite-sys-cache,
>>  > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
>>  > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
>>  > > Completed (final) rebalancing [grp=ignite-sys-cache,
>>  > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
>>  > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>>  > >
>>  > > 4)End rebalance of cache group.
>>  > > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
>>  > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
>>  > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
>>  > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
>>  > > lastCancelledTime=-1]
>>  > >
>>  > > Rebalance statistics:
>>  > >
>>  > > To speed up rebalance analysis, statistics will be output for each
>>  cache
>>  > > group and total for all cache groups.
>>  > > If duration rebalance for cache group is greater than threshold value,
>>  > > partition distribution is output.
>>  > > Statistics will you to analyze duration of the balance for each
>>  supplier
>>  > > to understand which of them has been transmitting data for longest
>>  time.
>>  > >
>>  > > System properties are used to output statistics:
>>  > >
>>  > > IGNITE_QUIET - to output statistics, value must be false;
>>  > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  > > duration rebalance of cache group after which partitions distribution
>>  is
>>  > > output, set in milliseconds, default value is 10 minutes.
>>  > >
>>  > > Statistics examples:
>>  > >
>>  > > Successful full and historical rebalance of group cache, without
>>  > > partitions distribution.
>>  > > Rebalance information per cache group (successful rebalance):
>>  [id=3181548,
>>  > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
>>  p=5,
>>  > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
>>  entries, b -
>>  > > bytes, d - duration, h - historical, nodeId mapping
>>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > > [1=rebalancing.RebalanceStatisticsTest0]
>>  > > Rebalance information per cache group (successful rebalance):
>>  [id=3181547,
>>  > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0,
>>  hp=10,
>>  > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  > > bytes, d - duration, h - historical, nodeId mapping
>>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  > >
>>  > > Successful full and historical rebalance of group cache, with
>>  partitions
>>  > > distribution.
>>  > > Rebalance information per cache group (successful rebalance):
>>  [id=3181548,
>>  > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
>>  p=5,
>>  > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
>>  entries, b -
>>  > > bytes, d - duration, h - historical, nodeId mapping
>>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
>>  > > than 5 ms, printing detailed information about partitions distribution
>>  > > (threshold can be changed by setting number of milliseconds into
>>  > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
>>  > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
>>  > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
>>  > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
>>  > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
>>  > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
>>  > > primary, bu - backup, su - supplier node, h - historical, nodeId
>>  mapping
>>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > > [1=rebalancing.RebalanceStatisticsTest2]
>>  > > [2=rebalancing.RebalanceStatisticsTest0]
>>  > > Rebalance information per cache group (successful rebalance):
>>  [id=3181547,
>>  > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0,
>>  hp=10,
>>  > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  > > bytes, d - duration, h - historical, nodeId mapping
>>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance
>>  > > duration was greater than 100 ms, printing detailed information about
>>  > > partitions distribution (threshold can be changed by setting number of
>>  > > milliseconds into
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
>>  > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
>>  > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
>>  > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
>>  > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
>>  > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
>>  > > primary, bu - backup, su - supplier node, h - historical, nodeId
>>  mapping
>>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  > > [1=rebalancing.RebalanceStatisticsTest2]
>>  > > [2=rebalancing.RebalanceStatisticsTest1]
>>  > >
>>  > > Interrupted rebalance of group cache.
>>  > > Rebalance information per cache group (interrupted rebalance):
>>  > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
>>  > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
>>  > >
>>  > > Total full and historical rebalance for all cache groups.
>>  > > Rebalance total information (including successful and not rebalances):
>>  > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
>>  > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
>>  > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
>>  > > entries, b - bytes, d - duration, h - historical, nodeId mapping
>>  > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > > [1=rebalancing.RebalanceStatisticsTest0]
>>  > > Rebalance total information (including successful and not rebalances):
>>  > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
>>  > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445,
>>  d=294
>>  > > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
>>  > > historical, nodeId mapping (nodeId=id,consistentId)
>>  > > [0=rebalancing.RebalanceStatisticsTest0]
>>  > >
>>  > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080
>
> --
>
> Best regards,
> Alexei Scherbakov

Re: Extended logging for rebalance performance analysis

Posted by Alexei Scherbakov <al...@gmail.com>.

Hello.

Let's look at existing rebalancing log for a single group:

[2020-05-06 20:56:36,999][INFO ][...] Rebalancing scheduled
[order=[ignite-sys-cache, cache1, cache2, default],
top=AffinityTopologyVersion [topVer=3, minorTopVer=1],
evt=DISCOVERY_CUSTOM_EVT, node=9d9edb7b-eb01-47a1-8ff9-fef715d00002]
...
[2020-05-06 20:56:37,034][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
partitionsCount=11, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Prepared rebalancing [grp=cache1,
mode=ASYNC, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
partitionsCount=10, topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1]]
[2020-05-06 20:56:37,036][INFO ][...] Starting rebalance routine [cache1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, fullPartitions=[1, 5, 7, 9,
11, 13, 15, 23, 27, 29, 31], histPartitions=[]]
[2020-05-06 20:56:37,037][INFO ][...] Starting rebalance routine [cache1,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000, fullPartitions=[6, 8, 10,
16, 18, 20, 22, 24, 26, 28], histPartitions=[]]
[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=1/2]
[2020-05-06 20:56:37,046][INFO ][...] Completed (final) rebalancing
[grp=cache1, supplier=b3f3aeeb-5fa0-42f7-a74e-cf39fa500000,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], progress=2/2]
[2020-05-06 20:56:37,048][INFO ][...] Completed rebalance future:
RebalanceFuture [grp=CacheGroupContext [grp=cache1],
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], rebalanceId=2,
routines=2]

From these logs I'm already can get answers to 1 and 4.
The logs look concise and easy to read and understand, and should
remain what way.

But I think some proposed improvements can be done here without harm.

2. OK, let's add it to supplier info per cache with additional info:

[2020-05-06 20:56:37,044][INFO ][...] Completed rebalancing [grp=cache1,
supplier=94a3fcbc-18d5-4c64-b0ab-4313aba00001, entries=100, duration=12ms,
bytesRcvd=5M, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
progress=1/2]

3. This information is already printed to log.

5. OK, let's add a line which is printed after all groups in the chain are
rebalanced or cancelled with a summary:

[2020-05-06 20:56:36,999][INFO ][...] Completed rebalance chain:
[rebalanceId=2, entries=200, duration=50ms, bytesRcvd=10M]

Another thing I would suggest: add rebalanceId to all messages for ease of
grepping and to distinguish multiple rebalances for a same topology version.

Regarding the "detailed" message - I wasn't able to figure out any case
where it is useful. Not sure if we need it at all.
Kirill Tkalenko, can you give me an example?

ср, 6 мая 2020 г. в 21:01, Maxim Muzafarov <mm...@apache.org>:

> Kirill,
>
>
> Thank you for raising this topic. It's true that the rebalance process
> still requires additional information for analyzing issues. Please,
> don't think that I'm against your changes :-)
>
> * My short answer. *
>
> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>
> Let's keep the `core` module as simple as possible.
> Let's design the right API for accessing rebalance internals for
> profiling tools.
> Can you, please, remove all changes from your PR [6] which are not
> related to the proposed topic? (e.g. variable renamings).
>
>
> * The long answer. *
>
> Here are my thoughts on this. There are two different types of issues
> in the rebalance process. The first case must be covered by daily
> monitoring subsystems, the second case must be covered by additional
> profiling tools.
> 1. errors during the rebalancing (e.g. rebalance not happens when required)
> 2. performance rebalancing issues (e.g. rebalance is slow)
>
> Daily monitoring tools (JMX, Logging) are always turned on and
> shouldn't require additional systems resources themselves. Since these
> metrics must be lightweight any internal aggregation machinery is not
> used on them. All these metrics are collected from each node
> independently. Please, take a look at this issue [1] which covers most
> of your questions mentioned above.
>
> For all available metrics, we can configure LogExporterSpi, so they
> will be available in logs.
>
> > 1)How long was the balance(divided by supplier)?
> rebalancingStartTime, rebalancingEndTime already exists for cache groups
> [4].
> We can add the same for suppliers.
>
> > 2)How many records and bytes per supplier were rebalanced?
> We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> We will have rebalancingExpectedKeys [5].
> We can add a new metric per cache keys, per supplier.
>
> > 3)How many times did the rebalance restart?
> rebalancingLastCancelledTime [4] metric exists.
> Do we need to keep historical data on it?
>
> > 4)Which partitions were rebalanced and from which nodes did they receive
> them?
> Let's print this information to log prior to the rebalance process
> starts. This will be helpful information and do not require a lot of
> changes.
>
> > 5)When did rebalance for all cache groups end?
> This metric may be aggregated from rebalancingEndTime [4] by pulling
> it from all nodes for all caches.
>
>
> Performance rebalancing issues are related to profiling tools. They
> may require additional system resources and definitely require a
> dedicated environment for tests. We can't do performance analysis on
> production environments due to performance impact.
> I see some disadvantages of adding such tools to production code:
> - verbose logging may affect performance.
> - the problematic process may become even worse if an automatic
> threshold suddenly turns on.
> - new code changes will require additional efforts to keep logging
> up-to-date.
>
>
> [1] https://issues.apache.org/jira/browse/IGNITE-12183
> [2] https://issues.apache.org/jira/browse/IGNITE-12666
> [3]
> https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> [4] https://issues.apache.org/jira/browse/IGNITE-12193
> [5] https://issues.apache.org/jira/browse/IGNITE-12194
> [6] https://github.com/apache/ignite/pull/7705/files
>
> On Wed, 6 May 2020 at 19:50, Ivan Rakov <iv...@gmail.com> wrote:
> >
> > Hi,
> >
> > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > > duration rebalance of cache group after which partitions distribution
> is
> > > output, set in milliseconds, default value is 10 minutes.
> >
> >  Does it mean that if the rebalancing process took less than 10 minutes,
> > only a short version of the message (with supplier statistics) will show
> up?
> >
> > In general, I have no objections.
> >
> >
> > On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tk...@yandex.ru>
> > wrote:
> >
> > > Hi, Igniters!
> > >
> > > I'd like to share a new small feature in AI [1].
> > >
> > > Current rebalance logging does not allow you to quickly answer
> following
> > > questions:
> > > 1)How long was the balance(divided by supplier)?
> > > 2)How many records and bytes per supplier were rebalanced?
> > > 3)How many times did rebalance restart?
> > > 4)Which partitions were rebalanced and from which nodes did they
> receive
> > > them?
> > > 5)When did rebalance for all cache groups end?
> > >
> > > What you can see in logs now:
> > >
> > > 1)Starting rebalance with order of cache groups.
> > > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> > > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> > > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
> > >
> > > 2)Start rebalance of cache group from a specific supplier, specifying
> > > partition ids and mode - historical or full.
> > > Starting rebalance routine [ignite-sys-cache,
> > > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> > > supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
> > > histPartitions=[]]
> > >
> > > 3)Getting partial or complete partitions of cache group.
> > > Completed rebalancing [grp=ignite-sys-cache,
> > > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
> > > Completed (final) rebalancing [grp=ignite-sys-cache,
> > > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> > > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
> > >
> > > 4)End rebalance of cache group.
> > > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> > > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> > > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> > > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
> > > lastCancelledTime=-1]
> > >
> > > Rebalance statistics:
> > >
> > > To speed up rebalance analysis, statistics will be output for each
> cache
> > > group and total for all cache groups.
> > > If duration rebalance for cache group is greater than threshold value,
> > > partition distribution is output.
> > > Statistics will you to analyze duration of the balance for each
> supplier
> > > to understand which of them has been transmitting data for longest
> time.
> > >
> > > System properties are used to output statistics:
> > >
> > > IGNITE_QUIET - to output statistics, value must be false;
> > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > > duration rebalance of cache group after which partitions distribution
> is
> > > output, set in milliseconds, default value is 10 minutes.
> > >
> > > Statistics examples:
> > >
> > > Successful full and historical rebalance of group cache, without
> > > partitions distribution.
> > > Rebalance information per cache group (successful rebalance):
> [id=3181548,
> > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
> p=5,
> > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> entries, b -
> > > bytes, d - duration, h - historical, nodeId mapping
> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > > [1=rebalancing.RebalanceStatisticsTest0]
> > > Rebalance information per cache group (successful rebalance):
> [id=3181547,
> > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0,
> hp=10,
> > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> > > bytes, d - duration, h - historical, nodeId mapping
> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> > >
> > > Successful full and historical rebalance of group cache, with
> partitions
> > > distribution.
> > > Rebalance information per cache group (successful rebalance):
> [id=3181548,
> > > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> > > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0,
> p=5,
> > > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e -
> entries, b -
> > > bytes, d - duration, h - historical, nodeId mapping
> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
> > > than 5 ms, printing detailed information about partitions distribution
> > > (threshold can be changed by setting number of milliseconds into
> > > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> > > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> > > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> > > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> > > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> > > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
> > > primary, bu - backup, su - supplier node, h - historical, nodeId
> mapping
> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > > [1=rebalancing.RebalanceStatisticsTest2]
> > > [2=rebalancing.RebalanceStatisticsTest0]
> > > Rebalance information per cache group (successful rebalance):
> [id=3181547,
> > > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> > > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0,
> hp=10,
> > > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> > > bytes, d - duration, h - historical, nodeId mapping
> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> Rebalance
> > > duration was greater than 100 ms, printing detailed information about
> > > partitions distribution (threshold can be changed by setting number of
> > > milliseconds into
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> > > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> > > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> > > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> > > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> > > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
> > > primary, bu - backup, su - supplier node, h - historical, nodeId
> mapping
> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> > > [1=rebalancing.RebalanceStatisticsTest2]
> > > [2=rebalancing.RebalanceStatisticsTest1]
> > >
> > > Interrupted rebalance of group cache.
> > > Rebalance information per cache group (interrupted rebalance):
> > > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> > > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
> > >
> > > Total full and historical rebalance for all cache groups.
> > > Rebalance total information (including successful and not rebalances):
> > > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
> > > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
> > > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
> > > entries, b - bytes, d - duration, h - historical, nodeId mapping
> > > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > > [1=rebalancing.RebalanceStatisticsTest0]
> > > Rebalance total information (including successful and not rebalances):
> > > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
> > > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445,
> d=294
> > > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
> > > historical, nodeId mapping (nodeId=id,consistentId)
> > > [0=rebalancing.RebalanceStatisticsTest0]
> > >
> > > [1] - https://issues.apache.org/jira/browse/IGNITE-12080
>


-- 

Best regards,
Alexei Scherbakov

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Hi, Maxim! Unfortunately, I delayed my answer.

The production environment may not be ideal and problems may occur in it. I only suggest adding a little logging to understand some of the cases that may occur during operation. In addition, this functionality can be disabled by the system property. Rename variables to improve code readability.

We can do it via metrics, but whether need to store it for all suppliers and how long is unclear. In my case, I do a simple aggregation and clear it after output.


06.05.2020, 21:01, "Maxim Muzafarov" <mm...@apache.org>:
> Kirill,
>
> Thank you for raising this topic. It's true that the rebalance process
> still requires additional information for analyzing issues. Please,
> don't think that I'm against your changes :-)
>
> * My short answer. *
>
> We won't do performance analysis on the production environment. Each
> time we need performance analysis it will be done on a test
> environment with verbose logging enabled. Thus I suggest moving these
> changes to a separate `profiling` module and extend the logging much
> more without any ышяу limitations. The same as these [2] [3]
> activities do.
>
> Let's keep the `core` module as simple as possible.
> Let's design the right API for accessing rebalance internals for
> profiling tools.
> Can you, please, remove all changes from your PR [6] which are not
> related to the proposed topic? (e.g. variable renamings).
>
> * The long answer. *
>
> Here are my thoughts on this. There are two different types of issues
> in the rebalance process. The first case must be covered by daily
> monitoring subsystems, the second case must be covered by additional
> profiling tools.
> 1. errors during the rebalancing (e.g. rebalance not happens when required)
> 2. performance rebalancing issues (e.g. rebalance is slow)
>
> Daily monitoring tools (JMX, Logging) are always turned on and
> shouldn't require additional systems resources themselves. Since these
> metrics must be lightweight any internal aggregation machinery is not
> used on them. All these metrics are collected from each node
> independently. Please, take a look at this issue [1] which covers most
> of your questions mentioned above.
>
> For all available metrics, we can configure LogExporterSpi, so they
> will be available in logs.
>
>>  1)How long was the balance(divided by supplier)?
>
> rebalancingStartTime, rebalancingEndTime already exists for cache groups [4].
> We can add the same for suppliers.
>
>>  2)How many records and bytes per supplier were rebalanced?
>
> We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
> We will have rebalancingExpectedKeys [5].
> We can add a new metric per cache keys, per supplier.
>
>>  3)How many times did the rebalance restart?
>
> rebalancingLastCancelledTime [4] metric exists.
> Do we need to keep historical data on it?
>
>>  4)Which partitions were rebalanced and from which nodes did they receive them?
>
> Let's print this information to log prior to the rebalance process
> starts. This will be helpful information and do not require a lot of
> changes.
>
>>  5)When did rebalance for all cache groups end?
>
> This metric may be aggregated from rebalancingEndTime [4] by pulling
> it from all nodes for all caches.
>
> Performance rebalancing issues are related to profiling tools. They
> may require additional system resources and definitely require a
> dedicated environment for tests. We can't do performance analysis on
> production environments due to performance impact.
> I see some disadvantages of adding such tools to production code:
> - verbose logging may affect performance.
> - the problematic process may become even worse if an automatic
> threshold suddenly turns on.
> - new code changes will require additional efforts to keep logging up-to-date.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-12183
> [2] https://issues.apache.org/jira/browse/IGNITE-12666
> [3] https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
> [4] https://issues.apache.org/jira/browse/IGNITE-12193
> [5] https://issues.apache.org/jira/browse/IGNITE-12194
> [6] https://github.com/apache/ignite/pull/7705/files
>
> On Wed, 6 May 2020 at 19:50, Ivan Rakov <iv...@gmail.com> wrote:
>>  Hi,
>>
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  > duration rebalance of cache group after which partitions distribution is
>>  > output, set in milliseconds, default value is 10 minutes.
>>
>>   Does it mean that if the rebalancing process took less than 10 minutes,
>>  only a short version of the message (with supplier statistics) will show up?
>>
>>  In general, I have no objections.
>>
>>  On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tk...@yandex.ru>
>>  wrote:
>>
>>  > Hi, Igniters!
>>  >
>>  > I'd like to share a new small feature in AI [1].
>>  >
>>  > Current rebalance logging does not allow you to quickly answer following
>>  > questions:
>>  > 1)How long was the balance(divided by supplier)?
>>  > 2)How many records and bytes per supplier were rebalanced?
>>  > 3)How many times did rebalance restart?
>>  > 4)Which partitions were rebalanced and from which nodes did they receive
>>  > them?
>>  > 5)When did rebalance for all cache groups end?
>>  >
>>  > What you can see in logs now:
>>  >
>>  > 1)Starting rebalance with order of cache groups.
>>  > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
>>  > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
>>  > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
>>  >
>>  > 2)Start rebalance of cache group from a specific supplier, specifying
>>  > partition ids and mode - historical or full.
>>  > Starting rebalance routine [ignite-sys-cache,
>>  > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
>>  > supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
>>  > histPartitions=[]]
>>  >
>>  > 3)Getting partial or complete partitions of cache group.
>>  > Completed rebalancing [grp=ignite-sys-cache,
>>  > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
>>  > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
>>  > Completed (final) rebalancing [grp=ignite-sys-cache,
>>  > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
>>  > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>>  >
>>  > 4)End rebalance of cache group.
>>  > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
>>  > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
>>  > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
>>  > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
>>  > lastCancelledTime=-1]
>>  >
>>  > Rebalance statistics:
>>  >
>>  > To speed up rebalance analysis, statistics will be output for each cache
>>  > group and total for all cache groups.
>>  > If duration rebalance for cache group is greater than threshold value,
>>  > partition distribution is output.
>>  > Statistics will you to analyze duration of the balance for each supplier
>>  > to understand which of them has been transmitting data for longest time.
>>  >
>>  > System properties are used to output statistics:
>>  >
>>  > IGNITE_QUIET - to output statistics, value must be false;
>>  > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  > duration rebalance of cache group after which partitions distribution is
>>  > output, set in milliseconds, default value is 10 minutes.
>>  >
>>  > Statistics examples:
>>  >
>>  > Successful full and historical rebalance of group cache, without
>>  > partitions distribution.
>>  > Rebalance information per cache group (successful rebalance): [id=3181548,
>>  > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest0]
>>  > Rebalance information per cache group (successful rebalance): [id=3181547,
>>  > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
>>  > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  >
>>  > Successful full and historical rebalance of group cache, with partitions
>>  > distribution.
>>  > Rebalance information per cache group (successful rebalance): [id=3181548,
>>  > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
>>  > than 5 ms, printing detailed information about partitions distribution
>>  > (threshold can be changed by setting number of milliseconds into
>>  > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
>>  > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
>>  > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest2]
>>  > [2=rebalancing.RebalanceStatisticsTest0]
>>  > Rebalance information per cache group (successful rebalance): [id=3181547,
>>  > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
>>  > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  > bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Rebalance
>>  > duration was greater than 100 ms, printing detailed information about
>>  > partitions distribution (threshold can be changed by setting number of
>>  > milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
>>  > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
>>  > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
>>  > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
>>  > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
>>  > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
>>  > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  > [1=rebalancing.RebalanceStatisticsTest2]
>>  > [2=rebalancing.RebalanceStatisticsTest1]
>>  >
>>  > Interrupted rebalance of group cache.
>>  > Rebalance information per cache group (interrupted rebalance):
>>  > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
>>  > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
>>  >
>>  > Total full and historical rebalance for all cache groups.
>>  > Rebalance total information (including successful and not rebalances):
>>  > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
>>  > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
>>  > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
>>  > entries, b - bytes, d - duration, h - historical, nodeId mapping
>>  > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  > [1=rebalancing.RebalanceStatisticsTest0]
>>  > Rebalance total information (including successful and not rebalances):
>>  > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
>>  > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294
>>  > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
>>  > historical, nodeId mapping (nodeId=id,consistentId)
>>  > [0=rebalancing.RebalanceStatisticsTest0]
>>  >
>>  > [1] - https://issues.apache.org/jira/browse/IGNITE-12080

Re: Extended logging for rebalance performance analysis

Posted by Maxim Muzafarov <mm...@apache.org>.

Kirill,


Thank you for raising this topic. It's true that the rebalance process
still requires additional information for analyzing issues. Please,
don't think that I'm against your changes :-)

* My short answer. *

We won't do performance analysis on the production environment. Each
time we need performance analysis it will be done on a test
environment with verbose logging enabled. Thus I suggest moving these
changes to a separate `profiling` module and extend the logging much
more without any ышяу limitations. The same as these [2] [3]
activities do.

Let's keep the `core` module as simple as possible.
Let's design the right API for accessing rebalance internals for
profiling tools.
Can you, please, remove all changes from your PR [6] which are not
related to the proposed topic? (e.g. variable renamings).


* The long answer. *

Here are my thoughts on this. There are two different types of issues
in the rebalance process. The first case must be covered by daily
monitoring subsystems, the second case must be covered by additional
profiling tools.
1. errors during the rebalancing (e.g. rebalance not happens when required)
2. performance rebalancing issues (e.g. rebalance is slow)

Daily monitoring tools (JMX, Logging) are always turned on and
shouldn't require additional systems resources themselves. Since these
metrics must be lightweight any internal aggregation machinery is not
used on them. All these metrics are collected from each node
independently. Please, take a look at this issue [1] which covers most
of your questions mentioned above.

For all available metrics, we can configure LogExporterSpi, so they
will be available in logs.

> 1)How long was the balance(divided by supplier)?
rebalancingStartTime, rebalancingEndTime already exists for cache groups [4].
We can add the same for suppliers.

> 2)How many records and bytes per supplier were rebalanced?
We already have rebalancingReceivedKeys, rebalancingReceivedBytes [4]
We will have rebalancingExpectedKeys [5].
We can add a new metric per cache keys, per supplier.

> 3)How many times did the rebalance restart?
rebalancingLastCancelledTime [4] metric exists.
Do we need to keep historical data on it?

> 4)Which partitions were rebalanced and from which nodes did they receive them?
Let's print this information to log prior to the rebalance process
starts. This will be helpful information and do not require a lot of
changes.

> 5)When did rebalance for all cache groups end?
This metric may be aggregated from rebalancingEndTime [4] by pulling
it from all nodes for all caches.


Performance rebalancing issues are related to profiling tools. They
may require additional system resources and definitely require a
dedicated environment for tests. We can't do performance analysis on
production environments due to performance impact.
I see some disadvantages of adding such tools to production code:
- verbose logging may affect performance.
- the problematic process may become even worse if an automatic
threshold suddenly turns on.
- new code changes will require additional efforts to keep logging up-to-date.


[1] https://issues.apache.org/jira/browse/IGNITE-12183
[2] https://issues.apache.org/jira/browse/IGNITE-12666
[3] https://cwiki.apache.org/confluence/display/IGNITE/Cluster+performance+profiling+tool
[4] https://issues.apache.org/jira/browse/IGNITE-12193
[5] https://issues.apache.org/jira/browse/IGNITE-12194
[6] https://github.com/apache/ignite/pull/7705/files

On Wed, 6 May 2020 at 19:50, Ivan Rakov <iv...@gmail.com> wrote:
>
> Hi,
>
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > duration rebalance of cache group after which partitions distribution is
> > output, set in milliseconds, default value is 10 minutes.
>
>  Does it mean that if the rebalancing process took less than 10 minutes,
> only a short version of the message (with supplier statistics) will show up?
>
> In general, I have no objections.
>
>
> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tk...@yandex.ru>
> wrote:
>
> > Hi, Igniters!
> >
> > I'd like to share a new small feature in AI [1].
> >
> > Current rebalance logging does not allow you to quickly answer following
> > questions:
> > 1)How long was the balance(divided by supplier)?
> > 2)How many records and bytes per supplier were rebalanced?
> > 3)How many times did rebalance restart?
> > 4)Which partitions were rebalanced and from which nodes did they receive
> > them?
> > 5)When did rebalance for all cache groups end?
> >
> > What you can see in logs now:
> >
> > 1)Starting rebalance with order of cache groups.
> > Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> > top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> > evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
> >
> > 2)Start rebalance of cache group from a specific supplier, specifying
> > partition ids and mode - historical or full.
> > Starting rebalance routine [ignite-sys-cache,
> > topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> > supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
> > histPartitions=[]]
> >
> > 3)Getting partial or complete partitions of cache group.
> > Completed rebalancing [grp=ignite-sys-cache,
> > supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
> > Completed (final) rebalancing [grp=ignite-sys-cache,
> > supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> > topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
> >
> > 4)End rebalance of cache group.
> > Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> > [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> > minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> > receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
> > lastCancelledTime=-1]
> >
> > Rebalance statistics:
> >
> > To speed up rebalance analysis, statistics will be output for each cache
> > group and total for all cache groups.
> > If duration rebalance for cache group is greater than threshold value,
> > partition distribution is output.
> > Statistics will you to analyze duration of the balance for each supplier
> > to understand which of them has been transmitting data for longest time.
> >
> > System properties are used to output statistics:
> >
> > IGNITE_QUIET - to output statistics, value must be false;
> > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> > duration rebalance of cache group after which partitions distribution is
> > output, set in milliseconds, default value is 10 minutes.
> >
> > Statistics examples:
> >
> > Successful full and historical rebalance of group cache, without
> > partitions distribution.
> > Rebalance information per cache group (successful rebalance): [id=3181548,
> > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest0]
> > Rebalance information per cache group (successful rebalance): [id=3181547,
> > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> >
> > Successful full and historical rebalance of group cache, with partitions
> > distribution.
> > Rebalance information per cache group (successful rebalance): [id=3181548,
> > name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> > 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> > d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
> > than 5 ms, printing detailed information about partitions distribution
> > (threshold can be changed by setting number of milliseconds into
> > IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> > [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> > [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> > [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> > [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> > [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
> > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest2]
> > [2=rebalancing.RebalanceStatisticsTest0]
> > Rebalance information per cache group (successful rebalance): [id=3181547,
> > name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> > 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> > he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> > bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Rebalance
> > duration was greater than 100 ms, printing detailed information about
> > partitions distribution (threshold can be changed by setting number of
> > milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> > 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> > [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> > [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> > [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> > [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
> > primary, bu - backup, su - supplier node, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> > [1=rebalancing.RebalanceStatisticsTest2]
> > [2=rebalancing.RebalanceStatisticsTest1]
> >
> > Interrupted rebalance of group cache.
> > Rebalance information per cache group (interrupted rebalance):
> > [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> > finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
> >
> > Total full and historical rebalance for all cache groups.
> > Rebalance total information (including successful and not rebalances):
> > [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
> > d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
> > [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
> > entries, b - bytes, d - duration, h - historical, nodeId mapping
> > (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> > [1=rebalancing.RebalanceStatisticsTest0]
> > Rebalance total information (including successful and not rebalances):
> > [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
> > d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294
> > ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
> > historical, nodeId mapping (nodeId=id,consistentId)
> > [0=rebalancing.RebalanceStatisticsTest0]
> >
> > [1] - https://issues.apache.org/jira/browse/IGNITE-12080

Re: Extended logging for rebalance performance analysis

Posted by ткаленко кирилл <tk...@yandex.ru>.

Hi, that's it.

06.05.2020, 19:50, "Ivan Rakov" <iv...@gmail.com>:
> Hi,
>
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  duration rebalance of cache group after which partitions distribution is
>>  output, set in milliseconds, default value is 10 minutes.
>
>  Does it mean that if the rebalancing process took less than 10 minutes,
> only a short version of the message (with supplier statistics) will show up?
>
> In general, I have no objections.
>
> On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tk...@yandex.ru>
> wrote:
>
>>  Hi, Igniters!
>>
>>  I'd like to share a new small feature in AI [1].
>>
>>  Current rebalance logging does not allow you to quickly answer following
>>  questions:
>>  1)How long was the balance(divided by supplier)?
>>  2)How many records and bytes per supplier were rebalanced?
>>  3)How many times did rebalance restart?
>>  4)Which partitions were rebalanced and from which nodes did they receive
>>  them?
>>  5)When did rebalance for all cache groups end?
>>
>>  What you can see in logs now:
>>
>>  1)Starting rebalance with order of cache groups.
>>  Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
>>  top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
>>  evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
>>
>>  2)Start rebalance of cache group from a specific supplier, specifying
>>  partition ids and mode - historical or full.
>>  Starting rebalance routine [ignite-sys-cache,
>>  topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
>>  supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
>>  histPartitions=[]]
>>
>>  3)Getting partial or complete partitions of cache group.
>>  Completed rebalancing [grp=ignite-sys-cache,
>>  supplier=8c525892-703b-4fc4-b28b-b2f139700000,
>>  topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
>>  Completed (final) rebalancing [grp=ignite-sys-cache,
>>  supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
>>  topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>>
>>  4)End rebalance of cache group.
>>  Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
>>  [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
>>  minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
>>  receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
>>  lastCancelledTime=-1]
>>
>>  Rebalance statistics:
>>
>>  To speed up rebalance analysis, statistics will be output for each cache
>>  group and total for all cache groups.
>>  If duration rebalance for cache group is greater than threshold value,
>>  partition distribution is output.
>>  Statistics will you to analyze duration of the balance for each supplier
>>  to understand which of them has been transmitting data for longest time.
>>
>>  System properties are used to output statistics:
>>
>>  IGNITE_QUIET - to output statistics, value must be false;
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
>>  duration rebalance of cache group after which partitions distribution is
>>  output, set in milliseconds, default value is 10 minutes.
>>
>>  Statistics examples:
>>
>>  Successful full and historical rebalance of group cache, without
>>  partitions distribution.
>>  Rebalance information per cache group (successful rebalance): [id=3181548,
>>  name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance information per cache group (successful rebalance): [id=3181547,
>>  name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
>>  he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>
>>  Successful full and historical rebalance of group cache, with partitions
>>  distribution.
>>  Rebalance information per cache group (successful rebalance): [id=3181548,
>>  name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
>>  10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
>>  d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
>>  than 5 ms, printing detailed information about partitions distribution
>>  (threshold can be changed by setting number of milliseconds into
>>  IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
>>  [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
>>  [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
>>  [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
>>  [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
>>  [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
>>  primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest2]
>>  [2=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance information per cache group (successful rebalance): [id=3181547,
>>  name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
>>  15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
>>  he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
>>  bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Rebalance
>>  duration was greater than 100 ms, printing detailed information about
>>  partitions distribution (threshold can be changed by setting number of
>>  milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
>>  0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
>>  [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
>>  [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
>>  [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
>>  [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
>>  primary, bu - backup, su - supplier node, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>>  [1=rebalancing.RebalanceStatisticsTest2]
>>  [2=rebalancing.RebalanceStatisticsTest1]
>>
>>  Interrupted rebalance of group cache.
>>  Rebalance information per cache group (interrupted rebalance):
>>  [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
>>  finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
>>
>>  Total full and historical rebalance for all cache groups.
>>  Rebalance total information (including successful and not rebalances):
>>  [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
>>  d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
>>  [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
>>  entries, b - bytes, d - duration, h - historical, nodeId mapping
>>  (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
>>  [1=rebalancing.RebalanceStatisticsTest0]
>>  Rebalance total information (including successful and not rebalances):
>>  [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
>>  d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294
>>  ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
>>  historical, nodeId mapping (nodeId=id,consistentId)
>>  [0=rebalancing.RebalanceStatisticsTest0]
>>
>>  [1] - https://issues.apache.org/jira/browse/IGNITE-12080

Re: Extended logging for rebalance performance analysis

Posted by Ivan Rakov <iv...@gmail.com>.

Hi,

IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> duration rebalance of cache group after which partitions distribution is
> output, set in milliseconds, default value is 10 minutes.

 Does it mean that if the rebalancing process took less than 10 minutes,
only a short version of the message (with supplier statistics) will show up?

In general, I have no objections.


On Mon, May 4, 2020 at 10:38 AM ткаленко кирилл <tk...@yandex.ru>
wrote:

> Hi, Igniters!
>
> I'd like to share a new small feature in AI [1].
>
> Current rebalance logging does not allow you to quickly answer following
> questions:
> 1)How long was the balance(divided by supplier)?
> 2)How many records and bytes per supplier were rebalanced?
> 3)How many times did rebalance restart?
> 4)Which partitions were rebalanced and from which nodes did they receive
> them?
> 5)When did rebalance for all cache groups end?
>
> What you can see in logs now:
>
> 1)Starting rebalance with order of cache groups.
> Rebalancing scheduled [order=[ignite-sys-cache, grp1, grp0],
> top=AffinityTopologyVersion [topVer=2, minorTopVer=0], force=false,
> evt=NODE_JOINED, node=c2146a04-dc23-4bc9-870d-dfbb55c00001]
>
> 2)Start rebalance of cache group from a specific supplier, specifying
> partition ids and mode - historical or full.
> Starting rebalance routine [ignite-sys-cache,
> topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0],
> supplier=8c525892-703b-4fc4-b28b-b2f139700000, fullPartitions=[0-99],
> histPartitions=[]]
>
> 3)Getting partial or complete partitions of cache group.
> Completed rebalancing [grp=ignite-sys-cache,
> supplier=8c525892-703b-4fc4-b28b-b2f139700000,
> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=1/2]
> Completed (final) rebalancing [grp=ignite-sys-cache,
> supplier=c2146a04-dc23-4bc9-870d-dfbb55c00001,
> topVer=AffinityTopologyVersion [topVer=5, minorTopVer=0], progress=2/2]
>
> 4)End rebalance of cache group.
> Completed rebalance future: RebalanceFuture [grp=CacheGroupContext
> [grp=ignite-sys-cache], topVer=AffinityTopologyVersion [topVer=2,
> minorTopVer=0], rebalanceId=1, routines=1, receivedBytes=1200,
> receivedKeys=0, partitionsLeft=0, startTime=1588519707607, endTime=-1,
> lastCancelledTime=-1]
>
> Rebalance statistics:
>
> To speed up rebalance analysis, statistics will be output for each cache
> group and total for all cache groups.
> If duration rebalance for cache group is greater than threshold value,
> partition distribution is output.
> Statistics will you to analyze duration of the balance for each supplier
> to understand which of them has been transmitting data for longest time.
>
> System properties are used to output statistics:
>
> IGNITE_QUIET - to output statistics, value must be false;
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD - threshold
> duration rebalance of cache group after which partitions distribution is
> output, set in milliseconds, default value is 10 minutes.
>
> Statistics examples:
>
> Successful full and historical rebalance of group cache, without
> partitions distribution.
> Rebalance information per cache group (successful rebalance): [id=3181548,
> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest0]
> Rebalance information per cache group (successful rebalance): [id=3181547,
> name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
>
> Successful full and historical rebalance of group cache, with partitions
> distribution.
> Rebalance information per cache group (successful rebalance): [id=3181548,
> name=grp1, startTime=2020-04-13 10:55:16,117, finishTime=2020-04-13
> 10:55:16,127, d=10 ms, restarted=0] Supplier statistics: [nodeId=0, p=5,
> d=10 ms] [nodeId=1, p=5, d=10 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest0] Rebalance duration was greater
> than 5 ms, printing detailed information about partitions distribution
> (threshold can be changed by setting number of milliseconds into
> IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD) 0 =
> [0,bu,su],[1,bu],[2,pr,su] 1 = [0,bu,su],[1,bu],[2,pr,su] 2 =
> [0,bu,su],[1,bu],[2,pr,su] 3 = [0,bu,su],[1,bu],[2,pr,su] 4 =
> [0,bu,su],[1,bu],[2,pr,su] 5 = [0,bu,su],[1,bu],[2,pr,su] 6 =
> [0,bu,su],[1,bu],[2,pr,su] 7 = [0,bu,su],[1,bu],[2,pr,su] 8 =
> [0,bu,su],[1,bu],[2,pr,su] 9 = [0,bu,su],[1,bu],[2,pr,su] Aliases: pr -
> primary, bu - backup, su - supplier node, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest2]
> [2=rebalancing.RebalanceStatisticsTest0]
> Rebalance information per cache group (successful rebalance): [id=3181547,
> name=grp0, startTime=2020-04-13 15:01:44,000, finishTime=2020-04-13
> 15:01:44,116, d=116 ms, restarted=0] Supplier statistics: [nodeId=0, hp=10,
> he=300, hb=30267, d=116 ms] Aliases: p - partitions, e - entries, b -
> bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0] Rebalance
> duration was greater than 100 ms, printing detailed information about
> partitions distribution (threshold can be changed by setting number of
> milliseconds into IGNITE_WRITE_REBALANCE_PARTITION_DISTRIBUTION_THRESHOLD)
> 0 = [0,pr,su],[1,bu],[2,bu] h 1 = [0,bu,su],[1,bu],[2,pr] h 2 =
> [0,pr,su],[1,bu],[2,bu] h 3 = [0,bu,su],[1,bu],[2,pr] h 4 =
> [0,pr,su],[1,bu],[2,bu] h 5 = [0,pr,su],[1,bu],[2,bu] h 6 =
> [0,bu,su],[1,bu],[2,pr] h 7 = [0,pr,su],[1,bu],[2,bu] h 8 =
> [0,pr,su],[1,bu],[2,bu] h 9 = [0,pr,su],[1,bu],[2,bu] h Aliases: pr -
> primary, bu - backup, su - supplier node, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest0]
> [1=rebalancing.RebalanceStatisticsTest2]
> [2=rebalancing.RebalanceStatisticsTest1]
>
> Interrupted rebalance of group cache.
> Rebalance information per cache group (interrupted rebalance):
> [id=644280849, name=default2, startTime=2020-04-13 14:55:24,969,
> finishTime=2020-04-13 14:55:24,969, d=0 ms, restarted=0]
>
> Total full and historical rebalance for all cache groups.
> Rebalance total information (including successful and not rebalances):
> [startTime=2020-04-13 10:55:18,726, finishTime=2020-04-13 10:55:18,780,
> d=54 ms] Supplier statistics: [nodeId=0, p=60, e=250, b=25000, d=54 ms]
> [nodeId=1, p=60, e=250, b=24945, d=54 ms] Aliases: p - partitions, e -
> entries, b - bytes, d - duration, h - historical, nodeId mapping
> (nodeId=id,consistentId) [0=rebalancing.RebalanceStatisticsTest1]
> [1=rebalancing.RebalanceStatisticsTest0]
> Rebalance total information (including successful and not rebalances):
> [startTime=2020-04-13 15:01:43,822, finishTime=2020-04-13 15:01:44,116,
> d=294 ms] Supplier statistics: [nodeId=0, hp=20, he=500, hb=50445, d=294
> ms] Aliases: p - partitions, e - entries, b - bytes, d - duration, h -
> historical, nodeId mapping (nodeId=id,consistentId)
> [0=rebalancing.RebalanceStatisticsTest0]
>
> [1] - https://issues.apache.org/jira/browse/IGNITE-12080