You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Grant Henke (Jira)" <ji...@apache.org> on 2020/06/03 14:20:00 UTC

[jira] [Updated] (KUDU-2681) Account for already-running tasks before sending new ones upon processing tablet reports

     [ https://issues.apache.org/jira/browse/KUDU-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Henke updated KUDU-2681:
------------------------------
    Labels: scalability supportability  (was: )

> Account for already-running tasks before sending new ones upon processing tablet reports
> ----------------------------------------------------------------------------------------
>
>                 Key: KUDU-2681
>                 URL: https://issues.apache.org/jira/browse/KUDU-2681
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.1
>            Reporter: Andrew Wong
>            Priority: Major
>              Labels: scalability, supportability
>
> I've seen a case where the master will reschedule the same delete tablet task for a given tablet multiple times, e.g. because it received a new tablet report that the tablet still exists on a given tserver. This results in significant log-spam, and ends up sending excessive RPCs to the tablet servers. Here are some master logs demonstrating this (note the repeated attempt numbers):
>  
> {{I0129 05:09:43.918886 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))}}
> {{W0129 05:09:43.919509 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet}}
> {{I0129 05:09:43.919517 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8226 ms (attempt = 10)}}
> {{I0129 05:09:43.960479 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))}}
> {{W0129 05:09:43.961150 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet}}
> {{I0129 05:09:43.961158 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8235 ms (attempt = 10)}}
> {{I0129 05:09:44.016152 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))}}
> {{W0129 05:09:44.016383 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet}}
> {{I0129 05:09:44.016391 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8206 ms (attempt = 10)}}
> {{I0129 05:09:44.226428 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))}}
> {{W0129 05:09:44.226753 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet}}
> {{I0129 05:09:44.226773 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8207 ms (attempt = 10)}}
> {{I0129 05:09:44.234709 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))}}
> {{W0129 05:09:44.234923 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet}}
> {{I0129 05:09:44.234936 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8199 ms (attempt = 10)}}
> {{I0129 05:09:44.322351 22190 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 1d75a2458b544c6ea01fb6ccb238ebbb on 90369522338b4763ae25dd0161d6e548 (server:7050) (Replica with old config index 3048677 (current committed config index is 3054594))}}
> {{W0129 05:09:44.322854 22190 catalog_manager.cc:2892] TS 90369522338b4763ae25dd0161d6e548 (server:7050): delete failed for tablet 1d75a2458b544c6ea01fb6ccb238ebbb with error code TABLET_NOT_RUNNING: Already present: State transition of tablet 1d75a2458b544c6ea01fb6ccb238ebbb already in progress: opening tablet}}
> {{I0129 05:09:44.322860 22190 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb Delete Tablet RPC for TS=90369522338b4763ae25dd0161d6e548 with a delay of 8207 ms (attempt = 10)}}
>  
> Here is a similar log for alter schema requests:
>  
> {{I0129 05:09:50.798943 22188 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60047 ms (attempt = 27)}}
> {{W0129 05:09:50.798954 22188 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e}}
> {{I0129 05:09:50.889256 22191 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60044 ms (attempt = 27)}}
> {{W0129 05:09:50.889267 22191 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e}}
> {{I0129 05:09:50.928457 22188 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60001 ms (attempt = 27)}}
> {{W0129 05:09:50.928469 22188 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e}}
> {{I0129 05:09:51.024364 22188 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60036 ms (attempt = 27)}}
> {{W0129 05:09:51.024372 22188 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e}}
> {{I0129 05:09:51.061228 22191 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60001 ms (attempt = 27)}}
> {{W0129 05:09:51.061239 22191 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e}}
> {{I0129 05:09:51.080605 22191 catalog_manager.cc:2700] Scheduling retry of 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC with a delay of 60038 ms (attempt = 27)}}
> {{W0129 05:09:51.080643 22191 catalog_manager.cc:2719] Async tablet task 1d75a2458b544c6ea01fb6ccb238ebbb (table <table_name> [id=7149156d1c2f4f0d88e131848a907d8a]) Alter Table RPC failed: Not found: failed to reset TS proxy: Could not find TS for UUID 94b8851c35914acca3ebea8c0ba6e06e}}
>  
> Kudu should take into account existing, retrying tasks running on the master to make sure we don't stack up in this way.
> I've seen this confirmed with Kudu 1.7.1, but looking through the code at CatalogManager::ProcessTabletReport, this seems possible on master too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)