You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Mitch Barnett (JIRA)" <ji...@apache.org> on 2019/04/12 00:51:00 UTC

[jira] [Updated] (KUDU-2762) Improve write transaction tracing

     [ https://issues.apache.org/jira/browse/KUDU-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mitch Barnett updated KUDU-2762:
--------------------------------
    Status: In Review  (was: In Progress)

> Improve write transaction tracing
> ---------------------------------
>
>                 Key: KUDU-2762
>                 URL: https://issues.apache.org/jira/browse/KUDU-2762
>             Project: Kudu
>          Issue Type: Improvement
>            Reporter: Will Berkeley
>            Assignee: Mitch Barnett
>            Priority: Major
>
> Here's a write transaction trace:
> {noformat}
> W0331 02:16:27.859648 27337 rpcz_store.cc:244] Call kudu.tserver.TabletServerService.Write from 10.80.34.74:58250 (ReqId={client: 6a904be8604b482989e3d1592f8824f2, seq_no=66108, attempt_no=1}) took 7855ms (client timeout 10000).
> W0331 02:16:27.859728 27337 rpcz_store.cc:248] Trace:
> 0331 02:16:20.003835 (+     0us) service_pool.cc:159] Inserting onto call queue
> 0331 02:16:20.003844 (+     9us) service_pool.cc:218] Handling call
> 0331 02:16:27.859641 (+7855797us) inbound_call.cc:157] Queueing success response
> Related trace 'txn':
> 0331 02:16:20.003988 (+     0us) write_transaction.cc:101] PREPARE: Starting
> 0331 02:16:20.004087 (+    99us) write_transaction.cc:268] Acquiring schema lock in shared mode
> 0331 02:16:20.004088 (+     1us) write_transaction.cc:271] Acquired schema lock
> 0331 02:16:20.004088 (+     0us) tablet.cc:400] PREPARE: Decoding operations
> 0331 02:16:20.004130 (+    42us) tablet.cc:422] PREPARE: Acquiring locks for 1 operations
> 0331 02:16:20.004137 (+     7us) tablet.cc:426] PREPARE: locks acquired
> 0331 02:16:20.004138 (+     1us) write_transaction.cc:126] PREPARE: finished.
> 0331 02:16:20.004154 (+    16us) write_transaction.cc:136] Start()
> 0331 02:16:20.004157 (+     3us) write_transaction.cc:141] Timestamp: P: 1554016580004153 usec, L: 0
> 0331 02:16:20.004192 (+    35us) log.cc:582] Serialized 3741 byte log entry
> 0331 02:16:27.859496 (+7855304us) write_transaction.cc:149] APPLY: Starting
> 0331 02:16:27.859608 (+   112us) tablet_metrics.cc:365] ProbeStats: bloom_lookups=2,key_file_lookups=2,delta_file_lookups=4,mrs_lookups=0
> 0331 02:16:27.859614 (+     6us) log.cc:582] Serialized 28 byte log entry
> 0331 02:16:27.859622 (+     8us) write_transaction.cc:309] Releasing row and schema locks
> 0331 02:16:27.859623 (+     1us) write_transaction.cc:277] Released schema lock
> 0331 02:16:27.859625 (+     2us) write_transaction.cc:196] FINISH: updating metrics
> Metrics: {"tcmalloc_contention_cycles":7552,"child_traces":[["txn",{"apply.queue_time_us":7854429,"cfile_cache_hit":12,"cfile_cache_hit_bytes":72423,"delta_iterators_relevant":2,"num_ops":1,"prepare.queue_time_us":11,"prepare.run_cpu_time_us":226,"prepare.run_wall_time_us":225,"raft.queue_time_us":12,"raft.run_cpu_time_us":92,"raft.run_wall_time_us":91,"replication_time_us":898,"spinlock_wait_cycles":18688}]]}
> {noformat}
> It could use some polish. Here's a few things to fix:
> 1. Nit: the casing and punctuation of {{PREPARE: finished.}} and {{APPLY: Starting}} is inconsistent.
> 2. {{PREPARE}} has a start and end; {{APPLY}} has a start but no end; replication has neither. We should add the missing trace events.
> 3. nit: There's no need for the {{PREPARE:}} preamble to trace events within {{PREPARE: Starting}} and {{PREPARE: finished.}}.
> It's potentially misleading to read this trace. Almost all of the time is spent between the trace event reporting the local WAL was written and the trace event reporting apply started. Naively, this should mean the operation took a long time to replicate, as that's the most obvious thing that could take a long time. However, the metrics show the time was taken waiting in the apply queue, held up by some other slow activity. This operation was fast to perform-- it took a long time because it was held up, idling, by something else.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)