You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Zhen Zhang <zh...@gmail.com> on 2018/01/26 08:49:30 UTC

Scan timeout and slow sync call on TS side

Hi Kudu Users,

We use Kudu1.3 and encountered the following exception yesterday:

2018-01-25,20:14:42,001 WARN org.apache.kudu.client.AsyncKuduScanner:
1042e82c70594e3aaca741b686aa91fa pretends to not know
KuduScanner(table=b2c_i18n.kudu_i18n_web_mistat_data,
tablet=1042e82c70594e3aaca741b686aa91fa,
scannerId="14f926d1384f4d848f62a7e701d2d623", scanRequestTimeout=30000)
org.apache.kudu.client.NonRecoverableException: Invalid call sequence ID in
scan request
at
org.apache.kudu.client.TabletClient.dispatchTSErrorOrReturnException(TabletClient.java:518)
at
org.apache.kudu.client.TabletClient.messageReceived(TabletClient.java:450)
at
org.apache.kudu.client.shaded.org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.apache.kudu.client.TabletClient.handleUpstream(TabletClient.java:595)


The following seems to be the corresponding log on TS side:

W0125 20:14:43.624604 69076 rpcz_store.cc:234] Call
kudu.tserver.TabletServerService.Scan from 10.68.146.36:35782 (request call
id 57) took 11635ms (client timeout 10000).
W0125 20:14:43.624671 69076 rpcz_store.cc:238] Trace:
0125 20:14:31.989201 (+     0us) service_pool.cc:143] Inserting onto call
queue
0125 20:14:31.989249 (+    48us) service_pool.cc:202] Handling call
0125 20:14:31.989532 (+   283us) tablet_service.cc:1669] Found scanner
14f926d1384f4d848f62a7e701d2d623
0125 20:14:43.621814 (+11632282us) tablet_service.cc:1728] Deadline expired
- responding early
0125 20:14:43.624597 (+  2783us) inbound_call.cc:130] Queueing success
response
Metrics:
{"cfile_cache_hit":21,"cfile_cache_hit_bytes":180754,"cfile_cache_miss":28,"cfile_cache_miss_bytes":780542,"lbm_read_time_us":11601328,"lbm_reads_1-10_ms":3,"lbm_reads_10-100_ms":6,"lbm_reads_gt_100_ms":14,"lbm_reads_lt_1ms":5,"spinlock_wait_cycles":21888,"tcmalloc_contention_cycles":220544}


It seems kudu spend too much time processing the scan request. The CPU load
then was about 7 ~ 10, and the machine has 36 cores.

I also found some warning logs on the server side like:

W0125 20:14:31.228502 68713 env_posix.cc:682] Time spent sync call for
/home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/f24c68a1bd1a4ed9be6b4d612487ba42.metadata:
real 1.799s   user 0.000s     sys 0.000s
W0125 20:14:32.640887 68713 env_posix.cc:682] Time spent sync call for
/home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/6f29e46d493a40e3a3610e7cff8023b6.metadata:
real 1.410s   user 0.000s     sys 0.000s
W0125 20:14:34.116614 68716 env_posix.cc:682] Time spent sync call for
/home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/tablet-meta/21b28a6ca38548b7afab09324ecf559a.kudutmp.wVZgBz:
real 5.792s      user 0.000s     sys 0.001s
W0125 20:14:35.665977 68713 env_posix.cc:682] Time spent sync call for
/home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/a1ce36a69a22427d802f7a856f1f6582.metadata:
real 2.900s   user 0.000s     sys 0.000s
W0125 20:14:39.278427 68713 env_posix.cc:682] Time spent sync call for
/home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/72f36af948684ca2a519e910f76ca91f.metadata:
real 1.650s   user 0.000s     sys 0.000s


Some sync call also seems slow, for example
21b28a6ca38548b7afab09324ecf559a takes more than 5 seconds to sync. I
checked more about this tablet and found it has about 110GB on-disk data.

I wonder if this huge tablet cause all the problems, including slow sync
and slow scan, but these are tablets of two different tables, and the cpu
load seems not high, so I'm not sure about it. What's your idea?

Thanks,
Zhen

Re: Scan timeout and slow sync call on TS side

Posted by Todd Lipcon <to...@cloudera.com>.
Hi Zhen,

Yea, that trace does seem to indicate just slow IO on that machine. I'm not
sure why the iowait would be localized to certain cores -- I've never
actually looked at tracking iowait on a per-core basis before, and not sure
whether it actually carries any meaning. If a thread is waiting on IO, it
is not scheduled to any core, so I guess the iowait per core is just
measuring whichever core that thread happened to run on most recently.

I'd double check that host for whether there is anything wrong with its
disks. It may also just be a case of an overloaded machine with not enough
RAM for effective caching, etc.

-Todd

On Fri, Jan 26, 2018 at 1:01 AM, Zhen Zhang <zh...@gmail.com> wrote:

>
> ​
> ​Attach iowait of all 32 cores. It seems only several cores' iowait is
> high.
>
> 2018-01-26 16:49 GMT+08:00 Zhen Zhang <zh...@gmail.com>:
>
>> Hi Kudu Users,
>>
>> We use Kudu1.3 and encountered the following exception yesterday:
>>
>> 2018-01-25,20:14:42,001 WARN org.apache.kudu.client.AsyncKuduScanner:
>> 1042e82c70594e3aaca741b686aa91fa pretends to not know
>> KuduScanner(table=b2c_i18n.kudu_i18n_web_mistat_data,
>> tablet=1042e82c70594e3aaca741b686aa91fa, scannerId="14f926d1384f4d848f62a7e701d2d623",
>> scanRequestTimeout=30000)
>> org.apache.kudu.client.NonRecoverableException: Invalid call sequence ID
>> in scan request
>> at org.apache.kudu.client.TabletClient.dispatchTSErrorOrReturnE
>> xception(TabletClient.java:518)
>> at org.apache.kudu.client.TabletClient.messageReceived(TabletCl
>> ient.java:450)
>> at org.apache.kudu.client.shaded.org.jboss.netty.channel.Simple
>> ChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamH
>> andler.java:70)
>> at org.apache.kudu.client.TabletClient.handleUpstream(TabletCli
>> ent.java:595)
>>
>>
>> The following seems to be the corresponding log on TS side:
>>
>> W0125 20:14:43.624604 69076 rpcz_store.cc:234] Call
>> kudu.tserver.TabletServerService.Scan from 10.68.146.36:35782 (request
>> call id 57) took 11635ms (client timeout 10000).
>> W0125 20:14:43.624671 69076 rpcz_store.cc:238] Trace:
>> 0125 20:14:31.989201 (+     0us) service_pool.cc:143] Inserting onto call
>> queue
>> 0125 20:14:31.989249 (+    48us) service_pool.cc:202] Handling call
>> 0125 20:14:31.989532 (+   283us) tablet_service.cc:1669] Found scanner
>> 14f926d1384f4d848f62a7e701d2d623
>> 0125 20:14:43.621814 (+11632282us) tablet_service.cc:1728] Deadline
>> expired - responding early
>> 0125 20:14:43.624597 (+  2783us) inbound_call.cc:130] Queueing success
>> response
>> Metrics: {"cfile_cache_hit":21,"cfile_cache_hit_bytes":180754,"cfile_
>> cache_miss":28,"cfile_cache_miss_bytes":780542,"lbm_read_
>> time_us":11601328,"lbm_reads_1-10_ms
>> <https://maps.google.com/?q=1-10_ms%22:3&entry=gmail&source=g>
>> ":3,"lbm_reads_10-100_ms":6
>> <https://maps.google.com/?q=10-100_ms%22:6&entry=gmail&source=g>
>> ,"lbm_reads_gt_100_ms":14,"lbm_reads_lt_1ms":5,"
>> spinlock_wait_cycles":21888,"tcmalloc_contention_cycles":220544}
>>
>>
>> It seems kudu spend too much time processing the scan request. The CPU
>> load then was about 7 ~ 10, and the machine has 36 cores.
>>
>> I also found some warning logs on the server side like:
>>
>> W0125 20:14:3
>> <https://maps.google.com/?q=1-10_ms%22:3&entry=gmail&source=g>1.228502
>> 68713 env_posix.cc:682] Time spent sync call for
>> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/
>> data/f24c68a1bd1a4ed9be6b4d612487ba42.metadata: real 1.799s   user
>> 0.000s     sys 0.000s
>> W0125 20:14:32.640887 68713 env_posix.cc:682] Time spent sync call for
>> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/
>> data/6f29e46d493a40e3a3610e7cff8023b6.metadata: real 1.410s   user
>> 0.000s     sys 0.000s
>> W0125 20:14:34.116614 68716 env_posix.cc:682] Time spent sync call for
>> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/
>> tablet-meta/21b28a6ca38548b7afab09324ecf559a.kudutmp.wVZgBz: real
>> 5.792s      user 0.000s     sys 0.001s
>> W0125 20:14:35.665977 68713 env_posix.cc:682] Time spent sync call for
>> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/
>> data/a1ce36a69a22427d802f7a856f1f6582.metadata: real 2.900s   user
>> 0.000s     sys 0.000s
>> W0125 20:14:39.278427 68713 env_posix.cc:682] Time spent sync call for
>> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/
>> data/72f36af948684ca2a519e910f76ca91f.metadata: real 1.650s   user
>> 0.000s     sys 0.000s
>>
>>
>> Some sync call also seems slow, for example 21b28a6ca38548b7afab09324ecf559a
>> takes more than 5 seconds to sync. I checked more about this tablet and
>> found it has about 110GB on-disk data.
>>
>> I wonder if this huge tablet cause all the problems, including slow sync
>> and slow scan, but these are tablets of two different tables, and the cpu
>> load seems not high, so I'm not sure about it. What's your idea?
>>
>> Thanks,
>> Zhen
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Scan timeout and slow sync call on TS side

Posted by Zhen Zhang <zh...@gmail.com>.
​
​Attach iowait of all 32 cores. It seems only several cores' iowait is high.

2018-01-26 16:49 GMT+08:00 Zhen Zhang <zh...@gmail.com>:

> Hi Kudu Users,
>
> We use Kudu1.3 and encountered the following exception yesterday:
>
> 2018-01-25,20:14:42,001 WARN org.apache.kudu.client.AsyncKuduScanner:
> 1042e82c70594e3aaca741b686aa91fa pretends to not know
> KuduScanner(table=b2c_i18n.kudu_i18n_web_mistat_data, tablet=
> 1042e82c70594e3aaca741b686aa91fa, scannerId="
> 14f926d1384f4d848f62a7e701d2d623", scanRequestTimeout=30000)
> org.apache.kudu.client.NonRecoverableException: Invalid call sequence ID
> in scan request
> at org.apache.kudu.client.TabletClient.dispatchTSErrorOrReturnExcepti
> on(TabletClient.java:518)
> at org.apache.kudu.client.TabletClient.messageReceived(
> TabletClient.java:450)
> at org.apache.kudu.client.shaded.org.jboss.netty.channel.
> SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.
> java:70)
> at org.apache.kudu.client.TabletClient.handleUpstream(
> TabletClient.java:595)
>
>
> The following seems to be the corresponding log on TS side:
>
> W0125 20:14:43.624604 69076 rpcz_store.cc:234] Call kudu.tserver.TabletServerService.Scan
> from 10.68.146.36:35782 (request call id 57) took 11635ms (client timeout
> 10000).
> W0125 20:14:43.624671 69076 rpcz_store.cc:238] Trace:
> 0125 20:14:31.989201 (+     0us) service_pool.cc:143] Inserting onto call
> queue
> 0125 20:14:31.989249 (+    48us) service_pool.cc:202] Handling call
> 0125 20:14:31.989532 (+   283us) tablet_service.cc:1669] Found scanner
> 14f926d1384f4d848f62a7e701d2d623
> 0125 20:14:43.621814 (+11632282us) tablet_service.cc:1728] Deadline
> expired - responding early
> 0125 20:14:43.624597 (+  2783us) inbound_call.cc:130] Queueing success
> response
> Metrics: {"cfile_cache_hit":21,"cfile_cache_hit_bytes":180754,"
> cfile_cache_miss":28,"cfile_cache_miss_bytes":780542,"lbm_
> read_time_us":11601328,"lbm_reads_1-10_ms":3,"lbm_reads_
> 10-100_ms":6,"lbm_reads_gt_100_ms":14,"lbm_reads_lt_1ms":
> 5,"spinlock_wait_cycles":21888,"tcmalloc_contention_cycles":220544}
>
>
> It seems kudu spend too much time processing the scan request. The CPU
> load then was about 7 ~ 10, and the machine has 36 cores.
>
> I also found some warning logs on the server side like:
>
> W0125 20:14:31.228502 68713 env_posix.cc:682] Time spent sync call for
> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/
> f24c68a1bd1a4ed9be6b4d612487ba42.metadata: real 1.799s   user 0.000s
>  sys 0.000s
> W0125 20:14:32.640887 68713 env_posix.cc:682] Time spent sync call for
> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/
> 6f29e46d493a40e3a3610e7cff8023b6.metadata: real 1.410s   user 0.000s
>  sys 0.000s
> W0125 20:14:34.116614 68716 env_posix.cc:682] Time spent sync call for
> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/tablet-meta/
> 21b28a6ca38548b7afab09324ecf559a.kudutmp.wVZgBz: real 5.792s      user
> 0.000s     sys 0.001s
> W0125 20:14:35.665977 68713 env_posix.cc:682] Time spent sync call for
> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/
> a1ce36a69a22427d802f7a856f1f6582.metadata: real 2.900s   user 0.000s
>  sys 0.000s
> W0125 20:14:39.278427 68713 env_posix.cc:682] Time spent sync call for
> /home/work/app/kudu/azsgprc-xiaomi/kudu_tablet_server/hdd2/data/
> 72f36af948684ca2a519e910f76ca91f.metadata: real 1.650s   user 0.000s
>  sys 0.000s
>
>
> Some sync call also seems slow, for example 21b28a6ca38548b7afab09324ecf559a
> takes more than 5 seconds to sync. I checked more about this tablet and
> found it has about 110GB on-disk data.
>
> I wonder if this huge tablet cause all the problems, including slow sync
> and slow scan, but these are tablets of two different tables, and the cpu
> load seems not high, so I'm not sure about it. What's your idea?
>
> Thanks,
> Zhen
>
>