You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "AdunGaos (JIRA)" <ji...@apache.org> on 2011/07/02 04:57:28 UTC

[jira] [Commented] (TS-852) hostdb and dns system hangup

    [ https://issues.apache.org/jira/browse/TS-852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058974#comment-13058974 ] 

AdunGaos commented on TS-852:
-----------------------------

We have do lots of work at this issue, found some information about it:

Default: 
{code}
proxy.config.dns.lookup_timeout=20
proxy.config.hostdb.ttl_mode=0
proxy.config.hostdb.verify_after=720
{code}
Our dns ttl is 300 seconds, this cause when a hostdb entry timeout, it will send a dns requst to dns system, ATS CAN NOT DO ANYTHING about this domain's request until dns system return an answer. BUT a dns request may be timeout in some status. this cause ATS DON'T WORK a long time(proxy.config.dns.lookup_timeout=20). In our test, about 3000qps, when this issue appear, will cause ATS stack so many request, even after dns timeout, a dns retry answer right, OLD requst take a long time to dequeue from hostdb. so it seems like hostdb hungup.

DNS requst may timeout because next issue:
Our nameserver is 127.0.0.1, it will answer requst so quickly less than 0.4ms. When a thread send dns requst, and it not ready for receive answer, the answer received by anthoer thread, It NOT FOUND correct DNS id from dnsProcessor.handler, cause this answer droped, until dns timeout, it do a retry.

Next log is our set proxy.config.dns.lookup_timeout=2:

step 1. send request:
{code}
[Jun 26 03:29:12.714] Server {1103427904} DEBUG: (dns) received query img.taobaocdn.com type = 1, timeout = ,0
[Jun 26 03:29:12.714] Server {1103427904} DEBUG: (dns) enqueing query img.taobaocdn.com
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) adding first to collapsing queue
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) send query for img.taobaocdn.com to fd 51
{code}
step 2. answer received by an other thread but drop it.
{code}
[Jun 26 03:29:12.715] Server {47576670086256} DEBUG: (dns) received packet size = 157
[Jun 26 03:29:12.715] Server {47576670086256} DEBUG: (dns) primary DNS response code = 0
[Jun 26 03:29:12.715] Server {47576670086256} DEBUG: (dns) unknown DNS id = 62816
{code}
step 3. thread 1 sent finished after answer has been droped.
{code}
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) sent qname = img.taobaocdn.com, id = 62816, nameserver = 0
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) sent_one: failover_number for resolver 0 is 1
{code}
step 4. timeout for query, do a retry and get right answer this time.
{code}
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) timeout for query img.taobaocdn.com
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) marking img.taobaocdn.com as not-written
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) doing retry for img.taobaocdn.com
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) send query for img.taobaocdn.com to fd 51
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) sent qname = img.taobaocdn.com, id = 39679, nameserver = 0
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) sent_one: failover_number for resolver 0 is 2
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received packet size = 157
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) primary DNS response code = 0
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received cname = taobao
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A name = taobao
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A = 122.224.194.211
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A = 122.224.194.212
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A = 110.75.3.141
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A = 110.75.3.142
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) SUCCESS result for img.taobaocdn.com = 122.224.194.211 retry 0
[Jun 26 03:29:14.710] Server {1103427904} DEBUG: (dns) called back continuation for img.taobaocdn.com
{code}

We do a small patch(change some code's order) about this issue, it seems this behavior not appeared again. Any one can take a look at this patch, declare this patch is right or no, or it may has other issue?

P.S. after this patch, TS-833 seems gone. I don't clear how this two issue seems is related.


> hostdb and dns system hangup
> ----------------------------
>
>                 Key: TS-852
>                 URL: https://issues.apache.org/jira/browse/TS-852
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core, DNS
>    Affects Versions: 3.1.0
>            Reporter: Zhao Yongming
>            Priority: Critical
>              Labels: dns, hostdb
>             Fix For: 3.1.0
>
>
> in my testing, all request that need go OS, fails with 30s timeout, and the slow query shows data from dns missed: 
> {code}
> [Jun 22 15:33:47.183] Server {1106880832} ERROR: [8122411] Slow Request: url: http://download.im.alisoft.com/aliim/AliIM6/update/packages//2011-5-4-10-10-40/0/modulesproxy/8203.xml.wwzip status: 0 unique id:  bytes: 0 fd: 0 client state: 6 server state: 0 ua_begin: 0.000 ua_read_header_done: 0.000 cache_open_read_begin: 0.000 cache_open_read_end: 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000 server_connect: -1.000 server_first_read: -1.000 server_read_header_done: -1.000 server_close: -1.000 ua_close: 30.667 sm_finish: 30.667
> [Jun 22 15:33:47.220] Server {1099663680} ERROR: [8122422] Slow Request: url: http://img.uu1001.cn/materials/original/2010-11-22/22-48/a302876929a9c40a8272ac439a16ad25e74edf71.png status: 0 unique id:  bytes: 0 fd: 0 client state: 6 server state: 0 ua_begin: 0.000 ua_read_header_done: 0.000 cache_open_read_begin: 0.000 cache_open_read_end: 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000 server_connect: -1.000 server_first_read: -1.000 server_read_header_done: -1.000 server_close: -1.000 ua_close: 30.318 sm_finish: 30.318
> [Jun 22 15:33:47.441] Server {1098611008} ERROR: [8122421] Slow Request: url: http://img02.taobaocdn.com/bao/uploaded/i2/T1mp8QXopdXXblNIZ6_061203.jpg_b.jpg status: 0 unique id:  bytes: 0 fd: 0 client state: 6 server state: 0 ua_begin: 0.000 ua_read_header_done: 0.000 cache_open_read_begin: 0.000 cache_open_read_end: 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000 server_connect: -1.000 server_first_read: -1.000 server_read_header_done: -1.000 server_close: -1.000 ua_close: 30.597 sm_finish: 30.597
> [Jun 22 15:33:47.441] Server {1098611008} ERROR: [8122409] Slow Request: url: http://img04.taobaocdn.com/tps/i4/T1EM9gXlttXXXXXXXX-440-135.jpg status: 0 unique id:  bytes: 0 fd: 0 client state: 6 server state: 0 ua_begin: 0.000 ua_read_header_done: 0.000 cache_open_read_begin: 0.000 cache_open_read_end: 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000 server_connect: -1.000 server_first_read: -1.000 server_read_header_done: -1.000 server_close: -1.000 ua_close: 30.948 sm_finish: 30.948
> {code}
> all http SM show from {http} in DNS_LOOKUP
> and traffic.out show nothing with error, when I enable debug on hostdb.*|dns.*:
> {code}
> [Jun 22 15:27:42.391] Server {1108986176} DEBUG: (hostdb) probe img03.taobaocdn.com DB357D60B8247DC7 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.391] Server {1108986176} DEBUG: (hostdb) timeout 64204 1308663404 300
> [Jun 22 15:27:42.391] Server {1108986176} DEBUG: (hostdb) delaying force 0 answer for img03.taobaocdn.com [timeout 0]
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) probe img03.taobaocdn.com DB357D60B8247DC7 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) timeout 64204 1308663404 300
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) DNS img03.taobaocdn.com
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) enqueuing additional request
> [Jun 22 15:27:42.416] Server {47177168876656} DEBUG: (hostdb) probe 127.0.0.1 E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.416] Server {47177168876656} DEBUG: (hostdb) immediate answer for 127.0.0.1
> [Jun 22 15:27:42.422] Server {1098611008} DEBUG: (hostdb) probe img08.taobaocdn.com 945198AE9AE37481 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.422] Server {1098611008} DEBUG: (hostdb) timeout 64281 1308663327 300
> [Jun 22 15:27:42.422] Server {1098611008} DEBUG: (hostdb) delaying force 0 answer for img08.taobaocdn.com [timeout 0]
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) probe img08.taobaocdn.com 945198AE9AE37481 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) timeout 64281 1308663327 300
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) DNS img08.taobaocdn.com
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) enqueuing additional request
> [Jun 22 15:27:42.470] Server {1099663680} DEBUG: (hostdb) probe 127.0.0.1 E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.470] Server {1099663680} DEBUG: (hostdb) immediate answer for 127.0.0.1
> [Jun 22 15:27:42.490] Server {1106880832} DEBUG: (hostdb) probe 127.0.0.1 E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.490] Server {1106880832} DEBUG: (hostdb) immediate answer for 127.0.0.1
> [Jun 22 15:27:42.594] Server {1108986176} DEBUG: (hostdb) probe 127.0.0.1 E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.594] Server {1108986176} DEBUG: (hostdb) immediate answer for 127.0.0.1
> [Jun 22 15:27:42.818] Server {1108986176} DEBUG: (hostdb) probe 127.0.0.1 E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.818] Server {1108986176} DEBUG: (hostdb) immediate answer for 127.0.0.1
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) probe img.taobao.com 55D857190294CA66 1 [ignore_timeout = 0]
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) timeout 64134 1308663475 300
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) DNS img.taobao.com
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (dns) received query img.taobao.com type = 1, timeout = 0
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (dns) enqueing query img.taobao.com
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (dns) adding first to collapsing queue
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) probe img.taobao.com 55D857190294CA66 1 [ignore_timeout = 0]
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) timeout 64134 1308663475 300
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) DNS img.taobao.com
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) enqueuing additional request
> {code}
> this is a critical fail for me!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira