You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@trafficserver.apache.org by "Zhao Yongming (JIRA)" <ji...@apache.org> on 2010/06/23 04:51:50 UTC

[jira] Created: (TS-394) taffic_server process sig abort in full cluster mode

taffic_server process sig abort in full cluster mode
----------------------------------------------------

                 Key: TS-394
                 URL: https://issues.apache.org/jira/browse/TS-394
             Project: Traffic Server
          Issue Type: Bug
          Components: Core
         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
            Reporter: Zhao Yongming


I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.

with debug log enabled in records.config:
CONFIG proxy.config.diags.debug.enabled INT 1
CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*

I got the following log from traffic.out:

[Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
[Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
[Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
[Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)

after strace traffic_server, I got the following info:

[pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
[pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
[pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
[pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
[pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
[pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
[pid 19830]      0.000074 close(101)    = 0
[pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
[pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
[pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53

I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhao Yongming updated TS-394:
-----------------------------

    Attachment: traffic_full_cluster_sig_abort.patch

remove assert of  ClusterHandler.cc:2047. make full cluster works again.

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leif Hedstrom updated TS-394:
-----------------------------


I'm going to move this out to a 2.3.0 target, unless someone is interested to work on this right now?

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leif Hedstrom updated TS-394:
-----------------------------

    Fix Version/s: 2.1.2

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>             Fix For: 2.1.2
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929822#action_12929822 ] 

Leif Hedstrom commented on TS-394:
----------------------------------

Can this bug be closed ?

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 3.1
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895653#action_12895653 ] 

Zhao Yongming commented on TS-394:
----------------------------------

I am looking back all coding change, it is clear that there is no much change during all yahoo internal versions till 1.18, and there is big changes after opensource, so I just tested the v2.0.0, the full cluster works as expect. 

so things maybe narrow down a little. there code change during 2.0.0 - 2.1.1 have affected the cluster function, as there is big io level change, may need much effort to get things clear though.

as full clustering is a big concern for our cdn system,  I to find a fix for this issue, helps really needed.

thanks.

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896263#action_12896263 ] 

Leif Hedstrom commented on TS-394:
----------------------------------

I'm a little confused. Since you're a Yahoo!, have you confirmed that clustering works in YTS v1.18 or v1.19 ? I was under the impression that it was broken in v1.17, and ATS v2.0.0 is basically the same as v1.17.

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leif Hedstrom updated TS-394:
-----------------------------

    Fix Version/s: 2.3.0
                       (was: 2.1.2)

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896351#action_12896351 ] 

Zhao Yongming commented on TS-394:
----------------------------------

crazy testing for 2+ days, have no luck to find any information helpful for identify the crash point. seems the dev branch is broken from the start. it is hard for svn to get track all the changes after merge, that's painful. so far, what I have collected is here:
version 2.0.0, or branch 2.0.x:  Cluster mode works as expected.
branch trunk: almost working, just this bug of cache object transfer will crash the traffic_server process.
version 2.1.0, or tags 2.1.0: completely out of function, can not setup cluster connection.
version 2.1.1, or tags 2.1.1: completely out of function, can not setup cluster connection.

I have try to merge trunk and 2.0.x branch, that does nothing better too.

seems I have run into the dead end of this road.

about YTS, I have not test cluster function after v1.16, I will do some test in the following days. I am nearly sure that all the these version have a working cluster stack.

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhao Yongming updated TS-394:
-----------------------------

    Attachment:     (was: cap.tgz)

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.1.2
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895739#action_12895739 ] 

Zhao Yongming commented on TS-394:
----------------------------------

just have short test, show that v2.1 cluster mode is broken deeply.

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896449#action_12896449 ] 

Zhao Yongming commented on TS-394:
----------------------------------

just tested v1.18/v1.19, both broken, the cluster connection can not setup, 8086 port is not listening.  that is strange as there is no much change in the cluster code. 

from the diff of iocore/cluster, between v1.16-v1.18, there is just one noticing change, that is the cluster RPC change of InkAPIWireless.h, more like a rewrite of something. maybe that is the root cause. that change is started from v1.17 indeed.

so we are lucky to get the working cluster code in v2.0.0.

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhao Yongming updated TS-394:
-----------------------------

    Comment: was deleted

(was: A comment with security level 'jira-users' was removed.)

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.1.2
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896236#action_12896236 ] 

Zhao Yongming commented on TS-394:
----------------------------------

after track from the v2.1 code, I am working on the track of the first commit of branch trunk, 'TS-196:Merged traffic-branchdev(trafficserver/traffic/branches/dev) changes r891822:915884 into trunk. Tested: ubuntu904, forward and reverse proxy.' this is a merge from dev branch. which led to the cluster communication out of function. will try to test dev branch to get the clear view.

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884526#action_12884526 ] 

Leif Hedstrom commented on TS-394:
----------------------------------

I'm curious, does disabling that assert actually make clustering work? Or does it just avoid the crasher, leaving clustering running but non functional?

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>             Fix For: 2.1.2
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882022#action_12882022 ] 

Zhao Yongming commented on TS-394:
----------------------------------

yes, I am confusing about the coding too. but that is too far beyond my capability, I am not a coding guy at all. :(

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhao Yongming closed TS-394.
----------------------------

    Backport to Version: 2.1.4

closing

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.1.4
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leif Hedstrom updated TS-394:
-----------------------------

    Priority: Critical  (was: Major)

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.1.2
>
>         Attachments: cap.tgz, traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhao Yongming resolved TS-394.
------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 3.1)
                   2.1.4

refix after the stats change before 2.1.3.

fixed version :1032788

commit 2781293d2203aaa053810be6c75589297c49fed8
Author: zwoop <zw...@13f79535-47bb-0310-9956-ffa450edef68>
Date:   Mon Nov 8 23:46:39 2010 +0000

    TS-519 Fixes for clustering that broke with stats rewrite.
    
    git-svn-id: https://svn.apache.org/repos/asf/trafficserver/traffic/trunk@1032788 13f79535-47bb-0310-9956-ffa450edef68


> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.1.4
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887413#action_12887413 ] 

Zhao Yongming commented on TS-394:
----------------------------------

confirmed that patch is not a fix, the objects need to write to other cluster members may still unable to write.
we need patch with a real fix

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.1.2
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882008#action_12882008 ] 

Leif Hedstrom commented on TS-394:
----------------------------------

Hmmm, I think we need to try to figure out why that assert triggers, and not just remove it?

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-394) taffic_server process sig abort in full cluster mode

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898235#action_12898235 ] 

Zhao Yongming commented on TS-394:
----------------------------------

retested the recently change,  to be surprise that trunk have a working cluster code. and here is the change that fix the breaking:

commit fef40a0c5e4d0362c759f37d19dc47208798e38c
Author: zwoop <zw...@13f79535-47bb-0310-9956-ffa450edef68>
Date:   Wed Jun 16 23:00:40 2010 +0000

    TS-320: Do some cleanup on Connection::fast_connect and Connection::bind_connect
    
    Tested: FC-13 64-bit
    Author: Alan M. Carroll
    Review and comments: John Plevyak
    
    git-svn-id: https://svn.apache.org/repos/asf/trafficserver/traffic/trunk@955421 13f79535-47bb-0310-9956-ffa450edef68

so far, thanks all :D

> taffic_server process sig abort in full cluster mode
> ----------------------------------------------------
>
>                 Key: TS-394
>                 URL: https://issues.apache.org/jira/browse/TS-394
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>         Environment: ATS in full cluster mode is unusable, the traffic_server process will get sig abort by every request. code in trunk tested. 
>            Reporter: Zhao Yongming
>            Priority: Critical
>             Fix For: 2.3.0
>
>         Attachments: traffic_full_cluster_sig_abort.patch
>
>
> I am trying to setup full cluster mode, but geting connection abort during every request. after tcpdump, it seems that ATS got the correct source file from backend, but do not send out the full file( with tcp reset during http transfer to client), then i am trying to figure out the root cause.
> with debug log enabled in records.config:
> CONFIG proxy.config.diags.debug.enabled INT 1
> CONFIG proxy.config.diags.debug.tags STRING http.*|cluster.*
> I got the following log from traffic.out:
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 6: Aborted
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR: [Alarms::signalAlarm] Server Process was reset
> [Jun 22 15:19:11.296] Manager {139809315796768} ERROR:  (last system error 2: No such file or directory)
> after strace traffic_server, I got the following info:
> [pid 19830]      0.000306 <... epoll_wait resumed> {}, 32768, 10) = 0
> [pid 19830]      0.000031 gettimeofday({1277256740, 532763}, NULL) = 0
> [pid 19830]      0.000181 write(2, "FATAL: ClusterHandler.cc:2047: failed assert `ntodo >= 0`\n", 58) = 58
> [pid 19830]      0.000076 gettimeofday({1277256740, 533020}, NULL) = 0
> [pid 19830]      0.000071 socket(PF_FILE, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 101
> [pid 19830]      0.000062 connect(101, {sa_family=AF_FILE, path="/dev/log"}, 110) = -1 ENOENT (No such file or directory)
> [pid 19830]      0.000074 close(101)    = 0
> [pid 19830]      0.000056 write(2, "/usr/bin/traffic_server", 23) = 23
> [pid 19830]      0.000050 write(2, " - STACK TRACE: \n", 17) = 17
> [pid 19830]      0.000289 futex(0x2b3a08e6a5b0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000251 futex(0x2b3a08b11190, FUTEX_WAKE_PRIVATE, 2147483647) = 0
> [pid 19830]      0.000656 writev(2, [{"/usr/bin/traffic_server", 23}, {"(", 1}, {"ink_fatal_va", 12}, {"+0x", 3}, {"ab", 2}, {")", 1}, {"[0x", 3}, {"6dcdab", 6}, {"]\n", 2}], 9) = 53
> I have fix the bug by comment out ClusterHandler.cc:2047. patch will followed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.