You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@trafficserver.apache.org by "Ricky Chan (JIRA)" <ji...@apache.org> on 2010/10/14 10:07:32 UTC

[jira] Created: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Seg Fault with Connection_Collapsing and clustering enabled.
------------------------------------------------------------

Key: TS-489
URL: https://issues.apache.org/jira/browse/TS-489
Project: Traffic Server
Issue Type: Bug
Affects Versions: 2.0.0
Environment: Debian Lenny.
2.6.26-2-amd-64
Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
64G Memory
Reporter: Ricky Chan

Bug is easily reproduced, with the following setup.

Traffic Server 2.0.0
Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)

Other changes to records.config which may or may affect it are changes to heuristics:

CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
CONFIG proxy.config.http.cache.fuzz.time INT 240
CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005

Using a 3rd machine using apache benchmark (ab) and request with say -n 1000000 with keep alive (-k) and -c 8000 say. I found it happens all the time above 8000. I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin. Size of file is 9 bytes only.

Note: You need to set ulimit -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.

Disabling clustering or connection Collapsing the program no longer.

I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.

I'll add these traces as attachments.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Ricky Chan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925381#action_12925381 ] 

Ricky Chan commented on TS-489:
-------------------------------

Okay I did a quick re-run of my tests user the 2.0.1 stable release.

I ran this for about 30 minutes, and with -c 20,000.

Here is my results:

Firstly it no longer crashes and AB will show successful results,  However looking at the origin logs, I can see that cache data is not working as expected.  Firstly I get 1,000's of queries to the origins, i.e. collapsing caching is no longer obeyed.  Secondly, every few seconds it will make more queries even though cache for max-age is set to 86400 and heuristics is set to a rate where it shouldn't come back that quickly.

I then examined TS logs, and I can see a load of TCP_MISS's.

I then repeat the test with clustering disabled.  No hits to the origin at all (as it has a cached version already),  I then purge the content. from the cache, and I see 1 connection to the origin as expect even though I am requesting 20,000 copies of it (so collapsing is working as expected).

So these 2 features in 2.0.1 (albeit without crashing) in my tests are not working properly, hitting the origin in those volumes are not possible acceptable for my downstream origin owners.

Mohan, check you logs (origin and TS) and make sure they have behaved as expected, because I suspect not.









> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace, ts_489_testing.txt
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920993#action_12920993 ] 

Leif Hedstrom commented on TS-489:
----------------------------------

Couple of thoughts here:

1) I'm fairly certain that clustering never worked at all in v2.0.x. Can you run the same tests with v2.1.3 or "trunk" from SVN?

2) Did you increase the max number of settings? You are encountering the problem right around the time I suspect you are hitting the default limits. Not saying that's an excuse to segfault, but it'll help debugging.


You can increase the max number of connections in records.config, with

CONFIG proxy.config.net.connections_throttle INT 50000


TS should automatically (as root) increase the rlimits accordingly, the "manual" use of ulimit should not be necessary. However, this might be an area that has improved / fixed since the v2.0.x release, so again it'd be great to try v2.3.1.

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Leif Hedstrom (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leif Hedstrom updated TS-489:
-----------------------------

    Fix Version/s: 2.3.0

Moving this to v2.3.0, since we don't expect / claim Clustering to be fully functional until then.

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Ricky Chan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925351#action_12925351 ] 

Ricky Chan commented on TS-489:
-------------------------------

I'll retest with 2.0.1, although I did do a differ between 2.0.0 and 2.0.1 when I first encounter this issue and didn't see any IMO code changes which was related to this. I just saw mianly the 3 fixes mentioned in the release fix.

Personally I'll repeat the test with -c 20,000 and have it run for a long period.

Also keep an eye on the origin, as a tell tale sign is a sudden rush of 1000;s of direct connections to the origin.

I'll retest with 2.0.1 with my setup and I'll get back to you.


Ricky

p.s. Zhao, I'm happy to later versions when I get a chance to.

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace, ts_489_testing.txt
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "mohan_zl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

mohan_zl updated TS-489:
------------------------

    Attachment: ts_489_testing.txt

We did an experiment with ts2.0.1 stable version following the way you setup the cluster and changed the relevant arguments, and the results demonstrate that we can use the full cluster mode and connection_collapsing feature together in ts2.0.1 version. The attach file shows how each change affects the results.

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace, ts_489_testing.txt
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Ricky Chan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921285#action_12921285 ] 

Ricky Chan commented on TS-489:
-------------------------------

FYI:

https://issues.apache.org/jira/browse/TS-394

Talks more about clustering, and is known to work in 2.0.0 but broken in other versions.





> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Ricky Chan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920887#action_12920887 ] 

Ricky Chan commented on TS-489:
-------------------------------

An addition side effect is even when it is running (not yet crashed).

The logs indicate 10000's of cache miss. On the origin I can then see 1000's of connection coming in for the file.  

When I disable clustering (leaving collapsing on), this now behaves as expected (1 connection to origin only).

Seems collapsing and clustering do not want to play together.

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Ricky Chan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ricky Chan updated TS-489:
--------------------------

    Attachment: collapse2.trace
                collapse1.trace

Traces generated by:

rename traffic_server to traffic_server.real

Create shell script traffic_server with:

gdb -ex 'r' -ex 'bt' -ex 'q' --args /usr/bin/traffic_server.real $* >>/tmp/ts.trace 2>&1

TS started up via traffic_cop as per normal.

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925360#action_12925360 ] 

Zhao Yongming commented on TS-489:
----------------------------------

cool
Mohan and me have setup up a testing according your direction on v2.0.1, but we don't find  the problem you got.

btw, the cluster function may not works due to TS-390 in trunk.

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace, ts_489_testing.txt
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Ricky Chan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920882#action_12920882 ] 

Ricky Chan commented on TS-489:
-------------------------------

Just re-read my submission, need to make 1 point very clear.

Disabling Clustering (i.e. 3) or disabling connection collapsing (i.e. 0), traffic server will no longer seg fault.

Many Thanks.




> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Ricky Chan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921278#action_12921278 ] 

Ricky Chan commented on TS-489:
-------------------------------

Thanks for your comments.

1, My testing with 2.0.0 showed cluster cache and config distrubtion worked. but was broken in 2.1.x development version.  The test where simple.

  * traffic_line cluster commands work, cluster.config showed participating members.
  *A new object cached in one TS node, was *eventually* shown up in another TS node.
  * The only issues was cache management is broken, and more likely or not sig abort or sig fault even after fixing the double free bugs.  but searching on Jira this is a known issue.

proxy.config.net.connections_throttle is set  to 1000000

I can see resources where fine by look at the process limits via /proc after it had started up (e.g. cat /proc/<pid>/limits>, and open files was at 1000000.

Anyway if I turned clustering off or collapsing off, it was able to handle it.

Because connection collapsing is a MUST for my evaluation I will evaluated without clustering.

As cluster in my tests >= 2.1 I guess it won't be a test for this bug until clustering is working, so I'll re-test when we have working clustering in 2.3.x. then (time permitting).



Ricky



> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TS-489) Seg Fault with Connection_Collapsing and clustering enabled.

Posted by "Zhao Yongming (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921888#action_12921888 ] 

Zhao Yongming commented on TS-489:
----------------------------------

I don't think the cluster function will be fixed without anyone interest of using it. if you need to use the cluster function, I'd like to get more feed back from you testing:
please help test v2.1.3 or svn trunk version if possible. 
that will help us understand the condition of cluster function :D

> Seg Fault with Connection_Collapsing and clustering enabled.
> ------------------------------------------------------------
>
>                 Key: TS-489
>                 URL: https://issues.apache.org/jira/browse/TS-489
>             Project: Traffic Server
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: Debian Lenny.
> 2.6.26-2-amd-64
> Sun Blade X6240 (2 x Six-Core AMD Opteron(tm) Processor 2439 SE)
> 64G Memory
>            Reporter: Ricky Chan
>             Fix For: 2.3.0
>
>         Attachments: collapse1.trace, collapse2.trace
>
>
> Bug is easily reproduced, with the following setup.
> Traffic Server 2.0.0
> Enable Clustering (so you'll need two machine and make sure cluster is actually working) (LOCAL proxy.local.cluster.type INT 1)
> Enable Connection Collapsing (CONFIG proxy.config.connection_collapsing.hashtable_enabled INT 1)
> Other changes to records.config which may or may affect it are changes to heuristics:
> CONFIG proxy.config.http.cache.heuristic_min_lifetime INT 5
> CONFIG proxy.config.http.cache.heuristic_max_lifetime INT 86400
> CONFIG proxy.config.http.cache.heuristic_lm_factor FLOAT 0.000100
> CONFIG proxy.config.http.cache.fuzz.time INT 240
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000005
> Using a 3rd machine using apache benchmark (ab)  and request with say -n 1000000 with  keep alive (-k) and -c 8000 say.  I found it happens all the time above 8000.  I just fetched a file from origin on lighttpd which had a cache-control header of max-age 86400, so to reduce hitting origin.  Size of file is 9 bytes only.
> Note: You need to set ulimit  -n very high and set sysctl ip_local_port_range to larger than defaults to be able to run test, I did ulimit -n 1000000 and had sysctl -w net.ipv4.ip_local_port_range="1024 65000" to be able to run AB.
> Disabling clustering or connection Collapsing the program no longer.
> I then added GDB wrapper around traffic_server and it clearly shows it's the connection collapsing API which is at fault here.
> I'll add these traces as attachments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.