You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "John Plevyak (JIRA)" <ji...@apache.org> on 2013/05/30 01:50:21 UTC
[jira] [Commented] (TS-1648) Segmentation fault in dir_clear_range()

    [ https://issues.apache.org/jira/browse/TS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669893#comment-13669893 ] 

John Plevyak commented on TS-1648:
----------------------------------

Rather than long we should be using int64 as "long" is not well defined (it is platform dependent). Are those 10TB RAIDs?  If so, you are better of using them as JBOD since ATS assumes that there is a single disk arm (or equal fraction) for each "disk" is storage.config.  Because of the size of your "disk" it is possible that you have more than 2^31 directory entries which would account for the overflow.  Also, given the size, the "clear" may take a long time.  Your trace is not long enough for me to see if it repeats.  However, if it does repeat, it is possible that it is because dir_in_bucket also takes an int which is then multiplied to get a directory number.  The other possibility is (of course) that you have memory corruption: the directory is the single largest memory user, and it contains a linked list which can be circularized by corruption, but let's concentrate on the other issues first.

I would suggest that we change all the bucket/entry/etc offsets to int64 (I can build a patch, but I would appreciate a review).  Second, I would suggest (after testing to ensure that the patch fixes your problem) that you move to JBOD rather than RAID-0 or to having multiple NAS volumes which correspond approximately to the number of underlying disks since ATS will only have one outstanding write (although multiple reads) for each "disk" in storage.config.  
                
> Segmentation fault in dir_clear_range()
> ---------------------------------------
>
>                 Key: TS-1648
>                 URL: https://issues.apache.org/jira/browse/TS-1648
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cache
>    Affects Versions: 3.3.0, 3.2.0
>         Environment: reverse proxy
>            Reporter: Tomasz Kuzemko
>            Assignee: weijin
>              Labels: A
>             Fix For: 3.3.3
>
>         Attachments: 0001-Fix-for-TS-1648-Segmentation-fault-in-dir_clear_rang.patch
>
>
> I use ATS as a reverse proxy. I have a fairly large disk cache consisting of 2x 10TB raw disks. I do not use cache compression. After a few days of running (this is a dev machine - not handling any traffic) ATS begins to crash with a segfault shortly after start:
> [Jan 11 16:11:00.690] Server {0x7ffff2bb8700} DEBUG: (rusage) took rusage snap 1357917060690487000
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff20ad700 (LWP 17292)]
> 0x0000000000696a71 in dir_clear_range (start=640, end=17024, vol=0x16057d0) at CacheDir.cc:382
> 382	CacheDir.cc: No such file or directory.
> 	in CacheDir.cc
> (gdb) p i
> $1 = 214748365
> (gdb) l
> 377	in CacheDir.cc
> (gdb) p dir_index(vol, i)
> $2 = (Dir *) 0x7ff997a04002
> (gdb) p dir_index(vol, i-1)
> $3 = (Dir *) 0x7ffa97a03ff8
> (gdb) p *dir_index(vol, i-1)
> $4 = {w = {0, 0, 0, 0, 0}}
> (gdb) p *dir_index(vol, i-2)
> $5 = {w = {0, 0, 52431, 52423, 0}}
> (gdb) p *dir_index(vol, i)
> Cannot access memory at address 0x7ff997a04002
> (gdb) p *dir_index(vol, i+2)
> Cannot access memory at address 0x7ff997a04016
> (gdb) p *dir_index(vol, i+1)
> Cannot access memory at address 0x7ff997a0400c
> (gdb) p vol->buckets * DIR_DEPTH * vol->segments
> $6 = 1246953472
> (gdb) bt
> #0  0x0000000000696a71 in dir_clear_range (start=640, end=17024, vol=0x16057d0) at CacheDir.cc:382
> #1  0x000000000068aba2 in Vol::handle_recover_from_data (this=0x16057d0, event=3900, data=0x16058a0) at Cache.cc:1384
> #2  0x00000000004e8e1c in Continuation::handleEvent (this=0x16057d0, event=3900, data=0x16058a0) at ../iocore/eventsystem/I_Continuation.h:146
> #3  0x0000000000692385 in AIOCallbackInternal::io_complete (this=0x16058a0, event=1, data=0x135afc0) at ../../iocore/aio/P_AIO.h:80
> #4  0x00000000004e8e1c in Continuation::handleEvent (this=0x16058a0, event=1, data=0x135afc0) at ../iocore/eventsystem/I_Continuation.h:146
> #5  0x0000000000700fec in EThread::process_event (this=0x7ffff36c4010, e=0x135afc0, calling_code=1) at UnixEThread.cc:142
> #6  0x00000000007011ff in EThread::execute (this=0x7ffff36c4010) at UnixEThread.cc:191
> #7  0x00000000006ff8c2 in spawn_thread_internal (a=0x1356040) at Thread.cc:88
> #8  0x00007ffff797e8ca in start_thread () from /lib/libpthread.so.0
> #9  0x00007ffff55c6b6d in clone () from /lib/libc.so.6
> #10 0x0000000000000000 in ?? ()
> This is fixed by running "traffic_server -Kk" to clear the cache. But after a few days the issue reappears.
> I will keep the current faulty setup as-is in case you need me to provide more data. I tried to make a core dump but it took a couple of GB even after gzip (I can however provide it on request).
> *Edit*
> OS is Debian GNU/Linux 6.0.6 with custom built kernel 3.2.13-grsec-xxxx-grs-ipv6-64

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira