You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Julian Foad <ju...@wandisco.com> on 2010/08/04 12:38:38 UTC

Re: fs_fs core dumps in checksum code

Hi Blair.  I had a look but can't see how this can occur.

On Sat, 2010-07-31, Blair Zajac wrote:
> We run long running, persistent processes that expose the svn_fs and svn_repos 
> APIs via RPC (using ZeroC's Ice).  We've seen some occasional core dumps after 
> the processes have been up for two weeks or more.  We're running against 1.6.5.
> 
> The processes have LRU caches of fs_t and repos_t that they use if there is an 
> RPC request onto the same repos or revision.
> 
> The core dumps only appear to happen during the commit phase in checksum related 
> code.  In the first, svn_checksum_dup() is getting a pointer to address 0x48 
> which is odd and the other svn_checksum__from_digest() is getting a pointer to 
> an out of bounds address.
> 
> We haven't had the cycles to investigate them because they happen rarely and we 
> have more pressing issues.  I'm posting them here just in case anybody else has 
> seen these or has an idea.
> 
> Regards,
> Blair
> 
> #0  svn_checksum_dup (checksum=0x48, pool=0x19730448) at 
> subversion/libsvn_subr/checksum.c:190
> #1  0x00002abf50dabf87 in read_representation (contents_p=0x2aaadf514c50, 
> fs=0x2aab0af5a730, rep=0x19731478, pool=0x19730448)
>      at subversion/libsvn_fs_fs/fs_fs.c:2859

fs_fs.c:2859 is in rep_read_get_baton() which is actually a subroutine
of read_representation(), presumably in-lined by the compiler and
therefore not visible in the debug info.

The pointer 'rep->md5_checksum' is apparently 0x48 at this time.

> #2  0x00002abf50dac762 in svn_fs_fs__rep_contents_dir (entries_p=0x2aaadf514cd0, 
> fs=0x2aab0af5a730, noderev=0x19940428, pool=0x19730448)
>      at subversion/libsvn_fs_fs/fs_fs.c:3351

(Similarly, fs_fs.c:3351 is in get_dir_contents(), a subroutine of
svn_fs_fs__rep_contents_dir().)

(FWIW, we can see from the location of line 3351 that rep->txn_id is
null.)

> #3  0x00002abf50dadc83 in svn_fs_fs__set_entry (fs=0x2aab0af5a730, 
> txn_id=0x15ac3ca8 "683229-ewby", parent_noderev=0x19940428, name=0x1bf5a8d0 
> "hash:28",
>      id=0x2aab1316b378, kind=svn_node_dir, pool=0x1723ae78) at 
> subversion/libsvn_fs_fs/fs_fs.c:4651

(FWIW, at this level the rep of interest has been passed in as
'parent_noderev->data_rep'.)

> #4  0x00002abf50da1dee in set_entry (parent=0x2aab127a9378, name=0x1bf5a8d0 
> "hash:28", id=0x2aab1316b378, kind=svn_node_dir,
>      txn_id=0x15ac3ca8 "683229-ewby", pool=0x1723ae78) at 
> subversion/libsvn_fs_fs/dag.c:346
> #5  0x00002abf50db33a7 in merge (conflict_p=0x2aab0d1b4de8, 
> target_path=0x2abf50db7530 "/", target=0x2aab127a9378, source=0x2aab18bde098,
>      ancestor=0x27a17c98, txn_id=0x15ac3ca8 "683229-ewby", 
> mergeinfo_increment_out=0x0, pool=0x2aaaf76aed28) at 
> subversion/libsvn_fs_fs/tree.c:1422
> #6  0x00002abf50db34fa in merge_changes (ancestor_node=0x27a17c98, 
> source_node=0x2aab18bde098, txn=<value optimized out>, conflict=0x2aab0d1b4de8,
>      pool=0x2aaaf76aed28) at subversion/libsvn_fs_fs/tree.c:1602
> #7  0x00002abf50db3cae in svn_fs_fs__commit_txn (conflict_p=0x0, 
> new_rev=0x2aaadf515078, txn=0x15ac3c80, pool=0x2aab14b7afc8)
>      at subversion/libsvn_fs_fs/tree.c:1701
> #8  0x00002abf509782cb in svn_repos_fs_commit_txn (conflict_p=0x0, 
> repos=0x2aab10c99df8, new_rev=0x2aaadf515078, txn=0x15ac3c80, pool=0x2aab14b7afc8)
>      at subversion/libsvn_repos/fs-wrap.c:51
> #9  0x00000000004fd819 in SvnUtil::commit_transaction (repos=0x2aab10c99df8, 
> fs=0x2aab0af5a730, txn=0x15ac3c80, client_info=<value optimized out>,
>      pool=@0x2aab309374f8) at SvnUtil.cpp:276
> #10 0x00000000004f10e8 in SvnTransactionI::commit (this=0x2aab309374f0, 
> clientInfo=@0x2aaadf5153d0, current=<value optimized out>)
>      at SvnTransactionI.cpp:964
> #11 0x000000000049ef47 in VnpIce::SvnTransaction::___commit 
> (this=0x2aab309374f0, __inS=@0x2aaadf515580, __current=@0x2aaadf515580)
>      at vnp_svn_backend.cpp:6796
> #12 0x00002abf50099ea0 in IceInternal::Incoming::invoke (this=0x2aaadf515580, 
> servantManager=@0x2aaadf515930) at Incoming.cpp:447
> #13 0x00002abf5006a9ab in Ice::ConnectionI::invokeAll (this=0x23f96b50, 
> stream=@0x2aaadf515cb0, invokeNum=1, requestId=4873841, compress=0 '\0',
>      servantManager=@0x2aaadf515930, adapter=@0x2aaadf515920) at 
> ConnectionI.cpp:2440
> #14 0x00002abf5006eac7 in Ice::ConnectionI::message (this=0x23f96b50, 
> stream=@0x2aaadf515cb0, threadPool=<value optimized out>) at ConnectionI.cpp:1110
> #15 0x00002abf5017f92a in IceInternal::ThreadPool::run (this=0x150ee070) at 
> ThreadPool.cpp:519
> #16 0x00002abf5018013f in IceInternal::ThreadPool::EventHandlerThread::run 
> (this=0x151016d0) at ThreadPool.cpp:759
> #17 0x00002abf50448e02 in startHook (arg=0x151016d0) at Thread.cpp:368
> #18 0x0000003ae6a062f7 in start_thread () from /lib64/libpthread.so.0
> #19 0x00000033b0ad1b6d in clone () from /lib64/libc.so.6
> Current language:  auto; currently c

These two back-traces appear to be showing the very same code path.

Again due to in-lining, I presume, fs_fs.c:2859 calls svn_checksum_dup()
but the debugger shows, in this case, only its subroutine
svn_checksum__from_digest().

> #0  0x00002ac15f8bd28c in svn_checksum__from_digest (
>      digest=0x646c6f663a706e76 <Address 0x646c6f663a706e76 out of bounds>,
>      kind=1949987428, result_pool=<value optimized out>)
>      at subversion/libsvn_subr/checksum.c:77

Here, digest is ASCII "dmof:pnv" and kind is hex 743a7264 is ASCII
"t:rd"...

> 77	  memcpy((unsigned char *)checksum->digest, digest, DIGESTSIZE(kind));
> #1  0x00002ac15f057f87 in read_representation (contents_p=0x2aaab92d7c50,
>      fs=0x2aab80d53590, rep=0x2aabbd57e748, pool=0x2aab1c15a0e8)
>      at subversion/libsvn_fs_fs/fs_fs.c:2859

... meaning that 'rep->md5_checksum' points to readable memory but not
to a valid checksum object.

Thus, in both cases, it looks like that pointer was either uninitialized
or subsequently corrupted.

I searched in 1.6.5 code for creation of a representation_t structure
and found nowhere that leaves it uninitialized.  At least one place
allocates the structure with apr_pcalloc (zero-initialized) and doesn't
subsequently fill in ->md5_checksum, but that's not what you're seeing.
I searched for creation of node_revision_t structures and found no
places where the data_rep member is uninitialized.

Thus I can only suggest looking for memory corruption.

- Julian


> #2  0x00002ac15f058762 in svn_fs_fs__rep_contents_dir (
>      entries_p=0x2aaab92d7cd0, fs=0x2aab80d53590, noderev=0x2aab60107a48,
>      pool=0x2aab1c15a0e8) at subversion/libsvn_fs_fs/fs_fs.c:3351
> #3  0x00002ac15f059c83 in svn_fs_fs__set_entry (fs=0x2aab80d53590,
>      txn_id=0x2aabccccda58 "532073-bkk4", parent_noderev=0x2aab60107a48,
>      name=0x2aab4fa6e0c0 "hash:18", id=0x2aab6921d868, kind=svn_node_dir,
>      pool=0x1b49bd78) at subversion/libsvn_fs_fs/fs_fs.c:4651
> #4  0x00002ac15f04ddee in set_entry (parent=0x2aaaf1ff7ee8,
>      name=0x2aab4fa6e0c0 "hash:18", id=0x2aab6921d868, kind=svn_node_dir,
>      txn_id=0x2aabccccda58 "532073-bkk4", pool=0x1b49bd78)
>      at subversion/libsvn_fs_fs/dag.c:346
> #5  0x00002ac15f05f3a7 in merge (conflict_p=0x2aabc568f168,
>      target_path=0x2ac15f063530 "/", target=0x2aaaf1ff7ee8,
>      source=0x2aab23fea2c8, ancestor=0x2aab482efc18,
>      txn_id=0x2aabccccda58 "532073-bkk4", mergeinfo_increment_out=0x0,
>      pool=0x2aab41ef8b38) at subversion/libsvn_fs_fs/tree.c:1422
> #6  0x00002ac15f05f4fa in merge_changes (ancestor_node=0x2aab482efc18,
>      source_node=0x2aab23fea2c8, txn=<value optimized out>,
>      conflict=0x2aabc568f168, pool=0x2aab41ef8b38)
>      at subversion/libsvn_fs_fs/tree.c:1602
> #7  0x00002ac15f05fcae in svn_fs_fs__commit_txn (conflict_p=0x0,
>      new_rev=0x2aaab92d8078, txn=0x2aabccccda30, pool=0x2aab81433cc8)
>      at subversion/libsvn_fs_fs/tree.c:1701
> #8  0x00002ac15ec242cb in svn_repos_fs_commit_txn (conflict_p=0x0,
>      repos=0x12f71318, new_rev=0x2aaab92d8078, txn=0x2aabccccda30,
>      pool=0x2aab81433cc8) at subversion/libsvn_repos/fs-wrap.c:51
> #9  0x00000000004fd819 in SvnUtil::commit_transaction (repos=0x12f71318,
>      fs=0x2aab80d53590, txn=0x2aabccccda30, client_info=<value optimized out>,
>      pool=@0x2aab709cf448) at SvnUtil.cpp:276
> #10 0x00000000004f10e8 in SvnTransactionI::commit (this=0x2aab709cf440,
>      clientInfo=@0x2aaab92d83d0, current=<value optimized out>)
>      at SvnTransactionI.cpp:964
> #11 0x000000000049ef47 in VnpIce::SvnTransaction::___commit (
>      this=0x2aab709cf440, __inS=@0x2aaab92d8580, __current=@0x2aaab92d8580)
>      at vnp_svn_backend.cpp:6796
> #12 0x00002ac15e345ea0 in IceInternal::Incoming::invoke (this=0x2aaab92d8580,
>      servantManager=@0x2aaab92d8930) at Incoming.cpp:447
> #13 0x00002ac15e3169ab in Ice::ConnectionI::invokeAll (this=0x2aabd00c0920,
>      stream=@0x2aaab92d8cb0, invokeNum=1, requestId=18708638, compress=0 '\0',
>      servantManager=@0x2aaab92d8930, adapter=@0x2aaab92d8920)
>      at ConnectionI.cpp:2440
> #14 0x00002ac15e31aac7 in Ice::ConnectionI::message (this=0x2aabd00c0920,
>      stream=@0x2aaab92d8cb0, threadPool=<value optimized out>)
>      at ConnectionI.cpp:1110
> #15 0x00002ac15e42b92a in IceInternal::ThreadPool::run (this=0x11e58ff0)
>      at ThreadPool.cpp:519
> #16 0x00002ac15e42c13f in IceInternal::ThreadPool::EventHandlerThread::run (
>      this=0x11e65a50) at ThreadPool.cpp:759
> #17 0x00002ac15e6f4e02 in startHook (arg=0x11e65a50) at Thread.cpp:368
> #18 0x0000003ae6a062f7 in start_thread () from /lib64/libpthread.so.0
> #19 0x00000033b0ad1b6d in clone () from /lib64/libc.so.6


Re: fs_fs core dumps in checksum code

Posted by Blair Zajac <bl...@orcaware.com>.
On 04/13/2012 06:33 PM, Daniel Shahaf wrote:
> Blair Zajac wrote on Fri, Apr 13, 2012 at 13:08:12 -0700:
>> To do valgrind well, do I need to recompile APR with specific flags to
>> enable pool debugging?
>>
>
> Other than --enable-pool-debugging=N (aka -DAPR_POOL_DEBUG=N) (where
> N is placeholder for a number) ?

Yes, besides that?

Besides valgrind and enabling pool debugging, does anybody have any 
recommendations for memory debugging besides valgrind in a production 
environment, something like using a library that you LD_PRELOAD to 
intercept malloc(), free(), etc?

These core dumps occur infrequently so if there's a way to leave some 
debugging in that doesn't cost too much of a hit, that would be great.

Blair

Re: fs_fs core dumps in checksum code

Posted by Daniel Shahaf <da...@elego.de>.
Blair Zajac wrote on Fri, Apr 13, 2012 at 13:08:12 -0700:
> To do valgrind well, do I need to recompile APR with specific flags to  
> enable pool debugging?
>

Other than --enable-pool-debugging=N (aka -DAPR_POOL_DEBUG=N) (where
N is placeholder for a number) ?

Re: fs_fs core dumps in checksum code

Posted by Blair Zajac <bl...@orcaware.com>.
On 04/16/2012 02:16 AM, Julian Foad wrote:
> Blair Zajac wrote:
>> On 04/13/2012 12:45 AM, Julian Foad wrote:
>>>   Blair Zajac wrote:
>>>>   Having the empty files, such as changes, is that odd?  Could that be a
>>>> hint?
>>>
>>>   No, that's not interesting, that's just the result of crashing out
>>> at the point where it did -- in the middle of doing a commit.
>>
>> The 'changes' is created during the commit process and not building the
>> transaction?  If so, then having an empty changes file is odd and probably only
>> possible through the RPCS API we wrote that wraps svn_fs.h and svn_repos.h, in
>> which case, could there be a bug with trying to commit empty transactions in a
>> multithreaded environment?
>
> Or maybe the commit just got as far as creating the 'changes' file but crashed out before it got around to writing the list of changes into that file and flushing/closing it.

It appears that the changes file is only modified upon any modification 
to the txn and only read during the actual commit of the txn:

#!/usr/bin/python

import os
import sys
import time

import svn.fs
import svn.repos

try:
     repo = svn.repos.create('repo', None, None, {}, {})
except Exception:
     repo = svn.repos.open('repo')

fs = svn.repos.fs(repo)
txn = svn.fs.begin_txn2(fs, svn.fs.youngest_rev(fs), 0)
fs_root = svn.fs.txn_root(txn)

txn_name = svn.fs.txn_name(txn)

changes_path = os.path.join('repo', 'db', 'transactions', '%s.txn' % 
txn_name,
                             'changes')
print 1, os.stat(changes_path)[6], svn.fs.paths_changed(fs_root)

svn.fs.make_dir(fs_root, '%s' % (1000*1000*time.time()))

print 2, os.stat(changes_path)[6], svn.fs.paths_changed(fs_root)

os.system("strace -p %s -o strace.out &" % os.getpid())

print svn.repos.fs_commit_txn(repo, txn)


Re: fs_fs core dumps in checksum code

Posted by Julian Foad <ju...@btopenworld.com>.
Blair Zajac wrote:
> On 04/13/2012 12:45 AM, Julian Foad wrote:
>>  Blair Zajac wrote:
>>>  Having the empty files, such as changes, is that odd?  Could that be a 
>>> hint?
>> 
>>  No, that's not interesting, that's just the result of crashing out 
>> at the point where it did -- in the middle of doing a commit.
> 
> The 'changes' is created during the commit process and not building the 
> transaction?  If so, then having an empty changes file is odd and probably only 
> possible through the RPCS API we wrote that wraps svn_fs.h and svn_repos.h, in 
> which case, could there be a bug with trying to commit empty transactions in a 
> multithreaded environment?

Or maybe the commit just got as far as creating the 'changes' file but crashed out before it got around to writing the list of changes into that file and flushing/closing it.

- Julian

Re: fs_fs core dumps in checksum code

Posted by Blair Zajac <bl...@orcaware.com>.
On 04/13/2012 12:45 AM, Julian Foad wrote:
> Blair Zajac wrote:
>
>> Since we discussed this, we moved the Subversion server to a new box and
>> from RAID to FusionIO storage and we're still getting the core dumps
>> with the same stack trace, so I don't think its memory corruption.
>
> I meant I suspect corruption of this process's state by any mechanism, which could be buffer overflows, bad multi-threading, and so on.  You wrote before, "I'll run our our backend severs on the dev cluster in valgrind and see if we pick up anything there."  Were you able to try that?  Or just load the core dump files into GDB and see what you can see?

We didn't do valgrind, the box is in production and it would be too 
slow.  Maybe I can set up a dev process on the production server to test.

To do valgrind well, do I need to recompile APR with specific flags to 
enable pool debugging?

>> Yesterday, we got two core dumps within 30 minutes of each other.
>>
>> Would looking at the txn files in progress tell us anything?
> [...]
>> Having the empty files, such as changes, is that odd?  Could that be a hint?
>
> No, that's not interesting, that's just the result of crashing out at the point where it did -- in the middle of doing a commit.

The 'changes' is created during the commit process and not building the 
transaction?  If so, then having an empty changes file is odd and 
probably only possible through the RPCS API we wrote that wraps svn_fs.h 
and svn_repos.h, in which case, could there be a bug with trying to 
commit empty transactions in a multithreaded environment?

Blair

Re: fs_fs core dumps in checksum code

Posted by Julian Foad <ju...@btopenworld.com>.
Blair Zajac wrote:

> Since we discussed this, we moved the Subversion server to a new box and 
> from RAID to FusionIO storage and we're still getting the core dumps 
> with the same stack trace, so I don't think its memory corruption.

I meant I suspect corruption of this process's state by any mechanism, which could be buffer overflows, bad multi-threading, and so on.  You wrote before, "I'll run our our backend severs on the dev cluster in valgrind and see if we pick up anything there."  Were you able to try that?  Or just load the core dump files into GDB and see what you can see?

> Yesterday, we got two core dumps within 30 minutes of each other.
> 
> Would looking at the txn files in progress tell us anything?
[...]
> Having the empty files, such as changes, is that odd?  Could that be a hint?

No, that's not interesting, that's just the result of crashing out at the point where it did -- in the middle of doing a commit.

- Julian


> [1] 
> http://mail-archives.apache.org/mod_mbox/subversion-dev/201008.mbox/%3C4C59960C.90409@orcaware.com%3E
> 

Re: fs_fs core dumps in checksum code

Posted by Blair Zajac <bl...@orcaware.com>.
On 08/04/2010 09:32 AM, Blair Zajac wrote:
> On 08/04/2010 05:38 AM, Julian Foad wrote:
>> Again due to in-lining, I presume, fs_fs.c:2859 calls svn_checksum_dup()
>> but the debugger shows, in this case, only its subroutine
>> svn_checksum__from_digest().
>>
>>> #0 0x00002ac15f8bd28c in svn_checksum__from_digest (
>>> digest=0x646c6f663a706e76<Address 0x646c6f663a706e76 out of bounds>,
>>> kind=1949987428, result_pool=<value optimized out>)
>>> at subversion/libsvn_subr/checksum.c:77
>>
>> Here, digest is ASCII "dmof:pnv" and kind is hex 743a7264 is ASCII
>> "t:rd"...
>>
>>> 77 memcpy((unsigned char *)checksum->digest, digest, DIGESTSIZE(kind));
>>> #1 0x00002ac15f057f87 in read_representation (contents_p=0x2aaab92d7c50,
>>> fs=0x2aab80d53590, rep=0x2aabbd57e748, pool=0x2aab1c15a0e8)
>>> at subversion/libsvn_fs_fs/fs_fs.c:2859
>>
>> ... meaning that 'rep->md5_checksum' points to readable memory but not
>> to a valid checksum object.
>>
>> Thus, in both cases, it looks like that pointer was either uninitialized
>> or subsequently corrupted.
>>
>> I searched in 1.6.5 code for creation of a representation_t structure
>> and found nowhere that leaves it uninitialized. At least one place
>> allocates the structure with apr_pcalloc (zero-initialized) and doesn't
>> subsequently fill in ->md5_checksum, but that's not what you're seeing.
>> I searched for creation of node_revision_t structures and found no
>> places where the data_rep member is uninitialized.
>>
>> Thus I can only suggest looking for memory corruption.
>
> Hi Julian,
>
> Thanks for taking your time to look at this, I appreciate it.
>
> I haven't followed the code path in detail, but my gut agrees with you
> on memory corruption.

Hi Julian,

Resurrecting this thread [1] from 1.5 years ago ;)

Since we discussed this, we moved the Subversion server to a new box and 
from RAID to FusionIO storage and we're still getting the core dumps 
with the same stack trace, so I don't think its memory corruption.

Yesterday, we got two core dumps within 30 minutes of each other.

Would looking at the txn files in progress tell us anything?

bash-3.2# ls -l transactions/5653610-3f4jm.txn/
total 16
-rw-r--r-- 1 tomcat games    0 Apr 11 18:40 changes
-rw-r--r-- 1 tomcat games    4 Apr 11 18:41 next-ids
-rw-r--r-- 1 tomcat games  156 Apr 11 18:41 node.0.0
-rw-r--r-- 1 tomcat games 2035 Apr 11 18:41 node.0.0.children
-rw-r--r-- 1 tomcat games 2366 Apr 11 18:41 props

bash-3.2# ls -l txn-protorevs/5653610-3f4jm.rev*
-rw-r--r-- 1 tomcat games 0 Apr 11 18:40 txn-protorevs/5653610-3f4jm.rev
-rw-r--r-- 1 tomcat games 0 Apr 11 18:40 
txn-protorevs/5653610-3f4jm.rev-lock

Having the empty files, such as changes, is that odd?  Could that be a hint?

Blair

[1] 
http://mail-archives.apache.org/mod_mbox/subversion-dev/201008.mbox/%3C4C59960C.90409@orcaware.com%3E

Re: fs_fs core dumps in checksum code

Posted by Blair Zajac <bl...@orcaware.com>.
On 08/04/2010 05:38 AM, Julian Foad wrote:
> Hi Blair.  I had a look but can't see how this can occur.
>
> On Sat, 2010-07-31, Blair Zajac wrote:
>> We run long running, persistent processes that expose the svn_fs and svn_repos
>> APIs via RPC (using ZeroC's Ice).  We've seen some occasional core dumps after
>> the processes have been up for two weeks or more.  We're running against 1.6.5.
>>
>> The processes have LRU caches of fs_t and repos_t that they use if there is an
>> RPC request onto the same repos or revision.
>>
>> The core dumps only appear to happen during the commit phase in checksum related
>> code.  In the first, svn_checksum_dup() is getting a pointer to address 0x48
>> which is odd and the other svn_checksum__from_digest() is getting a pointer to
>> an out of bounds address.
>>
>> We haven't had the cycles to investigate them because they happen rarely and we
>> have more pressing issues.  I'm posting them here just in case anybody else has
>> seen these or has an idea.
>>
>> Regards,
>> Blair
>>
>> #0  svn_checksum_dup (checksum=0x48, pool=0x19730448) at
>> subversion/libsvn_subr/checksum.c:190
>> #1  0x00002abf50dabf87 in read_representation (contents_p=0x2aaadf514c50,
>> fs=0x2aab0af5a730, rep=0x19731478, pool=0x19730448)
>>       at subversion/libsvn_fs_fs/fs_fs.c:2859
>
> fs_fs.c:2859 is in rep_read_get_baton() which is actually a subroutine
> of read_representation(), presumably in-lined by the compiler and
> therefore not visible in the debug info.
>
> The pointer 'rep->md5_checksum' is apparently 0x48 at this time.
>
>> #2  0x00002abf50dac762 in svn_fs_fs__rep_contents_dir (entries_p=0x2aaadf514cd0,
>> fs=0x2aab0af5a730, noderev=0x19940428, pool=0x19730448)
>>       at subversion/libsvn_fs_fs/fs_fs.c:3351
>
> (Similarly, fs_fs.c:3351 is in get_dir_contents(), a subroutine of
> svn_fs_fs__rep_contents_dir().)
>
> (FWIW, we can see from the location of line 3351 that rep->txn_id is
> null.)
>
>> #3  0x00002abf50dadc83 in svn_fs_fs__set_entry (fs=0x2aab0af5a730,
>> txn_id=0x15ac3ca8 "683229-ewby", parent_noderev=0x19940428, name=0x1bf5a8d0
>> "hash:28",
>>       id=0x2aab1316b378, kind=svn_node_dir, pool=0x1723ae78) at
>> subversion/libsvn_fs_fs/fs_fs.c:4651
>
> (FWIW, at this level the rep of interest has been passed in as
> 'parent_noderev->data_rep'.)
>
>> #4  0x00002abf50da1dee in set_entry (parent=0x2aab127a9378, name=0x1bf5a8d0
>> "hash:28", id=0x2aab1316b378, kind=svn_node_dir,
>>       txn_id=0x15ac3ca8 "683229-ewby", pool=0x1723ae78) at
>> subversion/libsvn_fs_fs/dag.c:346
>> #5  0x00002abf50db33a7 in merge (conflict_p=0x2aab0d1b4de8,
>> target_path=0x2abf50db7530 "/", target=0x2aab127a9378, source=0x2aab18bde098,
>>       ancestor=0x27a17c98, txn_id=0x15ac3ca8 "683229-ewby",
>> mergeinfo_increment_out=0x0, pool=0x2aaaf76aed28) at
>> subversion/libsvn_fs_fs/tree.c:1422
>> #6  0x00002abf50db34fa in merge_changes (ancestor_node=0x27a17c98,
>> source_node=0x2aab18bde098, txn=<value optimized out>, conflict=0x2aab0d1b4de8,
>>       pool=0x2aaaf76aed28) at subversion/libsvn_fs_fs/tree.c:1602
>> #7  0x00002abf50db3cae in svn_fs_fs__commit_txn (conflict_p=0x0,
>> new_rev=0x2aaadf515078, txn=0x15ac3c80, pool=0x2aab14b7afc8)
>>       at subversion/libsvn_fs_fs/tree.c:1701
>> #8  0x00002abf509782cb in svn_repos_fs_commit_txn (conflict_p=0x0,
>> repos=0x2aab10c99df8, new_rev=0x2aaadf515078, txn=0x15ac3c80, pool=0x2aab14b7afc8)
>>       at subversion/libsvn_repos/fs-wrap.c:51
>> #9  0x00000000004fd819 in SvnUtil::commit_transaction (repos=0x2aab10c99df8,
>> fs=0x2aab0af5a730, txn=0x15ac3c80, client_info=<value optimized out>,
>>       pool=@0x2aab309374f8) at SvnUtil.cpp:276
>> #10 0x00000000004f10e8 in SvnTransactionI::commit (this=0x2aab309374f0,
>> clientInfo=@0x2aaadf5153d0, current=<value optimized out>)
>>       at SvnTransactionI.cpp:964
>> #11 0x000000000049ef47 in VnpIce::SvnTransaction::___commit
>> (this=0x2aab309374f0, __inS=@0x2aaadf515580, __current=@0x2aaadf515580)
>>       at vnp_svn_backend.cpp:6796
>> #12 0x00002abf50099ea0 in IceInternal::Incoming::invoke (this=0x2aaadf515580,
>> servantManager=@0x2aaadf515930) at Incoming.cpp:447
>> #13 0x00002abf5006a9ab in Ice::ConnectionI::invokeAll (this=0x23f96b50,
>> stream=@0x2aaadf515cb0, invokeNum=1, requestId=4873841, compress=0 '\0',
>>       servantManager=@0x2aaadf515930, adapter=@0x2aaadf515920) at
>> ConnectionI.cpp:2440
>> #14 0x00002abf5006eac7 in Ice::ConnectionI::message (this=0x23f96b50,
>> stream=@0x2aaadf515cb0, threadPool=<value optimized out>) at ConnectionI.cpp:1110
>> #15 0x00002abf5017f92a in IceInternal::ThreadPool::run (this=0x150ee070) at
>> ThreadPool.cpp:519
>> #16 0x00002abf5018013f in IceInternal::ThreadPool::EventHandlerThread::run
>> (this=0x151016d0) at ThreadPool.cpp:759
>> #17 0x00002abf50448e02 in startHook (arg=0x151016d0) at Thread.cpp:368
>> #18 0x0000003ae6a062f7 in start_thread () from /lib64/libpthread.so.0
>> #19 0x00000033b0ad1b6d in clone () from /lib64/libc.so.6
>> Current language:  auto; currently c
>
> These two back-traces appear to be showing the very same code path.
>
> Again due to in-lining, I presume, fs_fs.c:2859 calls svn_checksum_dup()
> but the debugger shows, in this case, only its subroutine
> svn_checksum__from_digest().
>
>> #0  0x00002ac15f8bd28c in svn_checksum__from_digest (
>>       digest=0x646c6f663a706e76<Address 0x646c6f663a706e76 out of bounds>,
>>       kind=1949987428, result_pool=<value optimized out>)
>>       at subversion/libsvn_subr/checksum.c:77
>
> Here, digest is ASCII "dmof:pnv" and kind is hex 743a7264 is ASCII
> "t:rd"...
>
>> 77	  memcpy((unsigned char *)checksum->digest, digest, DIGESTSIZE(kind));
>> #1  0x00002ac15f057f87 in read_representation (contents_p=0x2aaab92d7c50,
>>       fs=0x2aab80d53590, rep=0x2aabbd57e748, pool=0x2aab1c15a0e8)
>>       at subversion/libsvn_fs_fs/fs_fs.c:2859
>
> ... meaning that 'rep->md5_checksum' points to readable memory but not
> to a valid checksum object.
>
> Thus, in both cases, it looks like that pointer was either uninitialized
> or subsequently corrupted.
>
> I searched in 1.6.5 code for creation of a representation_t structure
> and found nowhere that leaves it uninitialized.  At least one place
> allocates the structure with apr_pcalloc (zero-initialized) and doesn't
> subsequently fill in ->md5_checksum, but that's not what you're seeing.
> I searched for creation of node_revision_t structures and found no
> places where the data_rep member is uninitialized.
>
> Thus I can only suggest looking for memory corruption.

Hi Julian,

Thanks for taking your time to look at this, I appreciate it.

I haven't followed the code path in detail, but my gut agrees with you 
on memory corruption.

The code that uses svn_fs.h and svn_repos.h is C++ with very little raw 
pointer manipulation, so my inclination is that it's not the C++ code 
treading over this memory but something in the svn layer.

This seems like a hard problem to track, as it occurs only after a long 
period of uptime in a high load performance setting so we just can't 
have debuggers or other tools attached to the process.  I'll run our our 
backend severs on the dev cluster in valgrind and see if we pick up 
anything there.

Regards,
Blair