You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apr.apache.org by cm...@collab.net on 2003/03/10 07:21:44 UTC

Re: RA-DAV tests acting *very* strangely.

(Cc:ing this to the APR Development List in hopes that someone there
 reads my problem description and it triggers some distant memory of a
 long-ago pain, or that perhaps they can otherwise shed some light on
 the situation.  For reference, I am running Red Hat 7.2, and the APR
 from HTTP 2.0.44.)

cmpilato@collab.net writes:

> Hm.  Well debugging led me to believe that 5202 didn't actually break
> something, but perhaps shed light on a previously existing other
> problem.  I went over that change with a fine-toothed comb, and could
> find absolutely no fault with it.

After spending all of Friday night and Saturday morning hunting this
puppy, I was all set to put it out of mind until Monday morning.  But
you know how that goes.  The wife wanted to do a little reading before
going to sleep tonight, so I figured I'd use that time to take a fresh
look at this nasty bug.

Wow, does mental clarity help when debugging.

I think I've finally narrowed the bug down to a small enough problem
area, and let me tell you -- it ain't where I expected it.  Let me
recite the facts I had as of 7:00am Saturday:

   - Subversion as of revision 5202 was not passing 'make check' over
     DAV.  The initial import that happens at the beginning of the
     test run was failing because the directory "A" (in the greek
     tree) was being imported more than once.  In case you don't know,
     5202 was the revision in which I changed 'svn import' back to an
     "infix textdelta" commit process instead of a "postfix textdelta"
     one.  This means that the contents of imported files are now
     transmitted as the files are created in the repository instead of
     having all the files created as part of the directory walk, and
     then all the content for all the files being transmitted at the
     very end of the import.

   - The bug seemed to only occur when 'svn import' was run inside the
     Python test suite, but never when run on the cmdline.  I could
     trigger the bug by running 'basic_tests.py 1 --url
     http://localhost' or by calling actions.guarantee_greek_repository()
     from inside of a Python interpreter.

   - svn_io_dir_read() was definitely, without-a-doubt, returning the
     following ordered list of dirents:  { '.', '..', 'iota', 'A', 'A' } 
     (And the second 'A' triggered the failure.)

   - Reverting the 5202 change caused the bug to go away.

   - Keeping the 5202 change, but commenting out the call to 
     editor->apply_textdelta() that happens during the import would
     also cause the bug to go away (though of course, now the tests
     would fail because no file contents were being transmitted).

I began my search tonight with a focus on that messed-up
svn_io_dir_read() situation.  Recall that svn_io_dir_read() is just a
wrapper around apr_dir_read().  

First, I modified the import_dir() function to simply do the loop over
dirents, but nothing else.  I gdb'd the client during its 'svn import'
from a test run and examined the apr_dir_t structure used in that
call.  The 'dirstruct' member of that structure was particularly
interesting to me, and looked like this after each call (the paths
after the '###' below are the current dirent just read by
svn_io_dir_read()):

   ### "."
   (gdb) p *dir->dirstruct
   $4 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 16, filepos = 1, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### ".."
   (gdb) p *dir->dirstruct
   $6 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 32, filepos = 2, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "iota"
   (gdb) p *dir->dirstruct
   $8 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 48, filepos = 3, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "A"
   (gdb) p *dir->dirstruct
   $10 = {fd = 4, data = 0x80896a0 "C^\b", allocation = 4096, size = 64, 
     offset = 64, filepos = 4, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   
Then, I reverted that temporary change and debugged my client again to
see what happened when my weird bug showed up:

   ### "."
   (gdb) p *dir->dirstruct
   $1 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 16, filepos = 1, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### ".."
   (gdb) p *dir->dirstruct
   $3 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 32, filepos = 2, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "iota"
   (gdb) p *dir->dirstruct
   $5 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 48, filepos = 3, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "A"
   (gdb) p *dir->dirstruct
   $7 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
     offset = 64, filepos = 4, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}
   ### "A" (again)
   (gdb) p *dir->dirstruct
   $10 = {fd = 4, data = 0x80896a0 "èh\b", allocation = 4096, size = 16, 
     offset = 16, filepos = 5, lock = {__m_reserved = 0, __m_count = 0, 
       __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
         __spinlock = 0}}}

Whoa!  I noticed almost immediately that on that second "A" that is
returned, the *dir->dirstruct->data member had suddenly changed (and
the offset member was no longer incrementing by 16 like it had been).
This got me thinking that something down in APR was causing illness.
So I set a watchpoint on *dir->dirstruct->data, and ran the test
again.  The watchpoint was first triggered when the field was first
initialized with directory data, and then *POW* it was triggered again
on a call to readdir_r() (from apr_dir_read()).

So, readdir_r() was actually fouling up somehow, losing track of where
it was in its directory reading or something.  At this point, I
mis-read the documentation for readdir_r().  I recalled that we were
creating a temporary file in the top-level import directory each time
file contents were sent (in the editor->apply_textdelta() call, which
is commit_apply_txdelta in ra_dav).  My misunderstanding of the
readdir_r() manpage caused me to hack a change in ra.c so that instead
of making the temporary file in the top-level import directory, it
always made the file in "/tmp".  *POOF*.  The bug disappeared.

At this point, I called Karl.

We talked through the situation and by the end had decided on a couple
of things:

   - First, that I had misread the readdir_r() manpage, but that
     regardless, something seemed to be quite wrong with how that
     function (or APR's use of that function) was behaving on my
     system.

   - Secondly, the reason that 5202 appeared the cause of the bug was
     because when we were doing postfix textdeltas, we had finished
     doing all the directory reading for the whole import before we
     created our first temporary file, thus our tmpfile creation
     wasn't screwing up dir reads like it is for me now.

Now that I've read the readdir_r() manpage correctly, it would appear
that it should be completely safe to have this kind of processing:

   opendir(DIR)
   while (readdir(DIR))
     {
        create_file(DIR/tmpfile)
        do_stuff()
        delete_file(DIR/tmpfile)
     }

But apparently this isn't working out too well on my machine.  

Does anyone have any reason to believe that this would not be safe,
and have any ideas why running Subversion via a Python popen3() call
would cause some problem to occur which does not run Subversion is run
directly from the shell?  Inquiring (and admittedly boggled) minds
wanna know.

Re: 0.19 release can have revision 5202

Posted by Michael Price <mp...@atl.lmco.com>.
Karl Fogel writes:
 > Michael, in case you're not reading every thread, note that Mike
 > Pilato said:
 > 
 > cmpilato@collab.net writes:
 > > Oh, Release Manager -- can you put the 5202 change back into the
 > > release, pretty please?
 > 
 > It's safe again :-).

OK.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

0.19 release can have revision 5202

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Michael, in case you're not reading every thread, note that Mike
Pilato said:

cmpilato@collab.net writes:
> Oh, Release Manager -- can you put the 5202 change back into the
> release, pretty please?

It's safe again :-).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

0.19 release can have revision 5202

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Michael, in case you're not reading every thread, note that Mike
Pilato said:

cmpilato@collab.net writes:
> Oh, Release Manager -- can you put the 5202 change back into the
> release, pretty please?

It's safe again :-).

Re: RA-DAV tests acting *very* strangely.

Posted by cm...@collab.net.
Justin Erenkrantz <ju...@erenkrantz.com> writes:

> --On Monday, March 10, 2003 12:21 AM -0600 cmpilato@collab.net wrote:
> 
> > call.  The 'dirstruct' member of that structure was particularly
> > interesting to me, and looked like this after each call (the paths
> > after the '###' below are the current dirent just read by
> > svn_io_dir_read()):
> 
> Can you jump into apr_dir_read and check the return values from
> readdir_r? I'd imagine that we should be hitting the !ret &&
> thedir->entry != retent case on line 187 - that should be returning
> APR_ENOENT.
> 
> Is that the case?  -- justin

First, let me say this:

  http://www.cs.helsinki.fi/linux/linux-kernel/2001-42/0064.html

Then, let me say that yes, this bug only happens for me when I'm
running the tests on tmpfs.

Finally, let me crawl into a deep dark hole and lie there a while.

(Karl ask me Friday afternoon if I had tested on both tmpfs and on the
regular filesystem.  I was confident that I had, but apparently messed
up somewhere.)

Oh, Release Manager -- can you put the 5202 change back into the
release, pretty please?

Re: RA-DAV tests acting *very* strangely.

Posted by cm...@collab.net.
Justin Erenkrantz <ju...@erenkrantz.com> writes:

> --On Monday, March 10, 2003 12:21 AM -0600 cmpilato@collab.net wrote:
> 
> > call.  The 'dirstruct' member of that structure was particularly
> > interesting to me, and looked like this after each call (the paths
> > after the '###' below are the current dirent just read by
> > svn_io_dir_read()):
> 
> Can you jump into apr_dir_read and check the return values from
> readdir_r? I'd imagine that we should be hitting the !ret &&
> thedir->entry != retent case on line 187 - that should be returning
> APR_ENOENT.
> 
> Is that the case?  -- justin

First, let me say this:

  http://www.cs.helsinki.fi/linux/linux-kernel/2001-42/0064.html

Then, let me say that yes, this bug only happens for me when I'm
running the tests on tmpfs.

Finally, let me crawl into a deep dark hole and lie there a while.

(Karl ask me Friday afternoon if I had tested on both tmpfs and on the
regular filesystem.  I was confident that I had, but apparently messed
up somewhere.)

Oh, Release Manager -- can you put the 5202 change back into the
release, pretty please?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: RA-DAV tests acting *very* strangely.

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.
--On Monday, March 10, 2003 12:21 AM -0600 cmpilato@collab.net wrote:

> call.  The 'dirstruct' member of that structure was particularly
> interesting to me, and looked like this after each call (the paths
> after the '###' below are the current dirent just read by
> svn_io_dir_read()):

Can you jump into apr_dir_read and check the return values from readdir_r? 
I'd imagine that we should be hitting the !ret && thedir->entry != retent case 
on line 187 - that should be returning APR_ENOENT.

Is that the case?  -- justin

Re: RA-DAV tests acting *very* strangely.

Posted by "Kirby C. Bohling" <kb...@birddog.com>.
On Mon, 2003-03-10 at 00:21, cmpilato@collab.net wrote:

> Then, I reverted that temporary change and debugged my client again to
> see what happened when my weird bug showed up:
> 
>    ### "."
>    (gdb) p *dir->dirstruct
>    $1 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
>      offset = 16, filepos = 1, lock = {__m_reserved = 0, __m_count = 0, 
>        __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
>          __spinlock = 0}}}
>    ### ".."
>    (gdb) p *dir->dirstruct
>    $3 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
>      offset = 32, filepos = 2, lock = {__m_reserved = 0, __m_count = 0, 
>        __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
>          __spinlock = 0}}}
>    ### "iota"
>    (gdb) p *dir->dirstruct
>    $5 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
>      offset = 48, filepos = 3, lock = {__m_reserved = 0, __m_count = 0, 
>        __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
>          __spinlock = 0}}}
>    ### "A"
>    (gdb) p *dir->dirstruct
>    $7 = {fd = 4, data = 0x80896a0 "çh\b", allocation = 4096, size = 64, 
>      offset = 64, filepos = 4, lock = {__m_reserved = 0, __m_count = 0, 
>        __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
>          __spinlock = 0}}}
>    ### "A" (again)
>    (gdb) p *dir->dirstruct
>    $10 = {fd = 4, data = 0x80896a0 "èh\b", allocation = 4096, size = 16, 
>      offset = 16, filepos = 5, lock = {__m_reserved = 0, __m_count = 0, 
>        __m_owner = 0x0, __m_kind = 0, __m_lock = {__status = 0, 
>          __spinlock = 0}}}
> 
> Whoa!  I noticed almost immediately that on that second "A" that is
> returned, the *dir->dirstruct->data member had suddenly changed (and
> the offset member was no longer incrementing by 16 like it had been).


I'm kinda curious about they 0x80896a0 changes it's interpretation from
run to run, and from the previous print outs, and the last one.  That's
weird.

The size=64 and offset incrementing by 16 looks right.  readdir_r calls
getdent(), which caches up to allocation size directory entries.  So
when you first get there it appears there are only 4 files.  If you read
the glibc code for readdir_r, when (offset <= size) you just grab the
next entry from the cache.  So when size=64 and offset=64 then you are
refreshing the cache (there now appear to be 5 files). The new cache is
16 bytes long, and has only one entry in it.  Sure seems like your doing
this:

opendir()
readdir()
creat() /* screws up the offsets and cache */
readdir()
readdir()
readdir()
readdir()
closedir()

I've read the manual.texi for glibc covering readdir_r/readdir_r64, and
didn't see anything that would lead me to believe that creating and
deleting a file would be safe.  Not sure what doc's your reading. 
readdir_r has no man page on my RH 7.2 machine.

I couldn't follow the code once you got into the getdent() case in
glibc/kernel (not smart enough, and didn't have time to investigate). 
I'm not sure how it works.  I'd be reasonable curious what you get if
you set a breakpoint just before the second "A" file gets read, and run
a program that has just the:

opendir()
while( !readdir_r() )
{
	printf( filename );
}
closedir();

To see how many entries it believes are in there now, and print out the
structure that has size/offset to see how many it attempts to read at
once.

That would miss files that only have an open file count, and don't exist
to anyone but the one process.  When you unlink the temporary, did it
get closed?  Otherwise you'd have a file that only the subversion
process can see (what's this have to do with python though, that doesn't
fit the facts yet).

the readdir_r has some logic for skipping deleted files, I wonder if
creating and deleting the files is screwing up the offsets.


<snip>

> But apparently this isn't working out too well on my machine.  
> 
> Does anyone have any reason to believe that this would not be safe,
> and have any ideas why running Subversion via a Python popen3() call
> would cause some problem to occur which does not run Subversion is run
> directly from the shell?  Inquiring (and admittedly boggled) minds
> wanna know.
> 

Hmmm, the only thing about popen3 I can see, is your environment might
be tweaked a bit.  Not sure how python's works, but the standard popen
says you get passed to the "/bin/sh -c" command line, so things like
LD_PRELOAD, the enviroment variable that puts glibc into compatibility
mode so your not calling the same readdir_r()/getdents() pair might be
mildly different.  Not sure how you could have problems with it, but
possible LFS vs. non-LFS in python versus not in python?

Possibly getting an strace in extremely verbose mode would be more
informative both under python, and not under python to see if there are
significant differences in libraries loading, and the series of syscalls
around this area.

Weird part is why isn't anyone else seeing it?  That doesn't make sense
either.

		Kirby


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
> 
> 
-- 
Real Programmers view electronic multimedia files with a hex editor.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: RA-DAV tests acting *very* strangely.

Posted by Justin Erenkrantz <ju...@erenkrantz.com>.
--On Monday, March 10, 2003 12:21 AM -0600 cmpilato@collab.net wrote:

> call.  The 'dirstruct' member of that structure was particularly
> interesting to me, and looked like this after each call (the paths
> after the '###' below are the current dirent just read by
> svn_io_dir_read()):

Can you jump into apr_dir_read and check the return values from readdir_r? 
I'd imagine that we should be hitting the !ret && thedir->entry != retent case 
on line 187 - that should be returning APR_ENOENT.

Is that the case?  -- justin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org