You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@directory.apache.org by Emmanuel Lecharny <el...@apache.org> on 2009/07/10 01:23:24 UTC
AVL vs Array, and DIRSERVER-1377
Hi guys,
I've been working with Kiran for 3 weeks trying to fix DIRSERVER-1377. I
think we have eliminated many differentt erros, and added some needed
synchronization code.
Right now, I can say that I was able to run the server using a
multi-threaded client creating and deleting the same entry (on enetry
per thread) without problem (I have done 2 000 000 updates, with 20
threads).
But the oneLevel index is broken soon after the test has started, and I
can reproduce this behavior quite easily, after only a few thousands of
loops.
So i'm still digging, and now, my perception is that we have a problem
in the current AVL tree implementation. I'm not very proud to admit that
I have spent more than one week trying to implement my own version of
those tree, using a damn smart algorithm, all iterative, with a hell of
optimisation and "pointers" all over the nodes. What a *fool* !
Not only it was useless, but the code I ended with, more than 4500 lines
of code, is not less likely to be bug free than the existing code. NIH
syndrom, I guess.
Anyway, yesterday, I had a kind of illumination in the dark (and smelly
code I'm gently floating in ...). Do we need to use AVLTrees ?
Let's see the pros and cons.
pros :
o From the academic POV, a very interesting data structure. A cool one
to evaluate your students, as soon as they don't ask you to write the code
o From the theorical POV : search in 1.44 O( log( N ) ), add and delete
in O ( Log( N ) ) too : quite efficient
o From the ASF POV : it could be promoted to commons-collection
cons :
o A hell to implement correctly
o A hell to test thoroughly
o 4500 line sof code : a hell to maintain
o Costly memory consumption (4 references to other nodes : left and
right, next and previou, a balance flag, a reference to the value)
o Complicated addition, even more complicated deletion
o Complicated serialization
o Almost impossible to enter in the code without a solid understanding
of the algorithm, a few existing decent web references.
So should we continue to use them ? Is there an alternative ?
No. And *Yes*.
And a damn good alternative : a simple array of sorted elements.
Why do we immediatly think about complex data structure when simple one
are offering all what we need ?
Ok, what do we need ? A place to store a set of ordered values. Thiose
values can be almost any kind of objects, and we may have millions of
them. Let's consider the case we just store 512 objects max, as we will
use another data structure on disk when we are above thise number.
A Object[] is all what we need, assuming it's sorted. And sorting such
an array is easy, as we just have to find the place where we will insert
the new value, copy some part of the array to the right, and insert the
data. O(Log(N)) for the search, plus a copy of all the values above the
given one.
Removing an element : same thing : find it, move all the elements on the
left one slot, and we are done Searching : O(log(N)), as the array is
sorted (use a dichotomic alogorithm. Quite easy)
Memory ? We just store an array of references to Objects, plus some
extra size. Let's say we start with a 16 slots array. Not to much space
is wasted. If we compare that with an AvlNode, when having more than 3
nodes, we save memory using an array even slightly too big.
Serialization is *way* easier, so is deserialization.
The whole code should be no more than a few hundred of commented and
javadoced lines of code.
Last, not least, it's likely to be *faster* than the AvlTree for the
basic search operation.
I'm almost done with this new implementation, just wanted to inform you
about those ideas. And I also think that it will fix DIRSERVER-1377.
Last, not least, I will be MIA for 5 days - unless I find some cheap
internet connection).
Thanks !
PS : Any thoughts ?
--
--
cordialement, regards,
Emmanuel Lécharny
www.iktek.com
directory.apache.org