You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by Shad Storhaug <sh...@shadstorhaug.com> on 2017/07/17 10:04:11 UTC

RE: TestSearcherManager_Mem

Vincent,

I noticed that TestSearcherManager_Mem is failing randomly again and thought it had something to do with the MMapDiretory fix. But I checked out the commit before that was implemented, and it still randomly fails. In fact, swapping out different directory implementations has no effect - it randomly fails on all of them.

I thought we had this one nailed down before, but going through this email thread it looks like this failing test was never fully fixed, just improved. When you have a chance could you take another look, kind sir?


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Sunday, March 26, 2017 6:59 AM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Update

I located and fixed the TestCData() problem already - it turned out to be a bug with the test framework replaceFirst() method matching on a literal instead of a Regex. I also discovered that the implementation of HTMLStripCharFilter was ported from 4.8.1 (which had improvements from 4.8.0).

I am also hot on the trail of TestRegexps(). I have narrowed it down to the constructor of the RegExp class (the Automaton one). So far discovered one method not accounting for surrogate pairs and confirmed fix worked with a specific case, but the test still has other issues (also in the constructor of the RegExp class). Still investigating...

Shad

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Saturday, March 25, 2017 1:59 PM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I took a look at TestSearcherManager_Mem(), and the problem is not related to any one codec or any one directory. My previous theory that it was an issue with Dispose() not being called in the correct order was also incorrect. The files are being created, but not being deleted (confirmed on disk). They are randomly getting an "in use by another process" error, which makes me suspect this is concurrency related (although we really can't rule out the test framework itself). You want to have another look?

I am also starting to see a recurrence of a very old random problem - very rarely, the Lucene.Net.Tests.Search.TestDocTermOrdsRewriteMethod.TestRegexps() test will go into a state where it doesn't finish. I have it capped at 1 minute, but since under normal conditions it only takes a couple of seconds to finish, I suspect it is still misbehaving when it times out (and what do you know, it is also based on Automaton).

Yet another dinosaur that is coming back is Lucene.Net.Analysis.CharFilters.HTMLStripCharFilterTest.TestCData() - fails very rarely, but if you put the Repeat(100) attribute on the test, it will fail constantly. There were quite a few bugs in there because of the difference between Java and .NET on the .Read() method. In Java, all overloads consistently return -1 to show nothing was read. In .NET, the overload with no parameters returns -1 and the one with parameters returns 0 if no characters are read. Might not be the problem, but it is definitely a suspect.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 2:34 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'll have to think about what you write more, but I'm a big fan of consolidation. I'll have to familiarize myself with all the code you're mentioning before making a more coherent statement, so let me give you my uninformed ".NET view" on the matter for now.

First, too much consolidation is not always good.  For example, in ObjectExtensions I see this:

        public static bool ValueEquals<T>(this T a, T b)
        {
            if (a is IEnumerable && b is IEnumerable)
            {
                var iter = (b as IEnumerable).GetEnumerator();
                foreach (object value in a as IEnumerable)
                {
                    iter.MoveNext();
                    if (!object.Equals(value, iter.Current))
                    {
                        return false;
                    }
                }
                return true;
            }

            return a.Equals(b);
        }

This makes red flashing lights and claxons go off in my head: no, you cannot enumerate collections for equality testing. And it's not because the enumerator isn't properly disposed of (another pet peeve of mine). You can't just compare two enumerable collections, since the first one may care about the order (e.g. lists), but the other may not (e.g. sets). What exactly constitutes equality is really whatever makes the code correct.  Remember the curse of Brzozowski!

But for those things that can be consolidated, I agree with "Equatable" as being a better name than "Value" for collections that can be compared. There are two ways to proceed:
- implement EquatableHashSet<T>, EquatableList<T>, and so on, explicitly implementing IEquatable<T> with T being the name of the collection.
- implement IEqualityComparer<T> for evert collection T. Conceptually, this is the cleaner solution (decoupling equality testing from the object implementation is a cleaner solution), but of course this means that every comparison, dictionary or set needs to be created with that comparer as parameter. This opens up a whole new class  of bugs: forget one, and you're hosed.

I don't really know the best way to proceed at this time. Maybe someone else can chime in?

Ever since I looked at the Lucene.net source code, I've been wondering why more time wasn't invested in making the port more ".NET like". Now that I've experienced the joy of tracking down bugs and making tests work, I think I understand why <g>.

You're a very courageous person. 

Vincent




-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 7:18 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

Good find.

Actually, I am thinking that some consolidation is in order:

GETHASHCODE

For comparing collections with GetHashCode() we have:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.GetHashCode()
Support.HashHelpers.GetValueHashode()

These are all doing the same thing (getting a hash code based on a collection), and they are not all doing it the same way.


EQUALS

Similarly, we have several places that are using Equals() to compare collections:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.Equals()
Support.ObjectExtensions.ValueEquals()

Plus in many places we are using:

ISet<T>.SetEquals()
IEnumerable<T>.SequenceEquals() (LINQ)


My thinking is that we should use one of static helper classes (probably Arrays) to make a single implementation of each of these, and then override Equals, GetHashCode, and for that matter ToString() in all of our Support collections and call our helper class in every place.

Not to mention, there are still many Queries (and other such things) that someone decided to "optimize" by making the implementation consider only the first and last item in the set (which is a good recipe for mysterious bugs that are nearly impossible to find as far as I am concerned).

But this is going to take some grace - there are some implementations of GetHashCode and Equals out there that require a very specific algorithm in order to function correctly.

We should probably just merge Support.EquatableList<T> with Support.ValueList<T>, (equatable sounds like a more clear name) and rename ValueHashSet to match (or just call it HashSet<T>, or would that be confusing?). Perhaps there should also be an overload that accepts IList<T> and ISet<T> in the constructor so we can just wrap existing instances easily, without having to worry about whether it inherits from our support classes.

Plus all of our support collections should be able to compare their values, just like in Java - many are missing the overloads. Also we can probably just factor out HashHelpers and ObjectExtensions altogether.

There are some Queries that have protected access to their clauses list (which originally was IList<T>, but has been changed to ValueList<T> or EquatableList<T>). Maybe they should all be reverted back to using IList<T> and the Query itself can call Arrays.GetHashCode() and Arrays.Equals() static methods to compare internally, without worrying about whether the list (that might just be any old list) will compare its contents. The only possible exceptions are where the list is used as a key for a dictionary, but there aren't many of those. WDYT?

Shad


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 12:37 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'd love to give you hope for Brzozowski. The only thing I can give you at this time is the probable correction of a possible bug.
There's a method in AutomatonTestUtil:

	public static void DeterminizeSimple(Automaton a, ISet<State> initialset)

The second argument of that method is later used as key in a dictionary, like so:

	sets[initialset] = initialset;

This implies that the argument must be a ValueHashSet<State>. 
Now look at this method in AutomatonTestUtil.MinimizeSimple: the calls look like this:

            DeterminizeSimple(a, SpecialOperations.Reverse(a));

This means that SpecialOperations.Reverse(a) must return a ValueHashSet<State>. But it doesn't: It returns the accepted state as:

            HashSet<State> accept = new HashSet<State>();

I think this needs to be changed in:

            ValueHashSet<State> accept = new ValueHashSet<State>();

This may open another avenue of inquiry. The error syndrome I get from the failed Brzozowski algorithm are automatons with duplicate states. This may hint at another hash set used as a key but not created as ValueHashSet

But in any case, I'm quite convinced that SpecialOperations.Reverse should be amended as described.

The investigation continues.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 2:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

That's great news!

Looks like it is passing always now. But, for some reason in .NET core it is taking MUCH longer to finish. I am getting about 6 seconds in .NET Framework, and over 1 minute in .NET Core. Maybe using a stopwatch is the right solution to stabilize this behavior?

To run the tests on .NET Core, open the Lucene.Net.Portable.sln. You may need to run "dotnet restore" from the command line at the root of the repository in order to get it to compile (sometimes you have to close and reopen Visual Studio to get the command to take).

I have taken a stab at that IndicNormalizer. It was failing when trying to get a character out of the BitArray that it previously put in there. But it was designed to work with a Java BitSet, not a .NET BitArray. Perhaps there is some difference in the way it works that is causing this (like null character not being stored, or something silly like that). It's a shot in the dark, but since I cannot get the test to fail under controlled conditions, I have replaced it with the OpenBitSet (which is basically a Java BitSet with access to its underlying storage). At the very least, it won't hurt.

I'll also take a closer look at the random "file not closed" failures coming from TestSearcherManager_Mem(). I think you fixed the underlying cause for the main failure. But this is a sign that there is an unexpected exception being thrown that triggers Dispose() too early. Perhaps there is still a broken Codec that is causing this failure, which would explain its randomness.

Is there any hope for Brzozowski? I'll make a compromise - if you can solve that one, I will pretend we are on version 4.9 for TestEarlyTerminationDifferentSorter() so we can put it to bed - it's probably not worth the effort anyway (I have already spent several days chasing after that one).

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Friday, March 24, 2017 4:40 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I have a theory about TestCRTReopen: if you look at the java code (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/search/ControlledRealTimeReopenThread.java), you see there's a relation between the reentrant lock and its condition variable:

ReentrantLock reopenLock = new ReentrantLock(); Condition reopenCond = reopenLock.newCondition();

Maybe there's some subtlety in there that we miss. The lock is used only as a guard around the reopen condition, which maybe is how they rule in the Java Shire, but no such concepts exist as such in C#. 
The closest thing to a "real" ReentrantLock implementation I have ever seen in .NET (complete with condition variables, fair locking, and so on) is https://github.com/spring-projects/spring-net-threading/blob/master/src/Spring/Spring.Threading/Threading/Locks/ReentrantLock.cs

But that's a gorilla. All we really want is a banana, without the gorilla attached to it.

So that got me thinking: we know what ControlledRealTImeReopenThread does. Why don't we implement it in "pure" C# instead of trying to translate it from Java using synchronization primitives that are almost but not quite totally unlike those in .NET? 

You can find the result in the file attached.  I restrained myself and didn't replace Environment.TickCount with Stopwatch (which would be more correct and guard against overflows that occur in TickCount every 24.9 days).

In a fit of altruism and insight, I even let all the related unit tests run, and they all pass.  And finish in time!

But that's in my alternate universe, of course <g>.


Vincent


-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 3:49 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

FYI - TestSearcherManager_Mem() succeeds much more frequently, but still randomly fails.

Also, although I was able to make the error message change for TestCRTReopen(), I didn't manage to get it working. The problem is pretty obvious - the WaitForGeneration() method (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Search/TestControlledRealTimeReopenThread.cs#L680) is WAY too slow. Even if I increase the wait period from 20 to 60 seconds it still doesn't finish in time. I played with a few of the variables in ControlledRealTimeReopenThread, but couldn't get the behavior to change. I verified that PulseAll() gets called, but it doesn't seem like it is having any effect on the Wait().

For TestEarlyTerminationDifferentSorter(), I reviewed all of the code under test in the Misc project line by line, but there was nothing significant to fix. So, still broken (sometimes).

Thanks,
Shad Storhaug (NightOwl888)





-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, March 23, 2017 8:25 PM
To: Van Den Berghe, Vincent; dev@lucenenet.apache.org
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Parallel universe or not, I think you are making progress. I found a similar IncrementAndGet issue in the ThreadedIndexingAndSearchingTestCase that I have already corrected. However, it only mattered in one case, in all other cases the result of IncrementAndGet was not being utilized.

It is like someone intentionally wanted to make it impossible to get all of the bugs out of this code...

Anyway, stupid is as stupid does...I went through and scanned the entire codebase for IncrementAndGet and compared it against Lucene. Sure enough, the Core.Util.RefCount class was refactored from its original. I changed it back to the original code (backed by an AtomicInteger/AtomicInt32), and the TestCRTReopen() test no longer fails almost immediately - after a couple of minutes it now fails with the message "waited too long for commit generation". I don't know if I fixed it or broke it more, but it is definitely behaving differently now.

Now, let me see if I can bring your other changes into my universe...perhaps the new failure has something to do with the reset event.

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, March 23, 2017 7:04 PM
To: dev@lucenenet.apache.org
Cc: Shad Storhaug
Subject: TestSearcherManager_Mem

Even though I seem to live in a parallel universe where 42 isn't 42 and 4.8 isn't 4.8, I'll have a stab at resolving TestSearcherManager_Mem.

First, there is a method in TrackingIndexWriter:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.IncrementAndGet();
        }


The implementation calls the wrong indexGen method: it should call GetAndIncrement(), which doesn't exist in the .NET version. You can add the method to the AtomicLong class.
Too bad there's no Interlocked.PostIncrement, but it's easy enough:

              public long GetAndIncrement()
              {
                     return Interlocked.Increment(ref value) - 1;
              }

And adjust the call accordingly:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.GetAndIncrement();
        }


Next we turn our attention to ControlledRealTimeReopenThread<T>.
There's an event defined as follows:

        private ManualResetEvent reopenCond = new ManualResetEvent(false);

This is not correct, since the remainder of the implementation only Sets and Waits, but never resets. Once the event Set, the wait will never ... uh... wait the second time around. Change this as follows:

        private AutoResetEvent reopenCond = new AutoResetEvent(false);

Next, for some reason, time is counted in nanoseconds, but since Environment.TickCount is in milliseconds, we need to convert it by multiplying by 1000000.
Unfortunately, this is done by multiplication:

Environment.TickCount * 1000000

Since Environment.TickCount is an int and 1000000 is an int, the result is negative unless you just rebooted your computer in a Tardis doing a polka backwards.
Define:

                private const long MS_IN_NS = 1000000;

... and change all other references to 1000000 except one (see below) with MS_IN_NS: this should solve the overflow problem using C#'s promotion rules.
Next, notice that 64-bit integers are sometimes read outside locks:

                searchingGen = refreshStartGen;
                if (targetGen > searchingGen)
                    while (targetGen > searchingGen)


This isn't guaranteed to be atomic, and I'm a curmudgeon when it comes to parallelism and atomicity. Change all these lines by:

                Interlocked.Exchange(ref searchingGen, refreshStartGen);
                if (targetGen > Interlocked.Read(ref searchingGen))
                    while (targetGen > Interlocked.Read(ref searchingGen))


In my own spacetime continuum, the test now passes.

Added bonus points: dispose of the waitable event in ControlledRealTimeReopenThread<T> Dispose method by adding:

                                  reopenCond.Dispose();

(after the Monitor.PulseAll(this); statement)

Extreme added bonus points: the following statement is incorrect, but it works anyway:

                  reopenCond.WaitOne(new TimeSpan(sleepNS / 1000000));//Convert NS to Ticks

(the 1000000 should not be replaced by MS_IN_NS in the new version) The reason why it's incorrect is because the argument to TimeSpan in the call accepts 100-nanoseconds units, and dividing nanoseconds by 1000000 yields milliseconds instead. So you will wait a delay of a factor 10000 shorter. It turns out that the correction (divide by 100) will cause timeouts in the tests, so I left it as-is.


But all of this might be wrong, I may not even exists.


Vincent

RE: TestSearcherManager_Mem

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

>> Are you getting the "file locked by another process" error, or something else?
Yes, I'm getting the "file locked by another process" error. The number of files is drastically reduced with the correction of that test framework bug I sent you, but I'm still getting it.

>> I am hopeful that fixing the LockFactory implementation will have something to do with this bug.
I hope so too, but signs aren't good. Actually, I've written  a truly remarkable native lock factory implementation, but the margin of this e-mail is too small to contain it. I'll send it separately. Alas, it doesn't correct the TestSearcherManager_Mem bug. But it does correct something that is bothering me when I had to look at it searching for other problems.

The test framework being a piece of work is an understatement. It's worse than working with github <g>!

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, July 21, 2017 12:33 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

Are you getting the "file locked by another process" error, or something else? I am hopeful that fixing the LockFactory implementation will have something to do with this bug.

I really hope you live longer than a couple of hours - you are a big help :). I think I have master caught up to all of the stuff you have been sending. 

Yea, that test framework is a piece of work. I found several bugs that were just due to the fact that someone decided to break all .NET and Java conventions and name the member variables pascal case (which conveniently conflicts with all of the property names). In fact, when I renamed the 2,000 pascal case variables in Lucene.Net core, some of the failing tests magically went away. Who knows how many more similar bugs remain in the test framework. 

I also noticed during the beta release to NuGet that in Java the test framework is actually a component for end users to use to assist them with testing Lucene applications. But our test framework isn't anywhere near the quality it needs to be for end users, and even if it were it assumes the end user is using NUnit. So, I doubt we will ever want to do that, but still it would be nice to clean it up so we know the random failures are coming from the software and not the test framework. Not to mention any failures we are missing because the testing isn't happening right. I found this interesting article (http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-after.html) that explains how the test framework (and Fsync) is *supposed* to work.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Thursday, July 20, 2017 9:10 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I've decided to dedicate the remainder of my useful life, or the next couple of hours (whichever comes first) to finding why TestSearcherManager_Mem fails randomly.
First observation: in LuceneTestCase.cs, if you remove all other directory types except NIOFSDirectory:

        /// <summary>
        /// Filesystem-based <seealso cref="Directory"/> implementations. </summary>
        private static readonly IList<string> FS_DIRECTORIES = Arrays.AsList(/*"SimpleFSDirectory",*/ "NIOFSDirectory" /*, "MMapDirectory"*/);

... the test fails almost always immediately, at least for me.
Second, there seems to be an inconsistency in tracking open file handlers for input slicers.
In MockDirectory.CreateSlicer, we see:

            IndexInputSlicer handle = new IndexInputSlicerAnonymousInnerClassHelper(this, name, delegateHandle);
            AddFileHandle(handle, name, Handle.Slice);

But when the slicer is disposed, in IndexInputSlicerAnonymousInnerClassHelper, there's the following:

                    if (disposing)
                    {
                        DelegateHandle.Dispose();
                        OuterInstance.RemoveOpenFile(OuterInstance, Name);
                    }

... where the OuterInstance is the MockDirectoryInstance. This is also IDisposable, so it compiles OK, but it's of course wrong since it will never be found.
I change the last statement to:

                        OuterInstance.RemoveOpenFile(this, Name);

It's one less bug... but the problem remains.


Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, July 17, 2017 12:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I noticed that TestSearcherManager_Mem is failing randomly again and thought it had something to do with the MMapDiretory fix. But I checked out the commit before that was implemented, and it still randomly fails. In fact, swapping out different directory implementations has no effect - it randomly fails on all of them.

I thought we had this one nailed down before, but going through this email thread it looks like this failing test was never fully fixed, just improved. When you have a chance could you take another look, kind sir?


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Sunday, March 26, 2017 6:59 AM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Update

I located and fixed the TestCData() problem already - it turned out to be a bug with the test framework replaceFirst() method matching on a literal instead of a Regex. I also discovered that the implementation of HTMLStripCharFilter was ported from 4.8.1 (which had improvements from 4.8.0).

I am also hot on the trail of TestRegexps(). I have narrowed it down to the constructor of the RegExp class (the Automaton one). So far discovered one method not accounting for surrogate pairs and confirmed fix worked with a specific case, but the test still has other issues (also in the constructor of the RegExp class). Still investigating...

Shad

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Saturday, March 25, 2017 1:59 PM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I took a look at TestSearcherManager_Mem(), and the problem is not related to any one codec or any one directory. My previous theory that it was an issue with Dispose() not being called in the correct order was also incorrect. The files are being created, but not being deleted (confirmed on disk). They are randomly getting an "in use by another process" error, which makes me suspect this is concurrency related (although we really can't rule out the test framework itself). You want to have another look?

I am also starting to see a recurrence of a very old random problem - very rarely, the Lucene.Net.Tests.Search.TestDocTermOrdsRewriteMethod.TestRegexps() test will go into a state where it doesn't finish. I have it capped at 1 minute, but since under normal conditions it only takes a couple of seconds to finish, I suspect it is still misbehaving when it times out (and what do you know, it is also based on Automaton).

Yet another dinosaur that is coming back is Lucene.Net.Analysis.CharFilters.HTMLStripCharFilterTest.TestCData() - fails very rarely, but if you put the Repeat(100) attribute on the test, it will fail constantly. There were quite a few bugs in there because of the difference between Java and .NET on the .Read() method. In Java, all overloads consistently return -1 to show nothing was read. In .NET, the overload with no parameters returns -1 and the one with parameters returns 0 if no characters are read. Might not be the problem, but it is definitely a suspect.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 2:34 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'll have to think about what you write more, but I'm a big fan of consolidation. I'll have to familiarize myself with all the code you're mentioning before making a more coherent statement, so let me give you my uninformed ".NET view" on the matter for now.

First, too much consolidation is not always good.  For example, in ObjectExtensions I see this:

        public static bool ValueEquals<T>(this T a, T b)
        {
            if (a is IEnumerable && b is IEnumerable)
            {
                var iter = (b as IEnumerable).GetEnumerator();
                foreach (object value in a as IEnumerable)
                {
                    iter.MoveNext();
                    if (!object.Equals(value, iter.Current))
                    {
                        return false;
                    }
                }
                return true;
            }

            return a.Equals(b);
        }

This makes red flashing lights and claxons go off in my head: no, you cannot enumerate collections for equality testing. And it's not because the enumerator isn't properly disposed of (another pet peeve of mine). You can't just compare two enumerable collections, since the first one may care about the order (e.g. lists), but the other may not (e.g. sets). What exactly constitutes equality is really whatever makes the code correct.  Remember the curse of Brzozowski!

But for those things that can be consolidated, I agree with "Equatable" as being a better name than "Value" for collections that can be compared. There are two ways to proceed:
- implement EquatableHashSet<T>, EquatableList<T>, and so on, explicitly implementing IEquatable<T> with T being the name of the collection.
- implement IEqualityComparer<T> for evert collection T. Conceptually, this is the cleaner solution (decoupling equality testing from the object implementation is a cleaner solution), but of course this means that every comparison, dictionary or set needs to be created with that comparer as parameter. This opens up a whole new class  of bugs: forget one, and you're hosed.

I don't really know the best way to proceed at this time. Maybe someone else can chime in?

Ever since I looked at the Lucene.net source code, I've been wondering why more time wasn't invested in making the port more ".NET like". Now that I've experienced the joy of tracking down bugs and making tests work, I think I understand why <g>.

You're a very courageous person. 

Vincent




-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 7:18 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

Good find.

Actually, I am thinking that some consolidation is in order:

GETHASHCODE

For comparing collections with GetHashCode() we have:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.GetHashCode()
Support.HashHelpers.GetValueHashode()

These are all doing the same thing (getting a hash code based on a collection), and they are not all doing it the same way.


EQUALS

Similarly, we have several places that are using Equals() to compare collections:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.Equals()
Support.ObjectExtensions.ValueEquals()

Plus in many places we are using:

ISet<T>.SetEquals()
IEnumerable<T>.SequenceEquals() (LINQ)


My thinking is that we should use one of static helper classes (probably Arrays) to make a single implementation of each of these, and then override Equals, GetHashCode, and for that matter ToString() in all of our Support collections and call our helper class in every place.

Not to mention, there are still many Queries (and other such things) that someone decided to "optimize" by making the implementation consider only the first and last item in the set (which is a good recipe for mysterious bugs that are nearly impossible to find as far as I am concerned).

But this is going to take some grace - there are some implementations of GetHashCode and Equals out there that require a very specific algorithm in order to function correctly.

We should probably just merge Support.EquatableList<T> with Support.ValueList<T>, (equatable sounds like a more clear name) and rename ValueHashSet to match (or just call it HashSet<T>, or would that be confusing?). Perhaps there should also be an overload that accepts IList<T> and ISet<T> in the constructor so we can just wrap existing instances easily, without having to worry about whether it inherits from our support classes.

Plus all of our support collections should be able to compare their values, just like in Java - many are missing the overloads. Also we can probably just factor out HashHelpers and ObjectExtensions altogether.

There are some Queries that have protected access to their clauses list (which originally was IList<T>, but has been changed to ValueList<T> or EquatableList<T>). Maybe they should all be reverted back to using IList<T> and the Query itself can call Arrays.GetHashCode() and Arrays.Equals() static methods to compare internally, without worrying about whether the list (that might just be any old list) will compare its contents. The only possible exceptions are where the list is used as a key for a dictionary, but there aren't many of those. WDYT?

Shad


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 12:37 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'd love to give you hope for Brzozowski. The only thing I can give you at this time is the probable correction of a possible bug.
There's a method in AutomatonTestUtil:

	public static void DeterminizeSimple(Automaton a, ISet<State> initialset)

The second argument of that method is later used as key in a dictionary, like so:

	sets[initialset] = initialset;

This implies that the argument must be a ValueHashSet<State>. 
Now look at this method in AutomatonTestUtil.MinimizeSimple: the calls look like this:

            DeterminizeSimple(a, SpecialOperations.Reverse(a));

This means that SpecialOperations.Reverse(a) must return a ValueHashSet<State>. But it doesn't: It returns the accepted state as:

            HashSet<State> accept = new HashSet<State>();

I think this needs to be changed in:

            ValueHashSet<State> accept = new ValueHashSet<State>();

This may open another avenue of inquiry. The error syndrome I get from the failed Brzozowski algorithm are automatons with duplicate states. This may hint at another hash set used as a key but not created as ValueHashSet

But in any case, I'm quite convinced that SpecialOperations.Reverse should be amended as described.

The investigation continues.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 2:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

That's great news!

Looks like it is passing always now. But, for some reason in .NET core it is taking MUCH longer to finish. I am getting about 6 seconds in .NET Framework, and over 1 minute in .NET Core. Maybe using a stopwatch is the right solution to stabilize this behavior?

To run the tests on .NET Core, open the Lucene.Net.Portable.sln. You may need to run "dotnet restore" from the command line at the root of the repository in order to get it to compile (sometimes you have to close and reopen Visual Studio to get the command to take).

I have taken a stab at that IndicNormalizer. It was failing when trying to get a character out of the BitArray that it previously put in there. But it was designed to work with a Java BitSet, not a .NET BitArray. Perhaps there is some difference in the way it works that is causing this (like null character not being stored, or something silly like that). It's a shot in the dark, but since I cannot get the test to fail under controlled conditions, I have replaced it with the OpenBitSet (which is basically a Java BitSet with access to its underlying storage). At the very least, it won't hurt.

I'll also take a closer look at the random "file not closed" failures coming from TestSearcherManager_Mem(). I think you fixed the underlying cause for the main failure. But this is a sign that there is an unexpected exception being thrown that triggers Dispose() too early. Perhaps there is still a broken Codec that is causing this failure, which would explain its randomness.

Is there any hope for Brzozowski? I'll make a compromise - if you can solve that one, I will pretend we are on version 4.9 for TestEarlyTerminationDifferentSorter() so we can put it to bed - it's probably not worth the effort anyway (I have already spent several days chasing after that one).

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Friday, March 24, 2017 4:40 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I have a theory about TestCRTReopen: if you look at the java code (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/search/ControlledRealTimeReopenThread.java), you see there's a relation between the reentrant lock and its condition variable:

ReentrantLock reopenLock = new ReentrantLock(); Condition reopenCond = reopenLock.newCondition();

Maybe there's some subtlety in there that we miss. The lock is used only as a guard around the reopen condition, which maybe is how they rule in the Java Shire, but no such concepts exist as such in C#. 
The closest thing to a "real" ReentrantLock implementation I have ever seen in .NET (complete with condition variables, fair locking, and so on) is https://github.com/spring-projects/spring-net-threading/blob/master/src/Spring/Spring.Threading/Threading/Locks/ReentrantLock.cs

But that's a gorilla. All we really want is a banana, without the gorilla attached to it.

So that got me thinking: we know what ControlledRealTImeReopenThread does. Why don't we implement it in "pure" C# instead of trying to translate it from Java using synchronization primitives that are almost but not quite totally unlike those in .NET? 

You can find the result in the file attached.  I restrained myself and didn't replace Environment.TickCount with Stopwatch (which would be more correct and guard against overflows that occur in TickCount every 24.9 days).

In a fit of altruism and insight, I even let all the related unit tests run, and they all pass.  And finish in time!

But that's in my alternate universe, of course <g>.


Vincent


-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 3:49 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

FYI - TestSearcherManager_Mem() succeeds much more frequently, but still randomly fails.

Also, although I was able to make the error message change for TestCRTReopen(), I didn't manage to get it working. The problem is pretty obvious - the WaitForGeneration() method (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Search/TestControlledRealTimeReopenThread.cs#L680) is WAY too slow. Even if I increase the wait period from 20 to 60 seconds it still doesn't finish in time. I played with a few of the variables in ControlledRealTimeReopenThread, but couldn't get the behavior to change. I verified that PulseAll() gets called, but it doesn't seem like it is having any effect on the Wait().

For TestEarlyTerminationDifferentSorter(), I reviewed all of the code under test in the Misc project line by line, but there was nothing significant to fix. So, still broken (sometimes).

Thanks,
Shad Storhaug (NightOwl888)





-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, March 23, 2017 8:25 PM
To: Van Den Berghe, Vincent; dev@lucenenet.apache.org
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Parallel universe or not, I think you are making progress. I found a similar IncrementAndGet issue in the ThreadedIndexingAndSearchingTestCase that I have already corrected. However, it only mattered in one case, in all other cases the result of IncrementAndGet was not being utilized.

It is like someone intentionally wanted to make it impossible to get all of the bugs out of this code...

Anyway, stupid is as stupid does...I went through and scanned the entire codebase for IncrementAndGet and compared it against Lucene. Sure enough, the Core.Util.RefCount class was refactored from its original. I changed it back to the original code (backed by an AtomicInteger/AtomicInt32), and the TestCRTReopen() test no longer fails almost immediately - after a couple of minutes it now fails with the message "waited too long for commit generation". I don't know if I fixed it or broke it more, but it is definitely behaving differently now.

Now, let me see if I can bring your other changes into my universe...perhaps the new failure has something to do with the reset event.

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, March 23, 2017 7:04 PM
To: dev@lucenenet.apache.org
Cc: Shad Storhaug
Subject: TestSearcherManager_Mem

Even though I seem to live in a parallel universe where 42 isn't 42 and 4.8 isn't 4.8, I'll have a stab at resolving TestSearcherManager_Mem.

First, there is a method in TrackingIndexWriter:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.IncrementAndGet();
        }


The implementation calls the wrong indexGen method: it should call GetAndIncrement(), which doesn't exist in the .NET version. You can add the method to the AtomicLong class.
Too bad there's no Interlocked.PostIncrement, but it's easy enough:

              public long GetAndIncrement()
              {
                     return Interlocked.Increment(ref value) - 1;
              }

And adjust the call accordingly:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.GetAndIncrement();
        }


Next we turn our attention to ControlledRealTimeReopenThread<T>.
There's an event defined as follows:

        private ManualResetEvent reopenCond = new ManualResetEvent(false);

This is not correct, since the remainder of the implementation only Sets and Waits, but never resets. Once the event Set, the wait will never ... uh... wait the second time around. Change this as follows:

        private AutoResetEvent reopenCond = new AutoResetEvent(false);

Next, for some reason, time is counted in nanoseconds, but since Environment.TickCount is in milliseconds, we need to convert it by multiplying by 1000000.
Unfortunately, this is done by multiplication:

Environment.TickCount * 1000000

Since Environment.TickCount is an int and 1000000 is an int, the result is negative unless you just rebooted your computer in a Tardis doing a polka backwards.
Define:

                private const long MS_IN_NS = 1000000;

... and change all other references to 1000000 except one (see below) with MS_IN_NS: this should solve the overflow problem using C#'s promotion rules.
Next, notice that 64-bit integers are sometimes read outside locks:

                searchingGen = refreshStartGen;
                if (targetGen > searchingGen)
                    while (targetGen > searchingGen)


This isn't guaranteed to be atomic, and I'm a curmudgeon when it comes to parallelism and atomicity. Change all these lines by:

                Interlocked.Exchange(ref searchingGen, refreshStartGen);
                if (targetGen > Interlocked.Read(ref searchingGen))
                    while (targetGen > Interlocked.Read(ref searchingGen))


In my own spacetime continuum, the test now passes.

Added bonus points: dispose of the waitable event in ControlledRealTimeReopenThread<T> Dispose method by adding:

                                  reopenCond.Dispose();

(after the Monitor.PulseAll(this); statement)

Extreme added bonus points: the following statement is incorrect, but it works anyway:

                  reopenCond.WaitOne(new TimeSpan(sleepNS / 1000000));//Convert NS to Ticks

(the 1000000 should not be replaced by MS_IN_NS in the new version) The reason why it's incorrect is because the argument to TimeSpan in the call accepts 100-nanoseconds units, and dividing nanoseconds by 1000000 yields milliseconds instead. So you will wait a delay of a factor 10000 shorter. It turns out that the correction (divide by 100) will cause timeouts in the tests, so I left it as-is.


But all of this might be wrong, I may not even exists.


Vincent

RE: TestSearcherManager_Mem

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Vincent,

Are you getting the "file locked by another process" error, or something else? I am hopeful that fixing the LockFactory implementation will have something to do with this bug.

I really hope you live longer than a couple of hours - you are a big help :). I think I have master caught up to all of the stuff you have been sending. 

Yea, that test framework is a piece of work. I found several bugs that were just due to the fact that someone decided to break all .NET and Java conventions and name the member variables pascal case (which conveniently conflicts with all of the property names). In fact, when I renamed the 2,000 pascal case variables in Lucene.Net core, some of the failing tests magically went away. Who knows how many more similar bugs remain in the test framework. 

I also noticed during the beta release to NuGet that in Java the test framework is actually a component for end users to use to assist them with testing Lucene applications. But our test framework isn't anywhere near the quality it needs to be for end users, and even if it were it assumes the end user is using NUnit. So, I doubt we will ever want to do that, but still it would be nice to clean it up so we know the random failures are coming from the software and not the test framework. Not to mention any failures we are missing because the testing isn't happening right. I found this interesting article (http://blog.mikemccandless.com/2014/04/testing-lucenes-index-durability-after.html) that explains how the test framework (and Fsync) is *supposed* to work.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Thursday, July 20, 2017 9:10 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I've decided to dedicate the remainder of my useful life, or the next couple of hours (whichever comes first) to finding why TestSearcherManager_Mem fails randomly.
First observation: in LuceneTestCase.cs, if you remove all other directory types except NIOFSDirectory:

        /// <summary>
        /// Filesystem-based <seealso cref="Directory"/> implementations. </summary>
        private static readonly IList<string> FS_DIRECTORIES = Arrays.AsList(/*"SimpleFSDirectory",*/ "NIOFSDirectory" /*, "MMapDirectory"*/);

... the test fails almost always immediately, at least for me.
Second, there seems to be an inconsistency in tracking open file handlers for input slicers.
In MockDirectory.CreateSlicer, we see:

            IndexInputSlicer handle = new IndexInputSlicerAnonymousInnerClassHelper(this, name, delegateHandle);
            AddFileHandle(handle, name, Handle.Slice);

But when the slicer is disposed, in IndexInputSlicerAnonymousInnerClassHelper, there's the following:

                    if (disposing)
                    {
                        DelegateHandle.Dispose();
                        OuterInstance.RemoveOpenFile(OuterInstance, Name);
                    }

... where the OuterInstance is the MockDirectoryInstance. This is also IDisposable, so it compiles OK, but it's of course wrong since it will never be found.
I change the last statement to:

                        OuterInstance.RemoveOpenFile(this, Name);

It's one less bug... but the problem remains.


Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, July 17, 2017 12:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I noticed that TestSearcherManager_Mem is failing randomly again and thought it had something to do with the MMapDiretory fix. But I checked out the commit before that was implemented, and it still randomly fails. In fact, swapping out different directory implementations has no effect - it randomly fails on all of them.

I thought we had this one nailed down before, but going through this email thread it looks like this failing test was never fully fixed, just improved. When you have a chance could you take another look, kind sir?


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Sunday, March 26, 2017 6:59 AM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Update

I located and fixed the TestCData() problem already - it turned out to be a bug with the test framework replaceFirst() method matching on a literal instead of a Regex. I also discovered that the implementation of HTMLStripCharFilter was ported from 4.8.1 (which had improvements from 4.8.0).

I am also hot on the trail of TestRegexps(). I have narrowed it down to the constructor of the RegExp class (the Automaton one). So far discovered one method not accounting for surrogate pairs and confirmed fix worked with a specific case, but the test still has other issues (also in the constructor of the RegExp class). Still investigating...

Shad

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Saturday, March 25, 2017 1:59 PM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I took a look at TestSearcherManager_Mem(), and the problem is not related to any one codec or any one directory. My previous theory that it was an issue with Dispose() not being called in the correct order was also incorrect. The files are being created, but not being deleted (confirmed on disk). They are randomly getting an "in use by another process" error, which makes me suspect this is concurrency related (although we really can't rule out the test framework itself). You want to have another look?

I am also starting to see a recurrence of a very old random problem - very rarely, the Lucene.Net.Tests.Search.TestDocTermOrdsRewriteMethod.TestRegexps() test will go into a state where it doesn't finish. I have it capped at 1 minute, but since under normal conditions it only takes a couple of seconds to finish, I suspect it is still misbehaving when it times out (and what do you know, it is also based on Automaton).

Yet another dinosaur that is coming back is Lucene.Net.Analysis.CharFilters.HTMLStripCharFilterTest.TestCData() - fails very rarely, but if you put the Repeat(100) attribute on the test, it will fail constantly. There were quite a few bugs in there because of the difference between Java and .NET on the .Read() method. In Java, all overloads consistently return -1 to show nothing was read. In .NET, the overload with no parameters returns -1 and the one with parameters returns 0 if no characters are read. Might not be the problem, but it is definitely a suspect.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 2:34 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'll have to think about what you write more, but I'm a big fan of consolidation. I'll have to familiarize myself with all the code you're mentioning before making a more coherent statement, so let me give you my uninformed ".NET view" on the matter for now.

First, too much consolidation is not always good.  For example, in ObjectExtensions I see this:

        public static bool ValueEquals<T>(this T a, T b)
        {
            if (a is IEnumerable && b is IEnumerable)
            {
                var iter = (b as IEnumerable).GetEnumerator();
                foreach (object value in a as IEnumerable)
                {
                    iter.MoveNext();
                    if (!object.Equals(value, iter.Current))
                    {
                        return false;
                    }
                }
                return true;
            }

            return a.Equals(b);
        }

This makes red flashing lights and claxons go off in my head: no, you cannot enumerate collections for equality testing. And it's not because the enumerator isn't properly disposed of (another pet peeve of mine). You can't just compare two enumerable collections, since the first one may care about the order (e.g. lists), but the other may not (e.g. sets). What exactly constitutes equality is really whatever makes the code correct.  Remember the curse of Brzozowski!

But for those things that can be consolidated, I agree with "Equatable" as being a better name than "Value" for collections that can be compared. There are two ways to proceed:
- implement EquatableHashSet<T>, EquatableList<T>, and so on, explicitly implementing IEquatable<T> with T being the name of the collection.
- implement IEqualityComparer<T> for evert collection T. Conceptually, this is the cleaner solution (decoupling equality testing from the object implementation is a cleaner solution), but of course this means that every comparison, dictionary or set needs to be created with that comparer as parameter. This opens up a whole new class  of bugs: forget one, and you're hosed.

I don't really know the best way to proceed at this time. Maybe someone else can chime in?

Ever since I looked at the Lucene.net source code, I've been wondering why more time wasn't invested in making the port more ".NET like". Now that I've experienced the joy of tracking down bugs and making tests work, I think I understand why <g>.

You're a very courageous person. 

Vincent




-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 7:18 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

Good find.

Actually, I am thinking that some consolidation is in order:

GETHASHCODE

For comparing collections with GetHashCode() we have:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.GetHashCode()
Support.HashHelpers.GetValueHashode()

These are all doing the same thing (getting a hash code based on a collection), and they are not all doing it the same way.


EQUALS

Similarly, we have several places that are using Equals() to compare collections:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.Equals()
Support.ObjectExtensions.ValueEquals()

Plus in many places we are using:

ISet<T>.SetEquals()
IEnumerable<T>.SequenceEquals() (LINQ)


My thinking is that we should use one of static helper classes (probably Arrays) to make a single implementation of each of these, and then override Equals, GetHashCode, and for that matter ToString() in all of our Support collections and call our helper class in every place.

Not to mention, there are still many Queries (and other such things) that someone decided to "optimize" by making the implementation consider only the first and last item in the set (which is a good recipe for mysterious bugs that are nearly impossible to find as far as I am concerned).

But this is going to take some grace - there are some implementations of GetHashCode and Equals out there that require a very specific algorithm in order to function correctly.

We should probably just merge Support.EquatableList<T> with Support.ValueList<T>, (equatable sounds like a more clear name) and rename ValueHashSet to match (or just call it HashSet<T>, or would that be confusing?). Perhaps there should also be an overload that accepts IList<T> and ISet<T> in the constructor so we can just wrap existing instances easily, without having to worry about whether it inherits from our support classes.

Plus all of our support collections should be able to compare their values, just like in Java - many are missing the overloads. Also we can probably just factor out HashHelpers and ObjectExtensions altogether.

There are some Queries that have protected access to their clauses list (which originally was IList<T>, but has been changed to ValueList<T> or EquatableList<T>). Maybe they should all be reverted back to using IList<T> and the Query itself can call Arrays.GetHashCode() and Arrays.Equals() static methods to compare internally, without worrying about whether the list (that might just be any old list) will compare its contents. The only possible exceptions are where the list is used as a key for a dictionary, but there aren't many of those. WDYT?

Shad


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 12:37 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'd love to give you hope for Brzozowski. The only thing I can give you at this time is the probable correction of a possible bug.
There's a method in AutomatonTestUtil:

	public static void DeterminizeSimple(Automaton a, ISet<State> initialset)

The second argument of that method is later used as key in a dictionary, like so:

	sets[initialset] = initialset;

This implies that the argument must be a ValueHashSet<State>. 
Now look at this method in AutomatonTestUtil.MinimizeSimple: the calls look like this:

            DeterminizeSimple(a, SpecialOperations.Reverse(a));

This means that SpecialOperations.Reverse(a) must return a ValueHashSet<State>. But it doesn't: It returns the accepted state as:

            HashSet<State> accept = new HashSet<State>();

I think this needs to be changed in:

            ValueHashSet<State> accept = new ValueHashSet<State>();

This may open another avenue of inquiry. The error syndrome I get from the failed Brzozowski algorithm are automatons with duplicate states. This may hint at another hash set used as a key but not created as ValueHashSet

But in any case, I'm quite convinced that SpecialOperations.Reverse should be amended as described.

The investigation continues.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 2:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

That's great news!

Looks like it is passing always now. But, for some reason in .NET core it is taking MUCH longer to finish. I am getting about 6 seconds in .NET Framework, and over 1 minute in .NET Core. Maybe using a stopwatch is the right solution to stabilize this behavior?

To run the tests on .NET Core, open the Lucene.Net.Portable.sln. You may need to run "dotnet restore" from the command line at the root of the repository in order to get it to compile (sometimes you have to close and reopen Visual Studio to get the command to take).

I have taken a stab at that IndicNormalizer. It was failing when trying to get a character out of the BitArray that it previously put in there. But it was designed to work with a Java BitSet, not a .NET BitArray. Perhaps there is some difference in the way it works that is causing this (like null character not being stored, or something silly like that). It's a shot in the dark, but since I cannot get the test to fail under controlled conditions, I have replaced it with the OpenBitSet (which is basically a Java BitSet with access to its underlying storage). At the very least, it won't hurt.

I'll also take a closer look at the random "file not closed" failures coming from TestSearcherManager_Mem(). I think you fixed the underlying cause for the main failure. But this is a sign that there is an unexpected exception being thrown that triggers Dispose() too early. Perhaps there is still a broken Codec that is causing this failure, which would explain its randomness.

Is there any hope for Brzozowski? I'll make a compromise - if you can solve that one, I will pretend we are on version 4.9 for TestEarlyTerminationDifferentSorter() so we can put it to bed - it's probably not worth the effort anyway (I have already spent several days chasing after that one).

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Friday, March 24, 2017 4:40 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I have a theory about TestCRTReopen: if you look at the java code (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/search/ControlledRealTimeReopenThread.java), you see there's a relation between the reentrant lock and its condition variable:

ReentrantLock reopenLock = new ReentrantLock(); Condition reopenCond = reopenLock.newCondition();

Maybe there's some subtlety in there that we miss. The lock is used only as a guard around the reopen condition, which maybe is how they rule in the Java Shire, but no such concepts exist as such in C#. 
The closest thing to a "real" ReentrantLock implementation I have ever seen in .NET (complete with condition variables, fair locking, and so on) is https://github.com/spring-projects/spring-net-threading/blob/master/src/Spring/Spring.Threading/Threading/Locks/ReentrantLock.cs

But that's a gorilla. All we really want is a banana, without the gorilla attached to it.

So that got me thinking: we know what ControlledRealTImeReopenThread does. Why don't we implement it in "pure" C# instead of trying to translate it from Java using synchronization primitives that are almost but not quite totally unlike those in .NET? 

You can find the result in the file attached.  I restrained myself and didn't replace Environment.TickCount with Stopwatch (which would be more correct and guard against overflows that occur in TickCount every 24.9 days).

In a fit of altruism and insight, I even let all the related unit tests run, and they all pass.  And finish in time!

But that's in my alternate universe, of course <g>.


Vincent


-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 3:49 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

FYI - TestSearcherManager_Mem() succeeds much more frequently, but still randomly fails.

Also, although I was able to make the error message change for TestCRTReopen(), I didn't manage to get it working. The problem is pretty obvious - the WaitForGeneration() method (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Search/TestControlledRealTimeReopenThread.cs#L680) is WAY too slow. Even if I increase the wait period from 20 to 60 seconds it still doesn't finish in time. I played with a few of the variables in ControlledRealTimeReopenThread, but couldn't get the behavior to change. I verified that PulseAll() gets called, but it doesn't seem like it is having any effect on the Wait().

For TestEarlyTerminationDifferentSorter(), I reviewed all of the code under test in the Misc project line by line, but there was nothing significant to fix. So, still broken (sometimes).

Thanks,
Shad Storhaug (NightOwl888)





-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, March 23, 2017 8:25 PM
To: Van Den Berghe, Vincent; dev@lucenenet.apache.org
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Parallel universe or not, I think you are making progress. I found a similar IncrementAndGet issue in the ThreadedIndexingAndSearchingTestCase that I have already corrected. However, it only mattered in one case, in all other cases the result of IncrementAndGet was not being utilized.

It is like someone intentionally wanted to make it impossible to get all of the bugs out of this code...

Anyway, stupid is as stupid does...I went through and scanned the entire codebase for IncrementAndGet and compared it against Lucene. Sure enough, the Core.Util.RefCount class was refactored from its original. I changed it back to the original code (backed by an AtomicInteger/AtomicInt32), and the TestCRTReopen() test no longer fails almost immediately - after a couple of minutes it now fails with the message "waited too long for commit generation". I don't know if I fixed it or broke it more, but it is definitely behaving differently now.

Now, let me see if I can bring your other changes into my universe...perhaps the new failure has something to do with the reset event.

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, March 23, 2017 7:04 PM
To: dev@lucenenet.apache.org
Cc: Shad Storhaug
Subject: TestSearcherManager_Mem

Even though I seem to live in a parallel universe where 42 isn't 42 and 4.8 isn't 4.8, I'll have a stab at resolving TestSearcherManager_Mem.

First, there is a method in TrackingIndexWriter:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.IncrementAndGet();
        }


The implementation calls the wrong indexGen method: it should call GetAndIncrement(), which doesn't exist in the .NET version. You can add the method to the AtomicLong class.
Too bad there's no Interlocked.PostIncrement, but it's easy enough:

              public long GetAndIncrement()
              {
                     return Interlocked.Increment(ref value) - 1;
              }

And adjust the call accordingly:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.GetAndIncrement();
        }


Next we turn our attention to ControlledRealTimeReopenThread<T>.
There's an event defined as follows:

        private ManualResetEvent reopenCond = new ManualResetEvent(false);

This is not correct, since the remainder of the implementation only Sets and Waits, but never resets. Once the event Set, the wait will never ... uh... wait the second time around. Change this as follows:

        private AutoResetEvent reopenCond = new AutoResetEvent(false);

Next, for some reason, time is counted in nanoseconds, but since Environment.TickCount is in milliseconds, we need to convert it by multiplying by 1000000.
Unfortunately, this is done by multiplication:

Environment.TickCount * 1000000

Since Environment.TickCount is an int and 1000000 is an int, the result is negative unless you just rebooted your computer in a Tardis doing a polka backwards.
Define:

                private const long MS_IN_NS = 1000000;

... and change all other references to 1000000 except one (see below) with MS_IN_NS: this should solve the overflow problem using C#'s promotion rules.
Next, notice that 64-bit integers are sometimes read outside locks:

                searchingGen = refreshStartGen;
                if (targetGen > searchingGen)
                    while (targetGen > searchingGen)


This isn't guaranteed to be atomic, and I'm a curmudgeon when it comes to parallelism and atomicity. Change all these lines by:

                Interlocked.Exchange(ref searchingGen, refreshStartGen);
                if (targetGen > Interlocked.Read(ref searchingGen))
                    while (targetGen > Interlocked.Read(ref searchingGen))


In my own spacetime continuum, the test now passes.

Added bonus points: dispose of the waitable event in ControlledRealTimeReopenThread<T> Dispose method by adding:

                                  reopenCond.Dispose();

(after the Monitor.PulseAll(this); statement)

Extreme added bonus points: the following statement is incorrect, but it works anyway:

                  reopenCond.WaitOne(new TimeSpan(sleepNS / 1000000));//Convert NS to Ticks

(the 1000000 should not be replaced by MS_IN_NS in the new version) The reason why it's incorrect is because the argument to TimeSpan in the call accepts 100-nanoseconds units, and dividing nanoseconds by 1000000 yields milliseconds instead. So you will wait a delay of a factor 10000 shorter. It turns out that the correction (divide by 100) will cause timeouts in the tests, so I left it as-is.


But all of this might be wrong, I may not even exists.


Vincent

RE: TestSearcherManager_Mem

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

Hello Shad,

I've decided to dedicate the remainder of my useful life, or the next couple of hours (whichever comes first) to finding why TestSearcherManager_Mem fails randomly.
First observation: in LuceneTestCase.cs, if you remove all other directory types except NIOFSDirectory:

        /// <summary>
        /// Filesystem-based <seealso cref="Directory"/> implementations. </summary>
        private static readonly IList<string> FS_DIRECTORIES = Arrays.AsList(/*"SimpleFSDirectory",*/ "NIOFSDirectory" /*, "MMapDirectory"*/);

... the test fails almost always immediately, at least for me.
Second, there seems to be an inconsistency in tracking open file handlers for input slicers.
In MockDirectory.CreateSlicer, we see:

            IndexInputSlicer handle = new IndexInputSlicerAnonymousInnerClassHelper(this, name, delegateHandle);
            AddFileHandle(handle, name, Handle.Slice);

But when the slicer is disposed, in IndexInputSlicerAnonymousInnerClassHelper, there's the following:

                    if (disposing)
                    {
                        DelegateHandle.Dispose();
                        OuterInstance.RemoveOpenFile(OuterInstance, Name);
                    }

... where the OuterInstance is the MockDirectoryInstance. This is also IDisposable, so it compiles OK, but it's of course wrong since it will never be found.
I change the last statement to:

                        OuterInstance.RemoveOpenFile(this, Name);

It's one less bug... but the problem remains.


Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, July 17, 2017 12:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I noticed that TestSearcherManager_Mem is failing randomly again and thought it had something to do with the MMapDiretory fix. But I checked out the commit before that was implemented, and it still randomly fails. In fact, swapping out different directory implementations has no effect - it randomly fails on all of them.

I thought we had this one nailed down before, but going through this email thread it looks like this failing test was never fully fixed, just improved. When you have a chance could you take another look, kind sir?


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Sunday, March 26, 2017 6:59 AM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Update

I located and fixed the TestCData() problem already - it turned out to be a bug with the test framework replaceFirst() method matching on a literal instead of a Regex. I also discovered that the implementation of HTMLStripCharFilter was ported from 4.8.1 (which had improvements from 4.8.0).

I am also hot on the trail of TestRegexps(). I have narrowed it down to the constructor of the RegExp class (the Automaton one). So far discovered one method not accounting for surrogate pairs and confirmed fix worked with a specific case, but the test still has other issues (also in the constructor of the RegExp class). Still investigating...

Shad

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Saturday, March 25, 2017 1:59 PM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I took a look at TestSearcherManager_Mem(), and the problem is not related to any one codec or any one directory. My previous theory that it was an issue with Dispose() not being called in the correct order was also incorrect. The files are being created, but not being deleted (confirmed on disk). They are randomly getting an "in use by another process" error, which makes me suspect this is concurrency related (although we really can't rule out the test framework itself). You want to have another look?

I am also starting to see a recurrence of a very old random problem - very rarely, the Lucene.Net.Tests.Search.TestDocTermOrdsRewriteMethod.TestRegexps() test will go into a state where it doesn't finish. I have it capped at 1 minute, but since under normal conditions it only takes a couple of seconds to finish, I suspect it is still misbehaving when it times out (and what do you know, it is also based on Automaton).

Yet another dinosaur that is coming back is Lucene.Net.Analysis.CharFilters.HTMLStripCharFilterTest.TestCData() - fails very rarely, but if you put the Repeat(100) attribute on the test, it will fail constantly. There were quite a few bugs in there because of the difference between Java and .NET on the .Read() method. In Java, all overloads consistently return -1 to show nothing was read. In .NET, the overload with no parameters returns -1 and the one with parameters returns 0 if no characters are read. Might not be the problem, but it is definitely a suspect.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 2:34 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'll have to think about what you write more, but I'm a big fan of consolidation. I'll have to familiarize myself with all the code you're mentioning before making a more coherent statement, so let me give you my uninformed ".NET view" on the matter for now.

First, too much consolidation is not always good.  For example, in ObjectExtensions I see this:

        public static bool ValueEquals<T>(this T a, T b)
        {
            if (a is IEnumerable && b is IEnumerable)
            {
                var iter = (b as IEnumerable).GetEnumerator();
                foreach (object value in a as IEnumerable)
                {
                    iter.MoveNext();
                    if (!object.Equals(value, iter.Current))
                    {
                        return false;
                    }
                }
                return true;
            }

            return a.Equals(b);
        }

This makes red flashing lights and claxons go off in my head: no, you cannot enumerate collections for equality testing. And it's not because the enumerator isn't properly disposed of (another pet peeve of mine). You can't just compare two enumerable collections, since the first one may care about the order (e.g. lists), but the other may not (e.g. sets). What exactly constitutes equality is really whatever makes the code correct.  Remember the curse of Brzozowski!

But for those things that can be consolidated, I agree with "Equatable" as being a better name than "Value" for collections that can be compared. There are two ways to proceed:
- implement EquatableHashSet<T>, EquatableList<T>, and so on, explicitly implementing IEquatable<T> with T being the name of the collection.
- implement IEqualityComparer<T> for evert collection T. Conceptually, this is the cleaner solution (decoupling equality testing from the object implementation is a cleaner solution), but of course this means that every comparison, dictionary or set needs to be created with that comparer as parameter. This opens up a whole new class  of bugs: forget one, and you're hosed.

I don't really know the best way to proceed at this time. Maybe someone else can chime in?

Ever since I looked at the Lucene.net source code, I've been wondering why more time wasn't invested in making the port more ".NET like". Now that I've experienced the joy of tracking down bugs and making tests work, I think I understand why <g>.

You're a very courageous person. 

Vincent




-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 7:18 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

Good find.

Actually, I am thinking that some consolidation is in order:

GETHASHCODE

For comparing collections with GetHashCode() we have:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.GetHashCode()
Support.HashHelpers.GetValueHashode()

These are all doing the same thing (getting a hash code based on a collection), and they are not all doing it the same way.


EQUALS

Similarly, we have several places that are using Equals() to compare collections:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.Equals()
Support.ObjectExtensions.ValueEquals()

Plus in many places we are using:

ISet<T>.SetEquals()
IEnumerable<T>.SequenceEquals() (LINQ)


My thinking is that we should use one of static helper classes (probably Arrays) to make a single implementation of each of these, and then override Equals, GetHashCode, and for that matter ToString() in all of our Support collections and call our helper class in every place.

Not to mention, there are still many Queries (and other such things) that someone decided to "optimize" by making the implementation consider only the first and last item in the set (which is a good recipe for mysterious bugs that are nearly impossible to find as far as I am concerned).

But this is going to take some grace - there are some implementations of GetHashCode and Equals out there that require a very specific algorithm in order to function correctly.

We should probably just merge Support.EquatableList<T> with Support.ValueList<T>, (equatable sounds like a more clear name) and rename ValueHashSet to match (or just call it HashSet<T>, or would that be confusing?). Perhaps there should also be an overload that accepts IList<T> and ISet<T> in the constructor so we can just wrap existing instances easily, without having to worry about whether it inherits from our support classes.

Plus all of our support collections should be able to compare their values, just like in Java - many are missing the overloads. Also we can probably just factor out HashHelpers and ObjectExtensions altogether.

There are some Queries that have protected access to their clauses list (which originally was IList<T>, but has been changed to ValueList<T> or EquatableList<T>). Maybe they should all be reverted back to using IList<T> and the Query itself can call Arrays.GetHashCode() and Arrays.Equals() static methods to compare internally, without worrying about whether the list (that might just be any old list) will compare its contents. The only possible exceptions are where the list is used as a key for a dictionary, but there aren't many of those. WDYT?

Shad


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 12:37 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'd love to give you hope for Brzozowski. The only thing I can give you at this time is the probable correction of a possible bug.
There's a method in AutomatonTestUtil:

	public static void DeterminizeSimple(Automaton a, ISet<State> initialset)

The second argument of that method is later used as key in a dictionary, like so:

	sets[initialset] = initialset;

This implies that the argument must be a ValueHashSet<State>. 
Now look at this method in AutomatonTestUtil.MinimizeSimple: the calls look like this:

            DeterminizeSimple(a, SpecialOperations.Reverse(a));

This means that SpecialOperations.Reverse(a) must return a ValueHashSet<State>. But it doesn't: It returns the accepted state as:

            HashSet<State> accept = new HashSet<State>();

I think this needs to be changed in:

            ValueHashSet<State> accept = new ValueHashSet<State>();

This may open another avenue of inquiry. The error syndrome I get from the failed Brzozowski algorithm are automatons with duplicate states. This may hint at another hash set used as a key but not created as ValueHashSet

But in any case, I'm quite convinced that SpecialOperations.Reverse should be amended as described.

The investigation continues.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 2:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

That's great news!

Looks like it is passing always now. But, for some reason in .NET core it is taking MUCH longer to finish. I am getting about 6 seconds in .NET Framework, and over 1 minute in .NET Core. Maybe using a stopwatch is the right solution to stabilize this behavior?

To run the tests on .NET Core, open the Lucene.Net.Portable.sln. You may need to run "dotnet restore" from the command line at the root of the repository in order to get it to compile (sometimes you have to close and reopen Visual Studio to get the command to take).

I have taken a stab at that IndicNormalizer. It was failing when trying to get a character out of the BitArray that it previously put in there. But it was designed to work with a Java BitSet, not a .NET BitArray. Perhaps there is some difference in the way it works that is causing this (like null character not being stored, or something silly like that). It's a shot in the dark, but since I cannot get the test to fail under controlled conditions, I have replaced it with the OpenBitSet (which is basically a Java BitSet with access to its underlying storage). At the very least, it won't hurt.

I'll also take a closer look at the random "file not closed" failures coming from TestSearcherManager_Mem(). I think you fixed the underlying cause for the main failure. But this is a sign that there is an unexpected exception being thrown that triggers Dispose() too early. Perhaps there is still a broken Codec that is causing this failure, which would explain its randomness.

Is there any hope for Brzozowski? I'll make a compromise - if you can solve that one, I will pretend we are on version 4.9 for TestEarlyTerminationDifferentSorter() so we can put it to bed - it's probably not worth the effort anyway (I have already spent several days chasing after that one).

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Friday, March 24, 2017 4:40 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I have a theory about TestCRTReopen: if you look at the java code (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/search/ControlledRealTimeReopenThread.java), you see there's a relation between the reentrant lock and its condition variable:

ReentrantLock reopenLock = new ReentrantLock(); Condition reopenCond = reopenLock.newCondition();

Maybe there's some subtlety in there that we miss. The lock is used only as a guard around the reopen condition, which maybe is how they rule in the Java Shire, but no such concepts exist as such in C#. 
The closest thing to a "real" ReentrantLock implementation I have ever seen in .NET (complete with condition variables, fair locking, and so on) is https://github.com/spring-projects/spring-net-threading/blob/master/src/Spring/Spring.Threading/Threading/Locks/ReentrantLock.cs

But that's a gorilla. All we really want is a banana, without the gorilla attached to it.

So that got me thinking: we know what ControlledRealTImeReopenThread does. Why don't we implement it in "pure" C# instead of trying to translate it from Java using synchronization primitives that are almost but not quite totally unlike those in .NET? 

You can find the result in the file attached.  I restrained myself and didn't replace Environment.TickCount with Stopwatch (which would be more correct and guard against overflows that occur in TickCount every 24.9 days).

In a fit of altruism and insight, I even let all the related unit tests run, and they all pass.  And finish in time!

But that's in my alternate universe, of course <g>.


Vincent


-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 3:49 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

FYI - TestSearcherManager_Mem() succeeds much more frequently, but still randomly fails.

Also, although I was able to make the error message change for TestCRTReopen(), I didn't manage to get it working. The problem is pretty obvious - the WaitForGeneration() method (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Search/TestControlledRealTimeReopenThread.cs#L680) is WAY too slow. Even if I increase the wait period from 20 to 60 seconds it still doesn't finish in time. I played with a few of the variables in ControlledRealTimeReopenThread, but couldn't get the behavior to change. I verified that PulseAll() gets called, but it doesn't seem like it is having any effect on the Wait().

For TestEarlyTerminationDifferentSorter(), I reviewed all of the code under test in the Misc project line by line, but there was nothing significant to fix. So, still broken (sometimes).

Thanks,
Shad Storhaug (NightOwl888)





-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, March 23, 2017 8:25 PM
To: Van Den Berghe, Vincent; dev@lucenenet.apache.org
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Parallel universe or not, I think you are making progress. I found a similar IncrementAndGet issue in the ThreadedIndexingAndSearchingTestCase that I have already corrected. However, it only mattered in one case, in all other cases the result of IncrementAndGet was not being utilized.

It is like someone intentionally wanted to make it impossible to get all of the bugs out of this code...

Anyway, stupid is as stupid does...I went through and scanned the entire codebase for IncrementAndGet and compared it against Lucene. Sure enough, the Core.Util.RefCount class was refactored from its original. I changed it back to the original code (backed by an AtomicInteger/AtomicInt32), and the TestCRTReopen() test no longer fails almost immediately - after a couple of minutes it now fails with the message "waited too long for commit generation". I don't know if I fixed it or broke it more, but it is definitely behaving differently now.

Now, let me see if I can bring your other changes into my universe...perhaps the new failure has something to do with the reset event.

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, March 23, 2017 7:04 PM
To: dev@lucenenet.apache.org
Cc: Shad Storhaug
Subject: TestSearcherManager_Mem

Even though I seem to live in a parallel universe where 42 isn't 42 and 4.8 isn't 4.8, I'll have a stab at resolving TestSearcherManager_Mem.

First, there is a method in TrackingIndexWriter:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.IncrementAndGet();
        }


The implementation calls the wrong indexGen method: it should call GetAndIncrement(), which doesn't exist in the .NET version. You can add the method to the AtomicLong class.
Too bad there's no Interlocked.PostIncrement, but it's easy enough:

              public long GetAndIncrement()
              {
                     return Interlocked.Increment(ref value) - 1;
              }

And adjust the call accordingly:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.GetAndIncrement();
        }


Next we turn our attention to ControlledRealTimeReopenThread<T>.
There's an event defined as follows:

        private ManualResetEvent reopenCond = new ManualResetEvent(false);

This is not correct, since the remainder of the implementation only Sets and Waits, but never resets. Once the event Set, the wait will never ... uh... wait the second time around. Change this as follows:

        private AutoResetEvent reopenCond = new AutoResetEvent(false);

Next, for some reason, time is counted in nanoseconds, but since Environment.TickCount is in milliseconds, we need to convert it by multiplying by 1000000.
Unfortunately, this is done by multiplication:

Environment.TickCount * 1000000

Since Environment.TickCount is an int and 1000000 is an int, the result is negative unless you just rebooted your computer in a Tardis doing a polka backwards.
Define:

                private const long MS_IN_NS = 1000000;

... and change all other references to 1000000 except one (see below) with MS_IN_NS: this should solve the overflow problem using C#'s promotion rules.
Next, notice that 64-bit integers are sometimes read outside locks:

                searchingGen = refreshStartGen;
                if (targetGen > searchingGen)
                    while (targetGen > searchingGen)


This isn't guaranteed to be atomic, and I'm a curmudgeon when it comes to parallelism and atomicity. Change all these lines by:

                Interlocked.Exchange(ref searchingGen, refreshStartGen);
                if (targetGen > Interlocked.Read(ref searchingGen))
                    while (targetGen > Interlocked.Read(ref searchingGen))


In my own spacetime continuum, the test now passes.

Added bonus points: dispose of the waitable event in ControlledRealTimeReopenThread<T> Dispose method by adding:

                                  reopenCond.Dispose();

(after the Monitor.PulseAll(this); statement)

Extreme added bonus points: the following statement is incorrect, but it works anyway:

                  reopenCond.WaitOne(new TimeSpan(sleepNS / 1000000));//Convert NS to Ticks

(the 1000000 should not be replaced by MS_IN_NS in the new version) The reason why it's incorrect is because the argument to TimeSpan in the call accepts 100-nanoseconds units, and dividing nanoseconds by 1000000 yields milliseconds instead. So you will wait a delay of a factor 10000 shorter. It turns out that the correction (divide by 100) will cause timeouts in the tests, so I left it as-is.


But all of this might be wrong, I may not even exists.


Vincent

RE: TestSearcherManager_Mem

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Vincent,

Fixed! Thanks for letting me know - there is a sliced bread shortage and this could have started a global crisis.

Hmm... but shouldn't we be calling base.Dispose() (which flushes) before file.Flush()? Actually, that would be the same order as your original solution.

I was looking into that FIPS compliant issue that was reported a few days ago (http://apache.markmail.org/search/?q=lucenenet+RE%3A+bug+fixed+in+lucene.Net+4.8+%3F#query:lucenenet%20RE%3A%20bug%20fixed%20in%20lucene.Net%204.8%20%3F+page:1+mid:qjmk3hzbg6zad73s+state:results). I was curious why there is a Cryptography class in Lucene.Net 3.0.3 and it turns out they are using a cryptographic file hash to generate a LockId. There are also some other ideas that were a bit different from our current FSDirectory implementation (but I can't say how successful they were). Have a look, perhaps there is something here we can use: https://svn.apache.org/repos/asf/lucene.net/tags/Lucene.Net_3_0_3_RC2_final. Or, if not at least we can point fingers and laugh.


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, July 18, 2017 3:44 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

I'll have a look at it.
Meanwhile, there's a serious performance problem introduced with the recently changed method:

// LUCENENET specific - file.Flush(flushToDisk: true) required in .NET for concurrency // Part of a solution suggested by Vincent Van Den Berghe: http://apache.markmail.org/message/hafnuhq2ydhfjmi2
public override void Flush()
{
	base.Flush();
	file.Flush(flushToDisk: true);
}

This ends up flushing evert 1024 bytes, which results in horrible performance. Really horrible.  So horrible it makes me want to change my name and buy sliced bread.
I've removed the horrible thing and moved the Flush(true) in the Dispose method:

			protected override void Dispose(bool disposing)
			{
				if (disposing)
				{
					parent.OnIndexOutputClosed(this);
					// only close the file if it has not been closed yet
					if (isOpen)
					{
						IOException priorE = null;
						try
						{
							file.Flush(flushToDisk: true);
							base.Dispose(disposing);
						}
						catch (IOException ioe)
						{
							priorE = ioe;
						}
						finally
						{
							isOpen = false;
							IOUtils.DisposeWhileHandlingException(priorE, file);
						}
					}
				}
			}

This restores the performance, and the tests keep running correctly. 

I shall have a look at TestSearcherManager_Mem, but I haven't said the last word on LockServer/StressTest either.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Monday, July 17, 2017 12:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I noticed that TestSearcherManager_Mem is failing randomly again and thought it had something to do with the MMapDiretory fix. But I checked out the commit before that was implemented, and it still randomly fails. In fact, swapping out different directory implementations has no effect - it randomly fails on all of them.

I thought we had this one nailed down before, but going through this email thread it looks like this failing test was never fully fixed, just improved. When you have a chance could you take another look, kind sir?


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Sunday, March 26, 2017 6:59 AM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Update

I located and fixed the TestCData() problem already - it turned out to be a bug with the test framework replaceFirst() method matching on a literal instead of a Regex. I also discovered that the implementation of HTMLStripCharFilter was ported from 4.8.1 (which had improvements from 4.8.0).

I am also hot on the trail of TestRegexps(). I have narrowed it down to the constructor of the RegExp class (the Automaton one). So far discovered one method not accounting for surrogate pairs and confirmed fix worked with a specific case, but the test still has other issues (also in the constructor of the RegExp class). Still investigating...

Shad

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Saturday, March 25, 2017 1:59 PM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I took a look at TestSearcherManager_Mem(), and the problem is not related to any one codec or any one directory. My previous theory that it was an issue with Dispose() not being called in the correct order was also incorrect. The files are being created, but not being deleted (confirmed on disk). They are randomly getting an "in use by another process" error, which makes me suspect this is concurrency related (although we really can't rule out the test framework itself). You want to have another look?

I am also starting to see a recurrence of a very old random problem - very rarely, the Lucene.Net.Tests.Search.TestDocTermOrdsRewriteMethod.TestRegexps() test will go into a state where it doesn't finish. I have it capped at 1 minute, but since under normal conditions it only takes a couple of seconds to finish, I suspect it is still misbehaving when it times out (and what do you know, it is also based on Automaton).

Yet another dinosaur that is coming back is Lucene.Net.Analysis.CharFilters.HTMLStripCharFilterTest.TestCData() - fails very rarely, but if you put the Repeat(100) attribute on the test, it will fail constantly. There were quite a few bugs in there because of the difference between Java and .NET on the .Read() method. In Java, all overloads consistently return -1 to show nothing was read. In .NET, the overload with no parameters returns -1 and the one with parameters returns 0 if no characters are read. Might not be the problem, but it is definitely a suspect.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Saturday, March 25, 2017 2:34 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'll have to think about what you write more, but I'm a big fan of consolidation. I'll have to familiarize myself with all the code you're mentioning before making a more coherent statement, so let me give you my uninformed ".NET view" on the matter for now.

First, too much consolidation is not always good.  For example, in ObjectExtensions I see this:

        public static bool ValueEquals<T>(this T a, T b)
        {
            if (a is IEnumerable && b is IEnumerable)
            {
                var iter = (b as IEnumerable).GetEnumerator();
                foreach (object value in a as IEnumerable)
                {
                    iter.MoveNext();
                    if (!object.Equals(value, iter.Current))
                    {
                        return false;
                    }
                }
                return true;
            }

            return a.Equals(b);
        }

This makes red flashing lights and claxons go off in my head: no, you cannot enumerate collections for equality testing. And it's not because the enumerator isn't properly disposed of (another pet peeve of mine). You can't just compare two enumerable collections, since the first one may care about the order (e.g. lists), but the other may not (e.g. sets). What exactly constitutes equality is really whatever makes the code correct.  Remember the curse of Brzozowski!

But for those things that can be consolidated, I agree with "Equatable" as being a better name than "Value" for collections that can be compared. There are two ways to proceed:
- implement EquatableHashSet<T>, EquatableList<T>, and so on, explicitly implementing IEquatable<T> with T being the name of the collection.
- implement IEqualityComparer<T> for evert collection T. Conceptually, this is the cleaner solution (decoupling equality testing from the object implementation is a cleaner solution), but of course this means that every comparison, dictionary or set needs to be created with that comparer as parameter. This opens up a whole new class  of bugs: forget one, and you're hosed.

I don't really know the best way to proceed at this time. Maybe someone else can chime in?

Ever since I looked at the Lucene.net source code, I've been wondering why more time wasn't invested in making the port more ".NET like". Now that I've experienced the joy of tracking down bugs and making tests work, I think I understand why <g>.

You're a very courageous person. 

Vincent




-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 7:18 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

Good find.

Actually, I am thinking that some consolidation is in order:

GETHASHCODE

For comparing collections with GetHashCode() we have:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.GetHashCode()
Support.HashHelpers.GetValueHashode()

These are all doing the same thing (getting a hash code based on a collection), and they are not all doing it the same way.


EQUALS

Similarly, we have several places that are using Equals() to compare collections:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.Equals()
Support.ObjectExtensions.ValueEquals()

Plus in many places we are using:

ISet<T>.SetEquals()
IEnumerable<T>.SequenceEquals() (LINQ)


My thinking is that we should use one of static helper classes (probably Arrays) to make a single implementation of each of these, and then override Equals, GetHashCode, and for that matter ToString() in all of our Support collections and call our helper class in every place.

Not to mention, there are still many Queries (and other such things) that someone decided to "optimize" by making the implementation consider only the first and last item in the set (which is a good recipe for mysterious bugs that are nearly impossible to find as far as I am concerned).

But this is going to take some grace - there are some implementations of GetHashCode and Equals out there that require a very specific algorithm in order to function correctly.

We should probably just merge Support.EquatableList<T> with Support.ValueList<T>, (equatable sounds like a more clear name) and rename ValueHashSet to match (or just call it HashSet<T>, or would that be confusing?). Perhaps there should also be an overload that accepts IList<T> and ISet<T> in the constructor so we can just wrap existing instances easily, without having to worry about whether it inherits from our support classes.

Plus all of our support collections should be able to compare their values, just like in Java - many are missing the overloads. Also we can probably just factor out HashHelpers and ObjectExtensions altogether.

There are some Queries that have protected access to their clauses list (which originally was IList<T>, but has been changed to ValueList<T> or EquatableList<T>). Maybe they should all be reverted back to using IList<T> and the Query itself can call Arrays.GetHashCode() and Arrays.Equals() static methods to compare internally, without worrying about whether the list (that might just be any old list) will compare its contents. The only possible exceptions are where the list is used as a key for a dictionary, but there aren't many of those. WDYT?

Shad


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Saturday, March 25, 2017 12:37 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'd love to give you hope for Brzozowski. The only thing I can give you at this time is the probable correction of a possible bug.
There's a method in AutomatonTestUtil:

	public static void DeterminizeSimple(Automaton a, ISet<State> initialset)

The second argument of that method is later used as key in a dictionary, like so:

	sets[initialset] = initialset;

This implies that the argument must be a ValueHashSet<State>. 
Now look at this method in AutomatonTestUtil.MinimizeSimple: the calls look like this:

            DeterminizeSimple(a, SpecialOperations.Reverse(a));

This means that SpecialOperations.Reverse(a) must return a ValueHashSet<State>. But it doesn't: It returns the accepted state as:

            HashSet<State> accept = new HashSet<State>();

I think this needs to be changed in:

            ValueHashSet<State> accept = new ValueHashSet<State>();

This may open another avenue of inquiry. The error syndrome I get from the failed Brzozowski algorithm are automatons with duplicate states. This may hint at another hash set used as a key but not created as ValueHashSet

But in any case, I'm quite convinced that SpecialOperations.Reverse should be amended as described.

The investigation continues.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 2:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

That's great news!

Looks like it is passing always now. But, for some reason in .NET core it is taking MUCH longer to finish. I am getting about 6 seconds in .NET Framework, and over 1 minute in .NET Core. Maybe using a stopwatch is the right solution to stabilize this behavior?

To run the tests on .NET Core, open the Lucene.Net.Portable.sln. You may need to run "dotnet restore" from the command line at the root of the repository in order to get it to compile (sometimes you have to close and reopen Visual Studio to get the command to take).

I have taken a stab at that IndicNormalizer. It was failing when trying to get a character out of the BitArray that it previously put in there. But it was designed to work with a Java BitSet, not a .NET BitArray. Perhaps there is some difference in the way it works that is causing this (like null character not being stored, or something silly like that). It's a shot in the dark, but since I cannot get the test to fail under controlled conditions, I have replaced it with the OpenBitSet (which is basically a Java BitSet with access to its underlying storage). At the very least, it won't hurt.

I'll also take a closer look at the random "file not closed" failures coming from TestSearcherManager_Mem(). I think you fixed the underlying cause for the main failure. But this is a sign that there is an unexpected exception being thrown that triggers Dispose() too early. Perhaps there is still a broken Codec that is causing this failure, which would explain its randomness.

Is there any hope for Brzozowski? I'll make a compromise - if you can solve that one, I will pretend we are on version 4.9 for TestEarlyTerminationDifferentSorter() so we can put it to bed - it's probably not worth the effort anyway (I have already spent several days chasing after that one).

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Friday, March 24, 2017 4:40 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I have a theory about TestCRTReopen: if you look at the java code (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/search/ControlledRealTimeReopenThread.java), you see there's a relation between the reentrant lock and its condition variable:

ReentrantLock reopenLock = new ReentrantLock(); Condition reopenCond = reopenLock.newCondition();

Maybe there's some subtlety in there that we miss. The lock is used only as a guard around the reopen condition, which maybe is how they rule in the Java Shire, but no such concepts exist as such in C#. 
The closest thing to a "real" ReentrantLock implementation I have ever seen in .NET (complete with condition variables, fair locking, and so on) is https://github.com/spring-projects/spring-net-threading/blob/master/src/Spring/Spring.Threading/Threading/Locks/ReentrantLock.cs

But that's a gorilla. All we really want is a banana, without the gorilla attached to it.

So that got me thinking: we know what ControlledRealTImeReopenThread does. Why don't we implement it in "pure" C# instead of trying to translate it from Java using synchronization primitives that are almost but not quite totally unlike those in .NET? 

You can find the result in the file attached.  I restrained myself and didn't replace Environment.TickCount with Stopwatch (which would be more correct and guard against overflows that occur in TickCount every 24.9 days).

In a fit of altruism and insight, I even let all the related unit tests run, and they all pass.  And finish in time!

But that's in my alternate universe, of course <g>.


Vincent


-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 3:49 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

FYI - TestSearcherManager_Mem() succeeds much more frequently, but still randomly fails.

Also, although I was able to make the error message change for TestCRTReopen(), I didn't manage to get it working. The problem is pretty obvious - the WaitForGeneration() method (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Search/TestControlledRealTimeReopenThread.cs#L680) is WAY too slow. Even if I increase the wait period from 20 to 60 seconds it still doesn't finish in time. I played with a few of the variables in ControlledRealTimeReopenThread, but couldn't get the behavior to change. I verified that PulseAll() gets called, but it doesn't seem like it is having any effect on the Wait().

For TestEarlyTerminationDifferentSorter(), I reviewed all of the code under test in the Misc project line by line, but there was nothing significant to fix. So, still broken (sometimes).

Thanks,
Shad Storhaug (NightOwl888)





-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, March 23, 2017 8:25 PM
To: Van Den Berghe, Vincent; dev@lucenenet.apache.org
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Parallel universe or not, I think you are making progress. I found a similar IncrementAndGet issue in the ThreadedIndexingAndSearchingTestCase that I have already corrected. However, it only mattered in one case, in all other cases the result of IncrementAndGet was not being utilized.

It is like someone intentionally wanted to make it impossible to get all of the bugs out of this code...

Anyway, stupid is as stupid does...I went through and scanned the entire codebase for IncrementAndGet and compared it against Lucene. Sure enough, the Core.Util.RefCount class was refactored from its original. I changed it back to the original code (backed by an AtomicInteger/AtomicInt32), and the TestCRTReopen() test no longer fails almost immediately - after a couple of minutes it now fails with the message "waited too long for commit generation". I don't know if I fixed it or broke it more, but it is definitely behaving differently now.

Now, let me see if I can bring your other changes into my universe...perhaps the new failure has something to do with the reset event.

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, March 23, 2017 7:04 PM
To: dev@lucenenet.apache.org
Cc: Shad Storhaug
Subject: TestSearcherManager_Mem

Even though I seem to live in a parallel universe where 42 isn't 42 and 4.8 isn't 4.8, I'll have a stab at resolving TestSearcherManager_Mem.

First, there is a method in TrackingIndexWriter:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.IncrementAndGet();
        }


The implementation calls the wrong indexGen method: it should call GetAndIncrement(), which doesn't exist in the .NET version. You can add the method to the AtomicLong class.
Too bad there's no Interlocked.PostIncrement, but it's easy enough:

              public long GetAndIncrement()
              {
                     return Interlocked.Increment(ref value) - 1;
              }

And adjust the call accordingly:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.GetAndIncrement();
        }


Next we turn our attention to ControlledRealTimeReopenThread<T>.
There's an event defined as follows:

        private ManualResetEvent reopenCond = new ManualResetEvent(false);

This is not correct, since the remainder of the implementation only Sets and Waits, but never resets. Once the event Set, the wait will never ... uh... wait the second time around. Change this as follows:

        private AutoResetEvent reopenCond = new AutoResetEvent(false);

Next, for some reason, time is counted in nanoseconds, but since Environment.TickCount is in milliseconds, we need to convert it by multiplying by 1000000.
Unfortunately, this is done by multiplication:

Environment.TickCount * 1000000

Since Environment.TickCount is an int and 1000000 is an int, the result is negative unless you just rebooted your computer in a Tardis doing a polka backwards.
Define:

                private const long MS_IN_NS = 1000000;

... and change all other references to 1000000 except one (see below) with MS_IN_NS: this should solve the overflow problem using C#'s promotion rules.
Next, notice that 64-bit integers are sometimes read outside locks:

                searchingGen = refreshStartGen;
                if (targetGen > searchingGen)
                    while (targetGen > searchingGen)


This isn't guaranteed to be atomic, and I'm a curmudgeon when it comes to parallelism and atomicity. Change all these lines by:

                Interlocked.Exchange(ref searchingGen, refreshStartGen);
                if (targetGen > Interlocked.Read(ref searchingGen))
                    while (targetGen > Interlocked.Read(ref searchingGen))


In my own spacetime continuum, the test now passes.

Added bonus points: dispose of the waitable event in ControlledRealTimeReopenThread<T> Dispose method by adding:

                                  reopenCond.Dispose();

(after the Monitor.PulseAll(this); statement)

Extreme added bonus points: the following statement is incorrect, but it works anyway:

                  reopenCond.WaitOne(new TimeSpan(sleepNS / 1000000));//Convert NS to Ticks

(the 1000000 should not be replaced by MS_IN_NS in the new version) The reason why it's incorrect is because the argument to TimeSpan in the call accepts 100-nanoseconds units, and dividing nanoseconds by 1000000 yields milliseconds instead. So you will wait a delay of a factor 10000 shorter. It turns out that the correction (divide by 100) will cause timeouts in the tests, so I left it as-is.


But all of this might be wrong, I may not even exists.


Vincent

RE: TestSearcherManager_Mem

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

I'll have a look at it.
Meanwhile, there's a serious performance problem introduced with the recently changed method:

// LUCENENET specific - file.Flush(flushToDisk: true) required in .NET for concurrency
// Part of a solution suggested by Vincent Van Den Berghe: http://apache.markmail.org/message/hafnuhq2ydhfjmi2
public override void Flush()
{
	base.Flush();
	file.Flush(flushToDisk: true);
}

This ends up flushing evert 1024 bytes, which results in horrible performance. Really horrible.  So horrible it makes me want to change my name and buy sliced bread.
I've removed the horrible thing and moved the Flush(true) in the Dispose method:

			protected override void Dispose(bool disposing)
			{
				if (disposing)
				{
					parent.OnIndexOutputClosed(this);
					// only close the file if it has not been closed yet
					if (isOpen)
					{
						IOException priorE = null;
						try
						{
							file.Flush(flushToDisk: true);
							base.Dispose(disposing);
						}
						catch (IOException ioe)
						{
							priorE = ioe;
						}
						finally
						{
							isOpen = false;
							IOUtils.DisposeWhileHandlingException(priorE, file);
						}
					}
				}
			}

This restores the performance, and the tests keep running correctly. 

I shall have a look at TestSearcherManager_Mem, but I haven't said the last word on LockServer/StressTest either.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, July 17, 2017 12:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I noticed that TestSearcherManager_Mem is failing randomly again and thought it had something to do with the MMapDiretory fix. But I checked out the commit before that was implemented, and it still randomly fails. In fact, swapping out different directory implementations has no effect - it randomly fails on all of them.

I thought we had this one nailed down before, but going through this email thread it looks like this failing test was never fully fixed, just improved. When you have a chance could you take another look, kind sir?


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Sunday, March 26, 2017 6:59 AM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Update

I located and fixed the TestCData() problem already - it turned out to be a bug with the test framework replaceFirst() method matching on a literal instead of a Regex. I also discovered that the implementation of HTMLStripCharFilter was ported from 4.8.1 (which had improvements from 4.8.0).

I am also hot on the trail of TestRegexps(). I have narrowed it down to the constructor of the RegExp class (the Automaton one). So far discovered one method not accounting for surrogate pairs and confirmed fix worked with a specific case, but the test still has other issues (also in the constructor of the RegExp class). Still investigating...

Shad

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Saturday, March 25, 2017 1:59 PM
To: Van Den Berghe, Vincent
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

I took a look at TestSearcherManager_Mem(), and the problem is not related to any one codec or any one directory. My previous theory that it was an issue with Dispose() not being called in the correct order was also incorrect. The files are being created, but not being deleted (confirmed on disk). They are randomly getting an "in use by another process" error, which makes me suspect this is concurrency related (although we really can't rule out the test framework itself). You want to have another look?

I am also starting to see a recurrence of a very old random problem - very rarely, the Lucene.Net.Tests.Search.TestDocTermOrdsRewriteMethod.TestRegexps() test will go into a state where it doesn't finish. I have it capped at 1 minute, but since under normal conditions it only takes a couple of seconds to finish, I suspect it is still misbehaving when it times out (and what do you know, it is also based on Automaton).

Yet another dinosaur that is coming back is Lucene.Net.Analysis.CharFilters.HTMLStripCharFilterTest.TestCData() - fails very rarely, but if you put the Repeat(100) attribute on the test, it will fail constantly. There were quite a few bugs in there because of the difference between Java and .NET on the .Read() method. In Java, all overloads consistently return -1 to show nothing was read. In .NET, the overload with no parameters returns -1 and the one with parameters returns 0 if no characters are read. Might not be the problem, but it is definitely a suspect.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 2:34 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'll have to think about what you write more, but I'm a big fan of consolidation. I'll have to familiarize myself with all the code you're mentioning before making a more coherent statement, so let me give you my uninformed ".NET view" on the matter for now.

First, too much consolidation is not always good.  For example, in ObjectExtensions I see this:

        public static bool ValueEquals<T>(this T a, T b)
        {
            if (a is IEnumerable && b is IEnumerable)
            {
                var iter = (b as IEnumerable).GetEnumerator();
                foreach (object value in a as IEnumerable)
                {
                    iter.MoveNext();
                    if (!object.Equals(value, iter.Current))
                    {
                        return false;
                    }
                }
                return true;
            }

            return a.Equals(b);
        }

This makes red flashing lights and claxons go off in my head: no, you cannot enumerate collections for equality testing. And it's not because the enumerator isn't properly disposed of (another pet peeve of mine). You can't just compare two enumerable collections, since the first one may care about the order (e.g. lists), but the other may not (e.g. sets). What exactly constitutes equality is really whatever makes the code correct.  Remember the curse of Brzozowski!

But for those things that can be consolidated, I agree with "Equatable" as being a better name than "Value" for collections that can be compared. There are two ways to proceed:
- implement EquatableHashSet<T>, EquatableList<T>, and so on, explicitly implementing IEquatable<T> with T being the name of the collection.
- implement IEqualityComparer<T> for evert collection T. Conceptually, this is the cleaner solution (decoupling equality testing from the object implementation is a cleaner solution), but of course this means that every comparison, dictionary or set needs to be created with that comparer as parameter. This opens up a whole new class  of bugs: forget one, and you're hosed.

I don't really know the best way to proceed at this time. Maybe someone else can chime in?

Ever since I looked at the Lucene.net source code, I've been wondering why more time wasn't invested in making the port more ".NET like". Now that I've experienced the joy of tracking down bugs and making tests work, I think I understand why <g>.

You're a very courageous person. 

Vincent




-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 7:18 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

Good find.

Actually, I am thinking that some consolidation is in order:

GETHASHCODE

For comparing collections with GetHashCode() we have:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.GetHashCode()
Support.HashHelpers.GetValueHashode()

These are all doing the same thing (getting a hash code based on a collection), and they are not all doing it the same way.


EQUALS

Similarly, we have several places that are using Equals() to compare collections:

Support.EquatableList<T>
Support.HashMap<T>
Support.ValueHashSet<T>
Support.ValueList<T>

Plus:

Support.Arrays.Equals()
Support.ObjectExtensions.ValueEquals()

Plus in many places we are using:

ISet<T>.SetEquals()
IEnumerable<T>.SequenceEquals() (LINQ)


My thinking is that we should use one of static helper classes (probably Arrays) to make a single implementation of each of these, and then override Equals, GetHashCode, and for that matter ToString() in all of our Support collections and call our helper class in every place.

Not to mention, there are still many Queries (and other such things) that someone decided to "optimize" by making the implementation consider only the first and last item in the set (which is a good recipe for mysterious bugs that are nearly impossible to find as far as I am concerned).

But this is going to take some grace - there are some implementations of GetHashCode and Equals out there that require a very specific algorithm in order to function correctly.

We should probably just merge Support.EquatableList<T> with Support.ValueList<T>, (equatable sounds like a more clear name) and rename ValueHashSet to match (or just call it HashSet<T>, or would that be confusing?). Perhaps there should also be an overload that accepts IList<T> and ISet<T> in the constructor so we can just wrap existing instances easily, without having to worry about whether it inherits from our support classes.

Plus all of our support collections should be able to compare their values, just like in Java - many are missing the overloads. Also we can probably just factor out HashHelpers and ObjectExtensions altogether.

There are some Queries that have protected access to their clauses list (which originally was IList<T>, but has been changed to ValueList<T> or EquatableList<T>). Maybe they should all be reverted back to using IList<T> and the Query itself can call Arrays.GetHashCode() and Arrays.Equals() static methods to compare internally, without worrying about whether the list (that might just be any old list) will compare its contents. The only possible exceptions are where the list is used as a key for a dictionary, but there aren't many of those. WDYT?

Shad


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Saturday, March 25, 2017 12:37 AM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I'd love to give you hope for Brzozowski. The only thing I can give you at this time is the probable correction of a possible bug.
There's a method in AutomatonTestUtil:

	public static void DeterminizeSimple(Automaton a, ISet<State> initialset)

The second argument of that method is later used as key in a dictionary, like so:

	sets[initialset] = initialset;

This implies that the argument must be a ValueHashSet<State>. 
Now look at this method in AutomatonTestUtil.MinimizeSimple: the calls look like this:

            DeterminizeSimple(a, SpecialOperations.Reverse(a));

This means that SpecialOperations.Reverse(a) must return a ValueHashSet<State>. But it doesn't: It returns the accepted state as:

            HashSet<State> accept = new HashSet<State>();

I think this needs to be changed in:

            ValueHashSet<State> accept = new ValueHashSet<State>();

This may open another avenue of inquiry. The error syndrome I get from the failed Brzozowski algorithm are automatons with duplicate states. This may hint at another hash set used as a key but not created as ValueHashSet

But in any case, I'm quite convinced that SpecialOperations.Reverse should be amended as described.

The investigation continues.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 2:04 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

That's great news!

Looks like it is passing always now. But, for some reason in .NET core it is taking MUCH longer to finish. I am getting about 6 seconds in .NET Framework, and over 1 minute in .NET Core. Maybe using a stopwatch is the right solution to stabilize this behavior?

To run the tests on .NET Core, open the Lucene.Net.Portable.sln. You may need to run "dotnet restore" from the command line at the root of the repository in order to get it to compile (sometimes you have to close and reopen Visual Studio to get the command to take).

I have taken a stab at that IndicNormalizer. It was failing when trying to get a character out of the BitArray that it previously put in there. But it was designed to work with a Java BitSet, not a .NET BitArray. Perhaps there is some difference in the way it works that is causing this (like null character not being stored, or something silly like that). It's a shot in the dark, but since I cannot get the test to fail under controlled conditions, I have replaced it with the OpenBitSet (which is basically a Java BitSet with access to its underlying storage). At the very least, it won't hurt.

I'll also take a closer look at the random "file not closed" failures coming from TestSearcherManager_Mem(). I think you fixed the underlying cause for the main failure. But this is a sign that there is an unexpected exception being thrown that triggers Dispose() too early. Perhaps there is still a broken Codec that is causing this failure, which would explain its randomness.

Is there any hope for Brzozowski? I'll make a compromise - if you can solve that one, I will pretend we are on version 4.9 for TestEarlyTerminationDifferentSorter() so we can put it to bed - it's probably not worth the effort anyway (I have already spent several days chasing after that one).

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Friday, March 24, 2017 4:40 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Hello Shad,

I have a theory about TestCRTReopen: if you look at the java code (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/search/ControlledRealTimeReopenThread.java), you see there's a relation between the reentrant lock and its condition variable:

ReentrantLock reopenLock = new ReentrantLock(); Condition reopenCond = reopenLock.newCondition();

Maybe there's some subtlety in there that we miss. The lock is used only as a guard around the reopen condition, which maybe is how they rule in the Java Shire, but no such concepts exist as such in C#. 
The closest thing to a "real" ReentrantLock implementation I have ever seen in .NET (complete with condition variables, fair locking, and so on) is https://github.com/spring-projects/spring-net-threading/blob/master/src/Spring/Spring.Threading/Threading/Locks/ReentrantLock.cs

But that's a gorilla. All we really want is a banana, without the gorilla attached to it.

So that got me thinking: we know what ControlledRealTImeReopenThread does. Why don't we implement it in "pure" C# instead of trying to translate it from Java using synchronization primitives that are almost but not quite totally unlike those in .NET? 

You can find the result in the file attached.  I restrained myself and didn't replace Environment.TickCount with Stopwatch (which would be more correct and guard against overflows that occur in TickCount every 24.9 days).

In a fit of altruism and insight, I even let all the related unit tests run, and they all pass.  And finish in time!

But that's in my alternate universe, of course <g>.


Vincent


-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Friday, March 24, 2017 3:49 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Vincent,

FYI - TestSearcherManager_Mem() succeeds much more frequently, but still randomly fails.

Also, although I was able to make the error message change for TestCRTReopen(), I didn't manage to get it working. The problem is pretty obvious - the WaitForGeneration() method (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Search/TestControlledRealTimeReopenThread.cs#L680) is WAY too slow. Even if I increase the wait period from 20 to 60 seconds it still doesn't finish in time. I played with a few of the variables in ControlledRealTimeReopenThread, but couldn't get the behavior to change. I verified that PulseAll() gets called, but it doesn't seem like it is having any effect on the Wait().

For TestEarlyTerminationDifferentSorter(), I reviewed all of the code under test in the Misc project line by line, but there was nothing significant to fix. So, still broken (sometimes).

Thanks,
Shad Storhaug (NightOwl888)





-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, March 23, 2017 8:25 PM
To: Van Den Berghe, Vincent; dev@lucenenet.apache.org
Cc: dev@lucenenet.apache.org
Subject: RE: TestSearcherManager_Mem

Parallel universe or not, I think you are making progress. I found a similar IncrementAndGet issue in the ThreadedIndexingAndSearchingTestCase that I have already corrected. However, it only mattered in one case, in all other cases the result of IncrementAndGet was not being utilized.

It is like someone intentionally wanted to make it impossible to get all of the bugs out of this code...

Anyway, stupid is as stupid does...I went through and scanned the entire codebase for IncrementAndGet and compared it against Lucene. Sure enough, the Core.Util.RefCount class was refactored from its original. I changed it back to the original code (backed by an AtomicInteger/AtomicInt32), and the TestCRTReopen() test no longer fails almost immediately - after a couple of minutes it now fails with the message "waited too long for commit generation". I don't know if I fixed it or broke it more, but it is definitely behaving differently now.

Now, let me see if I can bring your other changes into my universe...perhaps the new failure has something to do with the reset event.

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, March 23, 2017 7:04 PM
To: dev@lucenenet.apache.org
Cc: Shad Storhaug
Subject: TestSearcherManager_Mem

Even though I seem to live in a parallel universe where 42 isn't 42 and 4.8 isn't 4.8, I'll have a stab at resolving TestSearcherManager_Mem.

First, there is a method in TrackingIndexWriter:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.IncrementAndGet();
        }


The implementation calls the wrong indexGen method: it should call GetAndIncrement(), which doesn't exist in the .NET version. You can add the method to the AtomicLong class.
Too bad there's no Interlocked.PostIncrement, but it's easy enough:

              public long GetAndIncrement()
              {
                     return Interlocked.Increment(ref value) - 1;
              }

And adjust the call accordingly:

        public virtual long GetAndIncrementGeneration()
        {
            return indexingGen.GetAndIncrement();
        }


Next we turn our attention to ControlledRealTimeReopenThread<T>.
There's an event defined as follows:

        private ManualResetEvent reopenCond = new ManualResetEvent(false);

This is not correct, since the remainder of the implementation only Sets and Waits, but never resets. Once the event Set, the wait will never ... uh... wait the second time around. Change this as follows:

        private AutoResetEvent reopenCond = new AutoResetEvent(false);

Next, for some reason, time is counted in nanoseconds, but since Environment.TickCount is in milliseconds, we need to convert it by multiplying by 1000000.
Unfortunately, this is done by multiplication:

Environment.TickCount * 1000000

Since Environment.TickCount is an int and 1000000 is an int, the result is negative unless you just rebooted your computer in a Tardis doing a polka backwards.
Define:

                private const long MS_IN_NS = 1000000;

... and change all other references to 1000000 except one (see below) with MS_IN_NS: this should solve the overflow problem using C#'s promotion rules.
Next, notice that 64-bit integers are sometimes read outside locks:

                searchingGen = refreshStartGen;
                if (targetGen > searchingGen)
                    while (targetGen > searchingGen)


This isn't guaranteed to be atomic, and I'm a curmudgeon when it comes to parallelism and atomicity. Change all these lines by:

                Interlocked.Exchange(ref searchingGen, refreshStartGen);
                if (targetGen > Interlocked.Read(ref searchingGen))
                    while (targetGen > Interlocked.Read(ref searchingGen))


In my own spacetime continuum, the test now passes.

Added bonus points: dispose of the waitable event in ControlledRealTimeReopenThread<T> Dispose method by adding:

                                  reopenCond.Dispose();

(after the Monitor.PulseAll(this); statement)

Extreme added bonus points: the following statement is incorrect, but it works anyway:

                  reopenCond.WaitOne(new TimeSpan(sleepNS / 1000000));//Convert NS to Ticks

(the 1000000 should not be replaced by MS_IN_NS in the new version) The reason why it's incorrect is because the argument to TimeSpan in the call accepts 100-nanoseconds units, and dividing nanoseconds by 1000000 yields milliseconds instead. So you will wait a delay of a factor 10000 shorter. It turns out that the correction (divide by 100) will cause timeouts in the tests, so I left it as-is.


But all of this might be wrong, I may not even exists.


Vincent