You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com> on 2017/02/07 11:00:17 UTC

Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: [Possibly spoofed] Re: Bug in Lucene static initialization with multiple threads.

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

I'm using 4.8.0.765-beta for the moment.

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar Syn-Hershko
Sent: Tuesday, February 07, 2017 6:25 PM
To: dev@lucenenet.apache.org
Subject: [Possibly spoofed] Re: Bug in Lucene static initialization with multiple threads.

Hi Vincent, what version of Lucene.NET are you using?

--

Itamar Syn-Hershko
Freelance Developer & Consultant
Elasticsearch Consulting Partner
Microsoft MVP | Lucene.NET PMC
http://code972.com | @synhershko <https://twitter.com/synhershko> http://BigDataBoutique.co.il/

On Tue, Feb 7, 2017 at 1:00 PM, Van Den Berghe, Vincent < Vincent.VanDenBerghe@bvdinfo.com> wrote:

> Hello,
>
> Every once in a while, I get an error when using Lucene in a 
> multithreaded scenario (meaning: using a single IndexWriter in 
> multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
> The exception chain thrown is:
>
> Unhandled Exception: System.ArgumentException: Could not instantiate 
> implementing class for Lucene.Net.Analysis.Tokenattributes.
> ICharTermAttribute
> ---> System.ArgumentException: Could not find implementing class for
> ICharTermAttribute
> --->System.InvalidOperationException: Collection was modified;
> enumeration operation  may not execute.
>
> I could not understand what was going on, especially because it only 
> occurred "sometimes". It took me a while to figure out, but I think 
> it's a bug.
>
> Here's the stack trace of the exception when it occurs:
>
>                 [External Code]
> >             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>,
> System.WeakReference>.GetEnumerator() Line 229           C#
>                [External Code]
>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type,
> System.WeakReference>.Clean() Line 59           C#
>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type,
> System.WeakReference>.CleanIfNeeded() Line 71         C#
>                
> Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type,
> System.WeakReference>.Add(System.Type key, System.WeakReference value)
> Line 134           C#
>                 Lucene.Net.dll!Lucene.Net.Util.AttributeSource.
> AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.
> Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
>                 Lucene.Net.dll!Lucene.Net.Util.AttributeSource.
> AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<
> Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
>                 Lucene.Net.dll!Lucene.Net.Util.AttributeSource.
> AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>()
> Line 350                C#
>                Lucene.Net.dll!Lucene.Net.Documents.Field.
> StringTokenStream.InitializeInstanceFields() Line 658         C#
>                Lucene.Net.dll!Lucene.Net.Documents.Field.
> StringTokenStream.StringTokenStream() Line 676                C#
>                Lucene.Net.dll!Lucene.Net.Documents.Field.
> GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
>                Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.
> ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line
> 105              C#
>                 Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.
> ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279
>         C#
>                 Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.
> UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index
> .IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, 
> Lucene.Net.Index.Term delTerm)
> Line 287                C#
>                 Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.
> UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index
> .IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, 
> Lucene.Net.Index.Term delTerm)
> Line 574                C#
>                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.
> UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.
> IEnumerable<Lucene.Net.Index.IndexableField> doc,
> Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
>                 
> Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(
> System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
>                 
> Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(
> System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
> doc) Line 1436   C#
>
> ... and to wit, here are the threads just rushing in to do the same:
>
> Not Flagged                        35428    17           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        35444    11           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        44124    12           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged        >             44140    13           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        47700    14           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        28168    15           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        30988    16           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        21828    6              Worker Thread
> <No Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
>
> The reason why it only reproduces "sometimes" is because of this 
> little nugget of code:
>
>         private void CleanIfNeeded()
>         {
>             int currentColCount = GC.CollectionCount(0);
>             if (currentColCount > _gcCollections)
>             {
>                 Clean();
>                 _gcCollections = currentColCount;
>             }
>         }
>
> If one thread does a Clean() operation in the middle of another 
> Clean() operation on the same collection that replaces the object 
> being enumerated on, you get the exception. Always.
> To avoid the intermittence, create a bunch of threads like this and 
> eliminate the test "if (currentColCount > _gcCollections)" so that the
> Clean() code is always executed. You'll get the exception every time.
>
> I will not post the correction, but there's a simple workaround: just 
> make sure the static initializers are performed in a single thread.
> I.e. before creating your threads, do something like this:
>
> new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::
> Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer 
> object));
>
> Replace "some Analyzer object" with an instance of an Analyzer object, 
> it doesn't matter which one. It's meaningless, but it has the side 
> effect of initializing the static fields without problems.
>
>
> Vincent
>
>
>
>
>

Re: Bug in Lucene static initialization with multiple threads.

Posted by Itamar Syn-Hershko <it...@code972.com>.

Hi Vincent, what version of Lucene.NET are you using?

--

Itamar Syn-Hershko
Freelance Developer & Consultant
Elasticsearch Consulting Partner
Microsoft MVP | Lucene.NET PMC
http://code972.com | @synhershko <https://twitter.com/synhershko>
http://BigDataBoutique.co.il/

On Tue, Feb 7, 2017 at 1:00 PM, Van Den Berghe, Vincent <
Vincent.VanDenBerghe@bvdinfo.com> wrote:

> Hello,
>
> Every once in a while, I get an error when using Lucene in a multithreaded
> scenario (meaning: using a single IndexWriter in multiple threads, or using
> a distinct IndexWriter in each thread: it doesn't matter).
> The exception chain thrown is:
>
> Unhandled Exception: System.ArgumentException: Could not instantiate
> implementing class for Lucene.Net.Analysis.Tokenattributes.
> ICharTermAttribute
> ---> System.ArgumentException: Could not find implementing class for
> ICharTermAttribute
> --->System.InvalidOperationException: Collection was modified;
> enumeration operation  may not execute.
>
> I could not understand what was going on, especially because it only
> occurred "sometimes". It took me a while to figure out, but I think it's a
> bug.
>
> Here's the stack trace of the exception when it occurs:
>
>                 [External Code]
> >             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>,
> System.WeakReference>.GetEnumerator() Line 229           C#
>                [External Code]
>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type,
> System.WeakReference>.Clean() Line 59           C#
>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type,
> System.WeakReference>.CleanIfNeeded() Line 71         C#
>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type,
> System.WeakReference>.Add(System.Type key, System.WeakReference value)
> Line 134           C#
>                 Lucene.Net.dll!Lucene.Net.Util.AttributeSource.
> AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.
> Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
>                 Lucene.Net.dll!Lucene.Net.Util.AttributeSource.
> AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<
> Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
>                 Lucene.Net.dll!Lucene.Net.Util.AttributeSource.
> AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>()
> Line 350                C#
>                Lucene.Net.dll!Lucene.Net.Documents.Field.
> StringTokenStream.InitializeInstanceFields() Line 658         C#
>                Lucene.Net.dll!Lucene.Net.Documents.Field.
> StringTokenStream.StringTokenStream() Line 676                C#
>                Lucene.Net.dll!Lucene.Net.Documents.Field.
> GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
>                Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.
> ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line
> 105              C#
>                 Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.
> ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279
>         C#
>                 Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.
> UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm)
> Line 287                C#
>                 Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.
> UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm)
> Line 574                C#
>                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.
> UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.
> IEnumerable<Lucene.Net.Index.IndexableField> doc,
> Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
>                 Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(
> System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
>                 Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(
> System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField>
> doc) Line 1436   C#
>
> ... and to wit, here are the threads just rushing in to do the same:
>
> Not Flagged                        35428    17           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        35444    11           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        44124    12           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged        >             44140    13           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        47700    14           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        28168    15           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        30988    16           Worker Thread <No
> Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
> Not Flagged                        21828    6              Worker Thread
> <No Name>                Lucene.Net.dll!Lucene.Net.
> Support.WeakDictionary<System.Type, System.WeakReference>.Clean
>       Normal
>
> The reason why it only reproduces "sometimes" is because of this little
> nugget of code:
>
>         private void CleanIfNeeded()
>         {
>             int currentColCount = GC.CollectionCount(0);
>             if (currentColCount > _gcCollections)
>             {
>                 Clean();
>                 _gcCollections = currentColCount;
>             }
>         }
>
> If one thread does a Clean() operation in the middle of another Clean()
> operation on the same collection that replaces the object being enumerated
> on, you get the exception. Always.
> To avoid the intermittence, create a bunch of threads like this and
> eliminate the test "if (currentColCount > _gcCollections)" so that the
> Clean() code is always executed. You'll get the exception every time.
>
> I will not post the correction, but there's a simple workaround: just make
> sure the static initializers are performed in a single thread.
> I.e. before creating your threads, do something like this:
>
> new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::
> Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer
> object));
>
> Replace "some Analyzer object" with an instance of an Analyzer object, it
> doesn't matter which one. It's meaningless, but it has the side effect of
> initializing the static fields without problems.
>
>
> Vincent
>
>
>
>
>

RE: Bug in Lucene static initialization with multiple threads.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Vincent,

You have me convinced that I would rather be working on version 4.9 so I wouldn't have to deal with it :).

Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Thursday, March 23, 2017 8:00 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Let me make one more effort to convince you that the test is wrong: it was changed to its "correct" version in version 4.9:

https://github.com/apache/lucene-solr/commit/c59f13f9918faeeb4e69acd41731e674ce88f912

(LUCENE-5623: fix test to be reproducible and more evil)

Ah well, I tried.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Thursday, March 23, 2017 1:28 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sorry for the confusion. You have made it to the right version tag, but it turns out there are more than one TestEarlyTermination class. The one I am referring to is in the Misc subproject, not in the Core project.

But I think that the answer to fixing that test lies somewhere deep in the core project (or perhaps even in a subproject). It is quite possible that finding the answer to one of the other tests will also fix TestEarlyTerminationDifferentSorter(). It's also possible the piece of functionality that is broken has no tests of its own, which would explain why the issue is so hard to find.

I tried making a non-random test for this scenario before, but I couldn't find the correct set of input to make the error happen. The error happens rarely. But, the test does enough random loops to make it happen frequently. I wish there were a tool in Lucene.Net that would spit out the code for the exact same scenario that made the test fail, which would make it far easier to debug. When I stepped through the code, sometimes it failed on the second iteration, sometimes the third, sometimes the fourth, etc. But after it fails, it is nearly impossible to figure out the data that was in the index and the random components that were plugged in to make it fail.

So, we now have a big clue thanks to you - it will never fail on the first iteration. Hard coding it to 2 iterations makes it fail roughly 50% of the time. Now I suppose it is just a matter of going through and finding the random piece of input that makes it fail 100% of the time.

I also just found another clue. When the "different sorter" circumstance happens, the EarlyTerminatingSortingCollector changes its AcceptDocsOutOfOrder property from always being false to being the value of the wrapped ICollector. This is the only difference between the test that always passes and the one that randomly fails. So, either somebody is lying that they accept docs out of order (anybody who returns true is suspect), or the code that accepts the docs out of order is munged.

Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Thursday, March 23, 2017 1:41 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Insanity is knowing the answer to life, the universe and everything, and suddenly realizing that it's not 42.
Insanity is knowing that api-work is based on lucene 4.8.0 but yet it isn't. But it is. But it isn't.

I don't understand it, but I must be living in a parallel universe. Or perhaps you are. Or we both are. Maybe it's the matrix.

You refer to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131 as the code upon which the test is based, and I can't but acknowledge that it is so.

I do the following:
Go to https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java
On the page, click on the 'branch: master' button, click on 'tags', and choose 'release/lucene-solr/4.8.0'
The url switches to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java 

"My" TestEarlyTermination.java is different from yours.

I give up. Stick a fork in me, I'm done.

Turn the lights of when you're leaving.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 6:58 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

Thanks for the explanation. However, you have linked to wrong versions of documents and code. If you look at the version 4.8.0 tag instead of master (which is now at version 6.4.2), you can see that the code was reproduced faithfully: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131

The documentation link you posted was from a previous version, but that comment didn't change in 4.8.0: https://lucene.apache.org/core/4_8_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

I tried what you said and it made the test pass reliably, but that isn't what the test in Java does. Also, I have been seeing random errors from other places and in all cases the code was calling Util.PriorityQueue<T>.UpdateBottom() and then afterward gets a null result from the heap. That said, it is still too early to tell if the entirety of that issue was due to the IndexSearcher issue that was just fixed. If I see another occurrence of that issue, I will send the details your way.

Anyway, while it is tempting to just move the instantiation of the collectors inside of the loop and be done with it or, as I had previously discovered, the test will pass if you just put try-catch statements around the offending lines in TopDocsCollector and eat the null reference exception. But that isn't what is happening in Lucene, which indicates there is a bug. Sure, it is bad that in Java they didn't program defensively and check for a null before accessing its members (bad Java programmer!), and maybe these tests were not written according to the documentation, but the fact of the matter is Lucene is not supposed to throw this exception in this case.

You have given me more to consider, though. I previously theorized that it was the part of the code that is populating the queue that was failing, but it may be that there is a failure to reset the "full" state somewhere during the next search, which is in turn causing this failure to happen. It is strange that it only happens in the "different sorter" case, though because the tests are nearly identical in every other sense.

> I'm getting something different from TestSearcherManager_Mem --> cannot close: there are still open files: _1.cfs ,_0.cfs

You know what they say, the definition of insanity is doing the same thing twice and expecting a different result. So, I guess that makes me insane because that is exactly what these randomized tests do.

I saw this issue before, and it boils down to not appropriately disposing something, or not disposing stuff in the right order. It especially happens in the test environment when an assert fails and then the Dispose() method doesn't get called as a result, then the run or two attempt to recover. Sometimes changing the test to use using blocks works, but other times it screws up the test. Of course you would always use using blocks in production code, but you have to take into consideration these tests were written by the insane.

Even if you don't fix the test with using blocks, running the test again usually bypasses this message and gets you to another one.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:48 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

"King of the one-liners". Hmmm... maybe I'll order a T-shirt with that description.

Regarding TestEarlyTerminationDifferentSorter, I think the error is in the test, not in the code being tested.
Namely, the following:

                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);

What happens is the following: At most numHits results are collected in the collector.  This is done with a priority queue of at most numHits size. That priority queued keeps track of the numHits (or less) "best matched" documents. If there are more than numHits total hits, and there's a better matching document than the lowest matching document in the priority queue, that lowest matching document in the queue is ousted and the better matching document is put in the queue instead (at its proper priority).
The fact that the queue is full is tracked with the TopFieldCollector.queueFull Boolean. Once it's true, the queue is full and the above process kicks in.
Now, the call to GetTopDocs() returns the at most numHits top documents. This is done by getting them from the queue in TopDocsCollector (TopFieldCollector's base class):

        protected virtual void PopulateResults(ScoreDoc[] results, int howMany)
        {
            for (int i = howMany - 1; i >= 0; i--)
            {
                results[i] = m_pq.Pop();
            }
        }

If the priority queue contains howMany or less documents, after that operation the queue is empty.
However, all the other state variables in the TopFieldCollector collector (totalHits, queueFull, numHits, docBase) are left as-is.  So, even though the queue is empty, queueFull is true from the previous run. The system will try to replace a nonexisting element, causing the null reference.
The moral of the story is "after calling collector.GetTopDocs, the collector object cannot be reused".

In fact, this is confirmed by the documentation: https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

"NOTE: you cannot call this method more than once for each search execution. If you need to call it more than once, passing each time a different start, you should call topDocs() and work with the returned TopDocs object, which will contain all the results this search execution collected."

What happens in the test is wrong: the collector should be re-created at every iterations.
Interestingly, if you look at the java source https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTerminatingSortingCollector.java you will see that the testEarlyTermination does exactly that, and that testEarlyTerminationDifferentSorter is quite different.

So I think this should be rewritten as:

        public void TestEarlyTerminationDifferentSorter()
        {
            // test that the collector works correctly when the index was sorted by a
            // different sorter than the one specified in the ctor.
            CreateRandomIndexes(5);

            IndexSearcher searcher = NewSearcher(reader);
            int iters = AtLeast(5);
            for (int i = 0; i < iters; ++i)
            {
				int numHits = TestUtil.NextInt(Random(), 1, numDocs / 10);
				Sort sort = new Sort(new SortField("ndv2", SortFieldType.INT64, false));
				bool fillFields = Random().nextBoolean();
				bool trackDocScores = Random().nextBoolean();
				bool trackMaxScore = Random().nextBoolean();
				bool inOrder = Random().nextBoolean();
				TopFieldCollector collector1 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TopFieldCollector collector2 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TermQuery query = new TermQuery(new Term("s", RandomInts.RandomFrom(Random(), terms)));
                searcher.Search(query, collector1);
                Sort different = new Sort(new SortField("ndv2", SortFieldType.INT64));
                searcher.Search(query, new EarlyTerminatingSortingCollectorHelper(collector2, different, numHits));

                assertTrue(collector1.TotalHits >= collector2.TotalHits);
                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);
            }
        }

... and the same for the other test.

I'm getting something different from TestSearcherManager_Mem: 

Test Name:	TestSearcherManager_Mem
Test Outcome:	Failed
Result Message:	
System.Exception : MockDirectoryWrapper: cannot close: there are still open files: _1.cfs ,_0.cfs
  ----> System.Exception : unclosed IndexSlice: _5.cfs

... but perhaps Github is playing tricks on me again.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 2:00 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many failing tests. I am not sure how I missed that one - I think I was staring at the screen for too long. I moved the return true statement inside the try block, but I suppose it doesn't really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter() (~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem() (~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time - down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter() and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T> somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread class and fixed some minor differences from Lucene, but none of them affected the result. I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager, fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made, and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274. There are several things that depend on Automaton to work, and since none of them have failing tests, I am not all that concerned about this "difference". But it would be nice to find an answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate any (more) help you can provide.

Thanks again,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning "false", even if an element is available.
The correction is to add "return true;" where indicated:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(), e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.

Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.

Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

Hello Shad,

Let me make one more effort to convince you that the test is wrong: it was changed to its "correct" version in version 4.9:

https://github.com/apache/lucene-solr/commit/c59f13f9918faeeb4e69acd41731e674ce88f912

(LUCENE-5623: fix test to be reproducible and more evil)

Ah well, I tried.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Thursday, March 23, 2017 1:28 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sorry for the confusion. You have made it to the right version tag, but it turns out there are more than one TestEarlyTermination class. The one I am referring to is in the Misc subproject, not in the Core project.

But I think that the answer to fixing that test lies somewhere deep in the core project (or perhaps even in a subproject). It is quite possible that finding the answer to one of the other tests will also fix TestEarlyTerminationDifferentSorter(). It's also possible the piece of functionality that is broken has no tests of its own, which would explain why the issue is so hard to find.

I tried making a non-random test for this scenario before, but I couldn't find the correct set of input to make the error happen. The error happens rarely. But, the test does enough random loops to make it happen frequently. I wish there were a tool in Lucene.Net that would spit out the code for the exact same scenario that made the test fail, which would make it far easier to debug. When I stepped through the code, sometimes it failed on the second iteration, sometimes the third, sometimes the fourth, etc. But after it fails, it is nearly impossible to figure out the data that was in the index and the random components that were plugged in to make it fail.

So, we now have a big clue thanks to you - it will never fail on the first iteration. Hard coding it to 2 iterations makes it fail roughly 50% of the time. Now I suppose it is just a matter of going through and finding the random piece of input that makes it fail 100% of the time.

I also just found another clue. When the "different sorter" circumstance happens, the EarlyTerminatingSortingCollector changes its AcceptDocsOutOfOrder property from always being false to being the value of the wrapped ICollector. This is the only difference between the test that always passes and the one that randomly fails. So, either somebody is lying that they accept docs out of order (anybody who returns true is suspect), or the code that accepts the docs out of order is munged.

Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Thursday, March 23, 2017 1:41 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Insanity is knowing the answer to life, the universe and everything, and suddenly realizing that it's not 42.
Insanity is knowing that api-work is based on lucene 4.8.0 but yet it isn't. But it is. But it isn't.

I don't understand it, but I must be living in a parallel universe. Or perhaps you are. Or we both are. Maybe it's the matrix.

You refer to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131 as the code upon which the test is based, and I can't but acknowledge that it is so.

I do the following:
Go to https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java
On the page, click on the 'branch: master' button, click on 'tags', and choose 'release/lucene-solr/4.8.0'
The url switches to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java 

"My" TestEarlyTermination.java is different from yours.

I give up. Stick a fork in me, I'm done.

Turn the lights of when you're leaving.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 6:58 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

Thanks for the explanation. However, you have linked to wrong versions of documents and code. If you look at the version 4.8.0 tag instead of master (which is now at version 6.4.2), you can see that the code was reproduced faithfully: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131

The documentation link you posted was from a previous version, but that comment didn't change in 4.8.0: https://lucene.apache.org/core/4_8_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

I tried what you said and it made the test pass reliably, but that isn't what the test in Java does. Also, I have been seeing random errors from other places and in all cases the code was calling Util.PriorityQueue<T>.UpdateBottom() and then afterward gets a null result from the heap. That said, it is still too early to tell if the entirety of that issue was due to the IndexSearcher issue that was just fixed. If I see another occurrence of that issue, I will send the details your way.

Anyway, while it is tempting to just move the instantiation of the collectors inside of the loop and be done with it or, as I had previously discovered, the test will pass if you just put try-catch statements around the offending lines in TopDocsCollector and eat the null reference exception. But that isn't what is happening in Lucene, which indicates there is a bug. Sure, it is bad that in Java they didn't program defensively and check for a null before accessing its members (bad Java programmer!), and maybe these tests were not written according to the documentation, but the fact of the matter is Lucene is not supposed to throw this exception in this case.

You have given me more to consider, though. I previously theorized that it was the part of the code that is populating the queue that was failing, but it may be that there is a failure to reset the "full" state somewhere during the next search, which is in turn causing this failure to happen. It is strange that it only happens in the "different sorter" case, though because the tests are nearly identical in every other sense.


> I'm getting something different from TestSearcherManager_Mem --> cannot close: there are still open files: _1.cfs ,_0.cfs

You know what they say, the definition of insanity is doing the same thing twice and expecting a different result. So, I guess that makes me insane because that is exactly what these randomized tests do.

I saw this issue before, and it boils down to not appropriately disposing something, or not disposing stuff in the right order. It especially happens in the test environment when an assert fails and then the Dispose() method doesn't get called as a result, then the run or two attempt to recover. Sometimes changing the test to use using blocks works, but other times it screws up the test. Of course you would always use using blocks in production code, but you have to take into consideration these tests were written by the insane.

Even if you don't fix the test with using blocks, running the test again usually bypasses this message and gets you to another one.


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:48 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

"King of the one-liners". Hmmm... maybe I'll order a T-shirt with that description.

Regarding TestEarlyTerminationDifferentSorter, I think the error is in the test, not in the code being tested.
Namely, the following:

                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);

What happens is the following: At most numHits results are collected in the collector.  This is done with a priority queue of at most numHits size. That priority queued keeps track of the numHits (or less) "best matched" documents. If there are more than numHits total hits, and there's a better matching document than the lowest matching document in the priority queue, that lowest matching document in the queue is ousted and the better matching document is put in the queue instead (at its proper priority).
The fact that the queue is full is tracked with the TopFieldCollector.queueFull Boolean. Once it's true, the queue is full and the above process kicks in.
Now, the call to GetTopDocs() returns the at most numHits top documents. This is done by getting them from the queue in TopDocsCollector (TopFieldCollector's base class):

        protected virtual void PopulateResults(ScoreDoc[] results, int howMany)
        {
            for (int i = howMany - 1; i >= 0; i--)
            {
                results[i] = m_pq.Pop();
            }
        }

If the priority queue contains howMany or less documents, after that operation the queue is empty.
However, all the other state variables in the TopFieldCollector collector (totalHits, queueFull, numHits, docBase) are left as-is.  So, even though the queue is empty, queueFull is true from the previous run. The system will try to replace a nonexisting element, causing the null reference.
The moral of the story is "after calling collector.GetTopDocs, the collector object cannot be reused".

In fact, this is confirmed by the documentation: https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

"NOTE: you cannot call this method more than once for each search execution. If you need to call it more than once, passing each time a different start, you should call topDocs() and work with the returned TopDocs object, which will contain all the results this search execution collected."

What happens in the test is wrong: the collector should be re-created at every iterations.
Interestingly, if you look at the java source https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTerminatingSortingCollector.java you will see that the testEarlyTermination does exactly that, and that testEarlyTerminationDifferentSorter is quite different.

So I think this should be rewritten as:

        public void TestEarlyTerminationDifferentSorter()
        {
            // test that the collector works correctly when the index was sorted by a
            // different sorter than the one specified in the ctor.
            CreateRandomIndexes(5);

            IndexSearcher searcher = NewSearcher(reader);
            int iters = AtLeast(5);
            for (int i = 0; i < iters; ++i)
            {
				int numHits = TestUtil.NextInt(Random(), 1, numDocs / 10);
				Sort sort = new Sort(new SortField("ndv2", SortFieldType.INT64, false));
				bool fillFields = Random().nextBoolean();
				bool trackDocScores = Random().nextBoolean();
				bool trackMaxScore = Random().nextBoolean();
				bool inOrder = Random().nextBoolean();
				TopFieldCollector collector1 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TopFieldCollector collector2 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TermQuery query = new TermQuery(new Term("s", RandomInts.RandomFrom(Random(), terms)));
                searcher.Search(query, collector1);
                Sort different = new Sort(new SortField("ndv2", SortFieldType.INT64));
                searcher.Search(query, new EarlyTerminatingSortingCollectorHelper(collector2, different, numHits));


                assertTrue(collector1.TotalHits >= collector2.TotalHits);
                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);
            }
        }

... and the same for the other test.

I'm getting something different from TestSearcherManager_Mem: 

Test Name:	TestSearcherManager_Mem
Test Outcome:	Failed
Result Message:	
System.Exception : MockDirectoryWrapper: cannot close: there are still open files: _1.cfs ,_0.cfs
  ----> System.Exception : unclosed IndexSlice: _5.cfs

... but perhaps Github is playing tricks on me again.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 2:00 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many failing tests. I am not sure how I missed that one - I think I was staring at the screen for too long. I moved the return true statement inside the try block, but I suppose it doesn't really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter() (~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem() (~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time - down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter() and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T> somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread class and fixed some minor differences from Lucene, but none of them affected the result. I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager, fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made, and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274. There are several things that depend on Automaton to work, and since none of them have failing tests, I am not all that concerned about this "difference". But it would be nice to find an answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate any (more) help you can provide.


Thanks again,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning "false", even if an element is available.
The correction is to add "return true;" where indicated:


            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(), e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Hi Vincent,

Sorry for the confusion. You have made it to the right version tag, but it turns out there are more than one TestEarlyTermination class. The one I am referring to is in the Misc subproject, not in the Core project.

But I think that the answer to fixing that test lies somewhere deep in the core project (or perhaps even in a subproject). It is quite possible that finding the answer to one of the other tests will also fix TestEarlyTerminationDifferentSorter(). It's also possible the piece of functionality that is broken has no tests of its own, which would explain why the issue is so hard to find.

I tried making a non-random test for this scenario before, but I couldn't find the correct set of input to make the error happen. The error happens rarely. But, the test does enough random loops to make it happen frequently. I wish there were a tool in Lucene.Net that would spit out the code for the exact same scenario that made the test fail, which would make it far easier to debug. When I stepped through the code, sometimes it failed on the second iteration, sometimes the third, sometimes the fourth, etc. But after it fails, it is nearly impossible to figure out the data that was in the index and the random components that were plugged in to make it fail.

So, we now have a big clue thanks to you - it will never fail on the first iteration. Hard coding it to 2 iterations makes it fail roughly 50% of the time. Now I suppose it is just a matter of going through and finding the random piece of input that makes it fail 100% of the time.

I also just found another clue. When the "different sorter" circumstance happens, the EarlyTerminatingSortingCollector changes its AcceptDocsOutOfOrder property from always being false to being the value of the wrapped ICollector. This is the only difference between the test that always passes and the one that randomly fails. So, either somebody is lying that they accept docs out of order (anybody who returns true is suspect), or the code that accepts the docs out of order is munged.

Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Thursday, March 23, 2017 1:41 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Insanity is knowing the answer to life, the universe and everything, and suddenly realizing that it's not 42.
Insanity is knowing that api-work is based on lucene 4.8.0 but yet it isn't. But it is. But it isn't.

I don't understand it, but I must be living in a parallel universe. Or perhaps you are. Or we both are. Maybe it's the matrix.

You refer to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131 as the code upon which the test is based, and I can't but acknowledge that it is so.

I do the following:
Go to https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java
On the page, click on the 'branch: master' button, click on 'tags', and choose 'release/lucene-solr/4.8.0'
The url switches to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java 

"My" TestEarlyTermination.java is different from yours.

I give up. Stick a fork in me, I'm done.

Turn the lights of when you're leaving.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 6:58 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

Thanks for the explanation. However, you have linked to wrong versions of documents and code. If you look at the version 4.8.0 tag instead of master (which is now at version 6.4.2), you can see that the code was reproduced faithfully: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131

The documentation link you posted was from a previous version, but that comment didn't change in 4.8.0: https://lucene.apache.org/core/4_8_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

I tried what you said and it made the test pass reliably, but that isn't what the test in Java does. Also, I have been seeing random errors from other places and in all cases the code was calling Util.PriorityQueue<T>.UpdateBottom() and then afterward gets a null result from the heap. That said, it is still too early to tell if the entirety of that issue was due to the IndexSearcher issue that was just fixed. If I see another occurrence of that issue, I will send the details your way.

Anyway, while it is tempting to just move the instantiation of the collectors inside of the loop and be done with it or, as I had previously discovered, the test will pass if you just put try-catch statements around the offending lines in TopDocsCollector and eat the null reference exception. But that isn't what is happening in Lucene, which indicates there is a bug. Sure, it is bad that in Java they didn't program defensively and check for a null before accessing its members (bad Java programmer!), and maybe these tests were not written according to the documentation, but the fact of the matter is Lucene is not supposed to throw this exception in this case.

You have given me more to consider, though. I previously theorized that it was the part of the code that is populating the queue that was failing, but it may be that there is a failure to reset the "full" state somewhere during the next search, which is in turn causing this failure to happen. It is strange that it only happens in the "different sorter" case, though because the tests are nearly identical in every other sense.


> I'm getting something different from TestSearcherManager_Mem --> cannot close: there are still open files: _1.cfs ,_0.cfs

You know what they say, the definition of insanity is doing the same thing twice and expecting a different result. So, I guess that makes me insane because that is exactly what these randomized tests do.

I saw this issue before, and it boils down to not appropriately disposing something, or not disposing stuff in the right order. It especially happens in the test environment when an assert fails and then the Dispose() method doesn't get called as a result, then the run or two attempt to recover. Sometimes changing the test to use using blocks works, but other times it screws up the test. Of course you would always use using blocks in production code, but you have to take into consideration these tests were written by the insane.

Even if you don't fix the test with using blocks, running the test again usually bypasses this message and gets you to another one.


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:48 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

"King of the one-liners". Hmmm... maybe I'll order a T-shirt with that description.

Regarding TestEarlyTerminationDifferentSorter, I think the error is in the test, not in the code being tested.
Namely, the following:

                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);

What happens is the following: At most numHits results are collected in the collector.  This is done with a priority queue of at most numHits size. That priority queued keeps track of the numHits (or less) "best matched" documents. If there are more than numHits total hits, and there's a better matching document than the lowest matching document in the priority queue, that lowest matching document in the queue is ousted and the better matching document is put in the queue instead (at its proper priority).
The fact that the queue is full is tracked with the TopFieldCollector.queueFull Boolean. Once it's true, the queue is full and the above process kicks in.
Now, the call to GetTopDocs() returns the at most numHits top documents. This is done by getting them from the queue in TopDocsCollector (TopFieldCollector's base class):

        protected virtual void PopulateResults(ScoreDoc[] results, int howMany)
        {
            for (int i = howMany - 1; i >= 0; i--)
            {
                results[i] = m_pq.Pop();
            }
        }

If the priority queue contains howMany or less documents, after that operation the queue is empty.
However, all the other state variables in the TopFieldCollector collector (totalHits, queueFull, numHits, docBase) are left as-is.  So, even though the queue is empty, queueFull is true from the previous run. The system will try to replace a nonexisting element, causing the null reference.
The moral of the story is "after calling collector.GetTopDocs, the collector object cannot be reused".

In fact, this is confirmed by the documentation: https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

"NOTE: you cannot call this method more than once for each search execution. If you need to call it more than once, passing each time a different start, you should call topDocs() and work with the returned TopDocs object, which will contain all the results this search execution collected."

What happens in the test is wrong: the collector should be re-created at every iterations.
Interestingly, if you look at the java source https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTerminatingSortingCollector.java you will see that the testEarlyTermination does exactly that, and that testEarlyTerminationDifferentSorter is quite different.

So I think this should be rewritten as:

        public void TestEarlyTerminationDifferentSorter()
        {
            // test that the collector works correctly when the index was sorted by a
            // different sorter than the one specified in the ctor.
            CreateRandomIndexes(5);

            IndexSearcher searcher = NewSearcher(reader);
            int iters = AtLeast(5);
            for (int i = 0; i < iters; ++i)
            {
				int numHits = TestUtil.NextInt(Random(), 1, numDocs / 10);
				Sort sort = new Sort(new SortField("ndv2", SortFieldType.INT64, false));
				bool fillFields = Random().nextBoolean();
				bool trackDocScores = Random().nextBoolean();
				bool trackMaxScore = Random().nextBoolean();
				bool inOrder = Random().nextBoolean();
				TopFieldCollector collector1 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TopFieldCollector collector2 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TermQuery query = new TermQuery(new Term("s", RandomInts.RandomFrom(Random(), terms)));
                searcher.Search(query, collector1);
                Sort different = new Sort(new SortField("ndv2", SortFieldType.INT64));
                searcher.Search(query, new EarlyTerminatingSortingCollectorHelper(collector2, different, numHits));


                assertTrue(collector1.TotalHits >= collector2.TotalHits);
                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);
            }
        }

... and the same for the other test.

I'm getting something different from TestSearcherManager_Mem: 

Test Name:	TestSearcherManager_Mem
Test Outcome:	Failed
Result Message:	
System.Exception : MockDirectoryWrapper: cannot close: there are still open files: _1.cfs ,_0.cfs
  ----> System.Exception : unclosed IndexSlice: _5.cfs

... but perhaps Github is playing tricks on me again.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 2:00 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many failing tests. I am not sure how I missed that one - I think I was staring at the screen for too long. I moved the return true statement inside the try block, but I suppose it doesn't really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter() (~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem() (~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time - down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter() and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T> somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread class and fixed some minor differences from Lucene, but none of them affected the result. I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager, fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made, and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274. There are several things that depend on Automaton to work, and since none of them have failing tests, I am not all that concerned about this "difference". But it would be nice to find an answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate any (more) help you can provide.


Thanks again,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning "false", even if an element is available.
The correction is to add "return true;" where indicated:


            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(), e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

Hello Shad,

Insanity is knowing the answer to life, the universe and everything, and suddenly realizing that it's not 42.
Insanity is knowing that api-work is based on lucene 4.8.0 but yet it isn't. But it is. But it isn't.

I don't understand it, but I must be living in a parallel universe. Or perhaps you are. Or we both are. Maybe it's the matrix.

You refer to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131 as the code upon which the test is based, and I can't but acknowledge that it is so.

I do the following:
Go to https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java
On the page, click on the 'branch: master' button, click on 'tags', and choose 'release/lucene-solr/4.8.0'
The url switches to https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/test/org/apache/lucene/search/TestEarlyTermination.java 

"My" TestEarlyTermination.java is different from yours.

I give up. Stick a fork in me, I'm done.

Turn the lights of when you're leaving.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 6:58 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

Thanks for the explanation. However, you have linked to wrong versions of documents and code. If you look at the version 4.8.0 tag instead of master (which is now at version 6.4.2), you can see that the code was reproduced faithfully: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131

The documentation link you posted was from a previous version, but that comment didn't change in 4.8.0: https://lucene.apache.org/core/4_8_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

I tried what you said and it made the test pass reliably, but that isn't what the test in Java does. Also, I have been seeing random errors from other places and in all cases the code was calling Util.PriorityQueue<T>.UpdateBottom() and then afterward gets a null result from the heap. That said, it is still too early to tell if the entirety of that issue was due to the IndexSearcher issue that was just fixed. If I see another occurrence of that issue, I will send the details your way.

Anyway, while it is tempting to just move the instantiation of the collectors inside of the loop and be done with it or, as I had previously discovered, the test will pass if you just put try-catch statements around the offending lines in TopDocsCollector and eat the null reference exception. But that isn't what is happening in Lucene, which indicates there is a bug. Sure, it is bad that in Java they didn't program defensively and check for a null before accessing its members (bad Java programmer!), and maybe these tests were not written according to the documentation, but the fact of the matter is Lucene is not supposed to throw this exception in this case.

You have given me more to consider, though. I previously theorized that it was the part of the code that is populating the queue that was failing, but it may be that there is a failure to reset the "full" state somewhere during the next search, which is in turn causing this failure to happen. It is strange that it only happens in the "different sorter" case, though because the tests are nearly identical in every other sense.


> I'm getting something different from TestSearcherManager_Mem --> cannot close: there are still open files: _1.cfs ,_0.cfs

You know what they say, the definition of insanity is doing the same thing twice and expecting a different result. So, I guess that makes me insane because that is exactly what these randomized tests do.

I saw this issue before, and it boils down to not appropriately disposing something, or not disposing stuff in the right order. It especially happens in the test environment when an assert fails and then the Dispose() method doesn't get called as a result, then the run or two attempt to recover. Sometimes changing the test to use using blocks works, but other times it screws up the test. Of course you would always use using blocks in production code, but you have to take into consideration these tests were written by the insane.

Even if you don't fix the test with using blocks, running the test again usually bypasses this message and gets you to another one.


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:48 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

"King of the one-liners". Hmmm... maybe I'll order a T-shirt with that description.

Regarding TestEarlyTerminationDifferentSorter, I think the error is in the test, not in the code being tested.
Namely, the following:

                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);

What happens is the following: At most numHits results are collected in the collector.  This is done with a priority queue of at most numHits size. That priority queued keeps track of the numHits (or less) "best matched" documents. If there are more than numHits total hits, and there's a better matching document than the lowest matching document in the priority queue, that lowest matching document in the queue is ousted and the better matching document is put in the queue instead (at its proper priority).
The fact that the queue is full is tracked with the TopFieldCollector.queueFull Boolean. Once it's true, the queue is full and the above process kicks in.
Now, the call to GetTopDocs() returns the at most numHits top documents. This is done by getting them from the queue in TopDocsCollector (TopFieldCollector's base class):

        protected virtual void PopulateResults(ScoreDoc[] results, int howMany)
        {
            for (int i = howMany - 1; i >= 0; i--)
            {
                results[i] = m_pq.Pop();
            }
        }

If the priority queue contains howMany or less documents, after that operation the queue is empty.
However, all the other state variables in the TopFieldCollector collector (totalHits, queueFull, numHits, docBase) are left as-is.  So, even though the queue is empty, queueFull is true from the previous run. The system will try to replace a nonexisting element, causing the null reference.
The moral of the story is "after calling collector.GetTopDocs, the collector object cannot be reused".

In fact, this is confirmed by the documentation: https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

"NOTE: you cannot call this method more than once for each search execution. If you need to call it more than once, passing each time a different start, you should call topDocs() and work with the returned TopDocs object, which will contain all the results this search execution collected."

What happens in the test is wrong: the collector should be re-created at every iterations.
Interestingly, if you look at the java source https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTerminatingSortingCollector.java you will see that the testEarlyTermination does exactly that, and that testEarlyTerminationDifferentSorter is quite different.

So I think this should be rewritten as:

        public void TestEarlyTerminationDifferentSorter()
        {
            // test that the collector works correctly when the index was sorted by a
            // different sorter than the one specified in the ctor.
            CreateRandomIndexes(5);

            IndexSearcher searcher = NewSearcher(reader);
            int iters = AtLeast(5);
            for (int i = 0; i < iters; ++i)
            {
				int numHits = TestUtil.NextInt(Random(), 1, numDocs / 10);
				Sort sort = new Sort(new SortField("ndv2", SortFieldType.INT64, false));
				bool fillFields = Random().nextBoolean();
				bool trackDocScores = Random().nextBoolean();
				bool trackMaxScore = Random().nextBoolean();
				bool inOrder = Random().nextBoolean();
				TopFieldCollector collector1 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TopFieldCollector collector2 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TermQuery query = new TermQuery(new Term("s", RandomInts.RandomFrom(Random(), terms)));
                searcher.Search(query, collector1);
                Sort different = new Sort(new SortField("ndv2", SortFieldType.INT64));
                searcher.Search(query, new EarlyTerminatingSortingCollectorHelper(collector2, different, numHits));


                assertTrue(collector1.TotalHits >= collector2.TotalHits);
                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);
            }
        }

... and the same for the other test.

I'm getting something different from TestSearcherManager_Mem: 

Test Name:	TestSearcherManager_Mem
Test Outcome:	Failed
Result Message:	
System.Exception : MockDirectoryWrapper: cannot close: there are still open files: _1.cfs ,_0.cfs
  ----> System.Exception : unclosed IndexSlice: _5.cfs

... but perhaps Github is playing tricks on me again.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 2:00 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many failing tests. I am not sure how I missed that one - I think I was staring at the screen for too long. I moved the return true statement inside the try block, but I suppose it doesn't really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter() (~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem() (~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time - down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter() and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T> somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread class and fixed some minor differences from Lucene, but none of them affected the result. I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager, fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made, and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274. There are several things that depend on Automaton to work, and since none of them have failing tests, I am not all that concerned about this "difference". But it would be nice to find an answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate any (more) help you can provide.


Thanks again,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning "false", even if an element is available.
The correction is to add "return true;" where indicated:


            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(), e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Vincent,

Thanks for the explanation. However, you have linked to wrong versions of documents and code. If you look at the version 4.8.0 tag instead of master (which is now at version 6.4.2), you can see that the code was reproduced faithfully: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/misc/src/test/org/apache/lucene/index/sorter/TestEarlyTermination.java#L131

The documentation link you posted was from a previous version, but that comment didn't change in 4.8.0: https://lucene.apache.org/core/4_8_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

I tried what you said and it made the test pass reliably, but that isn't what the test in Java does. Also, I have been seeing random errors from other places and in all cases the code was calling Util.PriorityQueue<T>.UpdateBottom() and then afterward gets a null result from the heap. That said, it is still too early to tell if the entirety of that issue was due to the IndexSearcher issue that was just fixed. If I see another occurrence of that issue, I will send the details your way.

Anyway, while it is tempting to just move the instantiation of the collectors inside of the loop and be done with it or, as I had previously discovered, the test will pass if you just put try-catch statements around the offending lines in TopDocsCollector and eat the null reference exception. But that isn't what is happening in Lucene, which indicates there is a bug. Sure, it is bad that in Java they didn't program defensively and check for a null before accessing its members (bad Java programmer!), and maybe these tests were not written according to the documentation, but the fact of the matter is Lucene is not supposed to throw this exception in this case.

You have given me more to consider, though. I previously theorized that it was the part of the code that is populating the queue that was failing, but it may be that there is a failure to reset the "full" state somewhere during the next search, which is in turn causing this failure to happen. It is strange that it only happens in the "different sorter" case, though because the tests are nearly identical in every other sense.


> I'm getting something different from TestSearcherManager_Mem --> cannot close: there are still open files: _1.cfs ,_0.cfs

You know what they say, the definition of insanity is doing the same thing twice and expecting a different result. So, I guess that makes me insane because that is exactly what these randomized tests do.

I saw this issue before, and it boils down to not appropriately disposing something, or not disposing stuff in the right order. It especially happens in the test environment when an assert fails and then the Dispose() method doesn't get called as a result, then the run or two attempt to recover. Sometimes changing the test to use using blocks works, but other times it screws up the test. Of course you would always use using blocks in production code, but you have to take into consideration these tests were written by the insane.

Even if you don't fix the test with using blocks, running the test again usually bypasses this message and gets you to another one.


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:48 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

"King of the one-liners". Hmmm... maybe I'll order a T-shirt with that description.

Regarding TestEarlyTerminationDifferentSorter, I think the error is in the test, not in the code being tested.
Namely, the following:

                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);

What happens is the following: At most numHits results are collected in the collector.  This is done with a priority queue of at most numHits size. That priority queued keeps track of the numHits (or less) "best matched" documents. If there are more than numHits total hits, and there's a better matching document than the lowest matching document in the priority queue, that lowest matching document in the queue is ousted and the better matching document is put in the queue instead (at its proper priority).
The fact that the queue is full is tracked with the TopFieldCollector.queueFull Boolean. Once it's true, the queue is full and the above process kicks in.
Now, the call to GetTopDocs() returns the at most numHits top documents. This is done by getting them from the queue in TopDocsCollector (TopFieldCollector's base class):

        protected virtual void PopulateResults(ScoreDoc[] results, int howMany)
        {
            for (int i = howMany - 1; i >= 0; i--)
            {
                results[i] = m_pq.Pop();
            }
        }

If the priority queue contains howMany or less documents, after that operation the queue is empty.
However, all the other state variables in the TopFieldCollector collector (totalHits, queueFull, numHits, docBase) are left as-is.  So, even though the queue is empty, queueFull is true from the previous run. The system will try to replace a nonexisting element, causing the null reference.
The moral of the story is "after calling collector.GetTopDocs, the collector object cannot be reused".

In fact, this is confirmed by the documentation: https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

"NOTE: you cannot call this method more than once for each search execution. If you need to call it more than once, passing each time a different start, you should call topDocs() and work with the returned TopDocs object, which will contain all the results this search execution collected."

What happens in the test is wrong: the collector should be re-created at every iterations.
Interestingly, if you look at the java source https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTerminatingSortingCollector.java you will see that the testEarlyTermination does exactly that, and that testEarlyTerminationDifferentSorter is quite different.

So I think this should be rewritten as:

        public void TestEarlyTerminationDifferentSorter()
        {
            // test that the collector works correctly when the index was sorted by a
            // different sorter than the one specified in the ctor.
            CreateRandomIndexes(5);

            IndexSearcher searcher = NewSearcher(reader);
            int iters = AtLeast(5);
            for (int i = 0; i < iters; ++i)
            {
				int numHits = TestUtil.NextInt(Random(), 1, numDocs / 10);
				Sort sort = new Sort(new SortField("ndv2", SortFieldType.INT64, false));
				bool fillFields = Random().nextBoolean();
				bool trackDocScores = Random().nextBoolean();
				bool trackMaxScore = Random().nextBoolean();
				bool inOrder = Random().nextBoolean();
				TopFieldCollector collector1 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TopFieldCollector collector2 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TermQuery query = new TermQuery(new Term("s", RandomInts.RandomFrom(Random(), terms)));
                searcher.Search(query, collector1);
                Sort different = new Sort(new SortField("ndv2", SortFieldType.INT64));
                searcher.Search(query, new EarlyTerminatingSortingCollectorHelper(collector2, different, numHits));


                assertTrue(collector1.TotalHits >= collector2.TotalHits);
                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);
            }
        }

... and the same for the other test.

I'm getting something different from TestSearcherManager_Mem: 

Test Name:	TestSearcherManager_Mem
Test Outcome:	Failed
Result Message:	
System.Exception : MockDirectoryWrapper: cannot close: there are still open files: _1.cfs ,_0.cfs
  ----> System.Exception : unclosed IndexSlice: _5.cfs

... but perhaps Github is playing tricks on me again.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 2:00 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many failing tests. I am not sure how I missed that one - I think I was staring at the screen for too long. I moved the return true statement inside the try block, but I suppose it doesn't really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter() (~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem() (~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time - down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter() and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T> somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread class and fixed some minor differences from Lucene, but none of them affected the result. I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager, fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made, and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274. There are several things that depend on Automaton to work, and since none of them have failing tests, I am not all that concerned about this "difference". But it would be nice to find an answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate any (more) help you can provide.


Thanks again,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning "false", even if an element is available.
The correction is to add "return true;" where indicated:


            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(), e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

"King of the one-liners". Hmmm... maybe I'll order a T-shirt with that description.

Regarding TestEarlyTerminationDifferentSorter, I think the error is in the test, not in the code being tested.
Namely, the following:

                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);

What happens is the following: At most numHits results are collected in the collector.  This is done with a priority queue of at most numHits size. That priority queued keeps track of the numHits (or less) "best matched" documents. If there are more than numHits total hits, and there's a better matching document than the lowest matching document in the priority queue, that lowest matching document in the queue is ousted and the better matching document is put in the queue instead (at its proper priority).
The fact that the queue is full is tracked with the TopFieldCollector.queueFull Boolean. Once it's true, the queue is full and the above process kicks in.
Now, the call to GetTopDocs() returns the at most numHits top documents. This is done by getting them from the queue in TopDocsCollector (TopFieldCollector's base class):

        protected virtual void PopulateResults(ScoreDoc[] results, int howMany)
        {
            for (int i = howMany - 1; i >= 0; i--)
            {
                results[i] = m_pq.Pop();
            }
        }

If the priority queue contains howMany or less documents, after that operation the queue is empty.
However, all the other state variables in the TopFieldCollector collector (totalHits, queueFull, numHits, docBase) are left as-is.  So, even though the queue is empty, queueFull is true from the previous run. The system will try to replace a nonexisting element, causing the null reference.
The moral of the story is "after calling collector.GetTopDocs, the collector object cannot be reused".

In fact, this is confirmed by the documentation: https://lucene.apache.org/core/4_2_0/core/org/apache/lucene/search/TopDocsCollector.html#topDocs()

"NOTE: you cannot call this method more than once for each search execution. If you need to call it more than once, passing each time a different start, you should call topDocs() and work with the returned TopDocs object, which will contain all the results this search execution collected."

What happens in the test is wrong: the collector should be re-created at every iterations.
Interestingly, if you look at the java source https://github.com/apache/lucene-solr/blob/master/lucene/core/src/test/org/apache/lucene/search/TestEarlyTerminatingSortingCollector.java you will see that the testEarlyTermination does exactly that, and that testEarlyTerminationDifferentSorter is quite different.

So I think this should be rewritten as:

        public void TestEarlyTerminationDifferentSorter()
        {
            // test that the collector works correctly when the index was sorted by a
            // different sorter than the one specified in the ctor.
            CreateRandomIndexes(5);

            IndexSearcher searcher = NewSearcher(reader);
            int iters = AtLeast(5);
            for (int i = 0; i < iters; ++i)
            {
				int numHits = TestUtil.NextInt(Random(), 1, numDocs / 10);
				Sort sort = new Sort(new SortField("ndv2", SortFieldType.INT64, false));
				bool fillFields = Random().nextBoolean();
				bool trackDocScores = Random().nextBoolean();
				bool trackMaxScore = Random().nextBoolean();
				bool inOrder = Random().nextBoolean();
				TopFieldCollector collector1 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TopFieldCollector collector2 = Search.TopFieldCollector.Create(sort, numHits, fillFields, trackDocScores, trackMaxScore, inOrder);
				TermQuery query = new TermQuery(new Term("s", RandomInts.RandomFrom(Random(), terms)));
                searcher.Search(query, collector1);
                Sort different = new Sort(new SortField("ndv2", SortFieldType.INT64));
                searcher.Search(query, new EarlyTerminatingSortingCollectorHelper(collector2, different, numHits));


                assertTrue(collector1.TotalHits >= collector2.TotalHits);
                AssertTopDocsEquals(collector1.GetTopDocs().ScoreDocs, collector2.GetTopDocs().ScoreDocs);
            }
        }

... and the same for the other test.

I'm getting something different from TestSearcherManager_Mem: 

Test Name:	TestSearcherManager_Mem
Test Outcome:	Failed
Result Message:	
System.Exception : MockDirectoryWrapper: cannot close: there are still open files: _1.cfs ,_0.cfs
  ----> System.Exception : unclosed IndexSlice: _5.cfs

... but perhaps Github is playing tricks on me again.

Vincent



-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Wednesday, March 22, 2017 2:00 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many failing tests. I am not sure how I missed that one - I think I was staring at the screen for too long. I moved the return true statement inside the try block, but I suppose it doesn't really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter() (~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem() (~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time - down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter() and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T> somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread class and fixed some minor differences from Lucene, but none of them affected the result. I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager, fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made, and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274. There are several things that depend on Automaton to work, and since none of them have failing tests, I am not all that concerned about this "difference". But it would be nice to find an answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate any (more) help you can provide.


Thanks again,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning "false", even if an element is available.
The correction is to add "return true;" where indicated:


            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(), e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Vincent,

I am now going to have to call you "King of the one-liners", for such a quick fix to so many failing tests. I am not sure how I missed that one - I think I was staring at the screen for too long. I moved the return true statement inside the try block, but I suppose it doesn't really make any difference in this case.

So, here is what is left that is failing:

Lucene.Net.Tests.Util.Automaton.TestAgainstBrzozowski() (100% of the time)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen() (~100% of the time)
Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter() (~75% of the time - down from ~90% of the time after the recent fix)
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem() (~66% of the time - down from ~95% of the time after the recent fix)
Lucene.Net.Tests.Search.TestSearcherManager.TestSearcherManager_Mem() (~40% of the time - down from ~ 85% of the time after the recent fix)
Lucene.Net.Tests.Analysis.Common.Analysis.In.TestIndicNormalizer.TestBasic() (only seems to happen during a full run)

You read that correctly - there are now only 2 tests failing all of the time. There does seem to be light at the end of the tunnel!

I suspect whatever is causing the CRT tests to fail is also what is causing TestEarlyTerminationDifferentSorter() and TestSearcherManager_Mem() to fail. Most of these failures seem to be centered around IndexSearcher/Util.PriorityQueue<T> somehow.

I have already reviewed the ThreadedIndexingAndSearchingTestCase as well as the TestControlledRealTimeReopenThread class and fixed some minor differences from Lucene, but none of them affected the result. I also reviewed ControlledRealTimeReopenThread, ReferenceManager<T> and SearcherManager, fixed some bugs, but still no change.

As for TestAgainstBrzozowski(), here are my debug notes as well as a non-random test I made, and the result of the test in Java from my last attempt: https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Util/Automaton/TestMinimize.cs#L64-L274. There are several things that depend on Automaton to work, and since none of them have failing tests, I am not all that concerned about this "difference". But it would be nice to find an answer once all of the real problems are resolved.

Anyway, I am going to keep poking at it to see if I can spot the cause(s), and would appreciate any (more) help you can provide.


Thanks again,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 3:46 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello again Shad,

I found another bug in ExecutionHelper<T>.MoveNext(): the method was always returning "false", even if an element is available.
The correction is to add "return true;" where indicated:


            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        var awaitable = service.Take();
                        awaitable.Wait();
                        current = awaitable.Result;

                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException e)
                    {
                        throw new System.Threading.ThreadInterruptedException(e.ToString(), e);
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
		return true; // *********** here  *******************
                }

                return false;
            }

The TestDuringAddIndexes test passes now. No files, just the one line.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 7:41 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Thanks for this Vincent. It's a big help.

Unfortunately, these three lines didn't quite do the trick, but they did change the error message of TestDuringAddIndexes() to a null reference exception from the Util.PriorityQueue (which I have also been seeing randomly, most notably from Lucene.Net.Tests.Misc.Index.Sorter.TestEarlyTermination.TestEarlyTerminationDifferentSorter()).

Do note a much simpler test to work with is the Lucene.Net.Tests.Memory.Index.Memory.MemoryIndexTest.TestRandomQueries(), which (both before and after adding these 3 lines) is returning a 0 result when it should be 1 on the search for the term "Apache" (the first test) from IndexSearcher + TaskScheduler.

Since all of this is self-contained within the IndexSearcher, changing the design to be better is certainly not out of the question. I mean, there would be no public API change if we had to eliminate any or all of the private classes (including the ExecutorService<T>) in favor of a more .NET-friendly implementation. As long as the external behavior is the same, we don't much care how it is implemented internally. Would you be interested in providing a working solution?

The WeakIdentityTests are now passing. So, I will try swapping for the WeakDictionary to see if it eliminates your static initialization issue.

That said, there is one minor thing about the WeakIdentityMap that was overlooked. The enumerator was changed in Lucene 4.x to also "reap on read": https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java#L205, https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/CHANGES.txt#L2489-L2492. This is one thing that made it tricky to work with because the original implementation only allows a single read() and then the object is no longer available. This would be extremely unorthodox for a property (i.e. Current). It is also supposed to exit out of the loop if the last entry is unavailable to the current thread in the case where 2 of them call MoveNext() and one beats the other to read Current and reap in a race condition (it does this by throwing an exception in Java). I will attempt to get it working this way, but at least now we can settle for just passing instead of ideal behavior if we can't get it to work exactly this way.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 11:59 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

Hello Shad,

Regarding TestDuringAddIndexes, you found 99.9% of the problem. Allow me to contribute to the remaining 0.1%.
There is indeed nothing wrong with the limited concurrencylevel task scheduler. The problem is the ExecutionHelper<T>'s method of obtaining the results.
I see this:

            public bool MoveNext()
            {
                if (numTasks > 0)
                {
                    try
                    {
                        current = service.Take().Result;
                    }
#if !NETSTANDARD
                    catch (System.Threading.ThreadInterruptedException)
                    {
                        throw;
                    }
#endif
                    catch (Exception e)
                    {
                        // LUCENENET NOTE: We need to re-throw this as Exception to 
                        // ensure it is not caught in the wrong place
                        throw new Exception(e.ToString(), e);
                    }
                    finally
                    {
                        --numTasks;
                    }
                }

                return false;
            }

The call to service.Take() returns an awaitable task, which is nothing but a promise that the result will be delivered at some future time.
Alas, the enumerator is strictly synchronous and expects the result to be available at each iteration. 
Accessing result without waiting for the task to end (as is done now) will yield incorrect result.
The shortest solution is to replace:

	current = service.Take().Result;

By:

	var awaitable = service.Take();
	awaitable.Wait();
	current = awaitable.Result;

The best solution would be to await al tasks in parallel (or at least as many as possible) and return their value as soon as they become available. This is EXTREMELY hard to do, since you will be essentially fighting a battle between your executor service and its execution helper, so I leave that as an exercise for the reader.

No files to send this time, it's just these 3 lines.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.

Hello Shad,

Please find the files attached.

Have fun with them and let me know what develops!


Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Tuesday, March 21, 2017 3:42 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Hi Vincent,

Sure, you can email the files to me directly.

For a quick start on Git/GitHub, there is a fairly short book called Pragmatic Version Control Using Git (https://pragprog.com/book/tsgit/pragmatic-version-control-using-git) that gets you up and running quickly. I think you might be attempting to push to the main repo - and you won't have permission unless it is explicitly granted. What you need to do is to fork the repo to your own GitHub account, then you can read/write it as much as necessary. Once you get it to a point where you want to submit something, you can do a pull request (either through GitHub or just manually email a request) and someone else can then review and merge the changes.

Update

I found the source of the Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes() problem - it always occurs when you call an overload of the IndexSearcher constructor that takes a TaskScheduler as a parameter and pass a non-null value. This is built into the test framework (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.TestFramework/Util/LuceneTestCase.cs#L1778) to happen rarely, which explains many of the random failures we are seeing. If you change the random code to never use a TaskScheduler, the test will always pass, change it to always use a TaskScheduler and it will always fail.

The implementation of TaskScheduler we are using for testing (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs) was copied directly from MSDN, so I doubt that is the issue. In fact, there is a good chance that the issue is similar to the WeakIdentityMap issue in that there is an enumerator/call to enumerator that is not thread-safe see (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L474-L500) and (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Core/Search/IndexSearcher.cs#L569-L590).

Anyway, I know of at least 3 tests that are failing as a result of this, so fixing it would be a big prize.


Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, March 21, 2017 8:51 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hello Shad,

I had a little time on my hands and looked into this WeakIdentityMap issue, more specifically TestConcurrentHashMap which fails for me as well (in 100% of the cases). Maybe I have something to contribute: I have 3 observations:

First, notice that in TestConcurrentHashMap , 8 threads are created and then all joined by doing the following:

            finally
            {
                foreach (var w in workers)
                {
                    w.Join(1000L);
                }
            }

This gives the first thread 1 second to end, the second one at most 2 seconds (1 second + whatever time the first thread needed to end) and so on.  Given the amount of work of each test thread, this is far too little time even on a fast machine. It takes 13 seconds for all threads to end here.
The corresponding java test has the following:

   while (!exec.awaitTermination(1000L, TimeUnit.MILLISECONDS));

... which in effect just loops until the execution of each thread is finished, in units of 1 second.
In TPL, threads would be tasks and we would just be able to call  Task.WaitAll. Since we're dealing with "real" threads here,  I would suggest just call w.Join() and be done with it.
This would align the test with the java behavior.

Second, there are various weaknesses in the WeakIdentityMap:
1) the implementation of the Keys enumerator (IteratorAnonymousInnerClassHelper) relies on the order of the elements in the keys collection (outerInstance.backingStore.Keys.ElementAt(position)). This is bad for two reasons:
- it is extremely slow (there is no indexed access on outerInstance.backingStore.Keys in any current implementation, so ElementAt needs to skip "position" elements to get at the correct one)
- it relies on the assumption that removing (or adding) a key in a dictionary doesn't change the previous relative key order, which is incorrect in any current .NET implementation I am aware of (dictionaries are hash tables with collision resolution through chaining, and reusing of slots through a free list: it's just asking for trouble).
It turns out that you can use the backing store enumerator to implement the keys enumerator directly. The main loop simply becomes:

			public bool MoveNext()
			{
				while (enumerator.MoveNext())
				{
					next = enumerator.Current.Key.Target;
					if (next != null)
					{
						// unfold "null" special value:
						if (next == NULL)
							next = null;
						return true;
					}
				}
				return false;
			}

This works in the non-concurrent case (because we don't touch the collection while the enumerator is running), and in the concurrent case as well (because the ConcurrentDictionary<K,V> enumerator works by design and handles concurrent modifications without a problem).

2) calling Reap() create objects on the heap, even when there are no elements to be removed. Sadly, not all of these allocation can be eliminated, but you can delay the creation of the keysToRemove list until it's really needed:

			List<IdentityWeakReference> keysToRemove = null;
			foreach (IdentityWeakReference zombie in backingStore.Keys)
			{
				if (!zombie.IsAlive)
				{
					// create the list of keys to remove only if there are keys to remove.
					// this reduces heap pressure
					if (keysToRemove == null)
						keysToRemove = new List<IdentityWeakReference>();
					keysToRemove.Add(zombie);
				}
			}

			if (keysToRemove != null)
				foreach (var key in keysToRemove)
				{
					backingStore.Remove(key);
				}

Note that I don't iterate the Keys collection, but use the dictionary enumerator. Believe it or not, but this is slightly more efficient for reasons I won't explain here since this e-mail is already long enough.
It's sad but inevitable that a heap object is created for the dictionary enumerator, because we call it through an interface (IDictionary<K,V>): it we had the actual object, no enumerator object would be created on the heap.

3) Equality of weak identity references can be done using only one case (using "as" instead of "is"), which is more efficient.


Third, the test itself uses enumerators in a nonstandard manner. The two witness cases are:

	IEnumerator<string> iter = map.Keys.GetEnumerator();
	Assert.IsTrue(iter.MoveNext());
	Assert.IsNull(iter.Current);
	Assert.IsFalse(iter.MoveNext());
	Assert.IsFalse(iter.MoveNext());

And

            for (IEnumerator<string> iter = map.Keys.GetEnumerator(); iter.MoveNext();)
            {
                //Assert.IsTrue(iter.hasNext()); // try again, should return same result!
                string k = iter.Current;
	...
            }

All the other instances are variants of these witnesses.
The correct way of using IEnumerator<T> is by calling IEnumerator<T>.Dispose() after you're finished with the instance. Note that in Lucene itself, foreach() is used which does it correctly (ByteBufferIndexInput.cs):

                    foreach (ByteBufferIndexInput clone in clones.Keys)
                    {
                        clone.UnsetBuffers();
                    }

All usages of enumerators in TestWeakIdentityMap.cs must be rewritten accordingly. For example:

	using (IEnumerator<string> iter = map.Keys.GetEnumerator())
	{
		Assert.IsTrue(iter.MoveNext());
		Assert.IsNull(iter.Current);
		Assert.IsFalse(iter.MoveNext());
		Assert.IsFalse(iter.MoveNext());
	}
And

            foreach (object k in map.Keys)
            {
	...
            }

In case you are wondering why this is so important: you cannot guarantee that future implementations of an enumerator (especially one on a concurrent collection) doesn't have a cleanup to do to get rid of various synchronization objects. Right now this isn't the case, but you never know what the future will bring. And besides, nice guys Dispose() after their enumeration <g>.

The test passes now. Every time.

I've made the changes to api-work in my local repository, but when I tried to "push" or "sync" them, I get :

	Error encountered while pushing to the remote repository: Response status code does not indicate success: 403 (Forbidden).

I know next to nothing about GitHub. Can I e-mail the changed files to someone?

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Monday, March 20, 2017 9:19 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Bug in Lucene static initialization with multiple threads.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent

RE: Bug in Lucene static initialization with multiple threads.

Posted by Shad Storhaug <sh...@shadstorhaug.com>.

Hi Vincent,

Thanks for reporting this. In fact, thank you for all of your assistance tracking down bugs.

This issue boils down to being a failed attempt to replace Lucene's WeakIdentityMap with a new data structure called WeakDictionary. Since there are already tests to verify concurrency on WeakIdentityMap and it is used in a couple of other places in Lucene, it would be far better to get it working right than to try to fix this alternative version. I guess for the time being your workaround should suffice (though, a fix rather than a hack would be preferred).

I have spent quite a bit of time on this, but the best I have been able to do is to get the Lucene.Net.Tests.Util.TestWeakIdentityMap.TestConcurrentHashMap() test to pass about 50% of the time (and I can't seem to even get it back into that state). 

Here are a couple of attempts I have made:

https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-1 - using a port of the original Java backing classes
https://github.com/NightOwl888/lucenenet/commits/api-work-weak-identity-map-2 - using the .NET WeakIdentity class

And here is the original Java version: https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/util/WeakIdentityMap.java 

The complicated part is getting it to "reap" the elements in a thread-safe way so the counts are right on several concurrent enumerators. Any assistance you could provide to make WeakIdentityMap thread-safe would be much appreciated. Do note that the lead branch is now at https://github.com/apache/lucenenet/tree/api-work, so please do any work from that branch.

Also note there are also currently a few other concurrency tests that are failing:

Lucene.Net.Tests.Index.TestIndexReaderWriter.TestDuringAddIndexes()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestControlledRealTimeReopenThread_Mem()
Lucene.Net.Tests.Search.TestControlledRealTimeReopenThread.TestCRTReopen()

I am sure that getting to the bottom of these issues will probably fix most of the issues you are seeing. If you have any spare time, your help would be appreciated on these as well.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Tuesday, February 7, 2017 6:00 PM
To: dev@lucenenet.apache.org
Subject: Bug in Lucene static initialization with multiple threads.

Hello,

Every once in a while, I get an error when using Lucene in a multithreaded scenario (meaning: using a single IndexWriter in multiple threads, or using a distinct IndexWriter in each thread: it doesn't matter).
The exception chain thrown is:

Unhandled Exception: System.ArgumentException: Could not instantiate implementing class for Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute
---> System.ArgumentException: Could not find implementing class for ICharTermAttribute
--->System.InvalidOperationException: Collection was modified; enumeration operation  may not execute.

I could not understand what was going on, especially because it only occurred "sometimes". It took me a while to figure out, but I think it's a bug.

Here's the stack trace of the exception when it occurs:

                [External Code]
>             Lucene.Net.dll!Lucene.Net.Support.HashMap<Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.WeakKey<System.Type>, System.WeakReference>.GetEnumerator() Line 229           C#
               [External Code]
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean() Line 59           C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.CleanIfNeeded() Line 71         C#
               Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Add(System.Type key, System.WeakReference value) Line 134           C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.GetClassForInterface<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 90  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AttributeFactory.DefaultAttributeFactory.CreateAttributeInstance<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 70  C#
                Lucene.Net.dll!Lucene.Net.Util.AttributeSource.AddAttribute<Lucene.Net.Analysis.Tokenattributes.ICharTermAttribute>() Line 350                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.InitializeInstanceFields() Line 658         C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.StringTokenStream.StringTokenStream() Line 676                C#
               Lucene.Net.dll!Lucene.Net.Documents.Field.GetTokenStream(Lucene.Net.Analysis.Analyzer analyzer) Line 629         C#
               Lucene.Net.dll!Lucene.Net.Index.DocInverterPerField.ProcessFields(Lucene.Net.Index.IndexableField[] fields, int count) Line 105              C#
                Lucene.Net.dll!Lucene.Net.Index.DocFieldProcessor.ProcessDocument(Lucene.Net.Index.FieldInfos.Builder fieldInfos) Line 279          C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriterPerThread.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 287                C#
                Lucene.Net.dll!Lucene.Net.Index.DocumentsWriter.UpdateDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer, Lucene.Net.Index.Term delTerm) Line 574                C#
               Lucene.Net.dll!Lucene.Net.Index.IndexWriter.UpdateDocument(Lucene.Net.Index.Term term, System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1830          C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc, Lucene.Net.Analysis.Analyzer analyzer) Line 1455   C#
                Lucene.Net.dll!Lucene.Net.Index.IndexWriter.AddDocument(System.Collections.Generic.IEnumerable<Lucene.Net.Index.IndexableField> doc) Line 1436   C#

... and to wit, here are the threads just rushing in to do the same:

Not Flagged                        35428    17           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        35444    11           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        44124    12           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged        >             44140    13           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        47700    14           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        28168    15           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        30988    16           Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal
Not Flagged                        21828    6              Worker Thread <No Name>                Lucene.Net.dll!Lucene.Net.Support.WeakDictionary<System.Type, System.WeakReference>.Clean                Normal

The reason why it only reproduces "sometimes" is because of this little nugget of code:

        private void CleanIfNeeded()
        {
            int currentColCount = GC.CollectionCount(0);
            if (currentColCount > _gcCollections)
            {
                Clean();
                _gcCollections = currentColCount;
            }
        }

If one thread does a Clean() operation in the middle of another Clean() operation on the same collection that replaces the object being enumerated on, you get the exception. Always.
To avoid the intermittence, create a bunch of threads like this and eliminate the test "if (currentColCount > _gcCollections)" so that the Clean() code is always executed. You'll get the exception every time.

I will not post the correction, but there's a simple workaround: just make sure the static initializers are performed in a single thread.
I.e. before creating your threads, do something like this:

new global::Lucene.Net.Documents.TextField("dummy", "dummyvalue", global::Lucene.Net.Documents.Field.Store.NO).GetTokenStream(new (some Analyzer object));

Replace "some Analyzer object" with an instance of an Analyzer object, it doesn't matter which one. It's meaningless, but it has the side effect of initializing the static fields without problems.


Vincent