You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by Shad Storhaug <sh...@shadstorhaug.com> on 2017/08/04 18:33:27 UTC
RE: Debuggable and ICU

Vincent,

I have created a new icu-dotnet NuGet package based on all of the latest changes from the icu-dotnet repo + Connie's PR for .NET Standard and setup a MyGet feed for easy consumption: https://www.myget.org/F/lucene-icu-dotnet/api/v2. And I setup this branch (https://github.com/NightOwl888/lucenenet/tree/benchmark) to consume it. This is probably temporary, so I haven't pushed it to master yet. But I am hoping to keep an updated compilation containing each new PR on the icu-dotnet project so we can keep working on integrating everything even if those PRs are not approved right away.

Benchmark Bug

When I ported the benchmark functionality to .NET Standard, it revealed a really strange concurrency bug that only seems to happen on .NET Standard. The test that fails is Lucene.Net.Benchmarks.ByTask.Tasks.WriteLineDocTaskTest.TestMultiThreaded. It doesn't fail when running the test individually, but only when running other benchmark tests. It usually fails with a null reference exception, but sometimes with a string comparison mismatch.

Given it only fails in conjunction with other tests, I figured it was probably due to the fact that it is using the same file name as the other tests and NUnit was causing them to collide with its concurrency. But changing the file name makes no difference.

I also ruled out the fact that on .NET Standard we have no ConcurrentMergeScheduler (it uses TaskMergeScheduler instead) by checking whether the constructor of TaskMergeScheduler is called during the test. It is not.

Could you have a look when you get a moment please?

ICU Bugs

I reported some ICU Collator bugs to Connie (https://github.com/sillsdev/icu-dotnet/pull/37#issuecomment-320259428), but I won't hold it against you if you beat her to them.

ICU Wrapper Generator

The C# ICU wrapper generator I mentioned before can be found here: http://site.icu-project.org/related. I tried to run it, but it depends on Cygwin cpp and all I can seem to find is mcpp. The command line parameters don't match and I can't find any documentation to work out how to translate them...if that is possible.

It seems like we should be able to use this tool to generate a starting point on porting the functionality from ICU we need, although we will probably need to manually create object oriented wrappers with a similar API to icu4j. WDYT?


Thanks,
Shad Storhaug (NightOwl888)

From: Shad Storhaug
Sent: Friday, July 28, 2017 7:23 PM
To: 'Van Den Berghe, Vincent'
Cc: dev@lucenenet.apache.org
Subject: RE: Debuggable and ICU

Vincent,


Ø  The Debuggable attribute in release mode is frustrating. I've removed it in my version since I don't see any tests that fail (but that may also be the wine, again), and the performance increase is nothing short of enormous. It seems your tests fail easier than mine (lucky you), so how about this piece of reasoning:

Ø  If tests fail at your end because methods don't appear in the stack frame, the only reason I can think of is that they are being inlined in release mode. Inlining constraints have loosened a bit inRyuJIT , which probably makes the matter worse.

Ø  For those methods that the tests expect to appear in the stack frames, wouldn't it be sufficient to add on the attribute:


Ø  [MethodImpl(MethodImplOptions.NoInlining)].


Ø  ... to make them appear in the stack frames regardless of the build mode? I think the impact would be minimal, since there are only a handful of methods tested using StackTraceHelper.DoesStackTraceContainMethod. Of course, identifying the exact methods is a bit of work, since sometimes only the method name is specified. We absolutely don't want to apply this to all "Dispose" or "Flush" methods in Lucene.Net just to make sure it works. But if this is something that could work, I think it should be done.

If I recall correctly (and I might not because I am getting old), the issue only reared its ugly head in .NET Core. The stack traces just weren't there. It may have only done this on the command line (and keep in mind we are using a very old preview of the CLI tool). This was just a cheap hack to get the release done, but I didn't exhaust all of the possibilities of how to fix this right. I knew there would be a performance loss, but I didn't realize it would be so significant. There may be a much quicker solution than doing sweeping code changes, I just didn't want to hold up the release to look for it being that I had several other hurdles to get over at the time.

In fact, maybe we should wait and see if this is still a problem when we get onto the full release SDK (new .csproj format) before doing a special fix for it - it may just be a problem in that preview release that is screwing up the testing.


Ø  I'll have a look at that ICU thing. I've been programming a lot longer in C++ than in C#, and regularly write in both worlds including those Managed C++ and COM/Interop monsters (much to my chagrin), so maybe I could identify the AccessViolationException which I do see as well. Even with the wine!

Ahh - be warned. We only ended up here because the original solution that we had (ICU4NET) didn't support .NET Core, and per Connie Yau, it is because it was using C++ and p/invoke (https://github.com/sillsdev/icu-dotnet/issues/14). It also wasn't 64 bit compatible. This solution works with .NET Core/.NET Standard and 32/64 bit, and Connie actually just finished her PR for official .NET Standard support and we are waiting for the powers that be to review it and potentially release it (https://github.com/sillsdev/icu-dotnet/pull/37). And waiting...

Wouldn't it be better if we had a pure C# solution instead of going the way of a "wrapper"? Sure there is the disadvantage that on Linux which comes pre-installed with ICU you have to have 2 copies of it, but the advantage of not having to drag around special loader functionality to make it work and an extra platform-specific dependency seem worth it to me.

ICU has a lot of great stuff in it and icu-dotnet doesn't support very much of it. Collator seems to work just right, minus one deprecated feature that it is missing (variableTop). That is, for the icu4j compatible version in Lucene.Net.Analysis.ICU - the OpenJDK version in Analysis.Common is commented out and we might as well just forget about it since the documentation says the other one performs better anyway. But, there are some features that Lucene uses that icu-dotnet is missing:


1.       Normalizer2 (6 Lucene classes depend on it - see https://github.com/sillsdev/icu-dotnet/issues/48)

2.       Transliterator (2 Lucene classes depend on it, but it has many possible implementations that plug into these 2 classes)

3.       UScript (all of Segmentation depends on it -.NET's Regex class can identify different language scripts, but it doesn't assign a number to them or have a concept of "script inheritance", AFAIK)

4.       UCharacter (we can probably use Lucene.Net's Character class or .NET's char class for this, but UScript also depends on it)

5.       UTF16 (has differing functionality than .NET's Unicode class)

There is also a serious difference in behavior between the OpenJDK's java.text.BreakIterator and the ICU com.ibm.icu.text.BreakIterator - enough to make many of the Highlighter tests fail without the hacks I have put in place. Most of this is due to very subtle behaviors around how it deals with begin and end boundaries of a block of text (you can call move on the end element and it returns an "end" status, but it doesn't actually move). Connie has implemented a RuleBasedBreakIterator in icu.net, but I am told that the rules that are used in the OpenJDK don't correspond very much with the rules in icu4c, so making a default BreakIterator that works like the one that makes the PostingsHighlighter tests pass (not to mention all of the things that it does that are not tested in Lucene) might be a challenge. That said, it is a path not ventured.

All of the Lucene features that use BreakIterator (ThaiTokenizer, VectorHighlighter, and PostingsHighlighter) use the OpenJDK version except for Analysis.ICU, which uses the icu4j version. So, we need both types of behavior to get it all to work.

Technically, I created a bit of a hack for the ThaiTokenizer as well - the ThaiWordBreaker, which fixes the segmentation between Thai and non-Thai scripts in the icu.net BreakIterator static class. It might not perform as well as the original, but I think it might be good enough for production use. That said, I notice that icu4j has a dictionary-based ThaiBreakIterator that would probably work without any hacks, and would remove any doubts that it is working right. But it is not available in icu4c...


Ø  I spent a lot of time profiling Lucene.net, especially for indexing.  That produced some of the case folding and memory mapped file optimizations you so graciously integrated while immortalizing my name in the source code. Someone had to prick me to deflate my ego when I saw that.

Credit given to those who make a difference :). Actually, I would prefer if the number of commits here (https://github.com/apache/lucenenet/graphs/contributors) actually reflected the amount of work you have contributed, so others who are thinking of contributing can see the real numbers.


Ø  Another source of indexing inefficiency is the "writer-per-thread-pool" implementation, which I switched to a LIFO thread management similar to what was done in Lucene 4.8.1 (for exactly the same reasons). Hence, it would be nice to integrate https://github.com/vvdb/lucenenet/commit/18b8ef71fc08c1583c323ff00e2d1ef07e55c72c .

Noted. I would like to try to keep this as a product-by-product port of Lucene (like our goals state on the website) instead of a bunch of patchwork between different versions. A long time ago (in a galaxy not far from Jerusalem) Itamar announced that we would be porting to 4.8.0 and then patching to 4.8.1. It feels a bit odd doing a version upgrade before a final release. But we really should just do the upgrade of the whole patch if we are going that route.

There were 121 files that changed between 4.8.0 and 4.8.1, at least 15 were SOLR changes and I think we are already on 4.8.1 for all of Analysis.Common. So, there are roughly 80 files that would need to be upgraded to do the patch (most of which are just a few lines).

If we are going to do it prior to the 4.8.0 release, we should probably do it in conjunction with the breaking API changes it will take to make TermsEnum, DocsEnum, and DocsAndPositionsEnum into IEnumerable<T>. I was planning on doing those changes after the next beta release.

Thanks,
Shad Storhaug (NightOwl888)




From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Friday, July 28, 2017 2:14 PM
To: Shad Storhaug
Cc: dev@lucenenet.apache.org<ma...@lucenenet.apache.org>
Subject: RE: The curse of TestSearcherManager_Mem()

Hello Shad,

I don't drink beer, but I'll have a glass of wine instead. Thank you. You're also the first person ever to use the word "gallant" on something I did. You flatter me, sir.

I spent a lot of time profiling Lucene.net, especially for indexing.  That produced some of the case folding and memory mapped file optimizations you so graciously integrated while immortalizing my name in the source code. Someone had to prick me to deflate my ego when I saw that.
Another source of indexing inefficiency is the "writer-per-thread-pool" implementation, which I switched to a LIFO thread management similar to what was done in Lucene 4.8.1 (for exactly the same reasons). Hence, it would be nice to integrate https://github.com/vvdb/lucenenet/commit/18b8ef71fc08c1583c323ff00e2d1ef07e55c72c .
It would be interesting to see what Benchmark will turn up. I wasn't aware that such a thing existed. But then again, I'm not aware of a lot of things. It must be the wine.

The Debuggable attribute in release mode is frustrating. I've removed it in my version since I don't see any tests that fail (but that may also be the wine, again), and the performance increase is nothing short of enormous. It seems your tests fail easier than mine (lucky you), so how about this piece of reasoning:
If tests fail at your end because methods don't appear in the stack frame, the only reason I can think of is that they are being inlined in release mode. Inlining constraints have loosened a bit inRyuJIT , which probably makes the matter worse.
For those methods that the tests expect to appear in the stack frames, wouldn't it be sufficient to add on the attribute:

[MethodImpl(MethodImplOptions.NoInlining)].

... to make them appear in the stack frames regardless of the build mode? I think the impact would be minimal, since there are only a handful of methods tested using StackTraceHelper.DoesStackTraceContainMethod. Of course, identifying the exact methods is a bit of work, since sometimes only the method name is specified. We absolutely don't want to apply this to all "Dispose" or "Flush" methods in Lucene.Net just to make sure it works. But if this is something that could work, I think it should be done.

I'll have a look at that ICU thing. I've been programming a lot longer in C++ than in C#, and regularly write in both worlds including those Managed C++ and COM/Interop monsters (much to my chagrin), so maybe I could identify the AccessViolationException which I do see as well. Even with the wine!


Vincent

From: Shad Storhaug [mailto:shad@shadstorhaug.com]
Sent: Thursday, July 27, 2017 10:03 PM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>>
Cc: dev@lucenenet.apache.org<ma...@lucenenet.apache.org>
Subject: RE: The curse of TestSearcherManager_Mem()

Vincent,

Thanks for the gallant effort. While the original issue remains, the fact of the matter is the journey did bear some fruit (other bug fixes) so it wasn't a total loss.

So yes, have a beer - you definitely deserve one.

For the moment, I am working on porting Benchmark over, which I am hoping will be able to identify parts of Lucene.Net that are not as well optimized as they should be. This and Lucene.Net.Analysis.ICU are really the only 2 pieces left to port over that are reasonable - the remaining 2 analysis packages are adapters for rather large libraries that currently don't exist in .NET (and I have no idea how useful/useless they might be).

ICU is the interesting one (http://site.icu-project.org/) because it is the most integrated into Lucene. .NET is missing about 3/4 of the functionality that is in there, much of which could be very useful. For example, who doesn't want to remove the accent marks from text when analyzing it (résumé vs resumé vs resume)? The normalization features in .NET are lacking the "case-fold" option in ICU which means people end up writing code like this: https://stackoverflow.com/a/780800 (that many people say doesn't work right).

The solution we have now (wrapping the C library) is only as good as the quality of the wrapper. icu-dotnet is missing most of the features we need to complete Lucene.Net and most likely has thread-safety issues (the AccessViolationException I mentioned earlier). I am debating whether to just abandon the not-so-complete icu-dotnet project and do a fresh port of icu4j to C#, to try to fix up icu-dotnet to work with Lucene.NET, or to hack together a solution for the missing features that "almost works" using .NET platform features. I also found a project that promises to automatically generate a ICU wrapper in C#, which I need to investigate.

In terms of priorities, it is looking like the plan is:


1.       Finish up Benchmark

2.       Integrate Replicator

3.       Integrate documentation and website into the build

4.       Do another beta release

5.       Refactor public facing iterators to IEnumerator<T> (most notably the TermsEnum and its subclasses)

6.       Work on ICU features

7.       Optimize and fix bugs

8.       Do another beta release

9.       Release candidate

10.   Release 4.8.0

11.   Upgrade to a more recent version of Lucene

Once you have had your fill of alcohol, I hope you can help out with some of the effort (such as finding an alternate solution than using stack traces for testing so we can remove the Debuggable attribute). Your help is much appreciated.

Thanks,
Shad Storhaug (NightOwl888)


From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com]
Sent: Thursday, July 27, 2017 4:29 PM
To: dev@lucenenet.apache.org<ma...@lucenenet.apache.org>
Cc: Shad Storhaug
Subject: The curse of TestSearcherManager_Mem()

I give up.

I think I found at least one of the reason why the TestSearcherManager_Mem() sporadically fails. But I can't prove it's the only one. But I give up looking for others, since investing more time would seriously damage my mental health. So I'm going to explain my reasoning in the hope others can validate it. And I'll give up looking for others, if any.

It's simple enough to know the answer to the immediate questions "why" the test fails: in the MockDirectoryWrapper.Dispose(bool) an exception is thrown because the OpenFiles collection is not empty.
Calls to MockDirectoryWrapper.AddFileHandle add entries to OpenFiles or increment their reference count, and calls to MockDirectoryWrapper.RemoveOpenFle decrements the reference count and removes an entry when its reference count drops to 0. It's easy enough.
Therefore, there are 3 possibilities:

1.       calls to RemoveOpenFile are being missed.

2.       calls to RemoveOpenFile are not being missed, but the argument used to look up the entries (IDisposable c) is incorrect.

3.       Calls to AddFileHandle are done with an incorrect argument (IDisposable c) to add/update the entries.

I've identified one instance of possibility 3 (see a previous e-mail). This improved the situation somewhat, but didn't correct it.
I checked possibility 2 in the case of test failure, but found no instances. All calls have their correct argument.
I've checked possibility 1, and calls are indeed being missed. In one case of failure, there were 214 calls to AddFileHandle, but only 205 calls to RemoveOpenFile. This yielded 9 open files, which was exactly the number of items in the OpenFiles dictionary. These numbers are variable, since the parameters of the test are randomized.
There is another dictionary in MockDirectoryWrapper called OpenFilesDeleted: this holds the number of open files for which deletion is asked, but the files are still open. In case of failure, this collection is always empty. So no files are being deleted while open.
So somewhere in Lucene.net, Inputs are not being closed. But where?
It's difficult to reproduce the exact same failure every time, since at (pseudo)random times the tests will create different decoders, docvalues etc or reuse different searches.  In addition, it's difficult to find out a why a call is never made.
Since the debugger can't break on something that's doesn't happen, maybe we can trace where something happens: this could give us a clue on what doesn't happen. After dozens of runs, I found that the common stack trace seems to be this when a file is opened (that never closes):

   at Lucene.Net.Index.SegmentReader..ctor(SegmentCommitInfo si, Int32 termInfosIndexDivisor, IOContext context) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\SegmentReader.cs:line 112
   at Lucene.Net.Index.ReadersAndUpdates.GetReader(IOContext context) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\ReadersAndUpdates.cs:line 170
   at Lucene.Net.Index.ReadersAndUpdates.GetReadOnlyClone(IOContext context) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\ReadersAndUpdates.cs:line 301
   at Lucene.Net.Index.StandardDirectoryReader.Open(IndexWriter writer, SegmentInfos infos, Boolean applyAllDeletes) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 126
   at Lucene.Net.Index.IndexWriter.GetReader(Boolean applyAllDeletes) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\IndexWriter.cs:line 388
   at Lucene.Net.Index.StandardDirectoryReader.DoOpenFromWriter(IndexCommit commit) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 369
   at Lucene.Net.Index.StandardDirectoryReader.DoOpenIfChanged(IndexCommit commit) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 336
   at Lucene.Net.Index.StandardDirectoryReader.DoOpenIfChanged() in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\StandardDirectoryReader.cs:line 325
   at Lucene.Net.Index.DirectoryReader.OpenIfChanged(DirectoryReader oldReader) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Index\DirectoryReader.cs:line 176
   at Lucene.Net.Search.SearcherManager.RefreshIfNeeded(IndexSearcher referenceToRefresh) in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Search\SearcherManager.cs:line 126
   at Lucene.Net.Search.ReferenceManager`1.DoMaybeRefresh() in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Search\ReferenceManager.cs:line 203
   at Lucene.Net.Search.ReferenceManager`1.MaybeRefresh() in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net\Search\ReferenceManager.cs:line 263
   at Lucene.Net.Search.TestSearcherManager.get_CurrentSearcher() in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net.Tests\Search\TestSearcherManager.cs:line 198
   at Lucene.Net.Index.ThreadedIndexingAndSearchingTestCase.ThreadAnonymousInnerClassHelper2.Run() in D:\SOURCE\GitHub\lucenenet\src\Lucene.Net.TestFramework\Index\ThreadedIndexingAndSearchingTestCase.cs:line 455
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()


It looks like file are being opened in the SegmentReader constructor. Just to make our lives miserable, the matter is made more complex by having actual close operations hide behind a reference counting scheme.
But further down the stack, it's triggered in TestSearcherManager.CurrentSearcher property by a call to:

                    if (mgr.MaybeRefresh())

In that same CurrentSearcher property, the searcher is being tracked by a SearcherLifetimeManager instance using the Record(s) method:

                        long token = lifetimeMGR.Record(s);

And in the implementation of that method is this:

            var tracker = _searchers.GetOrAdd(version, l => { factoryMethodCalled = true; return new SearcherTracker(searcher); });

But (as you boys and girls now by now) this is incorrect! In GetOrAdd for a ConcurrentDictionary, there is ABSOLUTELY NO GUARANTEE that the factory method is called exactly once only when the element is not present.
And since new SearcherTracker(searcher) does:

                searcher.IndexReader.IncRef();

... we *may* end up with incorrect counts. But not always. Sometimes. Every now and then. Occasionally. On and off. Enough to drive you nuts.
The standard solution is to wrap the value as a Lazy<>, like this:

        private readonly ConcurrentDictionary<long, Lazy<SearcherTracker>> _searchers = new ConcurrentDictionary<long, Lazy<SearcherTracker>>();

and then change the code in the Record() as follows:

       var tracker = _searchers.GetOrAdd(version, l => new Lazy<SearcherTracker>(() => { factoryMethodCalled = true; return new SearcherTracker(searcher); })).Value;

This will guarantee a single call to the SearchTracker constructor every time.
In the Acquire method, we change the if() statement as follows:

            Lazy<SearcherTracker> tracker;
            if (_searchers.TryGetValue(version, out tracker) && tracker.IsValueCreated && tracker.Value.Searcher.IndexReader.TryIncRef())
            {
                return tracker.Value.Searcher;
            }

Note that the test "tracker!=null" has been replaced by tracker.IsValueCreated.  In .NET, it is never the case that tracker is null if TryGetValue succeeds, since values added are never null, so the test tracked!=null made no sense.
I'm not sure tracker.IsValueCreated makes much sense either: this would mean that the entry would exist, but the call to Acquire() would happen between the return of the GetOrAdd method and the call to the Value property in the Record() method. The probability of this is quite small, but if we want to make sure all trackers are created in the Record() method, we need this test.

The rest of the changes is trivial:

In the Prune method, we need to collect the trackers a bit differently:

     var trackers = _searchers.Values.Select(item => item.Value).ToList();

... and try to remove it with the lazy value

Lazy<SearcherTracker> _;
       _searchers.TryRemove(tracker.Version, out _);

Ihe Dispose() method constructs the list of items to close as follows:

                IList<SearcherTracker> toClose = new List<SearcherTracker>(_searchers.Values.Select(item => item.Value));

(thank you, Linq)
... and try to remove it with the lazy value, just like in the Prune method.

Lazy<SearcherTracker> _;
       _searchers.TryRemove(tracker.Version, out _);


The alternative solution is to trade in concurrency for a "locked dictionary approach", which is not really as sexy and may create a contention point where none existed.

I've checked the remainder of the lucene source for instances of .GetOrAdd and .AddOrUpdate with an add factory, and there are a few. But this, I think, was the one to catch. I'll leave the others are an exercise for the reader.

And now if you excuse me, I need a drink. Stick a fork in me, I'm done.

Vincent