You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by Shad Storhaug <sh...@shadstorhaug.com> on 2016/10/02 19:01:02 UTC

Remaining Work/Priorities

Hello,

I just wanted to open this discussion to talk about the work remaining to be done on Lucene.Net version 4.8.0. We are nearly there, but that doesn't mean we don't still need help!


FAILING TESTS
-------------------

We now have over 5000 passing tests and as soon as pull request #188 (https://github.com/apache/lucenenet/pull/188) is merged, by my count we have only 20 (actual) failing tests. Here is the breakdown by project:

Lucene.Net (Core) - 15 failing / 1989 total
Lucene.Net.Analysis.Common - 0 failing / 1445 total
Lucene.Net.Classification - 0 failing / 9 total
Lucene.Net.Expressions - 0 failing / 94 total
Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total
Lucene.Net.Join - 0 failing / 27 total
Lucene.Net.Memory - 0 failing / 10 total
Lucene.Net.Misc - 2 failing / 42 total
Lucene.Net.Queries - 2 failing / 96 total
Lucene.Net.QueryParser - 1 failing / 203 total
Lucene.Net.Suggest - 0 failing / 142 total

The reason why I said ACTUAL tests above is because I recently discovered that many of the "failures" that are being reported are false negatives (in fact, the VS2015 NUnit test runner shows there are 135 failing tests total and 902 tests total that don't belong to any project). Most NUnit 2.6 test runners do not correctly run tests in shared abstract classes with the correct context (test setup) to make them pass. These out-of-context runs add several additional minutes to the test run.

As an experiment, I upgraded to NUnit 3.4.1 and it helped the situation somewhat - that is, it ran the tests in the correct context and I was able to determine that we have more tests than the numbers above and they are all succeeding. However, it also ran the tests in an invalid context (that is, the context of the abstract class without any setup) and some of them still showed as failures.

I know @conniey is currently working on porting the tests over to xUnit. Hopefully, swapping test frameworks alone (or using some of the new fancy test attributes) is enough to fix this issue. If not, we need to find another solution - preferably one that can be applied to all of the tests in abstract classes without too much effort or changing them so they are too different from their Java counterpart.

Remaining Pieces to Port
---------------------------------

I took an inventory of the remaining pieces left to port a few days ago and here is what that looks like (alphabetical order):

1. Analysis.ICU (Depends on ICU4j)
2. Analysis.Kuromoji
3. Analysis.Morfologik (Depends on Morfologik)
4. Analysis.Phonetic (Depends on Apache Commons)
5. Analysis.SmartCN
6. Analysis.Stempel (currently in progress)
7. Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer)
8. Benchmark (many dependencies)
9. Demo
10. Highlighter (Depends on Collator (which is still being ported) and BreakIterator (which we don't have a solution that works in .NET core yet))
11. Replicator (many dependencies)
12. Sandbox (Depends on Apache Jakarta)
13. Spatial (Already ported in #174 (https://github.com/apache/lucenenet/pull/174), needs a recent version of spatial4n)
14. QueryParser.Flexible

Itamar, it would be helpful if you would be so kind as to organize this list in terms of priority. It also couldn't hurt to update the contributing documents (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md, and https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status with the latest information so anyone who wants to help out knows the current status.

Of course, it is the known status of dependencies that we need clarification on. Which of these dependencies is known to be ported? Which of them are ported but are not up to date? Which of them are known not to be ported, and which of them are unknown?


Public API Inconsistencies
---------------------------------

One thing that I have had my eye on for a while now is the .NETification/consistency of the core API (that is, in the Lucene.Net project). There are several issues that I would like to address including:


1.       Method names that are still camelCase

2.       Properties that should be methods (because they do a lot of processing or because they are non-deterministic)

3.       Methods that should be properties

4.       .Size() vs .Size vs .Count - should generally all be .Count in .NET

5.       Interfaces should begin with "I"

6.       Classes should not begin with "I" followed by another capital letter (for some reason some of them were named that way)

7.       .CharAt() should probably be this[]

8.       Generic types nested within generic types (which cause Visual Studio to crash when Intellisense tries to read them)

... and so on. The only thing is these are all sweeping changes that will affect everyone helping out on Lucene.Net and anyone who is currently using the beta. So, I just wanted to gather some input on when the most appropriate time to begin working on these sweeping changes would be?


Thanks,
Shad Storhaug (NightOwl888)







Re: Remaining Work/Priorities

Posted by Itamar Syn-Hershko <it...@gmail.com>.
Hi, it's my plan to take some of those tasks but middle of holiday season
here so getting caught up with other stuff. Go ahead and take it if you
have spare cycles, I got enough to focus on the docs and demos side anyway.

On Sat, Oct 22, 2016 at 8:55 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Itamar,
>
> I just noticed that there were some line break issues in your email that
> made me think that you were working on QueryParser.Flexible, when in fact
> you mentioned you would be working on examples (to replace the Demo).
>
> http://apache.markmail.org/search/?q=lucenenet+list%
> 3Aorg.apache.incubator.lucene-net-dev+priorities#query:
> lucenenet%20list%3Aorg.apache.incubator.lucene-net-dev%
> 20priorities+page:1+mid:64xvjbi75oypbfxb+state:results
>
> But you didn't say anything after my replies (below) that mentioned you
> would be working on QueryParser.Flexible.
>
> > I'm on it QueryParser.Flexible
>
> Please clarify, are you working on QueryParser.Flexible, or not? If not, I
> would like to fix the context issues in the tests over there ASAP to
> eliminate the false negatives.
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
> -----Original Message-----
> From: Shad Storhaug
> Sent: Wednesday, October 12, 2016 2:10 AM
> To: 'dev@lucenenet.apache.org'
> Cc: 'Connie Yau'; 'cribs2@gmail.com'; 'itamar.synhershko@gmail.com'
> Subject: RE: Remaining Work/Priorities
>
> Update
> ======
>
> I have just pushed some commits that fix several bugs in the
> Lucene.Net.Codecs project (all 452 tests pass most of the time, a few
> random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.
>
>
> Fix for Test Context
> -------------------------
>
> For now, I have added method override stubs to each subclass in order to
> add the [Test] attribute, so NUnit will run them in the correct context. I
> did that on all of the superclass tests except for the ones in QueryParser
> (since Itamar mentioned he would be working in that area). Itamar, you will
> probably need to follow suit to get all of the QP tests to pass - namely
> with the QueryParserTestBase and TestQueryParser classes.
>
> I have carefully put all of these changes into a single commit so it can
> be reverted easily, if this solution doesn't happen to be compatible with
> xUnit: https://github.com/apache/lucenenet/commit/
> 2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c. Hopefully that makes life
> easier for @conniey.
>
> @Itamar, let me know when this is completed on your end so I can do a
> double revert and squash the test stubs from QueryParser into an
> all-inclusive revert-able commit.
>
> We can now correctly see how many tests we have in the core. Currently
> there are 2730 - it seems we are still missing 720 tests, assuming they all
> were for something port-able.
>
>
> Remaining Tests
> ---------------------
>
> Next I plan to work on locating any tests that we have missed (starting in
> the core). It seems these fall into several categories:
>
> 1. Tests that have not yet been ported.
> 2. Tests that have been partially ported that have not been added to the
> project.
> 3. Tests that have been ported, but are missing the [Test] attribute.
> 4. Tests in classes that have been ported that have been commented out
> (presumably because at the time they were ported the dependencies did not
> yet exist).
> 5. Tests that have been Ignored in .NET that were not in Java.
> 6. Tests that have NUnit Assume.That() logic that depends on some
> non-existant JRE condition, so they are not running in .NET.
>
> I'll make a quick effort to get them to pass, but the main goal will be to
> ensure they all can run and are included in the project. Just a heads up
> that the number of test failures is likely to increase on this pass (but
> the number of bugs will likely decrease).
>
>
> Failing Core Tests
> -----------------------
>
> I have looked into the remaining tests somewhat. There are 2 issues that I
> need some input on to solve.
>
>
> TestRamUsageEstimator.TestSanity()
>
> Java Lucene uses a JRE-specific API to determine how much header size to
> add on each field. This makes the estimates higher in Java. But more
> importantly, this test is failing because the estimate for a real string
> instance is coming back as the same size as its shallow size (16 bytes in
> this case) - it needs to be at least 1 byte more than that for the test to
> pass. In Java (at least in a 64 bit environment), there are an extra 4
> bytes being added for each field.
>
> Technically, there is a way to get these numbers from .NET, but it
> involves calling undocumented APIs using pointers and will likely be
> different from one .NET version to the next (a bad idea for a project that
> needs to support multiple .NET versions). The only solution I can think of
> is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for
> 32 bit) in order to make the numbers for the instances larger than their
> shallow size. I suppose the alternative would be to either comment out the
> string test or change it to >= make it pass. Thoughts? Alternatives?
>
>
> TestNumericDocValuesUpdates.TestUpdateOldSegments()
>
> I discovered what the issue is here (normally that is the hard part), but
> it seems that the proper solution is going to be a major task. The
> NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a
> service locator to load classes throughout the project. In the Codec
> abstract class, it is used to load up the codec for the context it is used
> in. However, our port of the NamedSPILoader simply loads all of the classes
> from the current AppDomain without any way to order them or override them.
>
> The problem is that in Lucene, this was meant to be an extension point.
> And this particular test (and probably many more of them) uses that
> extension point to change the codec to a Mock from the test framework. This
> line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the
> issue is:
>
> > Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have
> > tests-framework.jar before lucene-core.jar");
>
> Basically, it is using a configuration file to order the classes that are
> loaded so the test mocks take priority over the built-in codecs.
>
> Just fixing the test could be done by making the static NamedSPILoader
> variable in the Codec class internal and swapping in a test double.
> However, that doesn't solve the bigger issue that Lucene.Net is missing its
> extensibility for anyone who wants to write their own codec (or tap into
> one of the other extensibility points). I guess the bigger question is how
> important will it be for anyone to extend Lucene codecs or inject
> dependencies into Analyzer factories? There doesn’t appear to be any more
> extensibility than that in Lucene 4.8.0, but that could change in more
> recent or future versions of Lucene.
>
>
> CI Builds
> -----------
>
> Not working. Can someone look into that please?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
> -----Original Message-----
> From: Shad Storhaug
> Sent: Wednesday, October 5, 2016 8:23 PM
> To: dev@lucenenet.apache.org
> Cc: Connie Yau; 'cribs2@gmail.com'
> Subject: RE: Remaining Work/Priorities
>
> > Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs
> from the analysis.commons module?
>
> Just for clarification, these are two entirely different things in Java.
> Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of
> Java:
>
> import java.text.BreakIterator;
> import java.text.Collator;
> import java.text.ParseException;
> import java.text.RuleBasedCollator;
>
> Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also
> depend on parts of Java:
>
> import java.text.BreakIterator;
> import java.text.CharacterIterator;
>
> Analysis.ICU depends on a separate (icu4j) package:
>
> import com.ibm.icu.text.Normalizer;
> import com.ibm.icu.text.Normalizer2;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.Replaceable;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.UTF16;
> import com.ibm.icu.text.UnicodeSet;
> import com.ibm.icu.text.FilteredNormalizer2;
> import com.ibm.icu.text.Collator;
> import com.ibm.icu.text.RuleBasedCollator;
> import com.ibm.icu.util.ULocale;
> import com.ibm.icu.text.RawCollationKey;
>
> That said, icu4j DOES have Collator and RuleBasedCollator classes, but it
> DOES NOT have a BreakIterator or CharacterIterator class. It is unclear
> whether the Collator from icu4j would work as a replacement for the one in
> core Java.
>
> When I was digging through the JDK code, I noticed that BreakIterator and
> RuleBasedCollator have a lot of common ICU dependencies there, so even if
> the RuleBasedCollator from icu4j is compatible, it might make sense for us
> to port the one from Java anyway so we are dealing with the same shared
> dependencies in Analysis.Common.
>
> Once we port over the classes from the Java JDK, we will be able to
> eliminate our current ICU4NET dependency (and the platform issues that come
> with it). That said, porting over those pieces could take considerable
> work. In the interim it might make sense to make separate projects/NuGet
> packages to isolate the areas that depend on BreakIterator,
> CharacterIterator, and RuleBasedCollator so the rest can be released for
> wide/cross-platform use. Perhaps we can even make a basic (scaled down)
> BreakIterator for Highlighter that breaks on spaces between words and
> punctuation between sentences, which wouldn't work for Thai, but would work
> for most other languages.
>
> Porting the (icu4j) package is another complete ball of yarn, we should
> take a look at (https://github.com/sillsdev/icu-dotnet) to see if there
> is enough overlap there to power Analysis.ICU (offhand it looks as though
> some classes are missing, though). It is a wrapper around the C library -
> it may be that we just need to port more of it to get all of the pieces we
> need.
>
> Speaking of Collation, @ChristopherHaws have you made any more progress on
> Analysis.Collation? Were you able to determine if icu-dotnet's collator
> will make the tests pass?
>
> > I'm on it QueryParser.Flexible
>
> Great. The TimeZone probably just needs more research to work out how to
> utilize (in order to implement the failing test). Also, FYI MSDN's
> recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(
> v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed
> that several of the tests were recently modified to use TimeZone rather
> than TimeZoneInfo).
>
> As for the culture, in .NET I am pretty sure that we need to pass it as a
> parameter to another overload of `QueryParser.Parse` rather than making it
> a property of QueryParser. But we can deal with that in one step after you
> have finished porting.
>
> --
>
> Shad Storhaug (NightOwl888)
>
> -----Original Message-----
> From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On
> Behalf Of Itamar Syn-Hershko
> Sent: Wednesday, October 5, 2016 5:28 AM
> To: dev@lucenenet.apache.org
> Cc: Connie Yau
> Subject: Re: Remaining Work/Priorities
>
> Awesome, thanks for all the hard work Shad!
>
> Our first priority should be fixing all remaining tests - in particular
> the one in Core. We should be ready to release and stamp our builds as 100%
> stable. As you mentioned, this could be an infrastructure issue - hopefully
> *Connie *can give a status update on her effort on the switch to xUnit?
>
> With regards to Modules, here's an updated breakdown based on your email +
> forgotten pieces + my comments:
>
> *Ported:*
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0
> failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total
> Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including
> #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total
> Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42
> total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1
> failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total
>
> We should do a second pass on the pieces we marked as ported, just to make
> sure the port is full and we didn't leave anything behind :)
>
> *Need to be ported:*
> Highlighter (Depends on Collator (which is still being ported) and
> BreakIterator (which we don't have a solution that works in .NET core yet))
> Spatial (has 3rd party libraries that need to be updates) Spatial4n (
> https://github.com/synhershko/Spatial4N) needs to be brought up to speed
> with spatial4j, dependencies of which may cause some issues....
> Codecs
> Partially ported, mostly the tests weren't ported Grouping Not urgent, but
> provides nice functionality that users will probably like
>
> The only part with dependencies seems to be the spatial module - I will
> have a look there soon if you don't get to that before I do.
>
> *Can wait* - some modules are less frequently used, we should stabilize
> and release first and then work on them based on demand Analysis.ICU
> (Depends on ICU4j) hopefully we can remove the ICU DLLs from the
> analysis.commons module? I keep getting reports on some issues they are
> causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik)
> Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly
> helper libraries, so there's probably not real dependency just lots of
> replacement Analysis.SmartCN Analysis.Stempel (currently in progress)
> Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo
> while important because can help newbies, we can do better by providing
> docs and real world examples. I'm on it QueryParser.Flexible
>
> *No need to port* - neither are needed in our context Benchmark (many
> dependencies) Replicator (many dependencies) Sandbox (Depends on Apache
> Jakarta)
>
> Once all modules are ported and all tests are passing, I think we should
> get two more items fixed before an official release:
>
> 1. .NET Core support - I'm not clear on the status of it at the moment. We
> probably want to have it in for the release.
>
> 2. Public API Inconsistencies. We can discuss what should be done and what
> not when we get to that stage. Some are an obvious "fixme" but some will
> break code compatibility with Java I think we should avoid.
>
> One last note - *Wyatt*, do we know why there are no CI builds lately?
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> Freelance Developer & Consultant Lucene.NET committer and PMC member
>
> On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
> wrote:
>
> > Hello,
> >
> > I just wanted to open this discussion to talk about the work remaining
> > to be done on Lucene.Net version 4.8.0. We are nearly there, but that
> > doesn't mean we don't still need help!
> >
> >
> > FAILING TESTS
> > -------------------
> >
> > We now have over 5000 passing tests and as soon as pull request #188 (
> > https://github.com/apache/lucenenet/pull/188) is merged, by my count
> > we have only 20 (actual) failing tests. Here is the breakdown by project:
> >
> > Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> > - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9
> > total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet -
> > (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0
> > failing / 27 total Lucene.Net.Memory - 0 failing / 10 total
> > Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing
> > / 96 total Lucene.Net.QueryParser - 1 failing / 203 total
> > Lucene.Net.Suggest - 0 failing / 142 total
> >
> > The reason why I said ACTUAL tests above is because I recently
> > discovered that many of the "failures" that are being reported are
> > false negatives (in fact, the VS2015 NUnit test runner shows there are
> > 135 failing tests total and 902 tests total that don't belong to any
> > project). Most NUnit 2.6 test runners do not correctly run tests in
> > shared abstract classes with the correct context (test setup) to make
> > them pass. These out-of-context runs add several additional minutes to
> the test run.
> >
> > As an experiment, I upgraded to NUnit 3.4.1 and it helped the
> > situation somewhat - that is, it ran the tests in the correct context
> > and I was able to determine that we have more tests than the numbers
> > above and they are all succeeding. However, it also ran the tests in
> > an invalid context (that is, the context of the abstract class without
> > any setup) and some of them still showed as failures.
> >
> > I know @conniey is currently working on porting the tests over to xUnit.
> > Hopefully, swapping test frameworks alone (or using some of the new
> > fancy test attributes) is enough to fix this issue. If not, we need to
> > find another solution - preferably one that can be applied to all of
> > the tests in abstract classes without too much effort or changing them
> > so they are too different from their Java counterpart.
> >
> > Remaining Pieces to Port
> > ---------------------------------
> >
> > I took an inventory of the remaining pieces left to port a few days
> > ago and here is what that looks like (alphabetical order):
> >
> > 1. Analysis.ICU (Depends on ICU4j)
> > 2. Analysis.Kuromoji
> > 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic
> > (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel
> > (currently in progress) 7. Analysis.UIMA (Depends on Tagger,
> > uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> > Demo 10. Highlighter (Depends on Collator (which is still being
> > ported) and BreakIterator (which we don't have a solution that works
> > in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox
> > (Depends on Apache Jakarta) 13. Spatial (Already ported in #174
> > (https://github.com/apache/ lucenenet/pull/174), needs a recent
> > version of spatial4n) 14. QueryParser.Flexible
> >
> > Itamar, it would be helpful if you would be so kind as to organize
> > this list in terms of priority. It also couldn't hurt to update the
> > contributing documents
> > (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> > and
> > https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> > with the latest information so anyone who wants to help out knows the
> > current status.
> >
> > Of course, it is the known status of dependencies that we need
> > clarification on. Which of these dependencies is known to be ported?
> > Which of them are ported but are not up to date? Which of them are
> > known not to be ported, and which of them are unknown?
> >
> >
> > Public API Inconsistencies
> > ---------------------------------
> >
> > One thing that I have had my eye on for a while now is the
> > .NETification/consistency of the core API (that is, in the Lucene.Net
> > project). There are several issues that I would like to address
> including:
> >
> >
> > 1.       Method names that are still camelCase
> >
> > 2.       Properties that should be methods (because they do a lot of
> > processing or because they are non-deterministic)
> >
> > 3.       Methods that should be properties
> >
> > 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> > .NET
> >
> > 5.       Interfaces should begin with "I"
> >
> > 6.       Classes should not begin with "I" followed by another capital
> > letter (for some reason some of them were named that way)
> >
> > 7.       .CharAt() should probably be this[]
> >
> > 8.       Generic types nested within generic types (which cause Visual
> > Studio to crash when Intellisense tries to read them)
> >
> > ... and so on. The only thing is these are all sweeping changes that
> > will affect everyone helping out on Lucene.Net and anyone who is
> > currently using the beta. So, I just wanted to gather some input on
> > when the most appropriate time to begin working on these sweeping
> changes would be?
> >
> >
> > Thanks,
> > Shad Storhaug (NightOwl888)
> >
> >
> >
> >
> >
> >
> >
>

RE: Remaining Work/Priorities

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Itamar,

I just noticed that there were some line break issues in your email that made me think that you were working on QueryParser.Flexible, when in fact you mentioned you would be working on examples (to replace the Demo). 

http://apache.markmail.org/search/?q=lucenenet+list%3Aorg.apache.incubator.lucene-net-dev+priorities#query:lucenenet%20list%3Aorg.apache.incubator.lucene-net-dev%20priorities+page:1+mid:64xvjbi75oypbfxb+state:results

But you didn't say anything after my replies (below) that mentioned you would be working on QueryParser.Flexible.

> I'm on it QueryParser.Flexible

Please clarify, are you working on QueryParser.Flexible, or not? If not, I would like to fix the context issues in the tests over there ASAP to eliminate the false negatives.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Shad Storhaug 
Sent: Wednesday, October 12, 2016 2:10 AM
To: 'dev@lucenenet.apache.org'
Cc: 'Connie Yau'; 'cribs2@gmail.com'; 'itamar.synhershko@gmail.com'
Subject: RE: Remaining Work/Priorities

Update
======

I have just pushed some commits that fix several bugs in the Lucene.Net.Codecs project (all 452 tests pass most of the time, a few random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.


Fix for Test Context
-------------------------

For now, I have added method override stubs to each subclass in order to add the [Test] attribute, so NUnit will run them in the correct context. I did that on all of the superclass tests except for the ones in QueryParser (since Itamar mentioned he would be working in that area). Itamar, you will probably need to follow suit to get all of the QP tests to pass - namely with the QueryParserTestBase and TestQueryParser classes.

I have carefully put all of these changes into a single commit so it can be reverted easily, if this solution doesn't happen to be compatible with xUnit: https://github.com/apache/lucenenet/commit/2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c. Hopefully that makes life easier for @conniey.

@Itamar, let me know when this is completed on your end so I can do a double revert and squash the test stubs from QueryParser into an all-inclusive revert-able commit.

We can now correctly see how many tests we have in the core. Currently there are 2730 - it seems we are still missing 720 tests, assuming they all were for something port-able.


Remaining Tests
---------------------

Next I plan to work on locating any tests that we have missed (starting in the core). It seems these fall into several categories:

1. Tests that have not yet been ported.
2. Tests that have been partially ported that have not been added to the project.
3. Tests that have been ported, but are missing the [Test] attribute.
4. Tests in classes that have been ported that have been commented out (presumably because at the time they were ported the dependencies did not yet exist).
5. Tests that have been Ignored in .NET that were not in Java.
6. Tests that have NUnit Assume.That() logic that depends on some non-existant JRE condition, so they are not running in .NET.

I'll make a quick effort to get them to pass, but the main goal will be to ensure they all can run and are included in the project. Just a heads up that the number of test failures is likely to increase on this pass (but the number of bugs will likely decrease).


Failing Core Tests
-----------------------

I have looked into the remaining tests somewhat. There are 2 issues that I need some input on to solve.


TestRamUsageEstimator.TestSanity()

Java Lucene uses a JRE-specific API to determine how much header size to add on each field. This makes the estimates higher in Java. But more importantly, this test is failing because the estimate for a real string instance is coming back as the same size as its shallow size (16 bytes in this case) - it needs to be at least 1 byte more than that for the test to pass. In Java (at least in a 64 bit environment), there are an extra 4 bytes being added for each field.

Technically, there is a way to get these numbers from .NET, but it involves calling undocumented APIs using pointers and will likely be different from one .NET version to the next (a bad idea for a project that needs to support multiple .NET versions). The only solution I can think of is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for 32 bit) in order to make the numbers for the instances larger than their shallow size. I suppose the alternative would be to either comment out the string test or change it to >= make it pass. Thoughts? Alternatives?


TestNumericDocValuesUpdates.TestUpdateOldSegments()

I discovered what the issue is here (normally that is the hard part), but it seems that the proper solution is going to be a major task. The NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a service locator to load classes throughout the project. In the Codec abstract class, it is used to load up the codec for the context it is used in. However, our port of the NamedSPILoader simply loads all of the classes from the current AppDomain without any way to order them or override them.

The problem is that in Lucene, this was meant to be an extension point. And this particular test (and probably many more of them) uses that extension point to change the codec to a Mock from the test framework. This line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the issue is:

> Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have 
> tests-framework.jar before lucene-core.jar");

Basically, it is using a configuration file to order the classes that are loaded so the test mocks take priority over the built-in codecs.

Just fixing the test could be done by making the static NamedSPILoader variable in the Codec class internal and swapping in a test double. However, that doesn't solve the bigger issue that Lucene.Net is missing its extensibility for anyone who wants to write their own codec (or tap into one of the other extensibility points). I guess the bigger question is how important will it be for anyone to extend Lucene codecs or inject dependencies into Analyzer factories? There doesn’t appear to be any more extensibility than that in Lucene 4.8.0, but that could change in more recent or future versions of Lucene.


CI Builds
-----------

Not working. Can someone look into that please?


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Shad Storhaug
Sent: Wednesday, October 5, 2016 8:23 PM
To: dev@lucenenet.apache.org
Cc: Connie Yau; 'cribs2@gmail.com'
Subject: RE: Remaining Work/Priorities

> Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module?

Just for clarification, these are two entirely different things in Java. Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of Java:

import java.text.BreakIterator;
import java.text.Collator;
import java.text.ParseException;
import java.text.RuleBasedCollator;

Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also depend on parts of Java:

import java.text.BreakIterator;
import java.text.CharacterIterator;

Analysis.ICU depends on a separate (icu4j) package:

import com.ibm.icu.text.Normalizer;
import com.ibm.icu.text.Normalizer2;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.Replaceable;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.text.FilteredNormalizer2;
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.RawCollationKey;

That said, icu4j DOES have Collator and RuleBasedCollator classes, but it DOES NOT have a BreakIterator or CharacterIterator class. It is unclear whether the Collator from icu4j would work as a replacement for the one in core Java.

When I was digging through the JDK code, I noticed that BreakIterator and RuleBasedCollator have a lot of common ICU dependencies there, so even if the RuleBasedCollator from icu4j is compatible, it might make sense for us to port the one from Java anyway so we are dealing with the same shared dependencies in Analysis.Common.

Once we port over the classes from the Java JDK, we will be able to eliminate our current ICU4NET dependency (and the platform issues that come with it). That said, porting over those pieces could take considerable work. In the interim it might make sense to make separate projects/NuGet packages to isolate the areas that depend on BreakIterator, CharacterIterator, and RuleBasedCollator so the rest can be released for wide/cross-platform use. Perhaps we can even make a basic (scaled down) BreakIterator for Highlighter that breaks on spaces between words and punctuation between sentences, which wouldn't work for Thai, but would work for most other languages.

Porting the (icu4j) package is another complete ball of yarn, we should take a look at (https://github.com/sillsdev/icu-dotnet) to see if there is enough overlap there to power Analysis.ICU (offhand it looks as though some classes are missing, though). It is a wrapper around the C library - it may be that we just need to port more of it to get all of the pieces we need.

Speaking of Collation, @ChristopherHaws have you made any more progress on Analysis.Collation? Were you able to determine if icu-dotnet's collator will make the tests pass?

> I'm on it QueryParser.Flexible

Great. The TimeZone probably just needs more research to work out how to utilize (in order to implement the failing test). Also, FYI MSDN's recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed that several of the tests were recently modified to use TimeZone rather than TimeZoneInfo).

As for the culture, in .NET I am pretty sure that we need to pass it as a parameter to another overload of `QueryParser.Parse` rather than making it a property of QueryParser. But we can deal with that in one step after you have finished porting.

--

Shad Storhaug (NightOwl888)

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar Syn-Hershko
Sent: Wednesday, October 5, 2016 5:28 AM
To: dev@lucenenet.apache.org
Cc: Connie Yau
Subject: Re: Remaining Work/Priorities

Awesome, thanks for all the hard work Shad!

Our first priority should be fixing all remaining tests - in particular the one in Core. We should be ready to release and stamp our builds as 100% stable. As you mentioned, this could be an infrastructure issue - hopefully *Connie *can give a status update on her effort on the switch to xUnit?

With regards to Modules, here's an updated breakdown based on your email + forgotten pieces + my comments:

*Ported:*
Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1 failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total

We should do a second pass on the pieces we marked as ported, just to make sure the port is full and we didn't leave anything behind :)

*Need to be ported:*
Highlighter (Depends on Collator (which is still being ported) and BreakIterator (which we don't have a solution that works in .NET core yet)) Spatial (has 3rd party libraries that need to be updates) Spatial4n (https://github.com/synhershko/Spatial4N) needs to be brought up to speed with spatial4j, dependencies of which may cause some issues....
Codecs
Partially ported, mostly the tests weren't ported Grouping Not urgent, but provides nice functionality that users will probably like

The only part with dependencies seems to be the spatial module - I will have a look there soon if you don't get to that before I do.

*Can wait* - some modules are less frequently used, we should stabilize and release first and then work on them based on demand Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module? I keep getting reports on some issues they are causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik) Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly helper libraries, so there's probably not real dependency just lots of replacement Analysis.SmartCN Analysis.Stempel (currently in progress) Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo while important because can help newbies, we can do better by providing docs and real world examples. I'm on it QueryParser.Flexible

*No need to port* - neither are needed in our context Benchmark (many dependencies) Replicator (many dependencies) Sandbox (Depends on Apache Jakarta)

Once all modules are ported and all tests are passing, I think we should get two more items fixed before an official release:

1. .NET Core support - I'm not clear on the status of it at the moment. We probably want to have it in for the release.

2. Public API Inconsistencies. We can discuss what should be done and what not when we get to that stage. Some are an obvious "fixme" but some will break code compatibility with Java I think we should avoid.

One last note - *Wyatt*, do we know why there are no CI builds lately?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member

On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining 
> to be done on Lucene.Net version 4.8.0. We are nearly there, but that 
> doesn't mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count 
> we have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 
> total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - 
> (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 
> failing / 27 total Lucene.Net.Memory - 0 failing / 10 total 
> Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing 
> / 96 total Lucene.Net.QueryParser - 1 failing / 203 total 
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently 
> discovered that many of the "failures" that are being reported are 
> false negatives (in fact, the VS2015 NUnit test runner shows there are
> 135 failing tests total and 902 tests total that don't belong to any 
> project). Most NUnit 2.6 test runners do not correctly run tests in 
> shared abstract classes with the correct context (test setup) to make 
> them pass. These out-of-context runs add several additional minutes to the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the 
> situation somewhat - that is, it ran the tests in the correct context 
> and I was able to determine that we have more tests than the numbers 
> above and they are all succeeding. However, it also ran the tests in 
> an invalid context (that is, the context of the abstract class without 
> any setup) and some of them still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new 
> fancy test attributes) is enough to fix this issue. If not, we need to 
> find another solution - preferably one that can be applied to all of 
> the tests in abstract classes without too much effort or changing them 
> so they are too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days 
> ago and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic 
> (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel 
> (currently in progress) 7. Analysis.UIMA (Depends on Tagger, 
> uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> Demo 10. Highlighter (Depends on Collator (which is still being
> ported) and BreakIterator (which we don't have a solution that works 
> in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox 
> (Depends on Apache Jakarta) 13. Spatial (Already ported in #174 
> (https://github.com/apache/ lucenenet/pull/174), needs a recent 
> version of spatial4n) 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize 
> this list in terms of priority. It also couldn't hurt to update the 
> contributing documents 
> (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and
> https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the 
> current status.
>
> Of course, it is the known status of dependencies that we need 
> clarification on. Which of these dependencies is known to be ported?
> Which of them are ported but are not up to date? Which of them are 
> known not to be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the 
> .NETification/consistency of the core API (that is, in the Lucene.Net 
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that 
> will affect everyone helping out on Lucene.Net and anyone who is 
> currently using the beta. So, I just wanted to gather some input on 
> when the most appropriate time to begin working on these sweeping changes would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>

RE: Remaining Work/Priorities

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Itamar,

I contacted my ISP and I think I now have it straightened out. Could you send me a test email to verify?


I just found a bug that is due to the "Old format impersonation is active" setting being made non-static. The class is loaded by the SPIClassIterator and in that class the setting is hard coded in the default constructor to true (even though in this case it is false in LuceneTestCase). Can I pick your brain to understand what the reasoning is for changing this to an instance variable? In this case we have a global setting combined with constrained construction so the only reasonable way for the class to read it is to make it static.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar Syn-Hershko
Sent: Thursday, October 13, 2016 12:09 AM
To: dev@lucenenet.apache.org
Subject: Re: Remaining Work/Priorities

While on that note, Shad - emails to you bounce with the following error
(still):

Delivery to the following recipient failed permanently:

     shad@shadstorhaug.com

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the server for the recipient domain shadstorhaug.com by mx1.hostmailserver.com.
[69.160.246.214].

The error that the other server returned was:
554 5.7.1 gmail.com is blacklisted.

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member

On Wed, Oct 12, 2016 at 8:06 PM, Itamar Syn-Hershko <it...@code972.com>
wrote:

> CI failure seems to be worked on: 
> https://twitter.com/codebetterCI/status/
> 785854074713468932 (Thanks Wyatt for pointing that out)
>
> I will look into the rest in a little while
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko> 
> Freelance Developer & Consultant Lucene.NET committer and PMC member
>
> On Tue, Oct 11, 2016 at 10:10 PM, Shad Storhaug 
> <sh...@shadstorhaug.com>
> wrote:
>
>> Update
>> ======
>>
>> I have just pushed some commits that fix several bugs in the 
>> Lucene.Net.Codecs project (all 452 tests pass most of the time, a few 
>> random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.
>>
>>
>> Fix for Test Context
>> -------------------------
>>
>> For now, I have added method override stubs to each subclass in order 
>> to add the [Test] attribute, so NUnit will run them in the correct 
>> context. I did that on all of the superclass tests except for the 
>> ones in QueryParser (since Itamar mentioned he would be working in 
>> that area). Itamar, you will probably need to follow suit to get all 
>> of the QP tests to pass - namely with the QueryParserTestBase and TestQueryParser classes.
>>
>> I have carefully put all of these changes into a single commit so it 
>> can be reverted easily, if this solution doesn't happen to be 
>> compatible with
>> xUnit: https://github.com/apache/lucenenet/commit/2a79edea6359e1ee1
>> f83269cc7dc3ef2753ebf2c. Hopefully that makes life easier for @conniey.
>>
>> @Itamar, let me know when this is completed on your end so I can do a 
>> double revert and squash the test stubs from QueryParser into an 
>> all-inclusive revert-able commit.
>>
>> We can now correctly see how many tests we have in the core. 
>> Currently there are 2730 - it seems we are still missing 720 tests, 
>> assuming they all were for something port-able.
>>
>>
>> Remaining Tests
>> ---------------------
>>
>> Next I plan to work on locating any tests that we have missed 
>> (starting in the core). It seems these fall into several categories:
>>
>> 1. Tests that have not yet been ported.
>> 2. Tests that have been partially ported that have not been added to 
>> the project.
>> 3. Tests that have been ported, but are missing the [Test] attribute.
>> 4. Tests in classes that have been ported that have been commented 
>> out (presumably because at the time they were ported the dependencies 
>> did not yet exist).
>> 5. Tests that have been Ignored in .NET that were not in Java.
>> 6. Tests that have NUnit Assume.That() logic that depends on some 
>> non-existant JRE condition, so they are not running in .NET.
>>
>> I'll make a quick effort to get them to pass, but the main goal will 
>> be to ensure they all can run and are included in the project. Just a 
>> heads up that the number of test failures is likely to increase on 
>> this pass (but the number of bugs will likely decrease).
>>
>>
>> Failing Core Tests
>> -----------------------
>>
>> I have looked into the remaining tests somewhat. There are 2 issues 
>> that I need some input on to solve.
>>
>>
>> TestRamUsageEstimator.TestSanity()
>>
>> Java Lucene uses a JRE-specific API to determine how much header size 
>> to add on each field. This makes the estimates higher in Java. But 
>> more importantly, this test is failing because the estimate for a 
>> real string instance is coming back as the same size as its shallow 
>> size (16 bytes in this case) - it needs to be at least 1 byte more 
>> than that for the test to pass. In Java (at least in a 64 bit 
>> environment), there are an extra 4 bytes being added for each field.
>>
>> Technically, there is a way to get these numbers from .NET, but it 
>> involves calling undocumented APIs using pointers and will likely be 
>> different from one .NET version to the next (a bad idea for a project 
>> that needs to support multiple .NET versions). The only solution I 
>> can think of is to hard code in an extra 4 bytes for 64 bit (and most 
>> likely 2 bytes for
>> 32 bit) in order to make the numbers for the instances larger than 
>> their shallow size. I suppose the alternative would be to either 
>> comment out the string test or change it to >= make it pass. Thoughts? Alternatives?
>>
>>
>> TestNumericDocValuesUpdates.TestUpdateOldSegments()
>>
>> I discovered what the issue is here (normally that is the hard part), 
>> but it seems that the proper solution is going to be a major task. 
>> The NamedSPILoader (backed by SPIClassIterator) in Java Lucene is 
>> used as a service locator to load classes throughout the project. In 
>> the Codec abstract class, it is used to load up the codec for the 
>> context it is used in. However, our port of the NamedSPILoader simply 
>> loads all of the classes from the current AppDomain without any way to order them or override them.
>>
>> The problem is that in Lucene, this was meant to be an extension point.
>> And this particular test (and probably many more of them) uses that 
>> extension point to change the codec to a Mock from the test 
>> framework. This line from TestRuleSetupAndRestoreClassEnv pretty much 
>> sums up what the issue is:
>>
>> > Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have
>> tests-framework.jar before lucene-core.jar");
>>
>> Basically, it is using a configuration file to order the classes that 
>> are loaded so the test mocks take priority over the built-in codecs.
>>
>> Just fixing the test could be done by making the static 
>> NamedSPILoader variable in the Codec class internal and swapping in a test double.
>> However, that doesn't solve the bigger issue that Lucene.Net is 
>> missing its extensibility for anyone who wants to write their own 
>> codec (or tap into one of the other extensibility points). I guess 
>> the bigger question is how important will it be for anyone to extend 
>> Lucene codecs or inject dependencies into Analyzer factories? There 
>> doesn’t appear to be any more extensibility than that in Lucene 
>> 4.8.0, but that could change in more recent or future versions of Lucene.
>>
>>
>> CI Builds
>> -----------
>>
>> Not working. Can someone look into that please?
>>
>>
>> Thanks,
>> Shad Storhaug (NightOwl888)
>>
>>
>>
>> -----Original Message-----
>> From: Shad Storhaug
>> Sent: Wednesday, October 5, 2016 8:23 PM
>> To: dev@lucenenet.apache.org
>> Cc: Connie Yau; 'cribs2@gmail.com'
>> Subject: RE: Remaining Work/Priorities
>>
>> > Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU 
>> > DLLs
>> from the analysis.commons module?
>>
>> Just for clarification, these are two entirely different things in Java.
>> Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts 
>> of
>> Java:
>>
>> import java.text.BreakIterator;
>> import java.text.Collator;
>> import java.text.ParseException;
>> import java.text.RuleBasedCollator;
>>
>> Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also 
>> depend on parts of Java:
>>
>> import java.text.BreakIterator;
>> import java.text.CharacterIterator;
>>
>> Analysis.ICU depends on a separate (icu4j) package:
>>
>> import com.ibm.icu.text.Normalizer;
>> import com.ibm.icu.text.Normalizer2;
>> import com.ibm.icu.text.Transliterator; import 
>> com.ibm.icu.text.Replaceable; import com.ibm.icu.text.Transliterator; 
>> import com.ibm.icu.text.UTF16; import com.ibm.icu.text.UnicodeSet; 
>> import com.ibm.icu.text.FilteredNormalizer2;
>> import com.ibm.icu.text.Collator;
>> import com.ibm.icu.text.RuleBasedCollator;
>> import com.ibm.icu.util.ULocale;
>> import com.ibm.icu.text.RawCollationKey;
>>
>> That said, icu4j DOES have Collator and RuleBasedCollator classes, 
>> but it DOES NOT have a BreakIterator or CharacterIterator class. It 
>> is unclear whether the Collator from icu4j would work as a 
>> replacement for the one in core Java.
>>
>> When I was digging through the JDK code, I noticed that BreakIterator 
>> and RuleBasedCollator have a lot of common ICU dependencies there, so 
>> even if the RuleBasedCollator from icu4j is compatible, it might make 
>> sense for us to port the one from Java anyway so we are dealing with 
>> the same shared dependencies in Analysis.Common.
>>
>> Once we port over the classes from the Java JDK, we will be able to 
>> eliminate our current ICU4NET dependency (and the platform issues 
>> that come with it). That said, porting over those pieces could take 
>> considerable work. In the interim it might make sense to make 
>> separate projects/NuGet packages to isolate the areas that depend on 
>> BreakIterator, CharacterIterator, and RuleBasedCollator so the rest 
>> can be released for wide/cross-platform use. Perhaps we can even make 
>> a basic (scaled down) BreakIterator for Highlighter that breaks on 
>> spaces between words and punctuation between sentences, which 
>> wouldn't work for Thai, but would work for most other languages.
>>
>> Porting the (icu4j) package is another complete ball of yarn, we 
>> should take a look at (https://github.com/sillsdev/icu-dotnet) to see 
>> if there is enough overlap there to power Analysis.ICU (offhand it 
>> looks as though some classes are missing, though). It is a wrapper 
>> around the C library - it may be that we just need to port more of it 
>> to get all of the pieces we need.
>>
>> Speaking of Collation, @ChristopherHaws have you made any more 
>> progress on Analysis.Collation? Were you able to determine if 
>> icu-dotnet's collator will make the tests pass?
>>
>> > I'm on it QueryParser.Flexible
>>
>> Great. The TimeZone probably just needs more research to work out how 
>> to utilize (in order to implement the failing test). Also, FYI MSDN's 
>> recommendation (https://msdn.microsoft.com/en
>> -us/library/system.timezone(v=vs.110).aspx) is to use TimeZoneInfo 
>> rather than TimeZone (I noticed that several of the tests were 
>> recently modified to use TimeZone rather than TimeZoneInfo).
>>
>> As for the culture, in .NET I am pretty sure that we need to pass it 
>> as a parameter to another overload of `QueryParser.Parse` rather than 
>> making it a property of QueryParser. But we can deal with that in one 
>> step after you have finished porting.
>>
>> --
>>
>> Shad Storhaug (NightOwl888)
>>
>> -----Original Message-----
>> From: itamar.synhershko@gmail.com 
>> [mailto:itamar.synhershko@gmail.com]
>> On Behalf Of Itamar Syn-Hershko
>> Sent: Wednesday, October 5, 2016 5:28 AM
>> To: dev@lucenenet.apache.org
>> Cc: Connie Yau
>> Subject: Re: Remaining Work/Priorities
>>
>> Awesome, thanks for all the hard work Shad!
>>
>> Our first priority should be fixing all remaining tests - in 
>> particular the one in Core. We should be ready to release and stamp 
>> our builds as 100% stable. As you mentioned, this could be an 
>> infrastructure issue - hopefully *Connie *can give a status update on her effort on the switch to xUnit?
>>
>> With regards to Modules, here's an updated breakdown based on your 
>> email
>> + forgotten pieces + my comments:
>>
>> *Ported:*
>> Lucene.Net (Core) - 15 failing / 1989 total 
>> Lucene.Net.Analysis.Common -
>> 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 
>> total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet 
>> - (including
>> #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 
>> total Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 
>> failing / 42 total Lucene.Net.Queries - 2 failing / 96 total 
>> Lucene.Net.QueryParser - 1 failing / 203 total Lucene.Net.Suggest - 0 
>> failing / 142 total
>>
>> We should do a second pass on the pieces we marked as ported, just to 
>> make sure the port is full and we didn't leave anything behind :)
>>
>> *Need to be ported:*
>> Highlighter (Depends on Collator (which is still being ported) and 
>> BreakIterator (which we don't have a solution that works in .NET core 
>> yet)) Spatial (has 3rd party libraries that need to be updates) 
>> Spatial4n (
>> https://github.com/synhershko/Spatial4N) needs to be brought up to 
>> speed with spatial4j, dependencies of which may cause some issues....
>> Codecs
>> Partially ported, mostly the tests weren't ported Grouping Not 
>> urgent, but provides nice functionality that users will probably like
>>
>> The only part with dependencies seems to be the spatial module - I 
>> will have a look there soon if you don't get to that before I do.
>>
>> *Can wait* - some modules are less frequently used, we should 
>> stabilize and release first and then work on them based on demand 
>> Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs 
>> from the analysis.commons module? I keep getting reports on some 
>> issues they are causing Analysis.Kuromoji Analysis.Morfologik 
>> (Depends on Morfologik) Analysis.Phonetic (Depends on Apache Commons) 
>> Apache commons is mostly helper libraries, so there's probably not 
>> real dependency just lots of replacement Analysis.SmartCN 
>> Analysis.Stempel (currently in progress) Analysis.UIMA (Depends on 
>> Tagger, uimaj-core, WhiteSpaceTokenizer) Demo while important because 
>> can help newbies, we can do better by providing docs and real world 
>> examples. I'm on it QueryParser.Flexible
>>
>> *No need to port* - neither are needed in our context Benchmark (many
>> dependencies) Replicator (many dependencies) Sandbox (Depends on 
>> Apache
>> Jakarta)
>>
>> Once all modules are ported and all tests are passing, I think we 
>> should get two more items fixed before an official release:
>>
>> 1. .NET Core support - I'm not clear on the status of it at the moment.
>> We probably want to have it in for the release.
>>
>> 2. Public API Inconsistencies. We can discuss what should be done and 
>> what not when we get to that stage. Some are an obvious "fixme" but 
>> some will break code compatibility with Java I think we should avoid.
>>
>> One last note - *Wyatt*, do we know why there are no CI builds lately?
>>
>> --
>>
>> Itamar Syn-Hershko
>> http://code972.com | @synhershko <https://twitter.com/synhershko> 
>> Freelance Developer & Consultant Lucene.NET committer and PMC member
>>
>> On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug 
>> <sh...@shadstorhaug.com>
>> wrote:
>>
>> > Hello,
>> >
>> > I just wanted to open this discussion to talk about the work 
>> > remaining to be done on Lucene.Net version 4.8.0. We are nearly 
>> > there, but that doesn't mean we don't still need help!
>> >
>> >
>> > FAILING TESTS
>> > -------------------
>> >
>> > We now have over 5000 passing tests and as soon as pull request 
>> > #188 (
>> > https://github.com/apache/lucenenet/pull/188) is merged, by my 
>> > count we have only 20 (actual) failing tests. Here is the breakdown 
>> > by
>> project:
>> >
>> > Lucene.Net (Core) - 15 failing / 1989 total 
>> > Lucene.Net.Analysis.Common
>> > - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 
>> > total Lucene.Net.Expressions - 0 failing / 94 total 
>> > Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total 
>> > Lucene.Net.Join - 0 failing / 27 total Lucene.Net.Memory - 0 
>> > failing / 10 total Lucene.Net.Misc - 2 failing / 42 total 
>> > Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 
>> > 1 failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total
>> >
>> > The reason why I said ACTUAL tests above is because I recently 
>> > discovered that many of the "failures" that are being reported are 
>> > false negatives (in fact, the VS2015 NUnit test runner shows there 
>> > are
>> > 135 failing tests total and 902 tests total that don't belong to 
>> > any project). Most NUnit 2.6 test runners do not correctly run 
>> > tests in shared abstract classes with the correct context (test 
>> > setup) to make them pass. These out-of-context runs add several 
>> > additional minutes to
>> the test run.
>> >
>> > As an experiment, I upgraded to NUnit 3.4.1 and it helped the 
>> > situation somewhat - that is, it ran the tests in the correct 
>> > context and I was able to determine that we have more tests than 
>> > the numbers above and they are all succeeding. However, it also ran 
>> > the tests in an invalid context (that is, the context of the 
>> > abstract class without any setup) and some of them still showed as failures.
>> >
>> > I know @conniey is currently working on porting the tests over to xUnit.
>> > Hopefully, swapping test frameworks alone (or using some of the new 
>> > fancy test attributes) is enough to fix this issue. If not, we need 
>> > to find another solution - preferably one that can be applied to 
>> > all of the tests in abstract classes without too much effort or 
>> > changing them so they are too different from their Java counterpart.
>> >
>> > Remaining Pieces to Port
>> > ---------------------------------
>> >
>> > I took an inventory of the remaining pieces left to port a few days 
>> > ago and here is what that looks like (alphabetical order):
>> >
>> > 1. Analysis.ICU (Depends on ICU4j)
>> > 2. Analysis.Kuromoji
>> > 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic 
>> > (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel 
>> > (currently in progress) 7. Analysis.UIMA (Depends on Tagger, 
>> > uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
>> > Demo 10. Highlighter (Depends on Collator (which is still being
>> > ported) and BreakIterator (which we don't have a solution that 
>> > works in .NET core yet)) 11. Replicator (many dependencies) 12. 
>> > Sandbox (Depends on Apache Jakarta) 13. Spatial (Already ported in 
>> > #174 (https://github.com/apache/ lucenenet/pull/174), needs a 
>> > recent version of spatial4n) 14. QueryParser.Flexible
>> >
>> > Itamar, it would be helpful if you would be so kind as to organize 
>> > this list in terms of priority. It also couldn't hurt to update the 
>> > contributing documents 
>> > (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
>> > and
>> > https://cwiki.apache.org/confluence/display/LUCENENET/Current+Statu
>> > s with the latest information so anyone who wants to help out knows 
>> > the current status.
>> >
>> > Of course, it is the known status of dependencies that we need 
>> > clarification on. Which of these dependencies is known to be ported?
>> > Which of them are ported but are not up to date? Which of them are 
>> > known not to be ported, and which of them are unknown?
>> >
>> >
>> > Public API Inconsistencies
>> > ---------------------------------
>> >
>> > One thing that I have had my eye on for a while now is the 
>> > .NETification/consistency of the core API (that is, in the 
>> > Lucene.Net project). There are several issues that I would like to 
>> > address
>> including:
>> >
>> >
>> > 1.       Method names that are still camelCase
>> >
>> > 2.       Properties that should be methods (because they do a lot of
>> > processing or because they are non-deterministic)
>> >
>> > 3.       Methods that should be properties
>> >
>> > 4.       .Size() vs .Size vs .Count - should generally all be .Count in
>> > .NET
>> >
>> > 5.       Interfaces should begin with "I"
>> >
>> > 6.       Classes should not begin with "I" followed by another capital
>> > letter (for some reason some of them were named that way)
>> >
>> > 7.       .CharAt() should probably be this[]
>> >
>> > 8.       Generic types nested within generic types (which cause Visual
>> > Studio to crash when Intellisense tries to read them)
>> >
>> > ... and so on. The only thing is these are all sweeping changes 
>> > that will affect everyone helping out on Lucene.Net and anyone who 
>> > is currently using the beta. So, I just wanted to gather some input 
>> > on when the most appropriate time to begin working on these 
>> > sweeping
>> changes would be?
>> >
>> >
>> > Thanks,
>> > Shad Storhaug (NightOwl888)
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>

Re: Remaining Work/Priorities

Posted by Itamar Syn-Hershko <it...@code972.com>.
While on that note, Shad - emails to you bounce with the following error
(still):

Delivery to the following recipient failed permanently:

     shad@shadstorhaug.com

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the server for
the recipient domain shadstorhaug.com by mx1.hostmailserver.com.
[69.160.246.214].

The error that the other server returned was:
554 5.7.1 gmail.com is blacklisted.

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Wed, Oct 12, 2016 at 8:06 PM, Itamar Syn-Hershko <it...@code972.com>
wrote:

> CI failure seems to be worked on: https://twitter.com/codebetterCI/status/
> 785854074713468932 (Thanks Wyatt for pointing that out)
>
> I will look into the rest in a little while
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> Freelance Developer & Consultant
> Lucene.NET committer and PMC member
>
> On Tue, Oct 11, 2016 at 10:10 PM, Shad Storhaug <sh...@shadstorhaug.com>
> wrote:
>
>> Update
>> ======
>>
>> I have just pushed some commits that fix several bugs in the
>> Lucene.Net.Codecs project (all 452 tests pass most of the time, a few
>> random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.
>>
>>
>> Fix for Test Context
>> -------------------------
>>
>> For now, I have added method override stubs to each subclass in order to
>> add the [Test] attribute, so NUnit will run them in the correct context. I
>> did that on all of the superclass tests except for the ones in QueryParser
>> (since Itamar mentioned he would be working in that area). Itamar, you will
>> probably need to follow suit to get all of the QP tests to pass - namely
>> with the QueryParserTestBase and TestQueryParser classes.
>>
>> I have carefully put all of these changes into a single commit so it can
>> be reverted easily, if this solution doesn't happen to be compatible with
>> xUnit: https://github.com/apache/lucenenet/commit/2a79edea6359e1ee1
>> f83269cc7dc3ef2753ebf2c. Hopefully that makes life easier for @conniey.
>>
>> @Itamar, let me know when this is completed on your end so I can do a
>> double revert and squash the test stubs from QueryParser into an
>> all-inclusive revert-able commit.
>>
>> We can now correctly see how many tests we have in the core. Currently
>> there are 2730 - it seems we are still missing 720 tests, assuming they all
>> were for something port-able.
>>
>>
>> Remaining Tests
>> ---------------------
>>
>> Next I plan to work on locating any tests that we have missed (starting
>> in the core). It seems these fall into several categories:
>>
>> 1. Tests that have not yet been ported.
>> 2. Tests that have been partially ported that have not been added to the
>> project.
>> 3. Tests that have been ported, but are missing the [Test] attribute.
>> 4. Tests in classes that have been ported that have been commented out
>> (presumably because at the time they were ported the dependencies did not
>> yet exist).
>> 5. Tests that have been Ignored in .NET that were not in Java.
>> 6. Tests that have NUnit Assume.That() logic that depends on some
>> non-existant JRE condition, so they are not running in .NET.
>>
>> I'll make a quick effort to get them to pass, but the main goal will be
>> to ensure they all can run and are included in the project. Just a heads up
>> that the number of test failures is likely to increase on this pass (but
>> the number of bugs will likely decrease).
>>
>>
>> Failing Core Tests
>> -----------------------
>>
>> I have looked into the remaining tests somewhat. There are 2 issues that
>> I need some input on to solve.
>>
>>
>> TestRamUsageEstimator.TestSanity()
>>
>> Java Lucene uses a JRE-specific API to determine how much header size to
>> add on each field. This makes the estimates higher in Java. But more
>> importantly, this test is failing because the estimate for a real string
>> instance is coming back as the same size as its shallow size (16 bytes in
>> this case) - it needs to be at least 1 byte more than that for the test to
>> pass. In Java (at least in a 64 bit environment), there are an extra 4
>> bytes being added for each field.
>>
>> Technically, there is a way to get these numbers from .NET, but it
>> involves calling undocumented APIs using pointers and will likely be
>> different from one .NET version to the next (a bad idea for a project that
>> needs to support multiple .NET versions). The only solution I can think of
>> is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for
>> 32 bit) in order to make the numbers for the instances larger than their
>> shallow size. I suppose the alternative would be to either comment out the
>> string test or change it to >= make it pass. Thoughts? Alternatives?
>>
>>
>> TestNumericDocValuesUpdates.TestUpdateOldSegments()
>>
>> I discovered what the issue is here (normally that is the hard part), but
>> it seems that the proper solution is going to be a major task. The
>> NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a
>> service locator to load classes throughout the project. In the Codec
>> abstract class, it is used to load up the codec for the context it is used
>> in. However, our port of the NamedSPILoader simply loads all of the classes
>> from the current AppDomain without any way to order them or override them.
>>
>> The problem is that in Lucene, this was meant to be an extension point.
>> And this particular test (and probably many more of them) uses that
>> extension point to change the codec to a Mock from the test framework. This
>> line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the
>> issue is:
>>
>> > Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have
>> tests-framework.jar before lucene-core.jar");
>>
>> Basically, it is using a configuration file to order the classes that are
>> loaded so the test mocks take priority over the built-in codecs.
>>
>> Just fixing the test could be done by making the static NamedSPILoader
>> variable in the Codec class internal and swapping in a test double.
>> However, that doesn't solve the bigger issue that Lucene.Net is missing its
>> extensibility for anyone who wants to write their own codec (or tap into
>> one of the other extensibility points). I guess the bigger question is how
>> important will it be for anyone to extend Lucene codecs or inject
>> dependencies into Analyzer factories? There doesn’t appear to be any more
>> extensibility than that in Lucene 4.8.0, but that could change in more
>> recent or future versions of Lucene.
>>
>>
>> CI Builds
>> -----------
>>
>> Not working. Can someone look into that please?
>>
>>
>> Thanks,
>> Shad Storhaug (NightOwl888)
>>
>>
>>
>> -----Original Message-----
>> From: Shad Storhaug
>> Sent: Wednesday, October 5, 2016 8:23 PM
>> To: dev@lucenenet.apache.org
>> Cc: Connie Yau; 'cribs2@gmail.com'
>> Subject: RE: Remaining Work/Priorities
>>
>> > Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs
>> from the analysis.commons module?
>>
>> Just for clarification, these are two entirely different things in Java.
>> Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of
>> Java:
>>
>> import java.text.BreakIterator;
>> import java.text.Collator;
>> import java.text.ParseException;
>> import java.text.RuleBasedCollator;
>>
>> Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also
>> depend on parts of Java:
>>
>> import java.text.BreakIterator;
>> import java.text.CharacterIterator;
>>
>> Analysis.ICU depends on a separate (icu4j) package:
>>
>> import com.ibm.icu.text.Normalizer;
>> import com.ibm.icu.text.Normalizer2;
>> import com.ibm.icu.text.Transliterator;
>> import com.ibm.icu.text.Replaceable;
>> import com.ibm.icu.text.Transliterator;
>> import com.ibm.icu.text.UTF16;
>> import com.ibm.icu.text.UnicodeSet;
>> import com.ibm.icu.text.FilteredNormalizer2;
>> import com.ibm.icu.text.Collator;
>> import com.ibm.icu.text.RuleBasedCollator;
>> import com.ibm.icu.util.ULocale;
>> import com.ibm.icu.text.RawCollationKey;
>>
>> That said, icu4j DOES have Collator and RuleBasedCollator classes, but it
>> DOES NOT have a BreakIterator or CharacterIterator class. It is unclear
>> whether the Collator from icu4j would work as a replacement for the one in
>> core Java.
>>
>> When I was digging through the JDK code, I noticed that BreakIterator and
>> RuleBasedCollator have a lot of common ICU dependencies there, so even if
>> the RuleBasedCollator from icu4j is compatible, it might make sense for us
>> to port the one from Java anyway so we are dealing with the same shared
>> dependencies in Analysis.Common.
>>
>> Once we port over the classes from the Java JDK, we will be able to
>> eliminate our current ICU4NET dependency (and the platform issues that come
>> with it). That said, porting over those pieces could take considerable
>> work. In the interim it might make sense to make separate projects/NuGet
>> packages to isolate the areas that depend on BreakIterator,
>> CharacterIterator, and RuleBasedCollator so the rest can be released for
>> wide/cross-platform use. Perhaps we can even make a basic (scaled down)
>> BreakIterator for Highlighter that breaks on spaces between words and
>> punctuation between sentences, which wouldn't work for Thai, but would work
>> for most other languages.
>>
>> Porting the (icu4j) package is another complete ball of yarn, we should
>> take a look at (https://github.com/sillsdev/icu-dotnet) to see if there
>> is enough overlap there to power Analysis.ICU (offhand it looks as though
>> some classes are missing, though). It is a wrapper around the C library -
>> it may be that we just need to port more of it to get all of the pieces we
>> need.
>>
>> Speaking of Collation, @ChristopherHaws have you made any more progress
>> on Analysis.Collation? Were you able to determine if icu-dotnet's collator
>> will make the tests pass?
>>
>> > I'm on it QueryParser.Flexible
>>
>> Great. The TimeZone probably just needs more research to work out how to
>> utilize (in order to implement the failing test). Also, FYI MSDN's
>> recommendation (https://msdn.microsoft.com/en
>> -us/library/system.timezone(v=vs.110).aspx) is to use TimeZoneInfo
>> rather than TimeZone (I noticed that several of the tests were recently
>> modified to use TimeZone rather than TimeZoneInfo).
>>
>> As for the culture, in .NET I am pretty sure that we need to pass it as a
>> parameter to another overload of `QueryParser.Parse` rather than making it
>> a property of QueryParser. But we can deal with that in one step after you
>> have finished porting.
>>
>> --
>>
>> Shad Storhaug (NightOwl888)
>>
>> -----Original Message-----
>> From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com]
>> On Behalf Of Itamar Syn-Hershko
>> Sent: Wednesday, October 5, 2016 5:28 AM
>> To: dev@lucenenet.apache.org
>> Cc: Connie Yau
>> Subject: Re: Remaining Work/Priorities
>>
>> Awesome, thanks for all the hard work Shad!
>>
>> Our first priority should be fixing all remaining tests - in particular
>> the one in Core. We should be ready to release and stamp our builds as 100%
>> stable. As you mentioned, this could be an infrastructure issue - hopefully
>> *Connie *can give a status update on her effort on the switch to xUnit?
>>
>> With regards to Modules, here's an updated breakdown based on your email
>> + forgotten pieces + my comments:
>>
>> *Ported:*
>> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common -
>> 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total
>> Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including
>> #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total
>> Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42
>> total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1
>> failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total
>>
>> We should do a second pass on the pieces we marked as ported, just to
>> make sure the port is full and we didn't leave anything behind :)
>>
>> *Need to be ported:*
>> Highlighter (Depends on Collator (which is still being ported) and
>> BreakIterator (which we don't have a solution that works in .NET core yet))
>> Spatial (has 3rd party libraries that need to be updates) Spatial4n (
>> https://github.com/synhershko/Spatial4N) needs to be brought up to speed
>> with spatial4j, dependencies of which may cause some issues....
>> Codecs
>> Partially ported, mostly the tests weren't ported Grouping Not urgent,
>> but provides nice functionality that users will probably like
>>
>> The only part with dependencies seems to be the spatial module - I will
>> have a look there soon if you don't get to that before I do.
>>
>> *Can wait* - some modules are less frequently used, we should stabilize
>> and release first and then work on them based on demand Analysis.ICU
>> (Depends on ICU4j) hopefully we can remove the ICU DLLs from the
>> analysis.commons module? I keep getting reports on some issues they are
>> causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik)
>> Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly
>> helper libraries, so there's probably not real dependency just lots of
>> replacement Analysis.SmartCN Analysis.Stempel (currently in progress)
>> Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo
>> while important because can help newbies, we can do better by providing
>> docs and real world examples. I'm on it QueryParser.Flexible
>>
>> *No need to port* - neither are needed in our context Benchmark (many
>> dependencies) Replicator (many dependencies) Sandbox (Depends on Apache
>> Jakarta)
>>
>> Once all modules are ported and all tests are passing, I think we should
>> get two more items fixed before an official release:
>>
>> 1. .NET Core support - I'm not clear on the status of it at the moment.
>> We probably want to have it in for the release.
>>
>> 2. Public API Inconsistencies. We can discuss what should be done and
>> what not when we get to that stage. Some are an obvious "fixme" but some
>> will break code compatibility with Java I think we should avoid.
>>
>> One last note - *Wyatt*, do we know why there are no CI builds lately?
>>
>> --
>>
>> Itamar Syn-Hershko
>> http://code972.com | @synhershko <https://twitter.com/synhershko>
>> Freelance Developer & Consultant Lucene.NET committer and PMC member
>>
>> On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
>> wrote:
>>
>> > Hello,
>> >
>> > I just wanted to open this discussion to talk about the work remaining
>> > to be done on Lucene.Net version 4.8.0. We are nearly there, but that
>> > doesn't mean we don't still need help!
>> >
>> >
>> > FAILING TESTS
>> > -------------------
>> >
>> > We now have over 5000 passing tests and as soon as pull request #188 (
>> > https://github.com/apache/lucenenet/pull/188) is merged, by my count
>> > we have only 20 (actual) failing tests. Here is the breakdown by
>> project:
>> >
>> > Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
>> > - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9
>> > total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet -
>> > (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0
>> > failing / 27 total Lucene.Net.Memory - 0 failing / 10 total
>> > Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing
>> > / 96 total Lucene.Net.QueryParser - 1 failing / 203 total
>> > Lucene.Net.Suggest - 0 failing / 142 total
>> >
>> > The reason why I said ACTUAL tests above is because I recently
>> > discovered that many of the "failures" that are being reported are
>> > false negatives (in fact, the VS2015 NUnit test runner shows there are
>> > 135 failing tests total and 902 tests total that don't belong to any
>> > project). Most NUnit 2.6 test runners do not correctly run tests in
>> > shared abstract classes with the correct context (test setup) to make
>> > them pass. These out-of-context runs add several additional minutes to
>> the test run.
>> >
>> > As an experiment, I upgraded to NUnit 3.4.1 and it helped the
>> > situation somewhat - that is, it ran the tests in the correct context
>> > and I was able to determine that we have more tests than the numbers
>> > above and they are all succeeding. However, it also ran the tests in
>> > an invalid context (that is, the context of the abstract class without
>> > any setup) and some of them still showed as failures.
>> >
>> > I know @conniey is currently working on porting the tests over to xUnit.
>> > Hopefully, swapping test frameworks alone (or using some of the new
>> > fancy test attributes) is enough to fix this issue. If not, we need to
>> > find another solution - preferably one that can be applied to all of
>> > the tests in abstract classes without too much effort or changing them
>> > so they are too different from their Java counterpart.
>> >
>> > Remaining Pieces to Port
>> > ---------------------------------
>> >
>> > I took an inventory of the remaining pieces left to port a few days
>> > ago and here is what that looks like (alphabetical order):
>> >
>> > 1. Analysis.ICU (Depends on ICU4j)
>> > 2. Analysis.Kuromoji
>> > 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic
>> > (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel
>> > (currently in progress) 7. Analysis.UIMA (Depends on Tagger,
>> > uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
>> > Demo 10. Highlighter (Depends on Collator (which is still being
>> > ported) and BreakIterator (which we don't have a solution that works
>> > in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox
>> > (Depends on Apache Jakarta) 13. Spatial (Already ported in #174
>> > (https://github.com/apache/ lucenenet/pull/174), needs a recent
>> > version of spatial4n) 14. QueryParser.Flexible
>> >
>> > Itamar, it would be helpful if you would be so kind as to organize
>> > this list in terms of priority. It also couldn't hurt to update the
>> > contributing documents
>> > (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
>> > and
>> > https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
>> > with the latest information so anyone who wants to help out knows the
>> > current status.
>> >
>> > Of course, it is the known status of dependencies that we need
>> > clarification on. Which of these dependencies is known to be ported?
>> > Which of them are ported but are not up to date? Which of them are
>> > known not to be ported, and which of them are unknown?
>> >
>> >
>> > Public API Inconsistencies
>> > ---------------------------------
>> >
>> > One thing that I have had my eye on for a while now is the
>> > .NETification/consistency of the core API (that is, in the Lucene.Net
>> > project). There are several issues that I would like to address
>> including:
>> >
>> >
>> > 1.       Method names that are still camelCase
>> >
>> > 2.       Properties that should be methods (because they do a lot of
>> > processing or because they are non-deterministic)
>> >
>> > 3.       Methods that should be properties
>> >
>> > 4.       .Size() vs .Size vs .Count - should generally all be .Count in
>> > .NET
>> >
>> > 5.       Interfaces should begin with "I"
>> >
>> > 6.       Classes should not begin with "I" followed by another capital
>> > letter (for some reason some of them were named that way)
>> >
>> > 7.       .CharAt() should probably be this[]
>> >
>> > 8.       Generic types nested within generic types (which cause Visual
>> > Studio to crash when Intellisense tries to read them)
>> >
>> > ... and so on. The only thing is these are all sweeping changes that
>> > will affect everyone helping out on Lucene.Net and anyone who is
>> > currently using the beta. So, I just wanted to gather some input on
>> > when the most appropriate time to begin working on these sweeping
>> changes would be?
>> >
>> >
>> > Thanks,
>> > Shad Storhaug (NightOwl888)
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>

Re: Remaining Work/Priorities

Posted by Itamar Syn-Hershko <it...@code972.com>.
CI failure seems to be worked on:
https://twitter.com/codebetterCI/status/785854074713468932 (Thanks Wyatt
for pointing that out)

I will look into the rest in a little while

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Oct 11, 2016 at 10:10 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Update
> ======
>
> I have just pushed some commits that fix several bugs in the
> Lucene.Net.Codecs project (all 452 tests pass most of the time, a few
> random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.
>
>
> Fix for Test Context
> -------------------------
>
> For now, I have added method override stubs to each subclass in order to
> add the [Test] attribute, so NUnit will run them in the correct context. I
> did that on all of the superclass tests except for the ones in QueryParser
> (since Itamar mentioned he would be working in that area). Itamar, you will
> probably need to follow suit to get all of the QP tests to pass - namely
> with the QueryParserTestBase and TestQueryParser classes.
>
> I have carefully put all of these changes into a single commit so it can
> be reverted easily, if this solution doesn't happen to be compatible with
> xUnit: https://github.com/apache/lucenenet/commit/
> 2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c. Hopefully that makes life
> easier for @conniey.
>
> @Itamar, let me know when this is completed on your end so I can do a
> double revert and squash the test stubs from QueryParser into an
> all-inclusive revert-able commit.
>
> We can now correctly see how many tests we have in the core. Currently
> there are 2730 - it seems we are still missing 720 tests, assuming they all
> were for something port-able.
>
>
> Remaining Tests
> ---------------------
>
> Next I plan to work on locating any tests that we have missed (starting in
> the core). It seems these fall into several categories:
>
> 1. Tests that have not yet been ported.
> 2. Tests that have been partially ported that have not been added to the
> project.
> 3. Tests that have been ported, but are missing the [Test] attribute.
> 4. Tests in classes that have been ported that have been commented out
> (presumably because at the time they were ported the dependencies did not
> yet exist).
> 5. Tests that have been Ignored in .NET that were not in Java.
> 6. Tests that have NUnit Assume.That() logic that depends on some
> non-existant JRE condition, so they are not running in .NET.
>
> I'll make a quick effort to get them to pass, but the main goal will be to
> ensure they all can run and are included in the project. Just a heads up
> that the number of test failures is likely to increase on this pass (but
> the number of bugs will likely decrease).
>
>
> Failing Core Tests
> -----------------------
>
> I have looked into the remaining tests somewhat. There are 2 issues that I
> need some input on to solve.
>
>
> TestRamUsageEstimator.TestSanity()
>
> Java Lucene uses a JRE-specific API to determine how much header size to
> add on each field. This makes the estimates higher in Java. But more
> importantly, this test is failing because the estimate for a real string
> instance is coming back as the same size as its shallow size (16 bytes in
> this case) - it needs to be at least 1 byte more than that for the test to
> pass. In Java (at least in a 64 bit environment), there are an extra 4
> bytes being added for each field.
>
> Technically, there is a way to get these numbers from .NET, but it
> involves calling undocumented APIs using pointers and will likely be
> different from one .NET version to the next (a bad idea for a project that
> needs to support multiple .NET versions). The only solution I can think of
> is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for
> 32 bit) in order to make the numbers for the instances larger than their
> shallow size. I suppose the alternative would be to either comment out the
> string test or change it to >= make it pass. Thoughts? Alternatives?
>
>
> TestNumericDocValuesUpdates.TestUpdateOldSegments()
>
> I discovered what the issue is here (normally that is the hard part), but
> it seems that the proper solution is going to be a major task. The
> NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a
> service locator to load classes throughout the project. In the Codec
> abstract class, it is used to load up the codec for the context it is used
> in. However, our port of the NamedSPILoader simply loads all of the classes
> from the current AppDomain without any way to order them or override them.
>
> The problem is that in Lucene, this was meant to be an extension point.
> And this particular test (and probably many more of them) uses that
> extension point to change the codec to a Mock from the test framework. This
> line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the
> issue is:
>
> > Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have
> tests-framework.jar before lucene-core.jar");
>
> Basically, it is using a configuration file to order the classes that are
> loaded so the test mocks take priority over the built-in codecs.
>
> Just fixing the test could be done by making the static NamedSPILoader
> variable in the Codec class internal and swapping in a test double.
> However, that doesn't solve the bigger issue that Lucene.Net is missing its
> extensibility for anyone who wants to write their own codec (or tap into
> one of the other extensibility points). I guess the bigger question is how
> important will it be for anyone to extend Lucene codecs or inject
> dependencies into Analyzer factories? There doesn’t appear to be any more
> extensibility than that in Lucene 4.8.0, but that could change in more
> recent or future versions of Lucene.
>
>
> CI Builds
> -----------
>
> Not working. Can someone look into that please?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
> -----Original Message-----
> From: Shad Storhaug
> Sent: Wednesday, October 5, 2016 8:23 PM
> To: dev@lucenenet.apache.org
> Cc: Connie Yau; 'cribs2@gmail.com'
> Subject: RE: Remaining Work/Priorities
>
> > Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs
> from the analysis.commons module?
>
> Just for clarification, these are two entirely different things in Java.
> Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of
> Java:
>
> import java.text.BreakIterator;
> import java.text.Collator;
> import java.text.ParseException;
> import java.text.RuleBasedCollator;
>
> Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also
> depend on parts of Java:
>
> import java.text.BreakIterator;
> import java.text.CharacterIterator;
>
> Analysis.ICU depends on a separate (icu4j) package:
>
> import com.ibm.icu.text.Normalizer;
> import com.ibm.icu.text.Normalizer2;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.Replaceable;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.UTF16;
> import com.ibm.icu.text.UnicodeSet;
> import com.ibm.icu.text.FilteredNormalizer2;
> import com.ibm.icu.text.Collator;
> import com.ibm.icu.text.RuleBasedCollator;
> import com.ibm.icu.util.ULocale;
> import com.ibm.icu.text.RawCollationKey;
>
> That said, icu4j DOES have Collator and RuleBasedCollator classes, but it
> DOES NOT have a BreakIterator or CharacterIterator class. It is unclear
> whether the Collator from icu4j would work as a replacement for the one in
> core Java.
>
> When I was digging through the JDK code, I noticed that BreakIterator and
> RuleBasedCollator have a lot of common ICU dependencies there, so even if
> the RuleBasedCollator from icu4j is compatible, it might make sense for us
> to port the one from Java anyway so we are dealing with the same shared
> dependencies in Analysis.Common.
>
> Once we port over the classes from the Java JDK, we will be able to
> eliminate our current ICU4NET dependency (and the platform issues that come
> with it). That said, porting over those pieces could take considerable
> work. In the interim it might make sense to make separate projects/NuGet
> packages to isolate the areas that depend on BreakIterator,
> CharacterIterator, and RuleBasedCollator so the rest can be released for
> wide/cross-platform use. Perhaps we can even make a basic (scaled down)
> BreakIterator for Highlighter that breaks on spaces between words and
> punctuation between sentences, which wouldn't work for Thai, but would work
> for most other languages.
>
> Porting the (icu4j) package is another complete ball of yarn, we should
> take a look at (https://github.com/sillsdev/icu-dotnet) to see if there
> is enough overlap there to power Analysis.ICU (offhand it looks as though
> some classes are missing, though). It is a wrapper around the C library -
> it may be that we just need to port more of it to get all of the pieces we
> need.
>
> Speaking of Collation, @ChristopherHaws have you made any more progress on
> Analysis.Collation? Were you able to determine if icu-dotnet's collator
> will make the tests pass?
>
> > I'm on it QueryParser.Flexible
>
> Great. The TimeZone probably just needs more research to work out how to
> utilize (in order to implement the failing test). Also, FYI MSDN's
> recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(
> v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed
> that several of the tests were recently modified to use TimeZone rather
> than TimeZoneInfo).
>
> As for the culture, in .NET I am pretty sure that we need to pass it as a
> parameter to another overload of `QueryParser.Parse` rather than making it
> a property of QueryParser. But we can deal with that in one step after you
> have finished porting.
>
> --
>
> Shad Storhaug (NightOwl888)
>
> -----Original Message-----
> From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On
> Behalf Of Itamar Syn-Hershko
> Sent: Wednesday, October 5, 2016 5:28 AM
> To: dev@lucenenet.apache.org
> Cc: Connie Yau
> Subject: Re: Remaining Work/Priorities
>
> Awesome, thanks for all the hard work Shad!
>
> Our first priority should be fixing all remaining tests - in particular
> the one in Core. We should be ready to release and stamp our builds as 100%
> stable. As you mentioned, this could be an infrastructure issue - hopefully
> *Connie *can give a status update on her effort on the switch to xUnit?
>
> With regards to Modules, here's an updated breakdown based on your email +
> forgotten pieces + my comments:
>
> *Ported:*
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0
> failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total
> Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including
> #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total
> Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42
> total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1
> failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total
>
> We should do a second pass on the pieces we marked as ported, just to make
> sure the port is full and we didn't leave anything behind :)
>
> *Need to be ported:*
> Highlighter (Depends on Collator (which is still being ported) and
> BreakIterator (which we don't have a solution that works in .NET core yet))
> Spatial (has 3rd party libraries that need to be updates) Spatial4n (
> https://github.com/synhershko/Spatial4N) needs to be brought up to speed
> with spatial4j, dependencies of which may cause some issues....
> Codecs
> Partially ported, mostly the tests weren't ported Grouping Not urgent, but
> provides nice functionality that users will probably like
>
> The only part with dependencies seems to be the spatial module - I will
> have a look there soon if you don't get to that before I do.
>
> *Can wait* - some modules are less frequently used, we should stabilize
> and release first and then work on them based on demand Analysis.ICU
> (Depends on ICU4j) hopefully we can remove the ICU DLLs from the
> analysis.commons module? I keep getting reports on some issues they are
> causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik)
> Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly
> helper libraries, so there's probably not real dependency just lots of
> replacement Analysis.SmartCN Analysis.Stempel (currently in progress)
> Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo
> while important because can help newbies, we can do better by providing
> docs and real world examples. I'm on it QueryParser.Flexible
>
> *No need to port* - neither are needed in our context Benchmark (many
> dependencies) Replicator (many dependencies) Sandbox (Depends on Apache
> Jakarta)
>
> Once all modules are ported and all tests are passing, I think we should
> get two more items fixed before an official release:
>
> 1. .NET Core support - I'm not clear on the status of it at the moment. We
> probably want to have it in for the release.
>
> 2. Public API Inconsistencies. We can discuss what should be done and what
> not when we get to that stage. Some are an obvious "fixme" but some will
> break code compatibility with Java I think we should avoid.
>
> One last note - *Wyatt*, do we know why there are no CI builds lately?
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> Freelance Developer & Consultant Lucene.NET committer and PMC member
>
> On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
> wrote:
>
> > Hello,
> >
> > I just wanted to open this discussion to talk about the work remaining
> > to be done on Lucene.Net version 4.8.0. We are nearly there, but that
> > doesn't mean we don't still need help!
> >
> >
> > FAILING TESTS
> > -------------------
> >
> > We now have over 5000 passing tests and as soon as pull request #188 (
> > https://github.com/apache/lucenenet/pull/188) is merged, by my count
> > we have only 20 (actual) failing tests. Here is the breakdown by project:
> >
> > Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> > - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9
> > total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet -
> > (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0
> > failing / 27 total Lucene.Net.Memory - 0 failing / 10 total
> > Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing
> > / 96 total Lucene.Net.QueryParser - 1 failing / 203 total
> > Lucene.Net.Suggest - 0 failing / 142 total
> >
> > The reason why I said ACTUAL tests above is because I recently
> > discovered that many of the "failures" that are being reported are
> > false negatives (in fact, the VS2015 NUnit test runner shows there are
> > 135 failing tests total and 902 tests total that don't belong to any
> > project). Most NUnit 2.6 test runners do not correctly run tests in
> > shared abstract classes with the correct context (test setup) to make
> > them pass. These out-of-context runs add several additional minutes to
> the test run.
> >
> > As an experiment, I upgraded to NUnit 3.4.1 and it helped the
> > situation somewhat - that is, it ran the tests in the correct context
> > and I was able to determine that we have more tests than the numbers
> > above and they are all succeeding. However, it also ran the tests in
> > an invalid context (that is, the context of the abstract class without
> > any setup) and some of them still showed as failures.
> >
> > I know @conniey is currently working on porting the tests over to xUnit.
> > Hopefully, swapping test frameworks alone (or using some of the new
> > fancy test attributes) is enough to fix this issue. If not, we need to
> > find another solution - preferably one that can be applied to all of
> > the tests in abstract classes without too much effort or changing them
> > so they are too different from their Java counterpart.
> >
> > Remaining Pieces to Port
> > ---------------------------------
> >
> > I took an inventory of the remaining pieces left to port a few days
> > ago and here is what that looks like (alphabetical order):
> >
> > 1. Analysis.ICU (Depends on ICU4j)
> > 2. Analysis.Kuromoji
> > 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic
> > (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel
> > (currently in progress) 7. Analysis.UIMA (Depends on Tagger,
> > uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> > Demo 10. Highlighter (Depends on Collator (which is still being
> > ported) and BreakIterator (which we don't have a solution that works
> > in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox
> > (Depends on Apache Jakarta) 13. Spatial (Already ported in #174
> > (https://github.com/apache/ lucenenet/pull/174), needs a recent
> > version of spatial4n) 14. QueryParser.Flexible
> >
> > Itamar, it would be helpful if you would be so kind as to organize
> > this list in terms of priority. It also couldn't hurt to update the
> > contributing documents
> > (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> > and
> > https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> > with the latest information so anyone who wants to help out knows the
> > current status.
> >
> > Of course, it is the known status of dependencies that we need
> > clarification on. Which of these dependencies is known to be ported?
> > Which of them are ported but are not up to date? Which of them are
> > known not to be ported, and which of them are unknown?
> >
> >
> > Public API Inconsistencies
> > ---------------------------------
> >
> > One thing that I have had my eye on for a while now is the
> > .NETification/consistency of the core API (that is, in the Lucene.Net
> > project). There are several issues that I would like to address
> including:
> >
> >
> > 1.       Method names that are still camelCase
> >
> > 2.       Properties that should be methods (because they do a lot of
> > processing or because they are non-deterministic)
> >
> > 3.       Methods that should be properties
> >
> > 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> > .NET
> >
> > 5.       Interfaces should begin with "I"
> >
> > 6.       Classes should not begin with "I" followed by another capital
> > letter (for some reason some of them were named that way)
> >
> > 7.       .CharAt() should probably be this[]
> >
> > 8.       Generic types nested within generic types (which cause Visual
> > Studio to crash when Intellisense tries to read them)
> >
> > ... and so on. The only thing is these are all sweeping changes that
> > will affect everyone helping out on Lucene.Net and anyone who is
> > currently using the beta. So, I just wanted to gather some input on
> > when the most appropriate time to begin working on these sweeping
> changes would be?
> >
> >
> > Thanks,
> > Shad Storhaug (NightOwl888)
> >
> >
> >
> >
> >
> >
> >
>

RE: Remaining Work/Priorities

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Update
======

I have just pushed some commits that fix several bugs in the Lucene.Net.Codecs project (all 452 tests pass most of the time, a few random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.


Fix for Test Context
-------------------------

For now, I have added method override stubs to each subclass in order to add the [Test] attribute, so NUnit will run them in the correct context. I did that on all of the superclass tests except for the ones in QueryParser (since Itamar mentioned he would be working in that area). Itamar, you will probably need to follow suit to get all of the QP tests to pass - namely with the QueryParserTestBase and TestQueryParser classes.

I have carefully put all of these changes into a single commit so it can be reverted easily, if this solution doesn't happen to be compatible with xUnit: https://github.com/apache/lucenenet/commit/2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c. Hopefully that makes life easier for @conniey.

@Itamar, let me know when this is completed on your end so I can do a double revert and squash the test stubs from QueryParser into an all-inclusive revert-able commit.

We can now correctly see how many tests we have in the core. Currently there are 2730 - it seems we are still missing 720 tests, assuming they all were for something port-able.


Remaining Tests
---------------------

Next I plan to work on locating any tests that we have missed (starting in the core). It seems these fall into several categories:

1. Tests that have not yet been ported.
2. Tests that have been partially ported that have not been added to the project.
3. Tests that have been ported, but are missing the [Test] attribute.
4. Tests in classes that have been ported that have been commented out (presumably because at the time they were ported the dependencies did not yet exist).
5. Tests that have been Ignored in .NET that were not in Java.
6. Tests that have NUnit Assume.That() logic that depends on some non-existant JRE condition, so they are not running in .NET.

I'll make a quick effort to get them to pass, but the main goal will be to ensure they all can run and are included in the project. Just a heads up that the number of test failures is likely to increase on this pass (but the number of bugs will likely decrease).


Failing Core Tests
-----------------------

I have looked into the remaining tests somewhat. There are 2 issues that I need some input on to solve.


TestRamUsageEstimator.TestSanity()

Java Lucene uses a JRE-specific API to determine how much header size to add on each field. This makes the estimates higher in Java. But more importantly, this test is failing because the estimate for a real string instance is coming back as the same size as its shallow size (16 bytes in this case) - it needs to be at least 1 byte more than that for the test to pass. In Java (at least in a 64 bit environment), there are an extra 4 bytes being added for each field.

Technically, there is a way to get these numbers from .NET, but it involves calling undocumented APIs using pointers and will likely be different from one .NET version to the next (a bad idea for a project that needs to support multiple .NET versions). The only solution I can think of is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for 32 bit) in order to make the numbers for the instances larger than their shallow size. I suppose the alternative would be to either comment out the string test or change it to >= make it pass. Thoughts? Alternatives?


TestNumericDocValuesUpdates.TestUpdateOldSegments()

I discovered what the issue is here (normally that is the hard part), but it seems that the proper solution is going to be a major task. The NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a service locator to load classes throughout the project. In the Codec abstract class, it is used to load up the codec for the context it is used in. However, our port of the NamedSPILoader simply loads all of the classes from the current AppDomain without any way to order them or override them.

The problem is that in Lucene, this was meant to be an extension point. And this particular test (and probably many more of them) uses that extension point to change the codec to a Mock from the test framework. This line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the issue is:

> Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have tests-framework.jar before lucene-core.jar");

Basically, it is using a configuration file to order the classes that are loaded so the test mocks take priority over the built-in codecs.

Just fixing the test could be done by making the static NamedSPILoader variable in the Codec class internal and swapping in a test double. However, that doesn't solve the bigger issue that Lucene.Net is missing its extensibility for anyone who wants to write their own codec (or tap into one of the other extensibility points). I guess the bigger question is how important will it be for anyone to extend Lucene codecs or inject dependencies into Analyzer factories? There doesn’t appear to be any more extensibility than that in Lucene 4.8.0, but that could change in more recent or future versions of Lucene.


CI Builds
-----------

Not working. Can someone look into that please?


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Shad Storhaug 
Sent: Wednesday, October 5, 2016 8:23 PM
To: dev@lucenenet.apache.org
Cc: Connie Yau; 'cribs2@gmail.com'
Subject: RE: Remaining Work/Priorities

> Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module?

Just for clarification, these are two entirely different things in Java. Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of Java:

import java.text.BreakIterator;
import java.text.Collator;
import java.text.ParseException;
import java.text.RuleBasedCollator;

Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also depend on parts of Java:

import java.text.BreakIterator;
import java.text.CharacterIterator;

Analysis.ICU depends on a separate (icu4j) package:

import com.ibm.icu.text.Normalizer;
import com.ibm.icu.text.Normalizer2;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.Replaceable;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.text.FilteredNormalizer2;
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.RawCollationKey;

That said, icu4j DOES have Collator and RuleBasedCollator classes, but it DOES NOT have a BreakIterator or CharacterIterator class. It is unclear whether the Collator from icu4j would work as a replacement for the one in core Java.

When I was digging through the JDK code, I noticed that BreakIterator and RuleBasedCollator have a lot of common ICU dependencies there, so even if the RuleBasedCollator from icu4j is compatible, it might make sense for us to port the one from Java anyway so we are dealing with the same shared dependencies in Analysis.Common.

Once we port over the classes from the Java JDK, we will be able to eliminate our current ICU4NET dependency (and the platform issues that come with it). That said, porting over those pieces could take considerable work. In the interim it might make sense to make separate projects/NuGet packages to isolate the areas that depend on BreakIterator, CharacterIterator, and RuleBasedCollator so the rest can be released for wide/cross-platform use. Perhaps we can even make a basic (scaled down) BreakIterator for Highlighter that breaks on spaces between words and punctuation between sentences, which wouldn't work for Thai, but would work for most other languages.

Porting the (icu4j) package is another complete ball of yarn, we should take a look at (https://github.com/sillsdev/icu-dotnet) to see if there is enough overlap there to power Analysis.ICU (offhand it looks as though some classes are missing, though). It is a wrapper around the C library - it may be that we just need to port more of it to get all of the pieces we need.

Speaking of Collation, @ChristopherHaws have you made any more progress on Analysis.Collation? Were you able to determine if icu-dotnet's collator will make the tests pass?

> I'm on it QueryParser.Flexible

Great. The TimeZone probably just needs more research to work out how to utilize (in order to implement the failing test). Also, FYI MSDN's recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed that several of the tests were recently modified to use TimeZone rather than TimeZoneInfo).

As for the culture, in .NET I am pretty sure that we need to pass it as a parameter to another overload of `QueryParser.Parse` rather than making it a property of QueryParser. But we can deal with that in one step after you have finished porting.

--

Shad Storhaug (NightOwl888)

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar Syn-Hershko
Sent: Wednesday, October 5, 2016 5:28 AM
To: dev@lucenenet.apache.org
Cc: Connie Yau
Subject: Re: Remaining Work/Priorities

Awesome, thanks for all the hard work Shad!

Our first priority should be fixing all remaining tests - in particular the one in Core. We should be ready to release and stamp our builds as 100% stable. As you mentioned, this could be an infrastructure issue - hopefully *Connie *can give a status update on her effort on the switch to xUnit?

With regards to Modules, here's an updated breakdown based on your email + forgotten pieces + my comments:

*Ported:*
Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1 failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total

We should do a second pass on the pieces we marked as ported, just to make sure the port is full and we didn't leave anything behind :)

*Need to be ported:*
Highlighter (Depends on Collator (which is still being ported) and BreakIterator (which we don't have a solution that works in .NET core yet)) Spatial (has 3rd party libraries that need to be updates) Spatial4n (https://github.com/synhershko/Spatial4N) needs to be brought up to speed with spatial4j, dependencies of which may cause some issues....
Codecs
Partially ported, mostly the tests weren't ported Grouping Not urgent, but provides nice functionality that users will probably like

The only part with dependencies seems to be the spatial module - I will have a look there soon if you don't get to that before I do.

*Can wait* - some modules are less frequently used, we should stabilize and release first and then work on them based on demand Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module? I keep getting reports on some issues they are causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik) Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly helper libraries, so there's probably not real dependency just lots of replacement Analysis.SmartCN Analysis.Stempel (currently in progress) Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo while important because can help newbies, we can do better by providing docs and real world examples. I'm on it QueryParser.Flexible

*No need to port* - neither are needed in our context Benchmark (many dependencies) Replicator (many dependencies) Sandbox (Depends on Apache Jakarta)

Once all modules are ported and all tests are passing, I think we should get two more items fixed before an official release:

1. .NET Core support - I'm not clear on the status of it at the moment. We probably want to have it in for the release.

2. Public API Inconsistencies. We can discuss what should be done and what not when we get to that stage. Some are an obvious "fixme" but some will break code compatibility with Java I think we should avoid.

One last note - *Wyatt*, do we know why there are no CI builds lately?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member

On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining 
> to be done on Lucene.Net version 4.8.0. We are nearly there, but that 
> doesn't mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count 
> we have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 
> total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - 
> (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 
> failing / 27 total Lucene.Net.Memory - 0 failing / 10 total 
> Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing 
> / 96 total Lucene.Net.QueryParser - 1 failing / 203 total 
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently 
> discovered that many of the "failures" that are being reported are 
> false negatives (in fact, the VS2015 NUnit test runner shows there are
> 135 failing tests total and 902 tests total that don't belong to any 
> project). Most NUnit 2.6 test runners do not correctly run tests in 
> shared abstract classes with the correct context (test setup) to make 
> them pass. These out-of-context runs add several additional minutes to the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the 
> situation somewhat - that is, it ran the tests in the correct context 
> and I was able to determine that we have more tests than the numbers 
> above and they are all succeeding. However, it also ran the tests in 
> an invalid context (that is, the context of the abstract class without 
> any setup) and some of them still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new 
> fancy test attributes) is enough to fix this issue. If not, we need to 
> find another solution - preferably one that can be applied to all of 
> the tests in abstract classes without too much effort or changing them 
> so they are too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days 
> ago and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic 
> (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel 
> (currently in progress) 7. Analysis.UIMA (Depends on Tagger, 
> uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> Demo 10. Highlighter (Depends on Collator (which is still being
> ported) and BreakIterator (which we don't have a solution that works 
> in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox 
> (Depends on Apache Jakarta) 13. Spatial (Already ported in #174 
> (https://github.com/apache/ lucenenet/pull/174), needs a recent 
> version of spatial4n) 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize 
> this list in terms of priority. It also couldn't hurt to update the 
> contributing documents 
> (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and
> https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the 
> current status.
>
> Of course, it is the known status of dependencies that we need 
> clarification on. Which of these dependencies is known to be ported?
> Which of them are ported but are not up to date? Which of them are 
> known not to be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the 
> .NETification/consistency of the core API (that is, in the Lucene.Net 
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that 
> will affect everyone helping out on Lucene.Net and anyone who is 
> currently using the beta. So, I just wanted to gather some input on 
> when the most appropriate time to begin working on these sweeping changes would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>

Re: Remaining Work/Priorities

Posted by Wyatt Barnett <wy...@gmail.com>.
Thanks for pulling all of that together Shad. I can chime in a bit about
the build stuff:

1. Are there any remaining issues around the build/deployment we still need
to resolve (versioning, integration with TeamCity, etc)?

Build has been fun with this one -- I haven't reported in much because I
haven't had much success to report. We are now successfully moved to
https://teamcity.jetbrains.com/project.html?projectId=LuceneNet
<https://teamcity.jetbrains.com/project.html?projectId=LuceneNet&tab=projectOverview>.
The new platform is much more reliable in some ways but it appears to me
that the build machines appear to be at least intermittently much slower
than the old ones.

For example -- I've got the portable build running at
https://teamcity.jetbrains.com/viewType.html?buildTypeId=LuceneNet_Vs2015LuceneNetPortable
once
I wrapped my head around build.ps1 and teamcity but I can't get the tests
to run -- they timeout after 4 hours which is a lot slower than they run on
this i5. My suspicion is it is because the slow disk i/o on some cloud
agents but I don't have a lot of visibility into it.

Speed issues aside we've got a bit of knitting to do to get this ready to
release. We need tweak the builds to do things like drop xml comments as
well as perhaps dropping multiple artifacts for different .NET versions. We
will need to create .nuspec files presuming this package becomes
non-trivial.

Regarding .NET Core -- my presumption is that is the way we are going to
go. Should we try and rejigger the build process to work with build.ps1
from that project?

2. Are we able to utilize our current versioning scheme
(MAJOR.MINOR.BUILD.REVISION-PRE)? I have verified that NuGet behaves
correctly with this scheme, and IMO it makes sense to use this scheme on a
port such as this one so we have a way to patch without incrementing beyond
the semantic version of Lucene we are emulating. It looks like this
versioning issue has been a roadblock for fixing bugs in previous
Lucene.Net ports.

I cooked this up so I'm OK with it :). FWIW my thinking was exactly what
you are talking about -- we can "pin" to lucene version numbers while
keeping some uniqueness that works with nuget rather seamlessly.

One angle we will need to think through is how we want to handle branching
strategy here. The trick I'm using to run all the pull requests does not
work as well when we have multiple concurrent versions which we'd probably
want to have running as we do bug fixes in many cases.

I've got a little time over the holidays to try and iron out some of these
issues. Let me know how I can be of more assistance.

On Thu, Dec 15, 2016 at 2:21 PM Roethinger, Alexander <
aroethinger@affili.net> wrote:

Hi Shad,

thanks for this update and the insights on the current status.

I've been working with Lucene.Net 4.8 ever since you've pointed me to your
project in October.
My project is basically a generic WCF-Service for Lucene, providing a
stand-alone search-engine for any kind of .net object. It was originally
written using Lucene.Net 3.0 but is now fully ported to 4.8 and running for
an inhouse project. It includes built in AutoSuggest, HighFrequencyTerms,
Searcher-Warming during WCF initialization and some other nice features.
Apart from the ICU issue (which doesn't affect me because I don't need the
dependency) I have so far not encountered any serious issues.
Unfortunately, I don't have enough time and in-depth knowledge of Lucene to
help you guys with the actual porting.

But picking up on what you mentioned under " API Phase 2 - .NETify", I
would be happy to contribute from a "consumer" point of view based on the
stuff we have been developing so far, including testing my application
against current releases of Lucene.Net or helping to make the code more
.net like.

Kind regards
Alexander


-----Ursprüngliche Nachricht-----
Von: Shad Storhaug [mailto:shad@shadstorhaug.com]
Gesendet: Donnerstag, 15. Dezember 2016 19:34
An: dev@lucenenet.apache.org
Cc: Connie Yau <co...@microsoft.com>; itamar.synhershko@gmail.com
Betreff: RE: Remaining Work/Priorities

Update
======

It has been a while since I have communicated the current status of the
Lucene.Net codebase to the team, and I am getting concerned that claims
that we are "close to release" are being exaggerated a bit. We are almost
ready to put a pre-release on NuGet so the masses can start consuming it,
but there are some bases we still need to cover to stabilize for release
and ready Lucene.Net 4.8.0 for enterprise-level quality expectations.

We have now successfully ported more than 380,000 executable lines of code
from Java to .NET, and have ported every Lucene sub-project that Itamar has
earmarked as "important". We also have support for .NET Core (at least on a
branch: https://github.com/apache/lucenenet/pull/191) and have over 6000
passing tests.

The following sub-projects (and their tests) that Itamar has earmarked as
"optional" can still be ported if any interested party wants to make a
contribution. If not, they won't be in the initial release.

1. Analysis.ICU
2. Analysis.Kuromoji (note only 3-4 days of work here, I think) 3.
Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic (Depends
on Apache Commons) Apache commons is mostly helper libraries, so there's
probably not real dependency just lots of replacement 5. Analysis.SmartCN
(note only 2-3 days of work here, I think) 6. Analysis.UIMA (Depends on
Tagger, uimaj-core, WhiteSpaceTokenizer) 7. Demo (might be a good learning
exercise)

--------------------------------------------------------------------------------------------------------
Per Itamar: We should be ready to release and stamp our builds as 100%
stable.
--------------------------------------------------------------------------------------------------------

Agreed. But we are not there, yet. There is still quite a bit of work to do
on that front.

1. There are many issues with the public API of Lucene.Net.Core and a few
other places (such as Lucene.Net.Grouping). Most of these issues will
require breaking API changes to fix . Although most of these changes are
fairly minor, since most will just be changing methods to properties and
vice versa, these changes will sweep through every consuming project. See
API Phase 1 section below.
2. There are ~35 tests that are failing, and some others failing randomly,
not to mention some differences in failure counts between (
https://github.com/apache/lucenenet/pull/191) and master.
3. Some of the test framework is not yet complete, which explains some of
the test failures. The culture, time zone, and culture are not being
randomized, we don't yet have the SuppressCodecs functionality in place,
nor has the Lucene 3.x backward compatibility been tested. The fact that we
are not randomizing culture means we are not testing the complete picture.
We know at present that there are ~35 test failures in en-US, but it is not
currently known how many we will have if we try other cultures. I suspect
there are many issues around casing, date and number formatting that will
need to be addressed.
4. Bugs are still relatively easy to find in the codebase. For example, I
recently discovered that Atomaton doesn't fully support Unicode:
https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/Automaton.cs#L648.
I also recently found and fixed an issue in the Test Framework where we
were generating random strings of numbers instead of random Unicode
characters to create test strings, and discovered some of the other random
functions have bugs that make them not test the whole range. I could go on,
but the point is there are still many bugs that may negatively affect
quality, most which aren't causing obvious test failures.
5. We haven't had much feedback from end users, most likely because we
haven't had many downloads on MyGet, and because we haven't yet made the
beta available on NuGet.

IMO we should do at least the following before we release (in order, bare
minimum):

1. Fix the breaking API issues ASAP so we don't burden users with breaking
changes later. We can make all of these changes on #191 (or a branch of it)
before it is merged so there is only 1 build that breaks the API instead of
doing it incrementally and having several successive API-breaking packages.
2. Update documentation on GitHub home page (and possibly the website) so
people can easily report bugs, find the information about how to manually
build, see that we have support for NuGet and MyGet and where to get them,
see current status, find documentation, contribute, etc. Can we setup a
WIKI on GitHub so we can add .NET specific Lucene documentation?
3. Ensure we use a version scheme we use allows for patching bugs so we
aren't locked into a single release of this port. Our current version
scheme of MAJOR.MINOR.BUILD.REVISION-PRE will work, but it is unclear
whether the new build process supports this.
4. Get a wider beta release on MyGet (with icu-dotnet and .NET Core
support) so we can start getting user feedback on any remaining issues. We
can then rely on the user feedback (or lack thereof) as one factor to
determine when we are ready for release.
5. Do a line-by-line sweep of Lucene.Net.Core and Lucene.Net.TestFramework
(and possibly Lucene.Net.Classifications, Lucene.Net.Expressions,
Lucene.Net.Join) to ensure we have everything implemented and plugged
together correctly. The rest I am fairly confident is implemented correctly
(we can rely on user feedback for any potential issues).
6. Finish the Test Framework implementation so it randomizes culture, time
zone, and codec (and suppresses codecs correctly) so we get a true measure
of test failures. And also so we can determine if the Lucene.Net 3.x
backward compatibility works.
7. Fix (at least the high priority) remaining tests.

Other tasks that are remaining to complete:

1. Create a RuleBasedBreakIterator based on icu-dotnet that breaks text
similar to Java's RuleBasedBreakIterator (for Highlighter and possibly
Analysis.Th).
2. Fix the issues with Collator and RuleBasedCollator that are causing some
test failures (for Collation namespace in Analysis.Common).
3. Clean up the Support namespace, remove unused types, organize into
sub-namespaces.
4. Make/port tests for types in the Support namespace to verify stability.
5. (recommended) Try to get others involved in the project to make
high-level integration APIs for certain target frameworks (see API Phase 2
below). This could be done after release, but we might have more
possibilities if we finish this part while it is a pre-release.
6. (optional) Performance tuning.
7. Suppress insignificant compiler warnings to see if there are any
important ones left to deal with (Lucene.Net.Core and a few others) 8.
Finish XML documentation comments (Lucene.Net.Core,
Lucene.Net.Classification, Lucene.Net.Queries) 9. Fix any directory casing
issues in the codebase that can potentially cause problems on some
platforms (see https://github.com/apache/lucenenet/pull/196).

@Connie

1. Are there any remaining issues around the build/deployment we still need
to resolve (versioning, integration with TeamCity, etc)?
2. Are we able to utilize our current versioning scheme
(MAJOR.MINOR.BUILD.REVISION-PRE)? I have verified that NuGet behaves
correctly with this scheme, and IMO it makes sense to use this scheme on a
port such as this one so we have a way to patch without incrementing beyond
the semantic version of Lucene we are emulating. It looks like this
versioning issue has been a roadblock for fixing bugs in previous
Lucene.Net ports.

API
===

--------------------------------------------------------------------------------------------------------
API Phase 1 - Stabilization
--------------------------------------------------------------------------------------------------------

Phase 1 takes care of finishing the breaking API changes that are necessary
to get from where we are to where we need to be for release. This is so the
naming and other conventions are consistent with .NET and/or Lucene, to
eliminate casts that are currently required to use certain functionality
(such as Grouping), and to identify other parts of the API that could be
improved (either to make it more similar to Lucene or to make it more
usable/intuitive in .NET).

--------------------------------------------------------------------------------------------------------
Per Itamar: Public API Inconsistencies. We can discuss what should be done
and what not when we get to that stage. Some are an obvious "fixme" but
some will break code compatibility with Java I think we should avoid.
--------------------------------------------------------------------------------------------------------

We are now at the stage where we should make this a top priority. Mind
sharing your thoughts on what "needs to be compatible with Java"? It seems
that MSDN has clear information on how to differentiate between a property
and a method:
https://msdn.microsoft.com/en-us/library/ms229054(v=vs.100).aspx, but which
of the items below are you concerned about? Shouldn't we be more concerned
with making it compatible with .NET than with Java?

Here is that list of API issues again:

1.      Method names that are still camelCase
2.      Properties that should be methods (because they do a lot of
processing or because they are non-deterministic)
3.      Methods that should be properties
4.      .Size() vs .Size vs .Count – should generally all be .Count (or
.Length) in .NET
5.      Interfaces should begin with “I”
6.      Classes should not begin with “I” followed by another singular
capital letter (for some reason some of them were named that way)
7.      .CharAt() should probably be this[]
8.      Generic types nested within generic types (which cause Visual
Studio to crash when Intellisense tries to read them)

We should add to that list:

1. Fix member accessibility to match that in Java (virtual by default,
non-virtual if "sealed" specified, etc.), so the intent of the original
design can be realized.
2. Rename Tokenattributes namespace to TokenAttributes (and any other
namespaces that don't follow .NET conventions).
3. Rename enumerations that were named with a "_e" suffix back to their
original name (we can do this by de-nesting them from the class they are in
so the name doesn't collide with a property).
4. Find any public APIs that are using nullable enumerations and try to
find another solution (such as making an overload that doesn't accept the
parameter and/or making a NOT_SET state (with the enum default value of 0)
in the enumeration).
5. Try to find a better replacement for Number than object (possibly by
using different overloads that accept different numeric types and keeping
track of the type that was passed).
6. Fields should generally not be public in .NET - we have several that
were named using Pascal Case, but in general they should be made into
properties that are Pascal Case that are either auto-implemented or backed
by fields that are camelCase.
7. Change the Collector abstract class to an interface so it will support
covariance (required by Grouping). We could alternatively back the
Collector abstract class by an interface, but I think that would be
confusing since every place where the abstract class is used now will need
to be replaced with the interface anyway.

There are probably some additional issues that will cause API breaks to
fix, but this most likely makes up the bulk of them.

--------------------------------------------------------------------------------------------------------
API Phase 2 - .NETify
--------------------------------------------------------------------------------------------------------

Phase 2 is to make the API more .NET-friendly.

There are several places where we can add overloads and extension methods
to make features of Lucene.Net act in concert with .NET better. For
example, we could add extra overloads on FSDirectory that take a path as a
string so consumers don't have to new up a (probably pointless)
DirectoryInfo instance, which would make it act more like the FileStream
object in .NET. Also, as I previously mentioned to Itamar, there are
several parts of Lucene's design that were specifically aimed at using
anonymous classes. We can probably find ways to simulate this using some
helper extension methods and/or fluent builders.

IMO Lucene.NET is useful, but its API is very low-level which makes using
it challenging to learn and integrate with modern .NET projects. Like many
other packages that are available on NuGet, it could use some integration
packages with various frameworks within .NET to make it easier to use. For
example, StructureMap has integrations with MVC 5 and WebAPI:
http://structuremap.github.io/integrations/. Here are a few ideas for
integrations we could make for Lucene.Net:

1. Lucene.Net.AspNet - integration with ASP.NET/MVC core (for plugging
common Lucene.Net features into the startup pipeline, etc).
2. Lucene.Net.AspNet.Suggest - UI integration with ASP.NET/MVC core, making
Suggest functionality into an HTML helper and/or view component that can be
consumed/customized easily.
3. Lucene.Net.Linq - This already exists, but perhaps we should contact the
author about bringing in as part of the main repo, or perhaps helping out
with our API effort: https://github.com/themotleyfool/Lucene.Net.Linq
4. Lucene.Net.EntityFramework - ?
5. Lucene.Net.MVC5 - integration with ASP.NET MVC 5.
6. Lucene.Net.WebApi - integration with WebApi.
7. Lucene.Net.Wpf - integration with WPF.
8. Lucene.Net.AspNet.Facet - UI integration with ASP.NET/MVC core, making
faceted search functionality into an HTML helper and/or view component, as
well as any support needed to setup the index for faceted search.

This is just a short list to get the ideas flowing. But we should really
aim to make Lucene.Net support easy to use with a wide range of frameworks
across both the .NET Framework and .NET Core stacks.

Just to be clear, the idea here is that we keep the Phase 1 API in place -
that is, we have an API that looks pretty much the same as Lucene, but
build high-level APIs on top of it to integrate many common use
cases/configurations for integrating with these other frameworks.

For example, in ASP.NET Core, it is recommended to use a singleton instance
for an IndexWriter rather than opening and closing it all of the time -
ideally, this could be done in a way that is familiar to the ASP.NET Core
startup configuration API.

        // This method gets called by the runtime. Use this method to add
services to the container.
        public void ConfigureServices(IServiceCollection services)
        {
            // Add framework services.
            services.AddApplicationInsightsTelemetry(Configuration);

            services.AddMvc();

            // Add a Lucene IndexWriter to the container (as a singleton)
            services.AddIndexWriter("~/the_index/", <other lucene-specific
options>);
        }

That one line of code would potentially save everyone who uses a Lucene.Net
IndexWriter in combination with ASP.NET Core several hours of research and
testing.

In the past, no such integrations existed with Lucene.Net, and as a result
the project's success has been limited and the project has always teetered
on the edge of oblivion. IMO, bringing the API to the users instead of
making them come and find it would make Lucene.Net a much more useful tool
that is accessible to many more people, and make recruiting help for future
porting efforts easier. Furthermore, these integration packages could act
as an adapter API that doesn't need to change much from one Lucene.Net port
to the next which will ease upgrading.

I am not alone in thinking that the API of Lucene.Net falls short of where
it should be:

https://simplelucene.codeplex.com/documentation
https://ayende.com/blog/158914/lucene-net-is-ugly

So let's not let Lucene.Net fall short of expectations again. Instead,
let's aim for making Lucene.Net into the de-facto standard full-text search
engine that is (mysteriously) missing from the .NET framework.

Thoughts? Ideas?


Thanks,
Shad Storhaug (NightOwl888)


P. S. Itamar - can we get an update as to the status of the new website?




-----Original Message-----
From: Shad Storhaug
Sent: Wednesday, October 12, 2016 2:10 AM
To: 'dev@lucenenet.apache.org'
Cc: 'Connie Yau'; 'cribs2@gmail.com'; 'itamar.synhershko@gmail.com'
Subject: RE: Remaining Work/Priorities

Update
======

I have just pushed some commits that fix several bugs in the
Lucene.Net.Codecs project (all 452 tests pass most of the time, a few
random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.


Fix for Test Context
-------------------------

For now, I have added method override stubs to each subclass in order to
add the [Test] attribute, so NUnit will run them in the correct context. I
did that on all of the superclass tests except for the ones in QueryParser
(since Itamar mentioned he would be working in that area). Itamar, you will
probably need to follow suit to get all of the QP tests to pass - namely
with the QueryParserTestBase and TestQueryParser classes.

I have carefully put all of these changes into a single commit so it can be
reverted easily, if this solution doesn't happen to be compatible with
xUnit:
https://github.com/apache/lucenenet/commit/2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c.
Hopefully that makes life easier for @conniey.

@Itamar, let me know when this is completed on your end so I can do a
double revert and squash the test stubs from QueryParser into an
all-inclusive revert-able commit.

We can now correctly see how many tests we have in the core. Currently
there are 2730 - it seems we are still missing 720 tests, assuming they all
were for something port-able.


Remaining Tests
---------------------

Next I plan to work on locating any tests that we have missed (starting in
the core). It seems these fall into several categories:

1. Tests that have not yet been ported.
2. Tests that have been partially ported that have not been added to the
project.
3. Tests that have been ported, but are missing the [Test] attribute.
4. Tests in classes that have been ported that have been commented out
(presumably because at the time they were ported the dependencies did not
yet exist).
5. Tests that have been Ignored in .NET that were not in Java.
6. Tests that have NUnit Assume.That() logic that depends on some
non-existant JRE condition, so they are not running in .NET.

I'll make a quick effort to get them to pass, but the main goal will be to
ensure they all can run and are included in the project. Just a heads up
that the number of test failures is likely to increase on this pass (but
the number of bugs will likely decrease).


Failing Core Tests
-----------------------

I have looked into the remaining tests somewhat. There are 2 issues that I
need some input on to solve.


TestRamUsageEstimator.TestSanity()

Java Lucene uses a JRE-specific API to determine how much header size to
add on each field. This makes the estimates higher in Java. But more
importantly, this test is failing because the estimate for a real string
instance is coming back as the same size as its shallow size (16 bytes in
this case) - it needs to be at least 1 byte more than that for the test to
pass. In Java (at least in a 64 bit environment), there are an extra 4
bytes being added for each field.

Technically, there is a way to get these numbers from .NET, but it involves
calling undocumented APIs using pointers and will likely be different from
one .NET version to the next (a bad idea for a project that needs to
support multiple .NET versions). The only solution I can think of is to
hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for 32
bit) in order to make the numbers for the instances larger than their
shallow size. I suppose the alternative would be to either comment out the
string test or change it to >= make it pass. Thoughts? Alternatives?


TestNumericDocValuesUpdates.TestUpdateOldSegments()

I discovered what the issue is here (normally that is the hard part), but
it seems that the proper solution is going to be a major task. The
NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a
service locator to load classes throughout the project. In the Codec
abstract class, it is used to load up the codec for the context it is used
in. However, our port of the NamedSPILoader simply loads all of the classes
from the current AppDomain without any way to order them or override them.

The problem is that in Lucene, this was meant to be an extension point. And
this particular test (and probably many more of them) uses that extension
point to change the codec to a Mock from the test framework. This line from
TestRuleSetupAndRestoreClassEnv pretty much sums up what the issue is:

> Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have
> tests-framework.jar before lucene-core.jar");

Basically, it is using a configuration file to order the classes that are
loaded so the test mocks take priority over the built-in codecs.

Just fixing the test could be done by making the static NamedSPILoader
variable in the Codec class internal and swapping in a test double.
However, that doesn't solve the bigger issue that Lucene.Net is missing its
extensibility for anyone who wants to write their own codec (or tap into
one of the other extensibility points). I guess the bigger question is how
important will it be for anyone to extend Lucene codecs or inject
dependencies into Analyzer factories? There doesn’t appear to be any more
extensibility than that in Lucene 4.8.0, but that could change in more
recent or future versions of Lucene.


CI Builds
-----------

Not working. Can someone look into that please?


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Shad Storhaug
Sent: Wednesday, October 5, 2016 8:23 PM
To: dev@lucenenet.apache.org
Cc: Connie Yau; 'cribs2@gmail.com'
Subject: RE: Remaining Work/Priorities

> Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from
the analysis.commons module?

Just for clarification, these are two entirely different things in Java.
Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of
Java:

import java.text.BreakIterator;
import java.text.Collator;
import java.text.ParseException;
import java.text.RuleBasedCollator;

Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also depend
on parts of Java:

import java.text.BreakIterator;
import java.text.CharacterIterator;

Analysis.ICU depends on a separate (icu4j) package:

import com.ibm.icu.text.Normalizer;
import com.ibm.icu.text.Normalizer2;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.Replaceable;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.text.FilteredNormalizer2;
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.RawCollationKey;

That said, icu4j DOES have Collator and RuleBasedCollator classes, but it
DOES NOT have a BreakIterator or CharacterIterator class. It is unclear
whether the Collator from icu4j would work as a replacement for the one in
core Java.

When I was digging through the JDK code, I noticed that BreakIterator and
RuleBasedCollator have a lot of common ICU dependencies there, so even if
the RuleBasedCollator from icu4j is compatible, it might make sense for us
to port the one from Java anyway so we are dealing with the same shared
dependencies in Analysis.Common.

Once we port over the classes from the Java JDK, we will be able to
eliminate our current ICU4NET dependency (and the platform issues that come
with it). That said, porting over those pieces could take considerable
work. In the interim it might make sense to make separate projects/NuGet
packages to isolate the areas that depend on BreakIterator,
CharacterIterator, and RuleBasedCollator so the rest can be released for
wide/cross-platform use. Perhaps we can even make a basic (scaled down)
BreakIterator for Highlighter that breaks on spaces between words and
punctuation between sentences, which wouldn't work for Thai, but would work
for most other languages.

Porting the (icu4j) package is another complete ball of yarn, we should
take a look at (https://github.com/sillsdev/icu-dotnet) to see if there is
enough overlap there to power Analysis.ICU (offhand it looks as though some
classes are missing, though). It is a wrapper around the C library - it may
be that we just need to port more of it to get all of the pieces we need.

Speaking of Collation, @ChristopherHaws have you made any more progress on
Analysis.Collation? Were you able to determine if icu-dotnet's collator
will make the tests pass?

> I'm on it QueryParser.Flexible

Great. The TimeZone probably just needs more research to work out how to
utilize (in order to implement the failing test). Also, FYI MSDN's
recommendation (
https://msdn.microsoft.com/en-us/library/system.timezone(v=vs.110).aspx) is
to use TimeZoneInfo rather than TimeZone (I noticed that several of the
tests were recently modified to use TimeZone rather than TimeZoneInfo).

As for the culture, in .NET I am pretty sure that we need to pass it as a
parameter to another overload of `QueryParser.Parse` rather than making it
a property of QueryParser. But we can deal with that in one step after you
have finished porting.

--

Shad Storhaug (NightOwl888)

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On
Behalf Of Itamar Syn-Hershko
Sent: Wednesday, October 5, 2016 5:28 AM
To: dev@lucenenet.apache.org
Cc: Connie Yau
Subject: Re: Remaining Work/Priorities

Awesome, thanks for all the hard work Shad!

Our first priority should be fixing all remaining tests - in particular the
one in Core. We should be ready to release and stamp our builds as 100%
stable. As you mentioned, this could be an infrastructure issue - hopefully
*Connie *can give a status update on her effort on the switch to xUnit?

With regards to Modules, here's an updated breakdown based on your email +
forgotten pieces + my comments:

*Ported:*
Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0
failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total
Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including
#188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total
Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42
total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1
failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total

We should do a second pass on the pieces we marked as ported, just to make
sure the port is full and we didn't leave anything behind :)

*Need to be ported:*
Highlighter (Depends on Collator (which is still being ported) and
BreakIterator (which we don't have a solution that works in .NET core yet))
Spatial (has 3rd party libraries that need to be updates) Spatial4n (
https://github.com/synhershko/Spatial4N) needs to be brought up to speed
with spatial4j, dependencies of which may cause some issues....
Codecs
Partially ported, mostly the tests weren't ported Grouping Not urgent, but
provides nice functionality that users will probably like

The only part with dependencies seems to be the spatial module - I will
have a look there soon if you don't get to that before I do.

*Can wait* - some modules are less frequently used, we should stabilize and
release first and then work on them based on demand Analysis.ICU (Depends
on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons
module? I keep getting reports on some issues they are causing
Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik)
Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly
helper libraries, so there's probably not real dependency just lots of
replacement Analysis.SmartCN Analysis.Stempel (currently in progress)
Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo
while important because can help newbies, we can do better by providing
docs and real world examples. I'm on it QueryParser.Flexible

*No need to port* - neither are needed in our context Benchmark (many
dependencies) Replicator (many dependencies) Sandbox (Depends on Apache
Jakarta)

Once all modules are ported and all tests are passing, I think we should
get two more items fixed before an official release:

1. .NET Core support - I'm not clear on the status of it at the moment. We
probably want to have it in for the release.

2. Public API Inconsistencies. We can discuss what should be done and what
not when we get to that stage. Some are an obvious "fixme" but some will
break code compatibility with Java I think we should avoid.

One last note - *Wyatt*, do we know why there are no CI builds lately?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance
Developer & Consultant Lucene.NET committer and PMC member

On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining
> to be done on Lucene.Net version 4.8.0. We are nearly there, but that
> doesn't mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count
> we have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9
> total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet -
> (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0
> failing / 27 total Lucene.Net.Memory - 0 failing / 10 total
> Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing
> / 96 total Lucene.Net.QueryParser - 1 failing / 203 total
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently
> discovered that many of the "failures" that are being reported are
> false negatives (in fact, the VS2015 NUnit test runner shows there are
> 135 failing tests total and 902 tests total that don't belong to any
> project). Most NUnit 2.6 test runners do not correctly run tests in
> shared abstract classes with the correct context (test setup) to make
> them pass. These out-of-context runs add several additional minutes to
the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the
> situation somewhat - that is, it ran the tests in the correct context
> and I was able to determine that we have more tests than the numbers
> above and they are all succeeding. However, it also ran the tests in
> an invalid context (that is, the context of the abstract class without
> any setup) and some of them still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new
> fancy test attributes) is enough to fix this issue. If not, we need to
> find another solution - preferably one that can be applied to all of
> the tests in abstract classes without too much effort or changing them
> so they are too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days
> ago and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic
> (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel
> (currently in progress) 7. Analysis.UIMA (Depends on Tagger,
> uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> Demo 10. Highlighter (Depends on Collator (which is still being
> ported) and BreakIterator (which we don't have a solution that works
> in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox
> (Depends on Apache Jakarta) 13. Spatial (Already ported in #174
> (https://github.com/apache/ lucenenet/pull/174), needs a recent
> version of spatial4n) 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize
> this list in terms of priority. It also couldn't hurt to update the
> contributing documents
> (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and
> https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the
> current status.
>
> Of course, it is the known status of dependencies that we need
> clarification on. Which of these dependencies is known to be ported?
> Which of them are ported but are not up to date? Which of them are
> known not to be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the
> .NETification/consistency of the core API (that is, in the Lucene.Net
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that
> will affect everyone helping out on Lucene.Net and anyone who is
> currently using the beta. So, I just wanted to gather some input on
> when the most appropriate time to begin working on these sweeping changes
would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>

AW: Remaining Work/Priorities

Posted by "Roethinger, Alexander" <ar...@affili.net>.
Hi Shad,

thanks for this update and the insights on the current status.

I've been working with Lucene.Net 4.8 ever since you've pointed me to your project in October. 
My project is basically a generic WCF-Service for Lucene, providing a stand-alone search-engine for any kind of .net object. It was originally written using Lucene.Net 3.0 but is now fully ported to 4.8 and running for an inhouse project. It includes built in AutoSuggest, HighFrequencyTerms, Searcher-Warming during WCF initialization and some other nice features.
Apart from the ICU issue (which doesn't affect me because I don't need the dependency) I have so far not encountered any serious issues.
Unfortunately, I don't have enough time and in-depth knowledge of Lucene to help you guys with the actual porting. 

But picking up on what you mentioned under " API Phase 2 - .NETify", I would be happy to contribute from a "consumer" point of view based on the stuff we have been developing so far, including testing my application against current releases of Lucene.Net or helping to make the code more .net like.

Kind regards
Alexander
 

-----Ursprüngliche Nachricht-----
Von: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Gesendet: Donnerstag, 15. Dezember 2016 19:34
An: dev@lucenenet.apache.org
Cc: Connie Yau <co...@microsoft.com>; itamar.synhershko@gmail.com
Betreff: RE: Remaining Work/Priorities

Update
======

It has been a while since I have communicated the current status of the Lucene.Net codebase to the team, and I am getting concerned that claims that we are "close to release" are being exaggerated a bit. We are almost ready to put a pre-release on NuGet so the masses can start consuming it, but there are some bases we still need to cover to stabilize for release and ready Lucene.Net 4.8.0 for enterprise-level quality expectations.

We have now successfully ported more than 380,000 executable lines of code from Java to .NET, and have ported every Lucene sub-project that Itamar has earmarked as "important". We also have support for .NET Core (at least on a branch: https://github.com/apache/lucenenet/pull/191) and have over 6000 passing tests.

The following sub-projects (and their tests) that Itamar has earmarked as "optional" can still be ported if any interested party wants to make a contribution. If not, they won't be in the initial release.

1. Analysis.ICU
2. Analysis.Kuromoji (note only 3-4 days of work here, I think) 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly helper libraries, so there's probably not real dependency just lots of replacement 5. Analysis.SmartCN (note only 2-3 days of work here, I think) 6. Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) 7. Demo (might be a good learning exercise) 

--------------------------------------------------------------------------------------------------------
Per Itamar: We should be ready to release and stamp our builds as 100% stable.
--------------------------------------------------------------------------------------------------------

Agreed. But we are not there, yet. There is still quite a bit of work to do on that front.

1. There are many issues with the public API of Lucene.Net.Core and a few other places (such as Lucene.Net.Grouping). Most of these issues will require breaking API changes to fix . Although most of these changes are fairly minor, since most will just be changing methods to properties and vice versa, these changes will sweep through every consuming project. See API Phase 1 section below.
2. There are ~35 tests that are failing, and some others failing randomly, not to mention some differences in failure counts between (https://github.com/apache/lucenenet/pull/191) and master.
3. Some of the test framework is not yet complete, which explains some of the test failures. The culture, time zone, and culture are not being randomized, we don't yet have the SuppressCodecs functionality in place, nor has the Lucene 3.x backward compatibility been tested. The fact that we are not randomizing culture means we are not testing the complete picture. We know at present that there are ~35 test failures in en-US, but it is not currently known how many we will have if we try other cultures. I suspect there are many issues around casing, date and number formatting that will need to be addressed.
4. Bugs are still relatively easy to find in the codebase. For example, I recently discovered that Atomaton doesn't fully support Unicode: https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/Automaton.cs#L648. I also recently found and fixed an issue in the Test Framework where we were generating random strings of numbers instead of random Unicode characters to create test strings, and discovered some of the other random functions have bugs that make them not test the whole range. I could go on, but the point is there are still many bugs that may negatively affect quality, most which aren't causing obvious test failures.
5. We haven't had much feedback from end users, most likely because we haven't had many downloads on MyGet, and because we haven't yet made the beta available on NuGet.

IMO we should do at least the following before we release (in order, bare minimum):

1. Fix the breaking API issues ASAP so we don't burden users with breaking changes later. We can make all of these changes on #191 (or a branch of it) before it is merged so there is only 1 build that breaks the API instead of doing it incrementally and having several successive API-breaking packages.
2. Update documentation on GitHub home page (and possibly the website) so people can easily report bugs, find the information about how to manually build, see that we have support for NuGet and MyGet and where to get them, see current status, find documentation, contribute, etc. Can we setup a WIKI on GitHub so we can add .NET specific Lucene documentation?
3. Ensure we use a version scheme we use allows for patching bugs so we aren't locked into a single release of this port. Our current version scheme of MAJOR.MINOR.BUILD.REVISION-PRE will work, but it is unclear whether the new build process supports this.
4. Get a wider beta release on MyGet (with icu-dotnet and .NET Core support) so we can start getting user feedback on any remaining issues. We can then rely on the user feedback (or lack thereof) as one factor to determine when we are ready for release.
5. Do a line-by-line sweep of Lucene.Net.Core and Lucene.Net.TestFramework (and possibly Lucene.Net.Classifications, Lucene.Net.Expressions, Lucene.Net.Join) to ensure we have everything implemented and plugged together correctly. The rest I am fairly confident is implemented correctly (we can rely on user feedback for any potential issues).
6. Finish the Test Framework implementation so it randomizes culture, time zone, and codec (and suppresses codecs correctly) so we get a true measure of test failures. And also so we can determine if the Lucene.Net 3.x backward compatibility works.
7. Fix (at least the high priority) remaining tests.

Other tasks that are remaining to complete:

1. Create a RuleBasedBreakIterator based on icu-dotnet that breaks text similar to Java's RuleBasedBreakIterator (for Highlighter and possibly Analysis.Th).
2. Fix the issues with Collator and RuleBasedCollator that are causing some test failures (for Collation namespace in Analysis.Common).
3. Clean up the Support namespace, remove unused types, organize into sub-namespaces.
4. Make/port tests for types in the Support namespace to verify stability.
5. (recommended) Try to get others involved in the project to make high-level integration APIs for certain target frameworks (see API Phase 2 below). This could be done after release, but we might have more possibilities if we finish this part while it is a pre-release.
6. (optional) Performance tuning.
7. Suppress insignificant compiler warnings to see if there are any important ones left to deal with (Lucene.Net.Core and a few others) 8. Finish XML documentation comments (Lucene.Net.Core, Lucene.Net.Classification, Lucene.Net.Queries) 9. Fix any directory casing issues in the codebase that can potentially cause problems on some platforms (see https://github.com/apache/lucenenet/pull/196).

@Connie

1. Are there any remaining issues around the build/deployment we still need to resolve (versioning, integration with TeamCity, etc)? 
2. Are we able to utilize our current versioning scheme (MAJOR.MINOR.BUILD.REVISION-PRE)? I have verified that NuGet behaves correctly with this scheme, and IMO it makes sense to use this scheme on a port such as this one so we have a way to patch without incrementing beyond the semantic version of Lucene we are emulating. It looks like this versioning issue has been a roadblock for fixing bugs in previous Lucene.Net ports.

API
===

--------------------------------------------------------------------------------------------------------
API Phase 1 - Stabilization
--------------------------------------------------------------------------------------------------------

Phase 1 takes care of finishing the breaking API changes that are necessary to get from where we are to where we need to be for release. This is so the naming and other conventions are consistent with .NET and/or Lucene, to eliminate casts that are currently required to use certain functionality (such as Grouping), and to identify other parts of the API that could be improved (either to make it more similar to Lucene or to make it more usable/intuitive in .NET).

--------------------------------------------------------------------------------------------------------
Per Itamar: Public API Inconsistencies. We can discuss what should be done and what not when we get to that stage. Some are an obvious "fixme" but some will break code compatibility with Java I think we should avoid.
--------------------------------------------------------------------------------------------------------

We are now at the stage where we should make this a top priority. Mind sharing your thoughts on what "needs to be compatible with Java"? It seems that MSDN has clear information on how to differentiate between a property and a method: https://msdn.microsoft.com/en-us/library/ms229054(v=vs.100).aspx, but which of the items below are you concerned about? Shouldn't we be more concerned with making it compatible with .NET than with Java?

Here is that list of API issues again:

1.	Method names that are still camelCase
2.	Properties that should be methods (because they do a lot of processing or because they are non-deterministic)
3.	Methods that should be properties
4.	.Size() vs .Size vs .Count – should generally all be .Count (or .Length) in .NET
5.	Interfaces should begin with “I”
6.	Classes should not begin with “I” followed by another singular capital letter (for some reason some of them were named that way)
7.	.CharAt() should probably be this[]
8.	Generic types nested within generic types (which cause Visual Studio to crash when Intellisense tries to read them)

We should add to that list:

1. Fix member accessibility to match that in Java (virtual by default, non-virtual if "sealed" specified, etc.), so the intent of the original design can be realized.
2. Rename Tokenattributes namespace to TokenAttributes (and any other namespaces that don't follow .NET conventions).
3. Rename enumerations that were named with a "_e" suffix back to their original name (we can do this by de-nesting them from the class they are in so the name doesn't collide with a property).
4. Find any public APIs that are using nullable enumerations and try to find another solution (such as making an overload that doesn't accept the parameter and/or making a NOT_SET state (with the enum default value of 0) in the enumeration).
5. Try to find a better replacement for Number than object (possibly by using different overloads that accept different numeric types and keeping track of the type that was passed).
6. Fields should generally not be public in .NET - we have several that were named using Pascal Case, but in general they should be made into properties that are Pascal Case that are either auto-implemented or backed by fields that are camelCase.
7. Change the Collector abstract class to an interface so it will support covariance (required by Grouping). We could alternatively back the Collector abstract class by an interface, but I think that would be confusing since every place where the abstract class is used now will need to be replaced with the interface anyway. 

There are probably some additional issues that will cause API breaks to fix, but this most likely makes up the bulk of them.

--------------------------------------------------------------------------------------------------------
API Phase 2 - .NETify
--------------------------------------------------------------------------------------------------------

Phase 2 is to make the API more .NET-friendly.

There are several places where we can add overloads and extension methods to make features of Lucene.Net act in concert with .NET better. For example, we could add extra overloads on FSDirectory that take a path as a string so consumers don't have to new up a (probably pointless) DirectoryInfo instance, which would make it act more like the FileStream object in .NET. Also, as I previously mentioned to Itamar, there are several parts of Lucene's design that were specifically aimed at using anonymous classes. We can probably find ways to simulate this using some helper extension methods and/or fluent builders.

IMO Lucene.NET is useful, but its API is very low-level which makes using it challenging to learn and integrate with modern .NET projects. Like many other packages that are available on NuGet, it could use some integration packages with various frameworks within .NET to make it easier to use. For example, StructureMap has integrations with MVC 5 and WebAPI: http://structuremap.github.io/integrations/. Here are a few ideas for integrations we could make for Lucene.Net:

1. Lucene.Net.AspNet - integration with ASP.NET/MVC core (for plugging common Lucene.Net features into the startup pipeline, etc).
2. Lucene.Net.AspNet.Suggest - UI integration with ASP.NET/MVC core, making Suggest functionality into an HTML helper and/or view component that can be consumed/customized easily.
3. Lucene.Net.Linq - This already exists, but perhaps we should contact the author about bringing in as part of the main repo, or perhaps helping out with our API effort: https://github.com/themotleyfool/Lucene.Net.Linq
4. Lucene.Net.EntityFramework - ?
5. Lucene.Net.MVC5 - integration with ASP.NET MVC 5.
6. Lucene.Net.WebApi - integration with WebApi.
7. Lucene.Net.Wpf - integration with WPF.
8. Lucene.Net.AspNet.Facet - UI integration with ASP.NET/MVC core, making faceted search functionality into an HTML helper and/or view component, as well as any support needed to setup the index for faceted search.

This is just a short list to get the ideas flowing. But we should really aim to make Lucene.Net support easy to use with a wide range of frameworks across both the .NET Framework and .NET Core stacks.

Just to be clear, the idea here is that we keep the Phase 1 API in place - that is, we have an API that looks pretty much the same as Lucene, but build high-level APIs on top of it to integrate many common use cases/configurations for integrating with these other frameworks.

For example, in ASP.NET Core, it is recommended to use a singleton instance for an IndexWriter rather than opening and closing it all of the time - ideally, this could be done in a way that is familiar to the ASP.NET Core startup configuration API.

        // This method gets called by the runtime. Use this method to add services to the container.
        public void ConfigureServices(IServiceCollection services)
        {
            // Add framework services.
            services.AddApplicationInsightsTelemetry(Configuration);

            services.AddMvc();

            // Add a Lucene IndexWriter to the container (as a singleton)
            services.AddIndexWriter("~/the_index/", <other lucene-specific options>);
        }

That one line of code would potentially save everyone who uses a Lucene.Net IndexWriter in combination with ASP.NET Core several hours of research and testing.

In the past, no such integrations existed with Lucene.Net, and as a result the project's success has been limited and the project has always teetered on the edge of oblivion. IMO, bringing the API to the users instead of making them come and find it would make Lucene.Net a much more useful tool that is accessible to many more people, and make recruiting help for future porting efforts easier. Furthermore, these integration packages could act as an adapter API that doesn't need to change much from one Lucene.Net port to the next which will ease upgrading.

I am not alone in thinking that the API of Lucene.Net falls short of where it should be:

https://simplelucene.codeplex.com/documentation
https://ayende.com/blog/158914/lucene-net-is-ugly

So let's not let Lucene.Net fall short of expectations again. Instead, let's aim for making Lucene.Net into the de-facto standard full-text search engine that is (mysteriously) missing from the .NET framework.

Thoughts? Ideas?


Thanks,
Shad Storhaug (NightOwl888)


P. S. Itamar - can we get an update as to the status of the new website?




-----Original Message-----
From: Shad Storhaug
Sent: Wednesday, October 12, 2016 2:10 AM
To: 'dev@lucenenet.apache.org'
Cc: 'Connie Yau'; 'cribs2@gmail.com'; 'itamar.synhershko@gmail.com'
Subject: RE: Remaining Work/Priorities

Update
======

I have just pushed some commits that fix several bugs in the Lucene.Net.Codecs project (all 452 tests pass most of the time, a few random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.


Fix for Test Context
-------------------------

For now, I have added method override stubs to each subclass in order to add the [Test] attribute, so NUnit will run them in the correct context. I did that on all of the superclass tests except for the ones in QueryParser (since Itamar mentioned he would be working in that area). Itamar, you will probably need to follow suit to get all of the QP tests to pass - namely with the QueryParserTestBase and TestQueryParser classes.

I have carefully put all of these changes into a single commit so it can be reverted easily, if this solution doesn't happen to be compatible with xUnit: https://github.com/apache/lucenenet/commit/2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c. Hopefully that makes life easier for @conniey.

@Itamar, let me know when this is completed on your end so I can do a double revert and squash the test stubs from QueryParser into an all-inclusive revert-able commit.

We can now correctly see how many tests we have in the core. Currently there are 2730 - it seems we are still missing 720 tests, assuming they all were for something port-able.


Remaining Tests
---------------------

Next I plan to work on locating any tests that we have missed (starting in the core). It seems these fall into several categories:

1. Tests that have not yet been ported.
2. Tests that have been partially ported that have not been added to the project.
3. Tests that have been ported, but are missing the [Test] attribute.
4. Tests in classes that have been ported that have been commented out (presumably because at the time they were ported the dependencies did not yet exist).
5. Tests that have been Ignored in .NET that were not in Java.
6. Tests that have NUnit Assume.That() logic that depends on some non-existant JRE condition, so they are not running in .NET.

I'll make a quick effort to get them to pass, but the main goal will be to ensure they all can run and are included in the project. Just a heads up that the number of test failures is likely to increase on this pass (but the number of bugs will likely decrease).


Failing Core Tests
-----------------------

I have looked into the remaining tests somewhat. There are 2 issues that I need some input on to solve.


TestRamUsageEstimator.TestSanity()

Java Lucene uses a JRE-specific API to determine how much header size to add on each field. This makes the estimates higher in Java. But more importantly, this test is failing because the estimate for a real string instance is coming back as the same size as its shallow size (16 bytes in this case) - it needs to be at least 1 byte more than that for the test to pass. In Java (at least in a 64 bit environment), there are an extra 4 bytes being added for each field.

Technically, there is a way to get these numbers from .NET, but it involves calling undocumented APIs using pointers and will likely be different from one .NET version to the next (a bad idea for a project that needs to support multiple .NET versions). The only solution I can think of is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for 32 bit) in order to make the numbers for the instances larger than their shallow size. I suppose the alternative would be to either comment out the string test or change it to >= make it pass. Thoughts? Alternatives?


TestNumericDocValuesUpdates.TestUpdateOldSegments()

I discovered what the issue is here (normally that is the hard part), but it seems that the proper solution is going to be a major task. The NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a service locator to load classes throughout the project. In the Codec abstract class, it is used to load up the codec for the context it is used in. However, our port of the NamedSPILoader simply loads all of the classes from the current AppDomain without any way to order them or override them.

The problem is that in Lucene, this was meant to be an extension point. And this particular test (and probably many more of them) uses that extension point to change the codec to a Mock from the test framework. This line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the issue is:

> Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have 
> tests-framework.jar before lucene-core.jar");

Basically, it is using a configuration file to order the classes that are loaded so the test mocks take priority over the built-in codecs.

Just fixing the test could be done by making the static NamedSPILoader variable in the Codec class internal and swapping in a test double. However, that doesn't solve the bigger issue that Lucene.Net is missing its extensibility for anyone who wants to write their own codec (or tap into one of the other extensibility points). I guess the bigger question is how important will it be for anyone to extend Lucene codecs or inject dependencies into Analyzer factories? There doesn’t appear to be any more extensibility than that in Lucene 4.8.0, but that could change in more recent or future versions of Lucene.


CI Builds
-----------

Not working. Can someone look into that please?


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Shad Storhaug
Sent: Wednesday, October 5, 2016 8:23 PM
To: dev@lucenenet.apache.org
Cc: Connie Yau; 'cribs2@gmail.com'
Subject: RE: Remaining Work/Priorities

> Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module?

Just for clarification, these are two entirely different things in Java. Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of Java:

import java.text.BreakIterator;
import java.text.Collator;
import java.text.ParseException;
import java.text.RuleBasedCollator;

Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also depend on parts of Java:

import java.text.BreakIterator;
import java.text.CharacterIterator;

Analysis.ICU depends on a separate (icu4j) package:

import com.ibm.icu.text.Normalizer;
import com.ibm.icu.text.Normalizer2;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.Replaceable;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.text.FilteredNormalizer2;
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.RawCollationKey;

That said, icu4j DOES have Collator and RuleBasedCollator classes, but it DOES NOT have a BreakIterator or CharacterIterator class. It is unclear whether the Collator from icu4j would work as a replacement for the one in core Java.

When I was digging through the JDK code, I noticed that BreakIterator and RuleBasedCollator have a lot of common ICU dependencies there, so even if the RuleBasedCollator from icu4j is compatible, it might make sense for us to port the one from Java anyway so we are dealing with the same shared dependencies in Analysis.Common.

Once we port over the classes from the Java JDK, we will be able to eliminate our current ICU4NET dependency (and the platform issues that come with it). That said, porting over those pieces could take considerable work. In the interim it might make sense to make separate projects/NuGet packages to isolate the areas that depend on BreakIterator, CharacterIterator, and RuleBasedCollator so the rest can be released for wide/cross-platform use. Perhaps we can even make a basic (scaled down) BreakIterator for Highlighter that breaks on spaces between words and punctuation between sentences, which wouldn't work for Thai, but would work for most other languages.

Porting the (icu4j) package is another complete ball of yarn, we should take a look at (https://github.com/sillsdev/icu-dotnet) to see if there is enough overlap there to power Analysis.ICU (offhand it looks as though some classes are missing, though). It is a wrapper around the C library - it may be that we just need to port more of it to get all of the pieces we need.

Speaking of Collation, @ChristopherHaws have you made any more progress on Analysis.Collation? Were you able to determine if icu-dotnet's collator will make the tests pass?

> I'm on it QueryParser.Flexible

Great. The TimeZone probably just needs more research to work out how to utilize (in order to implement the failing test). Also, FYI MSDN's recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed that several of the tests were recently modified to use TimeZone rather than TimeZoneInfo).

As for the culture, in .NET I am pretty sure that we need to pass it as a parameter to another overload of `QueryParser.Parse` rather than making it a property of QueryParser. But we can deal with that in one step after you have finished porting.

--

Shad Storhaug (NightOwl888)

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar Syn-Hershko
Sent: Wednesday, October 5, 2016 5:28 AM
To: dev@lucenenet.apache.org
Cc: Connie Yau
Subject: Re: Remaining Work/Priorities

Awesome, thanks for all the hard work Shad!

Our first priority should be fixing all remaining tests - in particular the one in Core. We should be ready to release and stamp our builds as 100% stable. As you mentioned, this could be an infrastructure issue - hopefully *Connie *can give a status update on her effort on the switch to xUnit?

With regards to Modules, here's an updated breakdown based on your email + forgotten pieces + my comments:

*Ported:*
Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1 failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total

We should do a second pass on the pieces we marked as ported, just to make sure the port is full and we didn't leave anything behind :)

*Need to be ported:*
Highlighter (Depends on Collator (which is still being ported) and BreakIterator (which we don't have a solution that works in .NET core yet)) Spatial (has 3rd party libraries that need to be updates) Spatial4n (https://github.com/synhershko/Spatial4N) needs to be brought up to speed with spatial4j, dependencies of which may cause some issues....
Codecs
Partially ported, mostly the tests weren't ported Grouping Not urgent, but provides nice functionality that users will probably like

The only part with dependencies seems to be the spatial module - I will have a look there soon if you don't get to that before I do.

*Can wait* - some modules are less frequently used, we should stabilize and release first and then work on them based on demand Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module? I keep getting reports on some issues they are causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik) Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly helper libraries, so there's probably not real dependency just lots of replacement Analysis.SmartCN Analysis.Stempel (currently in progress) Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo while important because can help newbies, we can do better by providing docs and real world examples. I'm on it QueryParser.Flexible

*No need to port* - neither are needed in our context Benchmark (many dependencies) Replicator (many dependencies) Sandbox (Depends on Apache Jakarta)

Once all modules are ported and all tests are passing, I think we should get two more items fixed before an official release:

1. .NET Core support - I'm not clear on the status of it at the moment. We probably want to have it in for the release.

2. Public API Inconsistencies. We can discuss what should be done and what not when we get to that stage. Some are an obvious "fixme" but some will break code compatibility with Java I think we should avoid.

One last note - *Wyatt*, do we know why there are no CI builds lately?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member

On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining 
> to be done on Lucene.Net version 4.8.0. We are nearly there, but that 
> doesn't mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count 
> we have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 
> total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - 
> (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 
> failing / 27 total Lucene.Net.Memory - 0 failing / 10 total 
> Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing 
> / 96 total Lucene.Net.QueryParser - 1 failing / 203 total 
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently 
> discovered that many of the "failures" that are being reported are 
> false negatives (in fact, the VS2015 NUnit test runner shows there are
> 135 failing tests total and 902 tests total that don't belong to any 
> project). Most NUnit 2.6 test runners do not correctly run tests in 
> shared abstract classes with the correct context (test setup) to make 
> them pass. These out-of-context runs add several additional minutes to the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the 
> situation somewhat - that is, it ran the tests in the correct context 
> and I was able to determine that we have more tests than the numbers 
> above and they are all succeeding. However, it also ran the tests in 
> an invalid context (that is, the context of the abstract class without 
> any setup) and some of them still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new 
> fancy test attributes) is enough to fix this issue. If not, we need to 
> find another solution - preferably one that can be applied to all of 
> the tests in abstract classes without too much effort or changing them 
> so they are too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days 
> ago and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic 
> (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel 
> (currently in progress) 7. Analysis.UIMA (Depends on Tagger, 
> uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> Demo 10. Highlighter (Depends on Collator (which is still being
> ported) and BreakIterator (which we don't have a solution that works 
> in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox 
> (Depends on Apache Jakarta) 13. Spatial (Already ported in #174 
> (https://github.com/apache/ lucenenet/pull/174), needs a recent 
> version of spatial4n) 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize 
> this list in terms of priority. It also couldn't hurt to update the 
> contributing documents 
> (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and
> https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the 
> current status.
>
> Of course, it is the known status of dependencies that we need 
> clarification on. Which of these dependencies is known to be ported?
> Which of them are ported but are not up to date? Which of them are 
> known not to be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the 
> .NETification/consistency of the core API (that is, in the Lucene.Net 
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that 
> will affect everyone helping out on Lucene.Net and anyone who is 
> currently using the beta. So, I just wanted to gather some input on 
> when the most appropriate time to begin working on these sweeping changes would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>

RE: Remaining Work/Priorities

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Update
======

It has been a while since I have communicated the current status of the Lucene.Net codebase to the team, and I am getting concerned that claims that we are "close to release" are being exaggerated a bit. We are almost ready to put a pre-release on NuGet so the masses can start consuming it, but there are some bases we still need to cover to stabilize for release and ready Lucene.Net 4.8.0 for enterprise-level quality expectations.

We have now successfully ported more than 380,000 executable lines of code from Java to .NET, and have ported every Lucene sub-project that Itamar has earmarked as "important". We also have support for .NET Core (at least on a branch: https://github.com/apache/lucenenet/pull/191) and have over 6000 passing tests.

The following sub-projects (and their tests) that Itamar has earmarked as "optional" can still be ported if any interested party wants to make a contribution. If not, they won't be in the initial release.

1. Analysis.ICU
2. Analysis.Kuromoji (note only 3-4 days of work here, I think)
3. Analysis.Morfologik (Depends on Morfologik) 
4. Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly helper libraries, so there's probably not real dependency just lots of replacement 
5. Analysis.SmartCN (note only 2-3 days of work here, I think)
6. Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer)
7. Demo (might be a good learning exercise) 

--------------------------------------------------------------------------------------------------------
Per Itamar: We should be ready to release and stamp our builds as 100% stable.
--------------------------------------------------------------------------------------------------------

Agreed. But we are not there, yet. There is still quite a bit of work to do on that front.

1. There are many issues with the public API of Lucene.Net.Core and a few other places (such as Lucene.Net.Grouping). Most of these issues will require breaking API changes to fix . Although most of these changes are fairly minor, since most will just be changing methods to properties and vice versa, these changes will sweep through every consuming project. See API Phase 1 section below.
2. There are ~35 tests that are failing, and some others failing randomly, not to mention some differences in failure counts between (https://github.com/apache/lucenenet/pull/191) and master.
3. Some of the test framework is not yet complete, which explains some of the test failures. The culture, time zone, and culture are not being randomized, we don't yet have the SuppressCodecs functionality in place, nor has the Lucene 3.x backward compatibility been tested. The fact that we are not randomizing culture means we are not testing the complete picture. We know at present that there are ~35 test failures in en-US, but it is not currently known how many we will have if we try other cultures. I suspect there are many issues around casing, date and number formatting that will need to be addressed.
4. Bugs are still relatively easy to find in the codebase. For example, I recently discovered that Atomaton doesn't fully support Unicode: https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Util/Automaton/Automaton.cs#L648. I also recently found and fixed an issue in the Test Framework where we were generating random strings of numbers instead of random Unicode characters to create test strings, and discovered some of the other random functions have bugs that make them not test the whole range. I could go on, but the point is there are still many bugs that may negatively affect quality, most which aren't causing obvious test failures.
5. We haven't had much feedback from end users, most likely because we haven't had many downloads on MyGet, and because we haven't yet made the beta available on NuGet.

IMO we should do at least the following before we release (in order, bare minimum):

1. Fix the breaking API issues ASAP so we don't burden users with breaking changes later. We can make all of these changes on #191 (or a branch of it) before it is merged so there is only 1 build that breaks the API instead of doing it incrementally and having several successive API-breaking packages.
2. Update documentation on GitHub home page (and possibly the website) so people can easily report bugs, find the information about how to manually build, see that we have support for NuGet and MyGet and where to get them, see current status, find documentation, contribute, etc. Can we setup a WIKI on GitHub so we can add .NET specific Lucene documentation?
3. Ensure we use a version scheme we use allows for patching bugs so we aren't locked into a single release of this port. Our current version scheme of MAJOR.MINOR.BUILD.REVISION-PRE will work, but it is unclear whether the new build process supports this.
4. Get a wider beta release on MyGet (with icu-dotnet and .NET Core support) so we can start getting user feedback on any remaining issues. We can then rely on the user feedback (or lack thereof) as one factor to determine when we are ready for release.
5. Do a line-by-line sweep of Lucene.Net.Core and Lucene.Net.TestFramework (and possibly Lucene.Net.Classifications, Lucene.Net.Expressions, Lucene.Net.Join) to ensure we have everything implemented and plugged together correctly. The rest I am fairly confident is implemented correctly (we can rely on user feedback for any potential issues).
6. Finish the Test Framework implementation so it randomizes culture, time zone, and codec (and suppresses codecs correctly) so we get a true measure of test failures. And also so we can determine if the Lucene.Net 3.x backward compatibility works.
7. Fix (at least the high priority) remaining tests.

Other tasks that are remaining to complete:

1. Create a RuleBasedBreakIterator based on icu-dotnet that breaks text similar to Java's RuleBasedBreakIterator (for Highlighter and possibly Analysis.Th).
2. Fix the issues with Collator and RuleBasedCollator that are causing some test failures (for Collation namespace in Analysis.Common).
3. Clean up the Support namespace, remove unused types, organize into sub-namespaces.
4. Make/port tests for types in the Support namespace to verify stability.
5. (recommended) Try to get others involved in the project to make high-level integration APIs for certain target frameworks (see API Phase 2 below). This could be done after release, but we might have more possibilities if we finish this part while it is a pre-release.
6. (optional) Performance tuning.
7. Suppress insignificant compiler warnings to see if there are any important ones left to deal with (Lucene.Net.Core and a few others)
8. Finish XML documentation comments (Lucene.Net.Core, Lucene.Net.Classification, Lucene.Net.Queries)
9. Fix any directory casing issues in the codebase that can potentially cause problems on some platforms (see https://github.com/apache/lucenenet/pull/196).

@Connie

1. Are there any remaining issues around the build/deployment we still need to resolve (versioning, integration with TeamCity, etc)? 
2. Are we able to utilize our current versioning scheme (MAJOR.MINOR.BUILD.REVISION-PRE)? I have verified that NuGet behaves correctly with this scheme, and IMO it makes sense to use this scheme on a port such as this one so we have a way to patch without incrementing beyond the semantic version of Lucene we are emulating. It looks like this versioning issue has been a roadblock for fixing bugs in previous Lucene.Net ports.

API
===

--------------------------------------------------------------------------------------------------------
API Phase 1 - Stabilization
--------------------------------------------------------------------------------------------------------

Phase 1 takes care of finishing the breaking API changes that are necessary to get from where we are to where we need to be for release. This is so the naming and other conventions are consistent with .NET and/or Lucene, to eliminate casts that are currently required to use certain functionality (such as Grouping), and to identify other parts of the API that could be improved (either to make it more similar to Lucene or to make it more usable/intuitive in .NET).

--------------------------------------------------------------------------------------------------------
Per Itamar: Public API Inconsistencies. We can discuss what should be done and what not when we get to that stage. Some are an obvious "fixme" but some will break code compatibility with Java I think we should avoid.
--------------------------------------------------------------------------------------------------------

We are now at the stage where we should make this a top priority. Mind sharing your thoughts on what "needs to be compatible with Java"? It seems that MSDN has clear information on how to differentiate between a property and a method: https://msdn.microsoft.com/en-us/library/ms229054(v=vs.100).aspx, but which of the items below are you concerned about? Shouldn't we be more concerned with making it compatible with .NET than with Java?

Here is that list of API issues again:

1.	Method names that are still camelCase
2.	Properties that should be methods (because they do a lot of processing or because they are non-deterministic)
3.	Methods that should be properties
4.	.Size() vs .Size vs .Count – should generally all be .Count (or .Length) in .NET
5.	Interfaces should begin with “I”
6.	Classes should not begin with “I” followed by another singular capital letter (for some reason some of them were named that way)
7.	.CharAt() should probably be this[]
8.	Generic types nested within generic types (which cause Visual Studio to crash when Intellisense tries to read them)

We should add to that list:

1. Fix member accessibility to match that in Java (virtual by default, non-virtual if "sealed" specified, etc.), so the intent of the original design can be realized.
2. Rename Tokenattributes namespace to TokenAttributes (and any other namespaces that don't follow .NET conventions).
3. Rename enumerations that were named with a "_e" suffix back to their original name (we can do this by de-nesting them from the class they are in so the name doesn't collide with a property).
4. Find any public APIs that are using nullable enumerations and try to find another solution (such as making an overload that doesn't accept the parameter and/or making a NOT_SET state (with the enum default value of 0) in the enumeration).
5. Try to find a better replacement for Number than object (possibly by using different overloads that accept different numeric types and keeping track of the type that was passed).
6. Fields should generally not be public in .NET - we have several that were named using Pascal Case, but in general they should be made into properties that are Pascal Case that are either auto-implemented or backed by fields that are camelCase.
7. Change the Collector abstract class to an interface so it will support covariance (required by Grouping). We could alternatively back the Collector abstract class by an interface, but I think that would be confusing since every place where the abstract class is used now will need to be replaced with the interface anyway. 

There are probably some additional issues that will cause API breaks to fix, but this most likely makes up the bulk of them.

--------------------------------------------------------------------------------------------------------
API Phase 2 - .NETify
--------------------------------------------------------------------------------------------------------

Phase 2 is to make the API more .NET-friendly.

There are several places where we can add overloads and extension methods to make features of Lucene.Net act in concert with .NET better. For example, we could add extra overloads on FSDirectory that take a path as a string so consumers don't have to new up a (probably pointless) DirectoryInfo instance, which would make it act more like the FileStream object in .NET. Also, as I previously mentioned to Itamar, there are several parts of Lucene's design that were specifically aimed at using anonymous classes. We can probably find ways to simulate this using some helper extension methods and/or fluent builders.

IMO Lucene.NET is useful, but its API is very low-level which makes using it challenging to learn and integrate with modern .NET projects. Like many other packages that are available on NuGet, it could use some integration packages with various frameworks within .NET to make it easier to use. For example, StructureMap has integrations with MVC 5 and WebAPI: http://structuremap.github.io/integrations/. Here are a few ideas for integrations we could make for Lucene.Net:

1. Lucene.Net.AspNet - integration with ASP.NET/MVC core (for plugging common Lucene.Net features into the startup pipeline, etc).
2. Lucene.Net.AspNet.Suggest - UI integration with ASP.NET/MVC core, making Suggest functionality into an HTML helper and/or view component that can be consumed/customized easily.
3. Lucene.Net.Linq - This already exists, but perhaps we should contact the author about bringing in as part of the main repo, or perhaps helping out with our API effort: https://github.com/themotleyfool/Lucene.Net.Linq
4. Lucene.Net.EntityFramework - ?
5. Lucene.Net.MVC5 - integration with ASP.NET MVC 5.
6. Lucene.Net.WebApi - integration with WebApi.
7. Lucene.Net.Wpf - integration with WPF.
8. Lucene.Net.AspNet.Facet - UI integration with ASP.NET/MVC core, making faceted search functionality into an HTML helper and/or view component, as well as any support needed to setup the index for faceted search.

This is just a short list to get the ideas flowing. But we should really aim to make Lucene.Net support easy to use with a wide range of frameworks across both the .NET Framework and .NET Core stacks.

Just to be clear, the idea here is that we keep the Phase 1 API in place - that is, we have an API that looks pretty much the same as Lucene, but build high-level APIs on top of it to integrate many common use cases/configurations for integrating with these other frameworks.

For example, in ASP.NET Core, it is recommended to use a singleton instance for an IndexWriter rather than opening and closing it all of the time - ideally, this could be done in a way that is familiar to the ASP.NET Core startup configuration API.

        // This method gets called by the runtime. Use this method to add services to the container.
        public void ConfigureServices(IServiceCollection services)
        {
            // Add framework services.
            services.AddApplicationInsightsTelemetry(Configuration);

            services.AddMvc();

            // Add a Lucene IndexWriter to the container (as a singleton)
            services.AddIndexWriter("~/the_index/", <other lucene-specific options>);
        }

That one line of code would potentially save everyone who uses a Lucene.Net IndexWriter in combination with ASP.NET Core several hours of research and testing.

In the past, no such integrations existed with Lucene.Net, and as a result the project's success has been limited and the project has always teetered on the edge of oblivion. IMO, bringing the API to the users instead of making them come and find it would make Lucene.Net a much more useful tool that is accessible to many more people, and make recruiting help for future porting efforts easier. Furthermore, these integration packages could act as an adapter API that doesn't need to change much from one Lucene.Net port to the next which will ease upgrading.

I am not alone in thinking that the API of Lucene.Net falls short of where it should be:

https://simplelucene.codeplex.com/documentation
https://ayende.com/blog/158914/lucene-net-is-ugly

So let's not let Lucene.Net fall short of expectations again. Instead, let's aim for making Lucene.Net into the de-facto standard full-text search engine that is (mysteriously) missing from the .NET framework.

Thoughts? Ideas?


Thanks,
Shad Storhaug (NightOwl888)


P. S. Itamar - can we get an update as to the status of the new website?




-----Original Message-----
From: Shad Storhaug 
Sent: Wednesday, October 12, 2016 2:10 AM
To: 'dev@lucenenet.apache.org'
Cc: 'Connie Yau'; 'cribs2@gmail.com'; 'itamar.synhershko@gmail.com'
Subject: RE: Remaining Work/Priorities

Update
======

I have just pushed some commits that fix several bugs in the Lucene.Net.Codecs project (all 452 tests pass most of the time, a few random failures) and fix all but 4 of the failing tests in Lucene.Net.Core.


Fix for Test Context
-------------------------

For now, I have added method override stubs to each subclass in order to add the [Test] attribute, so NUnit will run them in the correct context. I did that on all of the superclass tests except for the ones in QueryParser (since Itamar mentioned he would be working in that area). Itamar, you will probably need to follow suit to get all of the QP tests to pass - namely with the QueryParserTestBase and TestQueryParser classes.

I have carefully put all of these changes into a single commit so it can be reverted easily, if this solution doesn't happen to be compatible with xUnit: https://github.com/apache/lucenenet/commit/2a79edea6359e1ee1f83269cc7dc3ef2753ebf2c. Hopefully that makes life easier for @conniey.

@Itamar, let me know when this is completed on your end so I can do a double revert and squash the test stubs from QueryParser into an all-inclusive revert-able commit.

We can now correctly see how many tests we have in the core. Currently there are 2730 - it seems we are still missing 720 tests, assuming they all were for something port-able.


Remaining Tests
---------------------

Next I plan to work on locating any tests that we have missed (starting in the core). It seems these fall into several categories:

1. Tests that have not yet been ported.
2. Tests that have been partially ported that have not been added to the project.
3. Tests that have been ported, but are missing the [Test] attribute.
4. Tests in classes that have been ported that have been commented out (presumably because at the time they were ported the dependencies did not yet exist).
5. Tests that have been Ignored in .NET that were not in Java.
6. Tests that have NUnit Assume.That() logic that depends on some non-existant JRE condition, so they are not running in .NET.

I'll make a quick effort to get them to pass, but the main goal will be to ensure they all can run and are included in the project. Just a heads up that the number of test failures is likely to increase on this pass (but the number of bugs will likely decrease).


Failing Core Tests
-----------------------

I have looked into the remaining tests somewhat. There are 2 issues that I need some input on to solve.


TestRamUsageEstimator.TestSanity()

Java Lucene uses a JRE-specific API to determine how much header size to add on each field. This makes the estimates higher in Java. But more importantly, this test is failing because the estimate for a real string instance is coming back as the same size as its shallow size (16 bytes in this case) - it needs to be at least 1 byte more than that for the test to pass. In Java (at least in a 64 bit environment), there are an extra 4 bytes being added for each field.

Technically, there is a way to get these numbers from .NET, but it involves calling undocumented APIs using pointers and will likely be different from one .NET version to the next (a bad idea for a project that needs to support multiple .NET versions). The only solution I can think of is to hard code in an extra 4 bytes for 64 bit (and most likely 2 bytes for 32 bit) in order to make the numbers for the instances larger than their shallow size. I suppose the alternative would be to either comment out the string test or change it to >= make it pass. Thoughts? Alternatives?


TestNumericDocValuesUpdates.TestUpdateOldSegments()

I discovered what the issue is here (normally that is the hard part), but it seems that the proper solution is going to be a major task. The NamedSPILoader (backed by SPIClassIterator) in Java Lucene is used as a service locator to load classes throughout the project. In the Codec abstract class, it is used to load up the codec for the context it is used in. However, our port of the NamedSPILoader simply loads all of the classes from the current AppDomain without any way to order them or override them.

The problem is that in Lucene, this was meant to be an extension point. And this particular test (and probably many more of them) uses that extension point to change the codec to a Mock from the test framework. This line from TestRuleSetupAndRestoreClassEnv pretty much sums up what the issue is:

> Debug.Assert(Codec is Lucene42RWCodec, "fix your classpath to have 
> tests-framework.jar before lucene-core.jar");

Basically, it is using a configuration file to order the classes that are loaded so the test mocks take priority over the built-in codecs.

Just fixing the test could be done by making the static NamedSPILoader variable in the Codec class internal and swapping in a test double. However, that doesn't solve the bigger issue that Lucene.Net is missing its extensibility for anyone who wants to write their own codec (or tap into one of the other extensibility points). I guess the bigger question is how important will it be for anyone to extend Lucene codecs or inject dependencies into Analyzer factories? There doesn’t appear to be any more extensibility than that in Lucene 4.8.0, but that could change in more recent or future versions of Lucene.


CI Builds
-----------

Not working. Can someone look into that please?


Thanks,
Shad Storhaug (NightOwl888)



-----Original Message-----
From: Shad Storhaug
Sent: Wednesday, October 5, 2016 8:23 PM
To: dev@lucenenet.apache.org
Cc: Connie Yau; 'cribs2@gmail.com'
Subject: RE: Remaining Work/Priorities

> Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module?

Just for clarification, these are two entirely different things in Java. Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of Java:

import java.text.BreakIterator;
import java.text.Collator;
import java.text.ParseException;
import java.text.RuleBasedCollator;

Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also depend on parts of Java:

import java.text.BreakIterator;
import java.text.CharacterIterator;

Analysis.ICU depends on a separate (icu4j) package:

import com.ibm.icu.text.Normalizer;
import com.ibm.icu.text.Normalizer2;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.Replaceable;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.text.FilteredNormalizer2;
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.RawCollationKey;

That said, icu4j DOES have Collator and RuleBasedCollator classes, but it DOES NOT have a BreakIterator or CharacterIterator class. It is unclear whether the Collator from icu4j would work as a replacement for the one in core Java.

When I was digging through the JDK code, I noticed that BreakIterator and RuleBasedCollator have a lot of common ICU dependencies there, so even if the RuleBasedCollator from icu4j is compatible, it might make sense for us to port the one from Java anyway so we are dealing with the same shared dependencies in Analysis.Common.

Once we port over the classes from the Java JDK, we will be able to eliminate our current ICU4NET dependency (and the platform issues that come with it). That said, porting over those pieces could take considerable work. In the interim it might make sense to make separate projects/NuGet packages to isolate the areas that depend on BreakIterator, CharacterIterator, and RuleBasedCollator so the rest can be released for wide/cross-platform use. Perhaps we can even make a basic (scaled down) BreakIterator for Highlighter that breaks on spaces between words and punctuation between sentences, which wouldn't work for Thai, but would work for most other languages.

Porting the (icu4j) package is another complete ball of yarn, we should take a look at (https://github.com/sillsdev/icu-dotnet) to see if there is enough overlap there to power Analysis.ICU (offhand it looks as though some classes are missing, though). It is a wrapper around the C library - it may be that we just need to port more of it to get all of the pieces we need.

Speaking of Collation, @ChristopherHaws have you made any more progress on Analysis.Collation? Were you able to determine if icu-dotnet's collator will make the tests pass?

> I'm on it QueryParser.Flexible

Great. The TimeZone probably just needs more research to work out how to utilize (in order to implement the failing test). Also, FYI MSDN's recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed that several of the tests were recently modified to use TimeZone rather than TimeZoneInfo).

As for the culture, in .NET I am pretty sure that we need to pass it as a parameter to another overload of `QueryParser.Parse` rather than making it a property of QueryParser. But we can deal with that in one step after you have finished porting.

--

Shad Storhaug (NightOwl888)

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar Syn-Hershko
Sent: Wednesday, October 5, 2016 5:28 AM
To: dev@lucenenet.apache.org
Cc: Connie Yau
Subject: Re: Remaining Work/Priorities

Awesome, thanks for all the hard work Shad!

Our first priority should be fixing all remaining tests - in particular the one in Core. We should be ready to release and stamp our builds as 100% stable. As you mentioned, this could be an infrastructure issue - hopefully *Connie *can give a status update on her effort on the switch to xUnit?

With regards to Modules, here's an updated breakdown based on your email + forgotten pieces + my comments:

*Ported:*
Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1 failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total

We should do a second pass on the pieces we marked as ported, just to make sure the port is full and we didn't leave anything behind :)

*Need to be ported:*
Highlighter (Depends on Collator (which is still being ported) and BreakIterator (which we don't have a solution that works in .NET core yet)) Spatial (has 3rd party libraries that need to be updates) Spatial4n (https://github.com/synhershko/Spatial4N) needs to be brought up to speed with spatial4j, dependencies of which may cause some issues....
Codecs
Partially ported, mostly the tests weren't ported Grouping Not urgent, but provides nice functionality that users will probably like

The only part with dependencies seems to be the spatial module - I will have a look there soon if you don't get to that before I do.

*Can wait* - some modules are less frequently used, we should stabilize and release first and then work on them based on demand Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module? I keep getting reports on some issues they are causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik) Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly helper libraries, so there's probably not real dependency just lots of replacement Analysis.SmartCN Analysis.Stempel (currently in progress) Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo while important because can help newbies, we can do better by providing docs and real world examples. I'm on it QueryParser.Flexible

*No need to port* - neither are needed in our context Benchmark (many dependencies) Replicator (many dependencies) Sandbox (Depends on Apache Jakarta)

Once all modules are ported and all tests are passing, I think we should get two more items fixed before an official release:

1. .NET Core support - I'm not clear on the status of it at the moment. We probably want to have it in for the release.

2. Public API Inconsistencies. We can discuss what should be done and what not when we get to that stage. Some are an obvious "fixme" but some will break code compatibility with Java I think we should avoid.

One last note - *Wyatt*, do we know why there are no CI builds lately?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member

On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining 
> to be done on Lucene.Net version 4.8.0. We are nearly there, but that 
> doesn't mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count 
> we have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 
> total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - 
> (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 
> failing / 27 total Lucene.Net.Memory - 0 failing / 10 total 
> Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing 
> / 96 total Lucene.Net.QueryParser - 1 failing / 203 total 
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently 
> discovered that many of the "failures" that are being reported are 
> false negatives (in fact, the VS2015 NUnit test runner shows there are
> 135 failing tests total and 902 tests total that don't belong to any 
> project). Most NUnit 2.6 test runners do not correctly run tests in 
> shared abstract classes with the correct context (test setup) to make 
> them pass. These out-of-context runs add several additional minutes to the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the 
> situation somewhat - that is, it ran the tests in the correct context 
> and I was able to determine that we have more tests than the numbers 
> above and they are all succeeding. However, it also ran the tests in 
> an invalid context (that is, the context of the abstract class without 
> any setup) and some of them still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new 
> fancy test attributes) is enough to fix this issue. If not, we need to 
> find another solution - preferably one that can be applied to all of 
> the tests in abstract classes without too much effort or changing them 
> so they are too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days 
> ago and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic 
> (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel 
> (currently in progress) 7. Analysis.UIMA (Depends on Tagger, 
> uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> Demo 10. Highlighter (Depends on Collator (which is still being
> ported) and BreakIterator (which we don't have a solution that works 
> in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox 
> (Depends on Apache Jakarta) 13. Spatial (Already ported in #174 
> (https://github.com/apache/ lucenenet/pull/174), needs a recent 
> version of spatial4n) 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize 
> this list in terms of priority. It also couldn't hurt to update the 
> contributing documents 
> (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and
> https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the 
> current status.
>
> Of course, it is the known status of dependencies that we need 
> clarification on. Which of these dependencies is known to be ported?
> Which of them are ported but are not up to date? Which of them are 
> known not to be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the 
> .NETification/consistency of the core API (that is, in the Lucene.Net 
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that 
> will affect everyone helping out on Lucene.Net and anyone who is 
> currently using the beta. So, I just wanted to gather some input on 
> when the most appropriate time to begin working on these sweeping changes would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>

Re: Remaining Work/Priorities

Posted by Christopher Haws <cr...@gmail.com>.
Hey Shad,

Sorry for not responding sooner. The last few weeks have been pretty crazy
for me (vacation and now a production release at work) so I haven't had a
chance to look into it further. I am hoping that within the next week or so
things will calm down so that I can take another look.


As a side question, related to something I am working on at work, do you
know if there is a reason why classes like IndexSearcher, IndexWriter, and
IndexReader don't implement interfaces? I was trying to build a wrapper
around some of these classes that log queries and performance details but
had to resort to some pretty hacky code since there are no interfaces for
these classes.

On Wed, Oct 5, 2016 at 6:23 AM Shad Storhaug <sh...@shadstorhaug.com> wrote:

> > Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs
> from the analysis.commons module?
>
> Just for clarification, these are two entirely different things in Java.
> Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of
> Java:
>
> import java.text.BreakIterator;
> import java.text.Collator;
> import java.text.ParseException;
> import java.text.RuleBasedCollator;
>
> Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also
> depend on parts of Java:
>
> import java.text.BreakIterator;
> import java.text.CharacterIterator;
>
> Analysis.ICU depends on a separate (icu4j) package:
>
> import com.ibm.icu.text.Normalizer;
> import com.ibm.icu.text.Normalizer2;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.Replaceable;
> import com.ibm.icu.text.Transliterator;
> import com.ibm.icu.text.UTF16;
> import com.ibm.icu.text.UnicodeSet;
> import com.ibm.icu.text.FilteredNormalizer2;
> import com.ibm.icu.text.Collator;
> import com.ibm.icu.text.RuleBasedCollator;
> import com.ibm.icu.util.ULocale;
> import com.ibm.icu.text.RawCollationKey;
>
> That said, icu4j DOES have Collator and RuleBasedCollator classes, but it
> DOES NOT have a BreakIterator or CharacterIterator class. It is unclear
> whether the Collator from icu4j would work as a replacement for the one in
> core Java.
>
> When I was digging through the JDK code, I noticed that BreakIterator and
> RuleBasedCollator have a lot of common ICU dependencies there, so even if
> the RuleBasedCollator from icu4j is compatible, it might make sense for us
> to port the one from Java anyway so we are dealing with the same shared
> dependencies in Analysis.Common.
>
> Once we port over the classes from the Java JDK, we will be able to
> eliminate our current ICU4NET dependency (and the platform issues that come
> with it). That said, porting over those pieces could take considerable
> work. In the interim it might make sense to make separate projects/NuGet
> packages to isolate the areas that depend on BreakIterator,
> CharacterIterator, and RuleBasedCollator so the rest can be released for
> wide/cross-platform use. Perhaps we can even make a basic (scaled down)
> BreakIterator for Highlighter that breaks on spaces between words and
> punctuation between sentences, which wouldn't work for Thai, but would work
> for most other languages.
>
> Porting the (icu4j) package is another complete ball of yarn, we should
> take a look at (https://github.com/sillsdev/icu-dotnet) to see if there
> is enough overlap there to power Analysis.ICU (offhand it looks as though
> some classes are missing, though). It is a wrapper around the C library -
> it may be that we just need to port more of it to get all of the pieces we
> need.
>
> Speaking of Collation, @ChristopherHaws have you made any more progress on
> Analysis.Collation? Were you able to determine if icu-dotnet's collator
> will make the tests pass?
>
> > I'm on it QueryParser.Flexible
>
> Great. The TimeZone probably just needs more research to work out how to
> utilize (in order to implement the failing test). Also, FYI MSDN's
> recommendation (
> https://msdn.microsoft.com/en-us/library/system.timezone(v=vs.110).aspx)
> is to use TimeZoneInfo rather than TimeZone (I noticed that several of the
> tests were recently modified to use TimeZone rather than TimeZoneInfo).
>
> As for the culture, in .NET I am pretty sure that we need to pass it as a
> parameter to another overload of `QueryParser.Parse` rather than making it
> a property of QueryParser. But we can deal with that in one step after you
> have finished porting.
>
> --
>
> Shad Storhaug (NightOwl888)
>
> -----Original Message-----
> From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On
> Behalf Of Itamar Syn-Hershko
> Sent: Wednesday, October 5, 2016 5:28 AM
> To: dev@lucenenet.apache.org
> Cc: Connie Yau
> Subject: Re: Remaining Work/Priorities
>
> Awesome, thanks for all the hard work Shad!
>
> Our first priority should be fixing all remaining tests - in particular
> the one in Core. We should be ready to release and stamp our builds as 100%
> stable. As you mentioned, this could be an infrastructure issue - hopefully
> *Connie *can give a status update on her effort on the switch to xUnit?
>
> With regards to Modules, here's an updated breakdown based on your email +
> forgotten pieces + my comments:
>
> *Ported:*
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0
> failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total
> Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including
> #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total
> Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42
> total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1
> failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total
>
> We should do a second pass on the pieces we marked as ported, just to make
> sure the port is full and we didn't leave anything behind :)
>
> *Need to be ported:*
> Highlighter (Depends on Collator (which is still being ported) and
> BreakIterator (which we don't have a solution that works in .NET core yet))
> Spatial (has 3rd party libraries that need to be updates) Spatial4n (
> https://github.com/synhershko/Spatial4N) needs to be brought up to speed
> with spatial4j, dependencies of which may cause some issues....
> Codecs
> Partially ported, mostly the tests weren't ported Grouping Not urgent, but
> provides nice functionality that users will probably like
>
> The only part with dependencies seems to be the spatial module - I will
> have a look there soon if you don't get to that before I do.
>
> *Can wait* - some modules are less frequently used, we should stabilize
> and release first and then work on them based on demand Analysis.ICU
> (Depends on ICU4j) hopefully we can remove the ICU DLLs from the
> analysis.commons module? I keep getting reports on some issues they are
> causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik)
> Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly
> helper libraries, so there's probably not real dependency just lots of
> replacement Analysis.SmartCN Analysis.Stempel (currently in progress)
> Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo
> while important because can help newbies, we can do better by providing
> docs and real world examples. I'm on it QueryParser.Flexible
>
> *No need to port* - neither are needed in our context Benchmark (many
> dependencies) Replicator (many dependencies) Sandbox (Depends on Apache
> Jakarta)
>
> Once all modules are ported and all tests are passing, I think we should
> get two more items fixed before an official release:
>
> 1. .NET Core support - I'm not clear on the status of it at the moment. We
> probably want to have it in for the release.
>
> 2. Public API Inconsistencies. We can discuss what should be done and what
> not when we get to that stage. Some are an obvious "fixme" but some will
> break code compatibility with Java I think we should avoid.
>
> One last note - *Wyatt*, do we know why there are no CI builds lately?
>
> --
>
> Itamar Syn-Hershko
> http://code972.com | @synhershko <https://twitter.com/synhershko>
> Freelance Developer & Consultant Lucene.NET committer and PMC member
>
> On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
> wrote:
>
> > Hello,
> >
> > I just wanted to open this discussion to talk about the work remaining
> > to be done on Lucene.Net version 4.8.0. We are nearly there, but that
> > doesn't mean we don't still need help!
> >
> >
> > FAILING TESTS
> > -------------------
> >
> > We now have over 5000 passing tests and as soon as pull request #188 (
> > https://github.com/apache/lucenenet/pull/188) is merged, by my count
> > we have only 20 (actual) failing tests. Here is the breakdown by project:
> >
> > Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common
> > - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9
> > total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet -
> > (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0
> > failing / 27 total Lucene.Net.Memory - 0 failing / 10 total
> > Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing
> > / 96 total Lucene.Net.QueryParser - 1 failing / 203 total
> > Lucene.Net.Suggest - 0 failing / 142 total
> >
> > The reason why I said ACTUAL tests above is because I recently
> > discovered that many of the "failures" that are being reported are
> > false negatives (in fact, the VS2015 NUnit test runner shows there are
> > 135 failing tests total and 902 tests total that don't belong to any
> > project). Most NUnit 2.6 test runners do not correctly run tests in
> > shared abstract classes with the correct context (test setup) to make
> > them pass. These out-of-context runs add several additional minutes to
> the test run.
> >
> > As an experiment, I upgraded to NUnit 3.4.1 and it helped the
> > situation somewhat - that is, it ran the tests in the correct context
> > and I was able to determine that we have more tests than the numbers
> > above and they are all succeeding. However, it also ran the tests in
> > an invalid context (that is, the context of the abstract class without
> > any setup) and some of them still showed as failures.
> >
> > I know @conniey is currently working on porting the tests over to xUnit.
> > Hopefully, swapping test frameworks alone (or using some of the new
> > fancy test attributes) is enough to fix this issue. If not, we need to
> > find another solution - preferably one that can be applied to all of
> > the tests in abstract classes without too much effort or changing them
> > so they are too different from their Java counterpart.
> >
> > Remaining Pieces to Port
> > ---------------------------------
> >
> > I took an inventory of the remaining pieces left to port a few days
> > ago and here is what that looks like (alphabetical order):
> >
> > 1. Analysis.ICU (Depends on ICU4j)
> > 2. Analysis.Kuromoji
> > 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic
> > (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel
> > (currently in progress) 7. Analysis.UIMA (Depends on Tagger,
> > uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9.
> > Demo 10. Highlighter (Depends on Collator (which is still being
> > ported) and BreakIterator (which we don't have a solution that works
> > in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox
> > (Depends on Apache Jakarta) 13. Spatial (Already ported in #174
> > (https://github.com/apache/ lucenenet/pull/174), needs a recent
> > version of spatial4n) 14. QueryParser.Flexible
> >
> > Itamar, it would be helpful if you would be so kind as to organize
> > this list in terms of priority. It also couldn't hurt to update the
> > contributing documents
> > (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> > and
> > https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> > with the latest information so anyone who wants to help out knows the
> > current status.
> >
> > Of course, it is the known status of dependencies that we need
> > clarification on. Which of these dependencies is known to be ported?
> > Which of them are ported but are not up to date? Which of them are
> > known not to be ported, and which of them are unknown?
> >
> >
> > Public API Inconsistencies
> > ---------------------------------
> >
> > One thing that I have had my eye on for a while now is the
> > .NETification/consistency of the core API (that is, in the Lucene.Net
> > project). There are several issues that I would like to address
> including:
> >
> >
> > 1.       Method names that are still camelCase
> >
> > 2.       Properties that should be methods (because they do a lot of
> > processing or because they are non-deterministic)
> >
> > 3.       Methods that should be properties
> >
> > 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> > .NET
> >
> > 5.       Interfaces should begin with "I"
> >
> > 6.       Classes should not begin with "I" followed by another capital
> > letter (for some reason some of them were named that way)
> >
> > 7.       .CharAt() should probably be this[]
> >
> > 8.       Generic types nested within generic types (which cause Visual
> > Studio to crash when Intellisense tries to read them)
> >
> > ... and so on. The only thing is these are all sweeping changes that
> > will affect everyone helping out on Lucene.Net and anyone who is
> > currently using the beta. So, I just wanted to gather some input on
> > when the most appropriate time to begin working on these sweeping
> changes would be?
> >
> >
> > Thanks,
> > Shad Storhaug (NightOwl888)
> >
> >
> >
> >
> >
> >
> >
>

RE: Remaining Work/Priorities

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
> Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module?

Just for clarification, these are two entirely different things in Java. Analysis.Common (Analysis.Collator and Analysis.Th) depends on parts of Java:

import java.text.BreakIterator;
import java.text.Collator;
import java.text.ParseException;
import java.text.RuleBasedCollator;

Highlighter.PostingsHighlighter and Highlighter.VectorHighlight also depend on parts of Java:

import java.text.BreakIterator;
import java.text.CharacterIterator;

Analysis.ICU depends on a separate (icu4j) package:

import com.ibm.icu.text.Normalizer;
import com.ibm.icu.text.Normalizer2;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.Replaceable;
import com.ibm.icu.text.Transliterator;
import com.ibm.icu.text.UTF16;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.text.FilteredNormalizer2;
import com.ibm.icu.text.Collator;
import com.ibm.icu.text.RuleBasedCollator;
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.RawCollationKey;

That said, icu4j DOES have Collator and RuleBasedCollator classes, but it DOES NOT have a BreakIterator or CharacterIterator class. It is unclear whether the Collator from icu4j would work as a replacement for the one in core Java.

When I was digging through the JDK code, I noticed that BreakIterator and RuleBasedCollator have a lot of common ICU dependencies there, so even if the RuleBasedCollator from icu4j is compatible, it might make sense for us to port the one from Java anyway so we are dealing with the same shared dependencies in Analysis.Common.

Once we port over the classes from the Java JDK, we will be able to eliminate our current ICU4NET dependency (and the platform issues that come with it). That said, porting over those pieces could take considerable work. In the interim it might make sense to make separate projects/NuGet packages to isolate the areas that depend on BreakIterator, CharacterIterator, and RuleBasedCollator so the rest can be released for wide/cross-platform use. Perhaps we can even make a basic (scaled down) BreakIterator for Highlighter that breaks on spaces between words and punctuation between sentences, which wouldn't work for Thai, but would work for most other languages.

Porting the (icu4j) package is another complete ball of yarn, we should take a look at (https://github.com/sillsdev/icu-dotnet) to see if there is enough overlap there to power Analysis.ICU (offhand it looks as though some classes are missing, though). It is a wrapper around the C library - it may be that we just need to port more of it to get all of the pieces we need.

Speaking of Collation, @ChristopherHaws have you made any more progress on Analysis.Collation? Were you able to determine if icu-dotnet's collator will make the tests pass?

> I'm on it QueryParser.Flexible

Great. The TimeZone probably just needs more research to work out how to utilize (in order to implement the failing test). Also, FYI MSDN's recommendation (https://msdn.microsoft.com/en-us/library/system.timezone(v=vs.110).aspx) is to use TimeZoneInfo rather than TimeZone (I noticed that several of the tests were recently modified to use TimeZone rather than TimeZoneInfo).

As for the culture, in .NET I am pretty sure that we need to pass it as a parameter to another overload of `QueryParser.Parse` rather than making it a property of QueryParser. But we can deal with that in one step after you have finished porting.

--

Shad Storhaug (NightOwl888)

-----Original Message-----
From: itamar.synhershko@gmail.com [mailto:itamar.synhershko@gmail.com] On Behalf Of Itamar Syn-Hershko
Sent: Wednesday, October 5, 2016 5:28 AM
To: dev@lucenenet.apache.org
Cc: Connie Yau
Subject: Re: Remaining Work/Priorities

Awesome, thanks for all the hard work Shad!

Our first priority should be fixing all remaining tests - in particular the one in Core. We should be ready to release and stamp our builds as 100% stable. As you mentioned, this could be an infrastructure issue - hopefully *Connie *can give a status update on her effort on the switch to xUnit?

With regards to Modules, here's an updated breakdown based on your email + forgotten pieces + my comments:

*Ported:*
Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 failing / 27 total Lucene.Net.Memory - 0 failing / 10 total Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing / 96 total Lucene.Net.QueryParser - 1 failing / 203 total Lucene.Net.Suggest - 0 failing / 142 total

We should do a second pass on the pieces we marked as ported, just to make sure the port is full and we didn't leave anything behind :)

*Need to be ported:*
Highlighter (Depends on Collator (which is still being ported) and BreakIterator (which we don't have a solution that works in .NET core yet)) Spatial (has 3rd party libraries that need to be updates) Spatial4n (https://github.com/synhershko/Spatial4N) needs to be brought up to speed with spatial4j, dependencies of which may cause some issues....
Codecs
Partially ported, mostly the tests weren't ported Grouping Not urgent, but provides nice functionality that users will probably like

The only part with dependencies seems to be the spatial module - I will have a look there soon if you don't get to that before I do.

*Can wait* - some modules are less frequently used, we should stabilize and release first and then work on them based on demand Analysis.ICU (Depends on ICU4j) hopefully we can remove the ICU DLLs from the analysis.commons module? I keep getting reports on some issues they are causing Analysis.Kuromoji Analysis.Morfologik (Depends on Morfologik) Analysis.Phonetic (Depends on Apache Commons) Apache commons is mostly helper libraries, so there's probably not real dependency just lots of replacement Analysis.SmartCN Analysis.Stempel (currently in progress) Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer) Demo while important because can help newbies, we can do better by providing docs and real world examples. I'm on it QueryParser.Flexible

*No need to port* - neither are needed in our context Benchmark (many dependencies) Replicator (many dependencies) Sandbox (Depends on Apache Jakarta)

Once all modules are ported and all tests are passing, I think we should get two more items fixed before an official release:

1. .NET Core support - I'm not clear on the status of it at the moment. We probably want to have it in for the release.

2. Public API Inconsistencies. We can discuss what should be done and what not when we get to that stage. Some are an obvious "fixme" but some will break code compatibility with Java I think we should avoid.

One last note - *Wyatt*, do we know why there are no CI builds lately?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko> Freelance Developer & Consultant Lucene.NET committer and PMC member

On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining 
> to be done on Lucene.Net version 4.8.0. We are nearly there, but that 
> doesn't mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count 
> we have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total Lucene.Net.Analysis.Common 
> - 0 failing / 1445 total Lucene.Net.Classification - 0 failing / 9 
> total Lucene.Net.Expressions - 0 failing / 94 total Lucene.Net.Facet - 
> (including #188 will be) 0 failing / 152 total Lucene.Net.Join - 0 
> failing / 27 total Lucene.Net.Memory - 0 failing / 10 total 
> Lucene.Net.Misc - 2 failing / 42 total Lucene.Net.Queries - 2 failing 
> / 96 total Lucene.Net.QueryParser - 1 failing / 203 total 
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently 
> discovered that many of the "failures" that are being reported are 
> false negatives (in fact, the VS2015 NUnit test runner shows there are 
> 135 failing tests total and 902 tests total that don't belong to any 
> project). Most NUnit 2.6 test runners do not correctly run tests in 
> shared abstract classes with the correct context (test setup) to make 
> them pass. These out-of-context runs add several additional minutes to the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the 
> situation somewhat - that is, it ran the tests in the correct context 
> and I was able to determine that we have more tests than the numbers 
> above and they are all succeeding. However, it also ran the tests in 
> an invalid context (that is, the context of the abstract class without 
> any setup) and some of them still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new 
> fancy test attributes) is enough to fix this issue. If not, we need to 
> find another solution - preferably one that can be applied to all of 
> the tests in abstract classes without too much effort or changing them 
> so they are too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days 
> ago and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik) 4. Analysis.Phonetic 
> (Depends on Apache Commons) 5. Analysis.SmartCN 6. Analysis.Stempel 
> (currently in progress) 7. Analysis.UIMA (Depends on Tagger, 
> uimaj-core, WhiteSpaceTokenizer) 8. Benchmark (many dependencies) 9. 
> Demo 10. Highlighter (Depends on Collator (which is still being 
> ported) and BreakIterator (which we don't have a solution that works 
> in .NET core yet)) 11. Replicator (many dependencies) 12. Sandbox 
> (Depends on Apache Jakarta) 13. Spatial (Already ported in #174 
> (https://github.com/apache/ lucenenet/pull/174), needs a recent 
> version of spatial4n) 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize 
> this list in terms of priority. It also couldn't hurt to update the 
> contributing documents 
> (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and 
> https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the 
> current status.
>
> Of course, it is the known status of dependencies that we need 
> clarification on. Which of these dependencies is known to be ported? 
> Which of them are ported but are not up to date? Which of them are 
> known not to be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the 
> .NETification/consistency of the core API (that is, in the Lucene.Net 
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that 
> will affect everyone helping out on Lucene.Net and anyone who is 
> currently using the beta. So, I just wanted to gather some input on 
> when the most appropriate time to begin working on these sweeping changes would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>

Re: Remaining Work/Priorities

Posted by Itamar Syn-Hershko <it...@code972.com>.
Awesome, thanks for all the hard work Shad!

Our first priority should be fixing all remaining tests - in particular the
one in Core. We should be ready to release and stamp our builds as 100%
stable. As you mentioned, this could be an infrastructure issue -
hopefully *Connie
*can give a status update on her effort on the switch to xUnit?

With regards to Modules, here's an updated breakdown based on your email +
forgotten pieces + my comments:

*Ported:*
Lucene.Net (Core) - 15 failing / 1989 total
Lucene.Net.Analysis.Common - 0 failing / 1445 total
Lucene.Net.Classification - 0 failing / 9 total
Lucene.Net.Expressions - 0 failing / 94 total
Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total
Lucene.Net.Join - 0 failing / 27 total
Lucene.Net.Memory - 0 failing / 10 total
Lucene.Net.Misc - 2 failing / 42 total
Lucene.Net.Queries - 2 failing / 96 total
Lucene.Net.QueryParser - 1 failing / 203 total
Lucene.Net.Suggest - 0 failing / 142 total

We should do a second pass on the pieces we marked as ported, just to make
sure the port is full and we didn't leave anything behind :)

*Need to be ported:*
Highlighter (Depends on Collator (which is still being ported) and
BreakIterator (which we don't have a solution that works in .NET core yet))
Spatial (has 3rd party libraries that need to be updates)
Spatial4n (https://github.com/synhershko/Spatial4N) needs to be brought up
to speed with spatial4j, dependencies of which may cause some issues....
Codecs
Partially ported, mostly the tests weren't ported
Grouping
Not urgent, but provides nice functionality that users will probably like

The only part with dependencies seems to be the spatial module - I will
have a look there soon if you don't get to that before I do.

*Can wait* - some modules are less frequently used, we should stabilize and
release first and then work on them based on demand
Analysis.ICU (Depends on ICU4j)
hopefully we can remove the ICU DLLs from the analysis.commons module? I
keep getting reports on some issues they are causing
Analysis.Kuromoji
Analysis.Morfologik (Depends on Morfologik)
Analysis.Phonetic (Depends on Apache Commons)
Apache commons is mostly helper libraries, so there's probably not real
dependency just lots of replacement
Analysis.SmartCN
Analysis.Stempel (currently in progress)
Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer)
Demo
while important because can help newbies, we can do better by providing
docs and real world examples. I'm on it
QueryParser.Flexible

*No need to port* - neither are needed in our context
Benchmark (many dependencies)
Replicator (many dependencies)
Sandbox (Depends on Apache Jakarta)

Once all modules are ported and all tests are passing, I think we should
get two more items fixed before an official release:

1. .NET Core support - I'm not clear on the status of it at the moment. We
probably want to have it in for the release.

2. Public API Inconsistencies. We can discuss what should be done and what
not when we get to that stage. Some are an obvious "fixme" but some will
break code compatibility with Java I think we should avoid.

One last note - *Wyatt*, do we know why there are no CI builds lately?

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining to
> be done on Lucene.Net version 4.8.0. We are nearly there, but that doesn't
> mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count we
> have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total
> Lucene.Net.Analysis.Common - 0 failing / 1445 total
> Lucene.Net.Classification - 0 failing / 9 total
> Lucene.Net.Expressions - 0 failing / 94 total
> Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total
> Lucene.Net.Join - 0 failing / 27 total
> Lucene.Net.Memory - 0 failing / 10 total
> Lucene.Net.Misc - 2 failing / 42 total
> Lucene.Net.Queries - 2 failing / 96 total
> Lucene.Net.QueryParser - 1 failing / 203 total
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently discovered
> that many of the "failures" that are being reported are false negatives (in
> fact, the VS2015 NUnit test runner shows there are 135 failing tests total
> and 902 tests total that don't belong to any project). Most NUnit 2.6 test
> runners do not correctly run tests in shared abstract classes with the
> correct context (test setup) to make them pass. These out-of-context runs
> add several additional minutes to the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the situation
> somewhat - that is, it ran the tests in the correct context and I was able
> to determine that we have more tests than the numbers above and they are
> all succeeding. However, it also ran the tests in an invalid context (that
> is, the context of the abstract class without any setup) and some of them
> still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new fancy
> test attributes) is enough to fix this issue. If not, we need to find
> another solution - preferably one that can be applied to all of the tests
> in abstract classes without too much effort or changing them so they are
> too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days ago
> and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik)
> 4. Analysis.Phonetic (Depends on Apache Commons)
> 5. Analysis.SmartCN
> 6. Analysis.Stempel (currently in progress)
> 7. Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer)
> 8. Benchmark (many dependencies)
> 9. Demo
> 10. Highlighter (Depends on Collator (which is still being ported) and
> BreakIterator (which we don't have a solution that works in .NET core yet))
> 11. Replicator (many dependencies)
> 12. Sandbox (Depends on Apache Jakarta)
> 13. Spatial (Already ported in #174 (https://github.com/apache/
> lucenenet/pull/174), needs a recent version of spatial4n)
> 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize this
> list in terms of priority. It also couldn't hurt to update the contributing
> documents (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the
> current status.
>
> Of course, it is the known status of dependencies that we need
> clarification on. Which of these dependencies is known to be ported? Which
> of them are ported but are not up to date? Which of them are known not to
> be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the
> .NETification/consistency of the core API (that is, in the Lucene.Net
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that will
> affect everyone helping out on Lucene.Net and anyone who is currently using
> the beta. So, I just wanted to gather some input on when the most
> appropriate time to begin working on these sweeping changes would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>

Re: Remaining Work/Priorities

Posted by Itamar Syn-Hershko <it...@code972.com>.
Elad, casting is only a compiler hack. Keeping the code as close to it's
original Java form is important for many reasons, and as such I think we
should avoid making such changes.

--

Itamar Syn-Hershko
http://code972.com | @synhershko <https://twitter.com/synhershko>
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Oct 3, 2016 at 9:15 PM, Elad Margalit <el...@gmail.com> wrote:

> Hi,
>
> first of all let me thank Shad for a really great job. we're lucky to have
> you and your contribution.
>
> I hope @conniey will manage the port to xUnit, I truly believe this will
> solve the context issues.
>
> I would like to do a major replace with the whole solution to avoid
> unnecessary castings (sbyte)
> for instance:
>
> from:
> if ((sbyte)b >= 0)
> to
> if (b <= 127)
>
> same all tests are pass, but this is a big change with many files,
>
> when do you think we should do it? after the pr's done or now?
>
> Thanks,
>
>
> On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
> wrote:
>
> > Hello,
> >
> > I just wanted to open this discussion to talk about the work remaining to
> > be done on Lucene.Net version 4.8.0. We are nearly there, but that
> doesn't
> > mean we don't still need help!
> >
> >
> > FAILING TESTS
> > -------------------
> >
> > We now have over 5000 passing tests and as soon as pull request #188 (
> > https://github.com/apache/lucenenet/pull/188) is merged, by my count we
> > have only 20 (actual) failing tests. Here is the breakdown by project:
> >
> > Lucene.Net (Core) - 15 failing / 1989 total
> > Lucene.Net.Analysis.Common - 0 failing / 1445 total
> > Lucene.Net.Classification - 0 failing / 9 total
> > Lucene.Net.Expressions - 0 failing / 94 total
> > Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total
> > Lucene.Net.Join - 0 failing / 27 total
> > Lucene.Net.Memory - 0 failing / 10 total
> > Lucene.Net.Misc - 2 failing / 42 total
> > Lucene.Net.Queries - 2 failing / 96 total
> > Lucene.Net.QueryParser - 1 failing / 203 total
> > Lucene.Net.Suggest - 0 failing / 142 total
> >
> > The reason why I said ACTUAL tests above is because I recently discovered
> > that many of the "failures" that are being reported are false negatives
> (in
> > fact, the VS2015 NUnit test runner shows there are 135 failing tests
> total
> > and 902 tests total that don't belong to any project). Most NUnit 2.6
> test
> > runners do not correctly run tests in shared abstract classes with the
> > correct context (test setup) to make them pass. These out-of-context runs
> > add several additional minutes to the test run.
> >
> > As an experiment, I upgraded to NUnit 3.4.1 and it helped the situation
> > somewhat - that is, it ran the tests in the correct context and I was
> able
> > to determine that we have more tests than the numbers above and they are
> > all succeeding. However, it also ran the tests in an invalid context
> (that
> > is, the context of the abstract class without any setup) and some of them
> > still showed as failures.
> >
> > I know @conniey is currently working on porting the tests over to xUnit.
> > Hopefully, swapping test frameworks alone (or using some of the new fancy
> > test attributes) is enough to fix this issue. If not, we need to find
> > another solution - preferably one that can be applied to all of the tests
> > in abstract classes without too much effort or changing them so they are
> > too different from their Java counterpart.
> >
> > Remaining Pieces to Port
> > ---------------------------------
> >
> > I took an inventory of the remaining pieces left to port a few days ago
> > and here is what that looks like (alphabetical order):
> >
> > 1. Analysis.ICU (Depends on ICU4j)
> > 2. Analysis.Kuromoji
> > 3. Analysis.Morfologik (Depends on Morfologik)
> > 4. Analysis.Phonetic (Depends on Apache Commons)
> > 5. Analysis.SmartCN
> > 6. Analysis.Stempel (currently in progress)
> > 7. Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer)
> > 8. Benchmark (many dependencies)
> > 9. Demo
> > 10. Highlighter (Depends on Collator (which is still being ported) and
> > BreakIterator (which we don't have a solution that works in .NET core
> yet))
> > 11. Replicator (many dependencies)
> > 12. Sandbox (Depends on Apache Jakarta)
> > 13. Spatial (Already ported in #174 (https://github.com/apache/
> > lucenenet/pull/174), needs a recent version of spatial4n)
> > 14. QueryParser.Flexible
> >
> > Itamar, it would be helpful if you would be so kind as to organize this
> > list in terms of priority. It also couldn't hurt to update the
> contributing
> > documents (https://github.com/apache/lucenenet/blob/master/
> CONTRIBUTING.md,
> > and https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> > with the latest information so anyone who wants to help out knows the
> > current status.
> >
> > Of course, it is the known status of dependencies that we need
> > clarification on. Which of these dependencies is known to be ported?
> Which
> > of them are ported but are not up to date? Which of them are known not to
> > be ported, and which of them are unknown?
> >
> >
> > Public API Inconsistencies
> > ---------------------------------
> >
> > One thing that I have had my eye on for a while now is the
> > .NETification/consistency of the core API (that is, in the Lucene.Net
> > project). There are several issues that I would like to address
> including:
> >
> >
> > 1.       Method names that are still camelCase
> >
> > 2.       Properties that should be methods (because they do a lot of
> > processing or because they are non-deterministic)
> >
> > 3.       Methods that should be properties
> >
> > 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> > .NET
> >
> > 5.       Interfaces should begin with "I"
> >
> > 6.       Classes should not begin with "I" followed by another capital
> > letter (for some reason some of them were named that way)
> >
> > 7.       .CharAt() should probably be this[]
> >
> > 8.       Generic types nested within generic types (which cause Visual
> > Studio to crash when Intellisense tries to read them)
> >
> > ... and so on. The only thing is these are all sweeping changes that will
> > affect everyone helping out on Lucene.Net and anyone who is currently
> using
> > the beta. So, I just wanted to gather some input on when the most
> > appropriate time to begin working on these sweeping changes would be?
> >
> >
> > Thanks,
> > Shad Storhaug (NightOwl888)
> >
> >
> >
> >
> >
> >
> >
>

Re: Remaining Work/Priorities

Posted by Elad Margalit <el...@gmail.com>.
Hi,

first of all let me thank Shad for a really great job. we're lucky to have
you and your contribution.

I hope @conniey will manage the port to xUnit, I truly believe this will
solve the context issues.

I would like to do a major replace with the whole solution to avoid
unnecessary castings (sbyte)
for instance:

from:
if ((sbyte)b >= 0)
to
if (b <= 127)

same all tests are pass, but this is a big change with many files,

when do you think we should do it? after the pr's done or now?

Thanks,


On Sun, Oct 2, 2016 at 10:01 PM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Hello,
>
> I just wanted to open this discussion to talk about the work remaining to
> be done on Lucene.Net version 4.8.0. We are nearly there, but that doesn't
> mean we don't still need help!
>
>
> FAILING TESTS
> -------------------
>
> We now have over 5000 passing tests and as soon as pull request #188 (
> https://github.com/apache/lucenenet/pull/188) is merged, by my count we
> have only 20 (actual) failing tests. Here is the breakdown by project:
>
> Lucene.Net (Core) - 15 failing / 1989 total
> Lucene.Net.Analysis.Common - 0 failing / 1445 total
> Lucene.Net.Classification - 0 failing / 9 total
> Lucene.Net.Expressions - 0 failing / 94 total
> Lucene.Net.Facet - (including #188 will be) 0 failing / 152 total
> Lucene.Net.Join - 0 failing / 27 total
> Lucene.Net.Memory - 0 failing / 10 total
> Lucene.Net.Misc - 2 failing / 42 total
> Lucene.Net.Queries - 2 failing / 96 total
> Lucene.Net.QueryParser - 1 failing / 203 total
> Lucene.Net.Suggest - 0 failing / 142 total
>
> The reason why I said ACTUAL tests above is because I recently discovered
> that many of the "failures" that are being reported are false negatives (in
> fact, the VS2015 NUnit test runner shows there are 135 failing tests total
> and 902 tests total that don't belong to any project). Most NUnit 2.6 test
> runners do not correctly run tests in shared abstract classes with the
> correct context (test setup) to make them pass. These out-of-context runs
> add several additional minutes to the test run.
>
> As an experiment, I upgraded to NUnit 3.4.1 and it helped the situation
> somewhat - that is, it ran the tests in the correct context and I was able
> to determine that we have more tests than the numbers above and they are
> all succeeding. However, it also ran the tests in an invalid context (that
> is, the context of the abstract class without any setup) and some of them
> still showed as failures.
>
> I know @conniey is currently working on porting the tests over to xUnit.
> Hopefully, swapping test frameworks alone (or using some of the new fancy
> test attributes) is enough to fix this issue. If not, we need to find
> another solution - preferably one that can be applied to all of the tests
> in abstract classes without too much effort or changing them so they are
> too different from their Java counterpart.
>
> Remaining Pieces to Port
> ---------------------------------
>
> I took an inventory of the remaining pieces left to port a few days ago
> and here is what that looks like (alphabetical order):
>
> 1. Analysis.ICU (Depends on ICU4j)
> 2. Analysis.Kuromoji
> 3. Analysis.Morfologik (Depends on Morfologik)
> 4. Analysis.Phonetic (Depends on Apache Commons)
> 5. Analysis.SmartCN
> 6. Analysis.Stempel (currently in progress)
> 7. Analysis.UIMA (Depends on Tagger, uimaj-core, WhiteSpaceTokenizer)
> 8. Benchmark (many dependencies)
> 9. Demo
> 10. Highlighter (Depends on Collator (which is still being ported) and
> BreakIterator (which we don't have a solution that works in .NET core yet))
> 11. Replicator (many dependencies)
> 12. Sandbox (Depends on Apache Jakarta)
> 13. Spatial (Already ported in #174 (https://github.com/apache/
> lucenenet/pull/174), needs a recent version of spatial4n)
> 14. QueryParser.Flexible
>
> Itamar, it would be helpful if you would be so kind as to organize this
> list in terms of priority. It also couldn't hurt to update the contributing
> documents (https://github.com/apache/lucenenet/blob/master/CONTRIBUTING.md,
> and https://cwiki.apache.org/confluence/display/LUCENENET/Current+Status
> with the latest information so anyone who wants to help out knows the
> current status.
>
> Of course, it is the known status of dependencies that we need
> clarification on. Which of these dependencies is known to be ported? Which
> of them are ported but are not up to date? Which of them are known not to
> be ported, and which of them are unknown?
>
>
> Public API Inconsistencies
> ---------------------------------
>
> One thing that I have had my eye on for a while now is the
> .NETification/consistency of the core API (that is, in the Lucene.Net
> project). There are several issues that I would like to address including:
>
>
> 1.       Method names that are still camelCase
>
> 2.       Properties that should be methods (because they do a lot of
> processing or because they are non-deterministic)
>
> 3.       Methods that should be properties
>
> 4.       .Size() vs .Size vs .Count - should generally all be .Count in
> .NET
>
> 5.       Interfaces should begin with "I"
>
> 6.       Classes should not begin with "I" followed by another capital
> letter (for some reason some of them were named that way)
>
> 7.       .CharAt() should probably be this[]
>
> 8.       Generic types nested within generic types (which cause Visual
> Studio to crash when Intellisense tries to read them)
>
> ... and so on. The only thing is these are all sweeping changes that will
> affect everyone helping out on Lucene.Net and anyone who is currently using
> the beta. So, I just wanted to gather some input on when the most
> appropriate time to begin working on these sweeping changes would be?
>
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
>
>
>
>
>