You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Shad Storhaug <sh...@shadstorhaug.com> on 2016/08/24 09:57:13 UTC

Lucene.Net.Analysis.Common Work

Itamar,

Just updating you regarding the work on Analysis on the mailing list as per your request.

To the rest, if anyone wants to help close out Analysis.Common, please coordinate with Itamar.

Scope of Remaining 43 Failing Tests

17 of the failing tests are in the Analysis.Synonym namespace (out of 51). I am not sure exactly what is happening here, but I am convinced that this issue is local to the namespace (or the tests) and it can be overcome with a little diligence. I plan to have another look at this today. Anyone who beats me to a solution gets an "atta-boy" (or "atta-girl") from me.

14 of the failing tests are in the Analysis.Th namespace (out of 16), and is subsequently causing the Analysis.Core.TestFactories.Test() method to fail (15 failing tests total). That is, Thai support is completely broken. Using the Clone() method of the BreakIterator subsequently results in null reference exceptions when SetText() is called. It is unclear why cloning is even necessary, though, and it appears this issue can be overcome by creating the iterators using the static methods instead of cloning. However the Thai words are still not breaking correctly, and I am not sure what is happening there. Looks like the documentation needs to be consulted on ICU to figure out how to use it properly. I haven't tried all of the available locales (only US English), but I do find it odd that there is no Locale for Thai in the package though since that appears to be the only language it supports (well, perhaps Lao, Khmer and Burmese will need this, but we don't have support for them anyway). I came up with another thought today that I didn't try as well - perhaps the culture of the current thread needs to be changed to Thai in order to make it work. Worth another look.

5 of the failing tests are in the Analysis.Compound namespace (out of 17). I had to take a slightly different approach than the Java implementation because it uses a SAX XML parser of which there is no direct equivalent in .NET, so made do with the XmlReader. It appears to be loading the data correctly, but it is difficult to verify this without cross-checking it against its Java counterpart. So, it is unknown whether the bugs are in the XML parsing, the data structure, the code that uses the data structure, some dependency of the code, or all of the above.

2 of the failing tests are in the HtmlStripCharFilterTest class. For some reason, there are null characters being added to the random tests which are not parsing correctly. The data being printed out in the log is not helpful to determine what the input is that is causing the failures. I attempted to pass null characters in using a couple of different forms and the parser seems to handle them fine. My gut feeling tells me that this is a problem with the random strings that are being generated for the test, not a problem with the parser.

1 of the failing tests is in the Analysis.Tr namespace (Turkish). The cause of this failure appears to be in the core. The Character.CodePointAt() method is returning character 304 rather than the expected LATIN_CAPITAL_LETTER_I + non-spacing mark character. I didn't look into it further than that.

1 of the failing tests, Analysis.Core.TestBugInSomething.TestWrapping() appears to be an issue with porting the test, and I will address that today.

1 of the failing tests, Analysis.Pattern.TestCaptureGroupTokenFilter.TestRandomString() is (I suspect) another issue with the random string generation.

1 of the failing tests, Analysis.Util.TestCharArrayMap.TestToString() is failing because we are not passing an object reference of CharArrayMap into the CharArraySet when it is instantiated in the CharArrayMap.KeySet() method. Without the direct object reference, calling ToString() on the CharArrayMap.Keys property returns an incorrect result because the CharArraySet and CharArrayMap are not referencing the same array. As far as I can tell, this is not causing an issue other than this one test.

It's a tricky issue because CharArrayMap is generic and CharArraySet is not. There are at least 3 ways we could solve this:


1.       Wrap the array (and other shared dependencies) into an internal reference type so both classes are referring to the same instance.

2.       Create a non-generic ICharArrayMap interface that references the few generic members as type object, and use the interface rather than the concrete type as the type to pass to CharArraySet.

3.       Make CharArraySet generic so its type remains in sync with the CharArrayMap.

I took a stab at #3 on this branch: https://github.com/NightOwl888/lucenenet/tree/analysis-chararray-generic, and it appears like it can compile that way, but it caused some other problems and I didn't get a chance to see if they could all be resolved.

It also occurred to me that rather than making an abstract CharArraySet and generic CharArraySet<V>, it might be worth considering making an ICharArraySet interface to pass the concrete type around without knowing its type, and making the CharArraySet a superclass of CharArraySet<string>, since more than 95% of the time we are using it with a string anyway (how many words do you know that are not strings?). It might be confusing, though - most people would expect an untyped CharArraySet class to be type object rather than string.

Not Yet Ported

There are just a few (known) remaining pieces missing out of Analysis.Common.


1.       Collation - Looks like we need to port over the Java Collator class. I took a stab at it in [this WIP commit](https://github.com/NightOwl888/lucenenet/commit/e6a037b2bb971b1015b2fc96eca663267df08273). AFAIK, ChristopherHaws is porting it (https://github.com/apache/lucenenet/pull/181#issuecomment-241481735), check with him if you want to give it a try as there doesn't seem to be much to it but will require some finesse to make it work with .NET.

2.       Analysis.Core.UpperCaseFilterFactory - Not sure how I missed it, but I will get on this today.

3.       Analysis.Miscellaneous.PatternAnalyzer - Obsolete and superseded by the Analysis.Pattern namespace. Do we really need this?

4.       Analysis.Core.TestAllAnalyzersHaveFactories

5.       Analysis.Core.TestRandomChains

6.       Analysis.Miscellaneous.PatternAnalyzerTest

7.       Analysis.Miscellaneous.TestSingleTokenFilter

8.       Analysis.Util.TestAnalysisSPILoader


Testing Issues

As for putting all of the Hunspell language dictionary (in compatible versions) into the repository to enable automated testing, I am not sure how well that would be received. The binaries are together an additional 108 MB, which means a lot of extra stuff to download if cloning Lucene.Net. In addition, the tests take quite a while to finish - at least 20-30 minutes. I think the Java dev team made the right choice by making this a manual task.

In any case, I have created a repo here: https://github.com/NightOwl888/lucenenet-hunspell that you can clone/copy/rehost to share these known working dictionary files with users of Lucene.Net. That is not to say that other versions of dictionaries won't work, but there can be issues when .NET doesn't recognize the encoding string or there are tags that the Hunspell parser doesn't recognize.

It also occurred to me that there are probably other hidden issues that won't rear their ugly head unless we change the culture of the current thread while testing. We should modify the test framework to test another culture on demand or perhaps randomly to weed out any localization issues while parsing strings.


Merging to Master/Prerelease

The only real concern here might be that the solution ultimately used to fix (or not fix) CharArraySet might end up changing the public API of CharArraySet. Other than that, I don't see any real issues that are holding up the release here. Thoughts?


Shad Storhaug/NightOwl888