You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2022/10/31 14:32:33 UTC

[GitHub] [lucenenet] krinsang opened a new issue, #732: ICUTokenizer discrepancies

krinsang opened a new issue, #732:
URL: https://github.com/apache/lucenenet/issues/732

   I've noticed that the ICUTokenizer for Thai will not generate the same tokens as the Java variant. I've tested the latest beta version of the .NET project against the Apache  Lucene v4.8.0. Even making sure that either implementations use the same `.brk` files did not yield consistent results. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1302465382

   Thanks. Still no luck getting the results you are seeing. I tried both `net5.0` and `net48` and even checked out the beta 14 tag to see if there was a difference, but no dice.
   
   One thing of note: Thai has no concept of uppercase/lowercase, so if you are not analyzing text with other scripts in it that can be omitted. It will also improve performance.
   
   So, for a repro we need to know more about your environment:
   
   1. Which runtime are you using? .NET Framework, .NET Core, Unity, Xamarin, Mono, etc? Which specific version?
   2. Which OS are you trying this on? Which specific version?
   3. What is the default culture set to on the OS?
   4. What is the culture of the current thread when you run this?
   
   It would also be helpful if you could setup a basic test that fails in an environment such as GitHub Actions or Azure DevOps so we can rule out anything with your specific environment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] krinsang commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
krinsang commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1302612314

   I've deduced the problem down to a difference in the way that the normalizer for Java and .NET were handling accented characters. While .NET's `System.String.Normalize` was able to detect and decompose the accented characters, the `Normalizer` class in `java.text` did not apply the same modifications to the original query. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1300996655

   That looks like it could be a side effect of the `ICUNormalizer2Filter`.
   
   - เทา (grey)
   - เท้า (foot)
   - รองเท้า (shoes)
   
   
   Removing the diacratics changes the meaning in Thai.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] krinsang commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
krinsang commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1300978918

   Yes, you're correct about the `StopFilter`. The `StopFilter` is reflected isometrically between both .NET and Java. I've reviewed the results again without the `StopFilter` but I'm still running into issues. A specific example:
   .NET:
   ```
   keyword: กล่องใส่รองเท้า
   tokenized components: "กลอง", "ใส", "รอง", "เทา"
   ```
   Java:
   ```
   keyword: กล่องใส่รองเท้า
   tokenized components: "กล่อง", "ใส่", "รองเท้า"
   ```
   
   It seems as if the ICUTokenizer in .NET seems to tokenize the keywords into small units which is not observed in Java.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1298866011

   I just re-read your report and realized you were referring to `ICUTokenizer` rather than the `ThaiTokenizer`.
   
   The reason why the behavior differs from Lucene 4.8.0, is because we ported the [`ICUTokenizer` from Lucene 8.6.1](https://github.com/apache/lucenenet/blob/ebcd9ea985001580cdd9f6f90801c605fe19a250/src/Lucene.Net.Analysis.ICU/Analysis/Icu/Segmentation/ICUTokenizer.cs#L1). Can you confirm that the behavior is the same as Lucene 8.6.1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1298948890

   Could you put together a test or console app we can run in both environments so we can take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] krinsang commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
krinsang commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1302254848

   [Analyzers.zip](https://github.com/apache/lucenenet/files/9930442/Analyzers.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] krinsang commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
krinsang commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1301013659

   I am not using the `ICUNormalizer2Filter` anywhere in my code, and I'm not seeing that it's being referenced in the `ICUTokenizer` class.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] krinsang commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
krinsang commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1298973680

   https://github.com/krinsang/lucenenet/tree/bug/icutokenizer
   please use my forked branch as a reference.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] krinsang commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
krinsang commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1298944253

   Yes, I just compared version 8.6.1 to to 4.8.0-beta00014 and the two tokenizers yield different results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] krinsang closed issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
krinsang closed issue #732: ICUTokenizer discrepancies
URL: https://github.com/apache/lucenenet/issues/732


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1301635515

   Could you post the code for how you are constructing the analyzer including how you are setting up the `StopFilter`? Something in your token stream is filtering out diacratics. We are most likely looking at some sort of a gap between how .NET and Java handle localization or normalization, but this doesn't appear to be directly related to `ICUTokenizer` or `CharArraySet`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1301788992

   Oh, and be sure to include the culture name (i.e. `en-US`) of the thread you are running this on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1300962409

   Thanks.
   
   I am having trouble getting my VM up and running for Java debugging, but since I can read Thai, I reviewed the test and came up with a theory.
   
   The original test is:
   
   ```c#
   AssertAnalyzesTo(a, "กล่องใส่รองเท้า ใส่ของอเนกประสงค์เปิดฝาด้านหน้า เนื้อพลาสติกแข็งชนิดเดียวกับแฟ้มหูหิ้ว",
                   new string[] { "กล่อง", "ใส่", "รองเท้า", "ใส่", "อเนกประสงค์", "ฝา", "หน้า", "เนื้อ", "พลาสติก", "แข็ง", "ชนิด", "แฟ้ม", "หู", "หิ้ว" });
   ```
   
   The test passes if you put all of the words in it that are in the input:
   
   ```c#
   AssertAnalyzesTo(a, "กล่องใส่รองเท้า ใส่ของอเนกประสงค์เปิดฝาด้านหน้า เนื้อพลาสติกแข็งชนิดเดียวกับแฟ้มหูหิ้ว",
                   new string[] { "กล่อง", "ใส่", "รองเท้า", "ใส่", "ของ", "อเนกประสงค์", "เปิด", "ฝา", "ด้าน", "หน้า", "เนื้อ", "พลาสติก", "แข็ง", "ชนิด", "เดียว", "กับ", "แฟ้ม", "หู", "หิ้ว" });
   ```
   
   The words that are being excluded are:
   
   - ของ (things/items)
   - เปิด (open)
   - ด้าน (side/area)
   - เดียว (also)
   - กับ (with)
   
   These appear to be common stop words. One thing to note: a tokenizer is only 1 component of an analyzer. I suspect you have a `StopFilter` in the analyzer you are using that does not exist in the [analyzer for the test](https://github.com/apache/lucenenet/blob/c076e40b14d4c20e6fdfee4e28d0b3332cf6d0ce/src/Lucene.Net.Tests.Analysis.ICU/Analysis/Icu/Segmentation/TestICUTokenizer.cs#L76-L81).
   
   > **SIDE NOTE:** There does appear to be a discrepancy in that the tests indicate they are ported from 7.1.0 but the production code indicates it is ported from 8.6.1. I need to check, but it is entirely possible that this was just because we reviewed the production code and it hadn't changed between 7.1.0 and 8.6.1, but we should have done the tests as well. However, the [8.6.1 analyzer](https://github.com/apache/lucene/blob/releases/lucene-solr/8.6.1/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L73-L79) is different from the [7.1.0 analyzer](https://github.com/apache/lucene/blob/releases/lucene-solr/8.6.1/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L73-L79), which is what we are currently testing with. There is an extra `ICUNormalizer2Filter` in the 7.1.0 version of the test.
   
   In any case, make sure your analyzer is built from the same components in both envirnoments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [lucenenet] NightOwl888 commented on issue #732: ICUTokenizer discrepancies

Posted by GitBox <gi...@apache.org>.
NightOwl888 commented on issue #732:
URL: https://github.com/apache/lucenenet/issues/732#issuecomment-1297294020

   Thanks for the report. Yeah, the Thai Tokenizer is a tough nut. .NET has nothing like a `BreakIterator`, and we can't port the one from the JDK due to licensing restrictions. Apache Harmony (which is a JRE that has an Apache license) used the ICU4J `BreakIterator`. We tried the same thing, but it behaves differently than the JDK.
   
   In fact, it isn't just the `ThaiTokenizer` that is affected, it affects the highlighters also.
   
   Unfortunately, `.brk` files don't apply to dictionary-based break iterators, for those you have to use `.dict` files. For sure they cannot be loaded using the `BreakIterator` API, but I don't recall if there is another API that can be used to load them or if it will actually take a custom compile of ICU4N to be able to pull it off. See the [ICU4J Resource Information](https://unicode-org.github.io/icu/userguide/icu4j/#icu4j-resource-information). I think there might have been a way to use class paths to load custom ones, but I am still trying to learn how these things are done "normally" in Java.
   
   Trying to come up with a reasonable way to work with resource files in ICU4N is still a work in progress, since the architecture for loading them in .NET is completely different. There is an attempt [here](https://github.com/NightOwl888/ICU4N/commits/feature/resource-automation) to migrate to using satellite assemblies for the localized resources. However, that still means the `.dict` files would be embedded in the `ICU4N.dll` assembly.
   
   There is also an attempt to replace system properties with `Microsoft.Extensions.Configuration` similarly to how we did it in Lucene.NET on another branch that hasn't been pushed yet. It may be relevant to how to inject resources into ICU4N, don't recall. Still a mess, went down a rabbit hole chasing a `ThreadAbortException` being thrown from the .NET `StringBuilder` class for seemingly no reason, before giving up 8 months ago. However, I haven't yet tried to go back to find the commit where it started.
   
   We were able to get the Lucene tests to pass by adding a [ThaiWordBreaker](https://github.com/apache/lucenenet/blob/e72315a75009854483c979462eb2406f41311796/src/Lucene.Net.Analysis.Common/Analysis/Th/ThaiTokenizer.cs#L212-L309) class that separates Thai letters from Thai numbers (and other text). This brings us a bit closer to JDK behavior. However, we don't have a thorough set of tests or documentation (that we could find) to determine all of the differences between the two implementations to close the gap. Of course, if all of it can be done using a different `.dict` file, that would be preferable to adding classes to mimic JDK behavior.
   
   If you could share any information you can find - more tests to show differences in behavior, info on how custom `.dict` files are loaded in ICU4J by users (without recompiling), info about how to create `.dict` files, etc., it would certainly help move us forward.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org