You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2022/09/30 02:48:09 UTC

[GitHub] [lucenenet] NightOwl888 commented on pull request #645: WIP Port Lucene.Net.Analysis.Ko

NightOwl888 commented on PR #645:
URL: https://github.com/apache/lucenenet/pull/645#issuecomment-1263040758

   Thanks for the contribution.
   
   There was a prior attempt at porting the nori analyzer on [the feature/analysis-nori branch](https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori). However, there were 2 issues preventing it from functioning on Lucene.Net 4.8.0:
   
   1. The FST implementation was redesigned between 4.8.0 and the version that nori was added in Lucene. This incompatibility made it impossible to debug the tests.
   2. BigDecimal doesn't exist in .NET.
   
   We are more concerned with getting past the first issue than the second, since it would be trivial to exclude `KoreanNumberFilter` and `KoreanNumberFilterFactory` from the project (commented out, with notes explaining that we are missing the `BigDecimal` dependency at the top of the file), but I will provide some info about how to proceed if you are interested in porting it.
   
   ## Issues to Fix
   
   1. Move this into a new project named `Lucene.Net.Analysis.Nori`, using `Lucene.Net.Analysis.Kuromoji` as a project template. We don't want to add large dependencies to `Lucene.Net.Analysis.Common`, since most people are using it.
   2. Add a comment at the top of each code file indicating the exact Lucene version it is ported from. Example: `// Lucene version compatibility level 9.3.0`
   3. Bring over the tests from [the feature/analysis-nori branch](https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori) and add them to a new project named `Lucene.Net.Tests.Analysis.Nori` (or alternatively create a fresh port of the tests). Either way, please ensure the tests are up to date with the exact Lucene version that you are porting the nori analyzer from.
   4. The nori project is very similar to [Lucene.Net.Analysis.Kuromoji](https://github.com/apache/lucenenet/tree/b5ea527c5bd125dd1db34d8b914e1a5d72e08ffa/src/Lucene.Net.Analysis.Kuromoji) and [Lucene.Net.Analysis.SmartCn](https://github.com/apache/lucenenet/tree/b5ea527c5bd125dd1db34d8b914e1a5d72e08ffa/src/Lucene.Net.Analysis.SmartCn) projects. Please use similar coding styles (constants instead of static fields, properties instead of methods where appropriate, naming conventions, folder detection for custom dictionaries, etc.)
   5. After addressing the porting issues below, please ensure all of the tests are passing. The best way to do this is to [setup Java debugging](https://lucenenet.apache.org/contributing/how-to-setup-java-lucene-debugging.html) and step through the code. Note that you will need to use the version of Java Lucene you are doing the port from in order to get the same results. Lucene has made some changes to their setup procedure, so it may take some research/experimentation to do. Do note that we have [these versions](https://github.com/NightOwl888/lucene/branches) successfully running. Alternatively, you may compare the code files line by line, but it may not be possible to figure out how to make the tests pass if you use this approach.
   
   ## FST
   
   At the time I attempted [the feature/analysis-nori branch](https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori), the FST API seemed to fit, however, due to some design changes it produced completely different results than the version I had ported it from (unfortunately, I don't recall what version it is based upon). At the time I thought that FST was tied deeply into other Lucene components and having multiple incompatible versions in the project wouldn't work. However, I have since learned that FST is only used in specific scenarios that end users won't need to plug together, so having a copy of the latest FST in the `Lucene.Net.Analysis.Nori` project should work fine (I think).
   
   To be able to debug, we need to be able to step through the code and get FST to return the exact results from the Lucene version this is based upon. So, we need a fresh port of FST from the same version of Lucene that nori is based upon. The convention we are following is to put "extra" components such as this into a folder named `Support`, followed by the regular folder convention in Lucene.
   
   ```
   src
     |
     -- Lucene.Net.Analysis.Nori
       |
       -- (All nori code files/folders)
       |
       -- Support
         |
         -- Util
           |
           -- Fst
             |
             -- (All FST code files)
     |
     -- Lucene.Net.Tests.Analysis.Nori
       |
       -- (All nori test code files/folders)
       |
       -- Support
         |
         -- Util
           |
           -- Fst
             |
             -- (All FST test code files)
   ```
   
   Please be mindful that we will be using similar namespace conventions as we are currently (the namespace may not necessarily match the name of the project it belongs to). For now, please put the new FST port into the `Lucene.Net.Support.Util.Fst` namespace.
   
   ## BigDecimal
   
   We definitely don't want to take on a dependency to IKVM, both because it is large and because it doesn't support most of the .NET runtimes that we do. Please try the following (in order of preference):
   
   1. I suspect the primary reason `BigDecimal` was the goto class in Java is because Java doesn't have a `decimal` data type like .NET does. It has limited precision compared to `BigDecimal`, but would work for most use cases. As in the [the feature/analysis-nori branch](https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori), we can use a nullable `decimal?` to account for gaps between Java and .NET reference vs value types.
   2. We could attempt to wrap `BigInteger` into a `BigDecimal`. There is an implementation [here](https://gist.github.com/JcBernack/0b4eef59ca97ee931a2f45542b9ff06d), which was found [on StackOverflow](https://stackoverflow.com/a/13813535). Please include it in the `Support/Numerics` folder (`Lucene.Net.Numerics` namespace) and be mindful of comments where people have suggested improvements to the implementation. We can use a nullable `BigDecimal?` to account for gaps between Java and .NET reference vs value types.
   3. If neither of the above options works, I suggest including the code files for `KoreanNumberFilter` and `KoreanNumberFilterFactory` (and their tests) in the project, but commenting them out. We will simply not include an implementation in Lucene.Net.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org