You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by ChristopherHaws <gi...@git.apache.org> on 2016/08/22 04:52:57 UTC

[GitHub] lucenenet pull request #181: Analysis work

GitHub user ChristopherHaws opened a pull request:

    https://github.com/apache/lucenenet/pull/181

    Analysis work

    Converted StopwordAnalyzerBase, WordlistLoader, StopAnalyzer, StopFilter and all of their associated tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ChristopherHaws/lucenenet analysis-work

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucenenet/pull/181.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #181
    
----
commit ac63b34e7a72888cd17a2c4f28988255875a0e33
Author: ChristopherHaws <cr...@gmail.com>
Date:   2016-08-22T00:04:24Z

    Implemented WordlistLoader and associated tests

commit 957e27867ebc9b2d440d3da4e3a6a6e704ef8fa1
Author: ChristopherHaws <cr...@gmail.com>
Date:   2016-08-22T01:26:47Z

    Converted StopwordAnalyzerBase

commit 3b5ce9ed167ac58f5ddefb0d7999fcc968149b39
Author: ChristopherHaws <cr...@gmail.com>
Date:   2016-08-22T03:20:56Z

    Converted StopAnalyzer and StopFilter along with their unit tests

commit 4397e210fe00ca3779a69fa11b19b50c46c08c17
Author: ChristopherHaws <cr...@gmail.com>
Date:   2016-08-22T04:21:00Z

    Fixed a bug in the EntryIterator class

commit e0a48e578bff0c9cd69f623e97f42852c024ef4c
Author: ChristopherHaws <cr...@gmail.com>
Date:   2016-08-22T04:42:22Z

    Marked some long running tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by NightOwl888 <gi...@git.apache.org>.
Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    @ChristopherHaws - Hey, I am thinking about taking another shot at porting over Collation. Did you get anything done and/or are you working on it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by NightOwl888 <gi...@git.apache.org>.
Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    Sure. I am currently working on https://github.com/NightOwl888/lucenenet/commits/analysis-work-2, which I plan to merge into #179 (probably within the next few hours).
    
    About the only thing left to port is the [collation namespace](https://github.com/apache/lucene-solr/tree/8fdf89690404c0e65784b2c5477552b9dec58591/lucene/analysis/common/src/java/org/apache/lucene/collation), a few missing tests and some obsolete functionality (I wasn't planning on doing these, so you are welcome to them). You might want to also check to make sure I didn't miss anything else.
    
    There are also currently 45 failing tests out of about 1400. Most of the failing tests are in the Synonym and Th namespaces if you want to have a go at them.
    
    Additionally, I haven't gone through the API to ensure the accessibility of fields and members is the same as it was in Java. Certain functionality, such as the Pattern namespace could use some more overloads to make them more .NET friendly (for example, passing regex's in as strings/regex options in addition to passing in pre-made Regex objects). I am sure there must be other tweaks to the API that need to be done, I was just giving this as an example.
    
    FYI - I am currently working on the Hunspell namespace - the rest is fair game. Just make your intentions clear so we don't duplicate work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by synhershko <gi...@git.apache.org>.
Github user synhershko commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    Awesome, thanks! Can you please check https://github.com/apache/lucenenet/pull/179 to see you and @NightOwl888 aren't working on the same stuff?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by NightOwl888 <gi...@git.apache.org>.
Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    I am pulling down your work on Collation so I can work with it. I'll see if I can merge these worlds together - most likely we just need to add some overloads that accept culture to some of the main actions and read from the default culture on the existing actions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by ChristopherHaws <gi...@git.apache.org>.
Github user ChristopherHaws commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    @NightOwl888 I didnt see your message until just now. I was working on converting from Locale to CultureInfo. I made a stub class for `Collator`, but this should probably be changed to an abstract class with a default implementation or removed entirely.
    
    Anyways, 9 out of 14 of the unit tests are succeeding. I am off to bed for now.
    
    https://github.com/ChristopherHaws/lucenenet/commit/c11205d75f070788940464ac1b8c3c6d915ecb92
    
    BTW, are you on the lucenenet-dev mailing list? We should probably move this conversation to there.
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by ChristopherHaws <gi...@git.apache.org>.
Github user ChristopherHaws commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    That's a shame that I didn't notice that the work was already done. I would love to help with the conversion in your fork if you have no objections.
    
    As for the EntryIterator bug, I didnt get it completely working, however the change that I made to `if (lastPos == outerInstance.keys.Length) return false;` made the iterator actually return the next value instead of it skipping them. I did this because at that point in the execution, `pos` is pointing to the next value, whereas `lastPos` is the current value, so we need to see if the current value (`lastPos`) is the last value in the array. This piece of code isn't in the java version and looks like it was added because .NET requires a bool return value here. I was going to keep looking into it today, however if you already have it fixed, perhaps I can pick something up off of your branch that you aren't working on?
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by ChristopherHaws <gi...@git.apache.org>.
Github user ChristopherHaws commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    Yes, I think we can close this pull request. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by synhershko <gi...@git.apache.org>.
Github user synhershko commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    @ChristopherHaws @NightOwl888 I merged that other branch - can we close this one?
    
    On a side-note, I'll be happy for you guys to take this great convo on the dev mailing list, as more people would get exposed to it and hopefully jump in. There are quite a few nasty challenges to tackle here..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by NightOwl888 <gi...@git.apache.org>.
Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    Collator is an abstract Java class and there isn't that much to it. The most complicated part is working out how to use the culture in a .NET-centric way. I took a stab at it in [this commit](https://github.com/NightOwl888/lucenenet/commit/e6a037b2bb971b1015b2fc96eca663267df08273), which you can grab if you want a starting point.
    
    BTW - I took a look at your stuff and noticed you still have the Locale class in there. I have basically been using the System.Globalization.CultureInfo class as a direct replacement for Locale, although in .NET we normally set the locale on the current thread and then use it from there, so it might make more sense to just pick it up from the current context rather than affixing it to the class.
    
    AFAIK, ICU4NET is *only* for supporting Thai. And I just got that part working in #182 (although I had to wrap the BreakIterator and add some functionality because it doesn't break on Thai/non-Thai combinations of characters like it does in Java). This seems like it will work for now.
    
    Frankly, I live in Thailand and can read and write Thai, but I am not even sure I will be able to take advantage of the Thai support in Lucene.Net. I read through some of those posts as well and from what I gather, this is supposed to be the interim solution to get Lucene.Net off the ground and at some later point we can port ICU. I don't really see any reason why a .NET core release couldn't just exclude Thai and the other classes that touch BreakIterator.
    
    But I understand where you are coming from. It would be nice not to have any unmanaged code references, if possible. And if that is a priority for you, then go for it.
    
    Do note I have added pretty much all of the other missing Analysis.Common pieces/tests (other than collation and Analysis.Miscellaneous.PatternAnalyzer and its tests) in #182. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by NightOwl888 <gi...@git.apache.org>.
Github user NightOwl888 commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    Actually, all of this is already done in #179.
    
    I am intrigued about the EntryIterator bug fix, though. The solution I came up with was to maintain an internal dictionary and return the enumerator on that set rather than the real one. But if there is a way to do it without using up all of that extra RAM it would certainly be better.
    
    I noticed that you also marked several tests as "LongRunning" that aren't. The CharArrayMap was returning the EntryIterator for the current set, which was causing infinite recursion. I gather that you marked these as long running tests because you assumed that they would eventually finish if you let them? If so, then maybe this fix won't solve the infinite recursion issue.
    
    Could you elaborate on what the fix actually fixed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by ChristopherHaws <gi...@git.apache.org>.
Github user ChristopherHaws commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    Yes, I pretty much have them all cleaned up, but the Collator class is specific to Java. I noticed that there is a reference to ICU4NET in the project, however that project only has the BreakIterator class ported. After looking through the dev mailing list, I found that it is still up in the air as to whether the project should use ICU4NET or not.
    
    I was currently looking into the possibility of porting just a few of the classes from the ICU4J project. Doing a manual port of the classes needed for Lucene would allow there to be no C++ dependencies. Also, if there are plans to compile against .NET Core (which I have read that there is), then using ICU4NET is not going to be feasible as it is using C++ CLI, which only works on Windows against full .NET.
    
    Here is what I have done so far. If you want, I can create a pull request to your repo, but I am not sure what else can be done without the Collator class.
    
    https://github.com/ChristopherHaws/lucenenet/commit/5213a5c36819de9a81db8873e16230667cd761b7


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #181: Analysis work

Posted by ChristopherHaws <gi...@git.apache.org>.
Github user ChristopherHaws commented on the issue:

    https://github.com/apache/lucenenet/pull/181
  
    Sounds good. I will take a look at the collation namespace tonight when I get home from work.
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request #181: Analysis work

Posted by ChristopherHaws <gi...@git.apache.org>.
Github user ChristopherHaws closed the pull request at:

    https://github.com/apache/lucenenet/pull/181


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---