You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by laimis <gi...@git.apache.org> on 2015/12/17 02:42:38 UTC

[GitHub] lucenenet pull request: Port CharArrayIterator

GitHub user laimis opened a pull request:

    https://github.com/apache/lucenenet/pull/157

    Port CharArrayIterator

    This class has a BreakIterator concept that has no equivalent in .NET. However there is this project that is active and has a .NET bindings library available via nuget:
    
    http://site.icu-project.org/
    
    .NET library:
    https://github.com/niaher/icu4net
    
    I was able to use it and port CharArrayIterator tests which are all passing. It also looks like BreakIterator is used in other places in Analysis so we will have a solution for them with this PR.
    
    I did have to set platform to 32bit for analysis project in order for the tests to pass. ICU4NET and ICU dlls appear to be platform specific and only 32 bit bindings are available for .NET. I was able to download icu4net code and compile it with Visual C++. So down the road I can see if I can get 64 bit version packaged and submitted as PR to icu4net project or package it as our own or something like that.
    
    Anyway, do let me know if you think we should use something else for BreakIterator implementations or go with this. It does seem like the ICU project implements other components that Analysis uses that have no equivalent in .NET (e.g. Collations concept). I am completely new to ICU project but it appears to be very active and has regular meetings and released.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/laimis/lucenenet analysis_chararrayiterator

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucenenet/pull/157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #157
    
----
commit 03d0950fac054c2e709e6a30aecca25baa2c34a2
Author: Laimonas Simutis <la...@gmail.com>
Date:   2015-12-17T01:30:34Z

    port CharArrayIterator

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by laimis <gi...@git.apache.org>.
Github user laimis commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-166791249
  
    @synhershko removed the "buggy breakiterator" code, looks good from the tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/lucenenet/pull/157


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by jpsullivan <gi...@git.apache.org>.
Github user jpsullivan commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-165325991
  
    Great job on this so far. I had a lot of problems getting ICU4NET properly integrated and running, so this alone is a fine start. Not that I have any say, but I'd vote for continuing with this plan rather than coming up with something else for a BreakIterator implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by synhershko <gi...@git.apache.org>.
Github user synhershko commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-165615989
  
    This is a bit tricky. icu4net doesn't seem to be that active (~40 commits in 4 years, last one >6 months ago), and the 32bit stuff is quite problematic. And I'd rather avoid maintaining yet another project.
    
    On the other hand, there's no point in reinventing the wheel - so long as we can confirm the validity and quality of that project.
    
    Take a look at icu4j - that's a Java implementation Lucene uses quiet a bit on itself. Maybe that is just a few classes that we can port over from there and the icu4net project?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by laimis <gi...@git.apache.org>.
Github user laimis commented on a diff in the pull request:

    https://github.com/apache/lucenenet/pull/157#discussion_r48024844
  
    --- Diff: src/Lucene.Net.Analysis.Common/Analysis/Util/CharArrayIterator.cs ---
    @@ -1,276 +1,268 @@
     using System;
    +using ICU4NET;
     
    -namespace org.apache.lucene.analysis.util
    +namespace Lucene.Net.Analysis.Util
     {
     
    -	/*
    -	 * Licensed to the Apache Software Foundation (ASF) under one or more
    -	 * contributor license agreements.  See the NOTICE file distributed with
    -	 * this work for additional information regarding copyright ownership.
    -	 * The ASF licenses this file to You under the Apache License, Version 2.0
    -	 * (the "License"); you may not use this file except in compliance with
    -	 * the License.  You may obtain a copy of the License at
    -	 *
    -	 *     http://www.apache.org/licenses/LICENSE-2.0
    -	 *
    -	 * Unless required by applicable law or agreed to in writing, software
    -	 * distributed under the License is distributed on an "AS IS" BASIS,
    -	 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    -	 * See the License for the specific language governing permissions and
    -	 * limitations under the License.
    -	 */
    +    /*
    +     * Licensed to the Apache Software Foundation (ASF) under one or more
    +     * contributor license agreements.  See the NOTICE file distributed with
    +     * this work for additional information regarding copyright ownership.
    +     * The ASF licenses this file to You under the Apache License, Version 2.0
    +     * (the "License"); you may not use this file except in compliance with
    +     * the License.  You may obtain a copy of the License at
    +     *
    +     *     http://www.apache.org/licenses/LICENSE-2.0
    +     *
    +     * Unless required by applicable law or agreed to in writing, software
    +     * distributed under the License is distributed on an "AS IS" BASIS,
    +     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +     * See the License for the specific language governing permissions and
    +     * limitations under the License.
    +     */
     
     
    -	/// <summary>
    -	/// A CharacterIterator used internally for use with <seealso cref="BreakIterator"/>
    -	/// @lucene.internal
    -	/// </summary>
    -	public abstract class CharArrayIterator //: CharacterIterator
    -	{
    -	  private char[] array;
    -	  private int start;
    -	  private int index;
    -	  private int length;
    -	  private int limit;
    +    /// <summary>
    +    /// A CharacterIterator used internally for use with <seealso cref="BreakIterator"/>
    +    /// @lucene.internal
    +    /// </summary>
    +    public abstract class CharArrayIterator : CharacterIterator
    +    {
    +        private char[] array;
    +        private int start;
    +        private int index;
    +        private int length;
    +        private int limit;
     
    -	  public virtual char [] Text
    -	  {
    -		  get
    -		  {
    -			return array;
    -		  }
    -	  }
    +        public virtual char[] Text
    +        {
    +            get
    +            {
    +                return array;
    +            }
    +        }
     
    -	  public virtual int Start
    -	  {
    -		  get
    -		  {
    -			return start;
    -		  }
    -	  }
    +        public virtual int Start
    +        {
    +            get
    +            {
    +                return start;
    +            }
    +        }
     
    -	  public virtual int Length
    -	  {
    -		  get
    -		  {
    -			return length;
    -		  }
    -	  }
    +        public virtual int Length
    +        {
    +            get
    +            {
    +                return length;
    +            }
    +        }
     
    -	  /// <summary>
    -	  /// Set a new region of text to be examined by this iterator
    -	  /// </summary>
    -	  /// <param name="array"> text buffer to examine </param>
    -	  /// <param name="start"> offset into buffer </param>
    -	  /// <param name="length"> maximum length to examine </param>
    -	  public virtual void setText(char[] array, int start, int length)
    -	  {
    -		this.array = array;
    -		this.start = start;
    -		this.index = start;
    -		this.length = length;
    -		this.limit = start + length;
    -	  }
    +        /// <summary>
    +        /// Set a new region of text to be examined by this iterator
    +        /// </summary>
    +        /// <param name="array"> text buffer to examine </param>
    +        /// <param name="start"> offset into buffer </param>
    +        /// <param name="length"> maximum length to examine </param>
    +        public virtual void SetText(char[] array, int start, int length)
    +        {
    +            this.array = array;
    +            this.start = start;
    +            this.index = start;
    +            this.length = length;
    +            this.limit = start + length;
    +        }
     
    -	  public override char Current()
    -	  {
    -		return (index == limit) ? DONE : jreBugWorkaround(array[index]);
    -	  }
    +        public override char Current()
    +        {
    +            return (index == limit) ? DONE : JreBugWorkaround(array[index]);
    +        }
     
    -	  protected internal abstract char jreBugWorkaround(char ch);
    +        protected internal abstract char JreBugWorkaround(char ch);
     
    -	  public override char First()
    -	  {
    -		index = start;
    -		return Current();
    -	  }
    +        public override char First()
    +        {
    +            index = start;
    +            return Current();
    +        }
     
    -	  public override int BeginIndex
    -	  {
    -		  get
    -		  {
    -			return 0;
    -		  }
    -	  }
    +        public int BeginIndex
    +        {
    +            get
    +            {
    +                return 0;
    +            }
    +        }
     
    -	  public override int EndIndex
    -	  {
    -		  get
    -		  {
    -			return length;
    -		  }
    -	  }
    +        public int EndIndex
    +        {
    +            get
    +            {
    +                return length;
    +            }
    +        }
     
    -	  public override int Index
    -	  {
    -		  get
    -		  {
    -			return index - start;
    -		  }
    -	  }
    +        public int Index
    +        {
    +            get
    +            {
    +                return index - start;
    +            }
    +        }
     
    -	  public override char Last()
    -	  {
    -		index = (limit == start) ? limit : limit - 1;
    -		return current();
    -	  }
    +        public override int GetBeginIndex()
    +        {
    +            return 0;
    +        }
     
    -	  public override char Next()
    -	  {
    -		if (++index >= limit)
    -		{
    -		  index = limit;
    -		  return DONE;
    -		}
    -		else
    -		{
    -		  return current();
    -		}
    -	  }
    +        public override int GetEndIndex()
    +        {
    +            return length;
    +        }
     
    -	  public override char Previous()
    -	  {
    -		if (--index < start)
    -		{
    -		  index = start;
    -		  return DONE;
    -		}
    -		else
    -		{
    -		  return current();
    -		}
    -	  }
    +        public override int GetIndex()
    +        {
    +            return index - start;
    +        }
     
    -	  public override char SetIndex(int position)
    -	  {
    -		if (position < BeginIndex || position > EndIndex)
    -		{
    -		  throw new System.ArgumentException("Illegal Position: " + position);
    -		}
    -		index = start + position;
    -		return current();
    -	  }
     
    -	  public override CharArrayIterator Clone()
    -	  {
    -		try
    -		{
    -		  return (CharArrayIterator)base.clone();
    -		}
    -		catch (CloneNotSupportedException e)
    -		{
    -		  // CharacterIterator does not allow you to throw CloneNotSupported
    -		  throw new Exception(e);
    -		}
    -	  }
    +        public override char Last()
    +        {
    +            index = (limit == start) ? limit : limit - 1;
    +            return Current();
    +        }
     
    -	  /// <summary>
    -	  /// Create a new CharArrayIterator that works around JRE bugs
    -	  /// in a manner suitable for <seealso cref="BreakIterator#getSentenceInstance()"/>
    -	  /// </summary>
    -	  public static CharArrayIterator newSentenceInstance()
    -	  {
    -		if (HAS_BUGGY_BREAKITERATORS)
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper();
    -		}
    -		else
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper2();
    -		}
    -	  }
    +        public override char Next()
    +        {
    +            if (++index >= limit)
    +            {
    +                index = limit;
    +                return DONE;
    +            }
    +            else
    +            {
    +                return Current();
    +            }
    +        }
     
    -	  private class CharArrayIteratorAnonymousInnerClassHelper : CharArrayIterator
    -	  {
    -		  public CharArrayIteratorAnonymousInnerClassHelper()
    -		  {
    -		  }
    +        public override char Previous()
    +        {
    +            if (--index < start)
    +            {
    +                index = start;
    +                return DONE;
    +            }
    +            else
    +            {
    +                return Current();
    +            }
    +        }
     
    -			  // work around this for now by lying about all surrogates to 
    -			  // the sentence tokenizer, instead we treat them all as 
    -			  // SContinue so we won't break around them.
    -		  protected internal override char jreBugWorkaround(char ch)
    -		  {
    -			return ch >= 0xD800 && ch <= 0xDFFF ? 0x002C : ch;
    -		  }
    -	  }
    +        public override char SetIndex(int position)
    +        {
    +            if (position < BeginIndex || position > EndIndex)
    +            {
    +                throw new ArgumentException("Illegal Position: " + position);
    +            }
    +            index = start + position;
    +            return Current();
    +        }
     
    -	  private class CharArrayIteratorAnonymousInnerClassHelper2 : CharArrayIterator
    -	  {
    -		  public CharArrayIteratorAnonymousInnerClassHelper2()
    -		  {
    -		  }
    +        public override object Clone()
    +        {
    +            return this.MemberwiseClone();
    +        }
     
    -			  // no bugs
    -		  protected internal override char jreBugWorkaround(char ch)
    -		  {
    -			return ch;
    -		  }
    -	  }
    +        /// <summary>
    +        /// Create a new CharArrayIterator that works around JRE bugs
    +        /// in a manner suitable for <seealso cref="BreakIterator#getSentenceInstance()"/>
    +        /// </summary>
    +        public static CharArrayIterator NewSentenceInstance()
    +        {
    +            if (HAS_BUGGY_BREAKITERATORS)
    +            {
    +                return new CharArrayIteratorAnonymousInnerClassHelper();
    +            }
    +            else
    +            {
    +                return new CharArrayIteratorAnonymousInnerClassHelper2();
    +            }
    +        }
     
    -	  /// <summary>
    -	  /// Create a new CharArrayIterator that works around JRE bugs
    -	  /// in a manner suitable for <seealso cref="BreakIterator#getWordInstance()"/>
    -	  /// </summary>
    -	  public static CharArrayIterator newWordInstance()
    -	  {
    -		if (HAS_BUGGY_BREAKITERATORS)
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper3();
    -		}
    -		else
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper4();
    -		}
    -	  }
    +        private class CharArrayIteratorAnonymousInnerClassHelper : CharArrayIterator
    +        {
    +            // work around this for now by lying about all surrogates to 
    +            // the sentence tokenizer, instead we treat them all as 
    +            // SContinue so we won't break around them.
    +            protected internal override char JreBugWorkaround(char ch)
    +            {
    +                return (char)(ch >= 0xD800 && ch <= 0xDFFF ? 0x002C : ch);
    +            }
    +        }
    --- End diff --
    
    I don't think so but I tried not to change code too much when porting. There are structures in there also for using different classes based on Java 4 vs Java 5 that seemed irrelevant to us, but I just kept it in there for the sake of keeping code similar to original. I will yank it and see what happens with the tests. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] lucenenet pull request: Port CharArrayIterator

Posted by Laimonas Simutis <la...@gmail.com>.
FYI,  someone else's experience with using ICU from C#:

http://blog.phil-ritchie.net/idiomatic-prose/2013/08/04/using-icu-from-c/

On Fri, Dec 18, 2015 at 9:05 AM, eladmarg <gi...@git.apache.org> wrote:

> Github user eladmarg commented on the pull request:
>
>     https://github.com/apache/lucenenet/pull/157#issuecomment-165785182
>
>     @laimis - total agree,
>     this release is already far behind the latest java, and there are
> enough challenges to make it live (facets, analysis, tests and probably
> other hidden bugs we didn't discover yet)
>     done is better than perfect.
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
> with INFRA.
> ---
>

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by eladmarg <gi...@git.apache.org>.
Github user eladmarg commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-165785182
  
    @laimis - total agree, 
    this release is already far behind the latest java, and there are enough challenges to make it live (facets, analysis, tests and probably other hidden bugs we didn't discover yet)
    done is better than perfect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by laimis <gi...@git.apache.org>.
Github user laimis commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-165783222
  
    I had similar concerns to your @synhershko yet at the same time felt like it was the best option for us. Some loose thoughts:
    
    - We all want to see some progress with Lucene.Net port and going to port icu4j feels like a roadblock to that initiate. I better use icu4net and make headway with the Lucene port than go onto another project port at this stage. Perhaps live with icu4net for as long as we can but keep on thinking what it will be replaced with down the road?
    
    - Just from taking a very raw look at the icu4j and BreakIterator for instance, it does not feel like it is a straightforward port. I could be wrong though. Also not sure if we can do that from the licensing perspective, all of their classes have comment with IBM copyright. Have no clue about those type of things (see example here: http://source.icu-project.org/repos/icu/icu4j/trunk/main/classes/core/src/com/ibm/icu/text/BreakIterator.java)
    
    - icu4net is not active, true, but it is just a wrapper around a very active ICU4C library. I was able to take ICU4NET, compile it, etc. So we could fork that project and create our own nuget packages and wrapper classes if we wanted to. Also that includes building 64 bit version of ICU4NET by packaging 64 bit version ICU4C libs. None of this I have experience with, so it is a bit of unknown.
    
    - The best part of ICU4C is that the classes it exposes, so far at least, have been a perfect match for Analysis at the API level. Porting CharArrayIterator was straightforward. I started porting SegmentingTokenizerBase in Utils just to see if I run into any issues, and again from the API perspective did not run into anything, it was straightforward to get to the point where it compiles (tests are failing, will figure out why :) ).
    
    It just feels like for the sake of making progress ICU4NET is the way to go. It allows us to port Lucene code and then we can yank ICU4NET out once we feel like we have a good alternative for it.
    
    What do you think?
    
    While you consider I am continuing to figure out what to do with ICU4NET on CI machine. It depends on init.ps1 script to be run but nuget restore has a bug where it does not do that (http://jeffhandley.com/archive/2013/12/09/nuget-package-restore-misconceptions.aspx, search for "init.ps1" paragraph)
    
    And @jpsullivan thanks for your input, appreciated. We don't have a lot of people around this so it is good to hear the opinion of others and have someone review the code, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] lucenenet pull request: Port CharArrayIterator

Posted by Laimonas Simutis <la...@gmail.com>.
Issue with CI failing is that icu4net nuget package depends on init.ps1
script to run but nuget restore does not run that script (seems like a bug
with nuget). In that script the dependent icu dlls are being pulled from
icu project site. Thinking what to do here... probably will hold off on any
solution until we finish discussing if we want to use icu4net.

On Wed, Dec 16, 2015 at 10:07 PM, Laimonas Simutis <la...@gmail.com> wrote:

> Seems like build on CI is failing, taking a look...
>
> On Wed, Dec 16, 2015 at 8:42 PM, laimis <gi...@git.apache.org> wrote:
>
>> GitHub user laimis opened a pull request:
>>
>>     https://github.com/apache/lucenenet/pull/157
>>
>>     Port CharArrayIterator
>>
>>     This class has a BreakIterator concept that has no equivalent in
>> .NET. However there is this project that is active and has a .NET bindings
>> library available via nuget:
>>
>>     http://site.icu-project.org/
>>
>>     .NET library:
>>     https://github.com/niaher/icu4net
>>
>>     I was able to use it and port CharArrayIterator tests which are all
>> passing. It also looks like BreakIterator is used in other places in
>> Analysis so we will have a solution for them with this PR.
>>
>>     I did have to set platform to 32bit for analysis project in order for
>> the tests to pass. ICU4NET and ICU dlls appear to be platform specific and
>> only 32 bit bindings are available for .NET. I was able to download icu4net
>> code and compile it with Visual C++. So down the road I can see if I can
>> get 64 bit version packaged and submitted as PR to icu4net project or
>> package it as our own or something like that.
>>
>>     Anyway, do let me know if you think we should use something else for
>> BreakIterator implementations or go with this. It does seem like the ICU
>> project implements other components that Analysis uses that have no
>> equivalent in .NET (e.g. Collations concept). I am completely new to ICU
>> project but it appears to be very active and has regular meetings and
>> released.
>>
>>
>> You can merge this pull request into a Git repository by running:
>>
>>     $ git pull https://github.com/laimis/lucenenet
>> analysis_chararrayiterator
>>
>> Alternatively you can review and apply these changes as the patch at:
>>
>>     https://github.com/apache/lucenenet/pull/157.patch
>>
>> To close this pull request, make a commit to your master/trunk branch
>> with (at least) the following in the commit message:
>>
>>     This closes #157
>>
>> ----
>> commit 03d0950fac054c2e709e6a30aecca25baa2c34a2
>> Author: Laimonas Simutis <la...@gmail.com>
>> Date:   2015-12-17T01:30:34Z
>>
>>     port CharArrayIterator
>>
>> ----
>>
>>
>> ---
>> If your project is set up for it, you can reply to this email and have
>> your
>> reply appear on GitHub as well. If your project does not have this feature
>> enabled and wishes so, or if the feature is enabled but not working,
>> please
>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>> with INFRA.
>> ---
>>
>
>

Re: [GitHub] lucenenet pull request: Port CharArrayIterator

Posted by Laimonas Simutis <la...@gmail.com>.
Seems like build on CI is failing, taking a look...

On Wed, Dec 16, 2015 at 8:42 PM, laimis <gi...@git.apache.org> wrote:

> GitHub user laimis opened a pull request:
>
>     https://github.com/apache/lucenenet/pull/157
>
>     Port CharArrayIterator
>
>     This class has a BreakIterator concept that has no equivalent in .NET.
> However there is this project that is active and has a .NET bindings
> library available via nuget:
>
>     http://site.icu-project.org/
>
>     .NET library:
>     https://github.com/niaher/icu4net
>
>     I was able to use it and port CharArrayIterator tests which are all
> passing. It also looks like BreakIterator is used in other places in
> Analysis so we will have a solution for them with this PR.
>
>     I did have to set platform to 32bit for analysis project in order for
> the tests to pass. ICU4NET and ICU dlls appear to be platform specific and
> only 32 bit bindings are available for .NET. I was able to download icu4net
> code and compile it with Visual C++. So down the road I can see if I can
> get 64 bit version packaged and submitted as PR to icu4net project or
> package it as our own or something like that.
>
>     Anyway, do let me know if you think we should use something else for
> BreakIterator implementations or go with this. It does seem like the ICU
> project implements other components that Analysis uses that have no
> equivalent in .NET (e.g. Collations concept). I am completely new to ICU
> project but it appears to be very active and has regular meetings and
> released.
>
>
> You can merge this pull request into a Git repository by running:
>
>     $ git pull https://github.com/laimis/lucenenet
> analysis_chararrayiterator
>
> Alternatively you can review and apply these changes as the patch at:
>
>     https://github.com/apache/lucenenet/pull/157.patch
>
> To close this pull request, make a commit to your master/trunk branch
> with (at least) the following in the commit message:
>
>     This closes #157
>
> ----
> commit 03d0950fac054c2e709e6a30aecca25baa2c34a2
> Author: Laimonas Simutis <la...@gmail.com>
> Date:   2015-12-17T01:30:34Z
>
>     port CharArrayIterator
>
> ----
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
> with INFRA.
> ---
>

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by laimis <gi...@git.apache.org>.
Github user laimis commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-171626312
  
    @synhershko yeah I went ahead and merged it to continue with the porting as this was a building block on some of the next ports I was doing. I waited a bit longer but after not hearing back just went with it.
    
    Right now looking at the CharArrayMap / CharArraySet classes that appear to be used frequently in analysis.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Re: [GitHub] lucenenet pull request: Port CharArrayIterator

Posted by Laimonas Simutis <la...@gmail.com>.
Itamar,

I will go ahead and merge this one into master tomorrow evening since I
haven't heard from you in a while. I have some more converted / fixed tests
to add that are dependent on this PR, so it would be good to continue with
the progress. If we end up not wanting to use ICU4NET, we can yank it out
without much damage done - the code that was ported was kept pretty much
the same as the Java version.

Let me know if I should hold off.


On Sat, Dec 26, 2015 at 5:51 PM, laimis <gi...@git.apache.org> wrote:

> Github user laimis commented on the pull request:
>
>     https://github.com/apache/lucenenet/pull/157#issuecomment-167370019
>
>     I had sometime to port some more code and got through
> SegmetingTokenizerBase and its tests. Since it builds on top of the
> CharArrayIterator stuff I went ahead and added it to this PR.
>
>
> ---
> If your project is set up for it, you can reply to this email and have your
> reply appear on GitHub as well. If your project does not have this feature
> enabled and wishes so, or if the feature is enabled but not working, please
> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
> with INFRA.
> ---
>

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by laimis <gi...@git.apache.org>.
Github user laimis commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-167370019
  
    I had sometime to port some more code and got through SegmetingTokenizerBase and its tests. Since it builds on top of the CharArrayIterator stuff I went ahead and added it to this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by synhershko <gi...@git.apache.org>.
Github user synhershko commented on the pull request:

    https://github.com/apache/lucenenet/pull/157#issuecomment-171588962
  
    Sorry, got caught up in other stuff. So was this merged? I'm not quite sure what happened here :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet pull request: Port CharArrayIterator

Posted by synhershko <gi...@git.apache.org>.
Github user synhershko commented on a diff in the pull request:

    https://github.com/apache/lucenenet/pull/157#discussion_r47976547
  
    --- Diff: src/Lucene.Net.Analysis.Common/Analysis/Util/CharArrayIterator.cs ---
    @@ -1,276 +1,268 @@
     using System;
    +using ICU4NET;
     
    -namespace org.apache.lucene.analysis.util
    +namespace Lucene.Net.Analysis.Util
     {
     
    -	/*
    -	 * Licensed to the Apache Software Foundation (ASF) under one or more
    -	 * contributor license agreements.  See the NOTICE file distributed with
    -	 * this work for additional information regarding copyright ownership.
    -	 * The ASF licenses this file to You under the Apache License, Version 2.0
    -	 * (the "License"); you may not use this file except in compliance with
    -	 * the License.  You may obtain a copy of the License at
    -	 *
    -	 *     http://www.apache.org/licenses/LICENSE-2.0
    -	 *
    -	 * Unless required by applicable law or agreed to in writing, software
    -	 * distributed under the License is distributed on an "AS IS" BASIS,
    -	 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    -	 * See the License for the specific language governing permissions and
    -	 * limitations under the License.
    -	 */
    +    /*
    +     * Licensed to the Apache Software Foundation (ASF) under one or more
    +     * contributor license agreements.  See the NOTICE file distributed with
    +     * this work for additional information regarding copyright ownership.
    +     * The ASF licenses this file to You under the Apache License, Version 2.0
    +     * (the "License"); you may not use this file except in compliance with
    +     * the License.  You may obtain a copy of the License at
    +     *
    +     *     http://www.apache.org/licenses/LICENSE-2.0
    +     *
    +     * Unless required by applicable law or agreed to in writing, software
    +     * distributed under the License is distributed on an "AS IS" BASIS,
    +     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +     * See the License for the specific language governing permissions and
    +     * limitations under the License.
    +     */
     
     
    -	/// <summary>
    -	/// A CharacterIterator used internally for use with <seealso cref="BreakIterator"/>
    -	/// @lucene.internal
    -	/// </summary>
    -	public abstract class CharArrayIterator //: CharacterIterator
    -	{
    -	  private char[] array;
    -	  private int start;
    -	  private int index;
    -	  private int length;
    -	  private int limit;
    +    /// <summary>
    +    /// A CharacterIterator used internally for use with <seealso cref="BreakIterator"/>
    +    /// @lucene.internal
    +    /// </summary>
    +    public abstract class CharArrayIterator : CharacterIterator
    +    {
    +        private char[] array;
    +        private int start;
    +        private int index;
    +        private int length;
    +        private int limit;
     
    -	  public virtual char [] Text
    -	  {
    -		  get
    -		  {
    -			return array;
    -		  }
    -	  }
    +        public virtual char[] Text
    +        {
    +            get
    +            {
    +                return array;
    +            }
    +        }
     
    -	  public virtual int Start
    -	  {
    -		  get
    -		  {
    -			return start;
    -		  }
    -	  }
    +        public virtual int Start
    +        {
    +            get
    +            {
    +                return start;
    +            }
    +        }
     
    -	  public virtual int Length
    -	  {
    -		  get
    -		  {
    -			return length;
    -		  }
    -	  }
    +        public virtual int Length
    +        {
    +            get
    +            {
    +                return length;
    +            }
    +        }
     
    -	  /// <summary>
    -	  /// Set a new region of text to be examined by this iterator
    -	  /// </summary>
    -	  /// <param name="array"> text buffer to examine </param>
    -	  /// <param name="start"> offset into buffer </param>
    -	  /// <param name="length"> maximum length to examine </param>
    -	  public virtual void setText(char[] array, int start, int length)
    -	  {
    -		this.array = array;
    -		this.start = start;
    -		this.index = start;
    -		this.length = length;
    -		this.limit = start + length;
    -	  }
    +        /// <summary>
    +        /// Set a new region of text to be examined by this iterator
    +        /// </summary>
    +        /// <param name="array"> text buffer to examine </param>
    +        /// <param name="start"> offset into buffer </param>
    +        /// <param name="length"> maximum length to examine </param>
    +        public virtual void SetText(char[] array, int start, int length)
    +        {
    +            this.array = array;
    +            this.start = start;
    +            this.index = start;
    +            this.length = length;
    +            this.limit = start + length;
    +        }
     
    -	  public override char Current()
    -	  {
    -		return (index == limit) ? DONE : jreBugWorkaround(array[index]);
    -	  }
    +        public override char Current()
    +        {
    +            return (index == limit) ? DONE : JreBugWorkaround(array[index]);
    +        }
     
    -	  protected internal abstract char jreBugWorkaround(char ch);
    +        protected internal abstract char JreBugWorkaround(char ch);
     
    -	  public override char First()
    -	  {
    -		index = start;
    -		return Current();
    -	  }
    +        public override char First()
    +        {
    +            index = start;
    +            return Current();
    +        }
     
    -	  public override int BeginIndex
    -	  {
    -		  get
    -		  {
    -			return 0;
    -		  }
    -	  }
    +        public int BeginIndex
    +        {
    +            get
    +            {
    +                return 0;
    +            }
    +        }
     
    -	  public override int EndIndex
    -	  {
    -		  get
    -		  {
    -			return length;
    -		  }
    -	  }
    +        public int EndIndex
    +        {
    +            get
    +            {
    +                return length;
    +            }
    +        }
     
    -	  public override int Index
    -	  {
    -		  get
    -		  {
    -			return index - start;
    -		  }
    -	  }
    +        public int Index
    +        {
    +            get
    +            {
    +                return index - start;
    +            }
    +        }
     
    -	  public override char Last()
    -	  {
    -		index = (limit == start) ? limit : limit - 1;
    -		return current();
    -	  }
    +        public override int GetBeginIndex()
    +        {
    +            return 0;
    +        }
     
    -	  public override char Next()
    -	  {
    -		if (++index >= limit)
    -		{
    -		  index = limit;
    -		  return DONE;
    -		}
    -		else
    -		{
    -		  return current();
    -		}
    -	  }
    +        public override int GetEndIndex()
    +        {
    +            return length;
    +        }
     
    -	  public override char Previous()
    -	  {
    -		if (--index < start)
    -		{
    -		  index = start;
    -		  return DONE;
    -		}
    -		else
    -		{
    -		  return current();
    -		}
    -	  }
    +        public override int GetIndex()
    +        {
    +            return index - start;
    +        }
     
    -	  public override char SetIndex(int position)
    -	  {
    -		if (position < BeginIndex || position > EndIndex)
    -		{
    -		  throw new System.ArgumentException("Illegal Position: " + position);
    -		}
    -		index = start + position;
    -		return current();
    -	  }
     
    -	  public override CharArrayIterator Clone()
    -	  {
    -		try
    -		{
    -		  return (CharArrayIterator)base.clone();
    -		}
    -		catch (CloneNotSupportedException e)
    -		{
    -		  // CharacterIterator does not allow you to throw CloneNotSupported
    -		  throw new Exception(e);
    -		}
    -	  }
    +        public override char Last()
    +        {
    +            index = (limit == start) ? limit : limit - 1;
    +            return Current();
    +        }
     
    -	  /// <summary>
    -	  /// Create a new CharArrayIterator that works around JRE bugs
    -	  /// in a manner suitable for <seealso cref="BreakIterator#getSentenceInstance()"/>
    -	  /// </summary>
    -	  public static CharArrayIterator newSentenceInstance()
    -	  {
    -		if (HAS_BUGGY_BREAKITERATORS)
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper();
    -		}
    -		else
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper2();
    -		}
    -	  }
    +        public override char Next()
    +        {
    +            if (++index >= limit)
    +            {
    +                index = limit;
    +                return DONE;
    +            }
    +            else
    +            {
    +                return Current();
    +            }
    +        }
     
    -	  private class CharArrayIteratorAnonymousInnerClassHelper : CharArrayIterator
    -	  {
    -		  public CharArrayIteratorAnonymousInnerClassHelper()
    -		  {
    -		  }
    +        public override char Previous()
    +        {
    +            if (--index < start)
    +            {
    +                index = start;
    +                return DONE;
    +            }
    +            else
    +            {
    +                return Current();
    +            }
    +        }
     
    -			  // work around this for now by lying about all surrogates to 
    -			  // the sentence tokenizer, instead we treat them all as 
    -			  // SContinue so we won't break around them.
    -		  protected internal override char jreBugWorkaround(char ch)
    -		  {
    -			return ch >= 0xD800 && ch <= 0xDFFF ? 0x002C : ch;
    -		  }
    -	  }
    +        public override char SetIndex(int position)
    +        {
    +            if (position < BeginIndex || position > EndIndex)
    +            {
    +                throw new ArgumentException("Illegal Position: " + position);
    +            }
    +            index = start + position;
    +            return Current();
    +        }
     
    -	  private class CharArrayIteratorAnonymousInnerClassHelper2 : CharArrayIterator
    -	  {
    -		  public CharArrayIteratorAnonymousInnerClassHelper2()
    -		  {
    -		  }
    +        public override object Clone()
    +        {
    +            return this.MemberwiseClone();
    +        }
     
    -			  // no bugs
    -		  protected internal override char jreBugWorkaround(char ch)
    -		  {
    -			return ch;
    -		  }
    -	  }
    +        /// <summary>
    +        /// Create a new CharArrayIterator that works around JRE bugs
    +        /// in a manner suitable for <seealso cref="BreakIterator#getSentenceInstance()"/>
    +        /// </summary>
    +        public static CharArrayIterator NewSentenceInstance()
    +        {
    +            if (HAS_BUGGY_BREAKITERATORS)
    +            {
    +                return new CharArrayIteratorAnonymousInnerClassHelper();
    +            }
    +            else
    +            {
    +                return new CharArrayIteratorAnonymousInnerClassHelper2();
    +            }
    +        }
     
    -	  /// <summary>
    -	  /// Create a new CharArrayIterator that works around JRE bugs
    -	  /// in a manner suitable for <seealso cref="BreakIterator#getWordInstance()"/>
    -	  /// </summary>
    -	  public static CharArrayIterator newWordInstance()
    -	  {
    -		if (HAS_BUGGY_BREAKITERATORS)
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper3();
    -		}
    -		else
    -		{
    -		  return new CharArrayIteratorAnonymousInnerClassHelper4();
    -		}
    -	  }
    +        private class CharArrayIteratorAnonymousInnerClassHelper : CharArrayIterator
    +        {
    +            // work around this for now by lying about all surrogates to 
    +            // the sentence tokenizer, instead we treat them all as 
    +            // SContinue so we won't break around them.
    +            protected internal override char JreBugWorkaround(char ch)
    +            {
    +                return (char)(ch >= 0xD800 && ch <= 0xDFFF ? 0x002C : ch);
    +            }
    +        }
    --- End diff --
    
    Do we really need this on the CLR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---