You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucenenet.apache.org by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com> on 2017/03/22 14:58:48 UTC

Proposal: UnicodeUtils.ToCharArray implementation

When the dust settles, I propose to rewrite UnicodeUtils.ToCharArray as follows:

        public static char[] ToCharArray(int[] codePoints, int offset, int count)
        {
            if (count < 0)
            {
                throw new System.ArgumentException();
            }
                           // as a first approximation, assume each codepoint is 1 character
            char[] chars = new char[count];
            int w = 0;
            for (int r = offset, e = offset + count; r < e; ++r)
            {
                int cp = codePoints[r];
                if (cp < 0 || cp > 0x10ffff)
                {
                    throw new System.ArgumentException();
                }
                                  if (cp < 0x010000)
                  {
                        chars[w++] = (char)cp;
                  }
                  else
                  {
                        chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_));
                                                       // if we need more room, add enough to hold 2-character codepoints for the rest of
                                                       // the string. Normally, this resize operation will happen at most once.
                                                       if (w >= chars.Length)
                                                              Array.Resize(ref chars, chars.Length + (e - r) * 2 - 1);
                                                chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_));
                  }
            }
                           // resize to the exact length: it's slightly faster to check if the resize is needed
                           if (w != chars.Length)
                                  Array.Resize(ref chars, w);
            return chars;
        }

This avoids exception overhead and at least one heap allocation.  If no code points generate 2 characters, the method doesn't allocate anything extra than the result array.


Vincent


RE: Proposal: UnicodeUtils.ToCharArray implementation

Posted by "Van Den Berghe, Vincent" <Vi...@bvdinfo.com>.
Hello Shad,

Yes, your solution is better than mine. It certainly gets rid of these annoying exceptions.
I'm still new to the "run lucene tests", so I admit I've not run the entire suite every time I change something. Part of the reason is that I'm using my machine for other work in the meantime. But a significant part of the reason is pure laziness and egoism, I'll admit it.

I'll better my life, I promise.

Vincent

-----Original Message-----
From: Shad Storhaug [mailto:shad@shadstorhaug.com] 
Sent: Friday, March 24, 2017 3:21 AM
To: Van Den Berghe, Vincent <Vi...@bvdinfo.com>
Cc: dev@lucenenet.apache.org
Subject: RE: Proposal: UnicodeUtils.ToCharArray implementation

Vincent,

I put this in, but was getting exceptions because the array overflowed in the Lucene.Net.Misc.Util.Fst.TestFSTsMisc.TestRandomWords() test.

I ended up changing it since it was highly optimized for non-surrogate pairs, but if using surrogate pairs it was copying arrays over and over to resize the array on every iteration in a tight loop. So, to compromise, we just double the count in cases where it is 1024 chars or less and do a loop to count the actual chars to set the array size exactly if it is over that. Running all of the tests in Analysis.Common used to take 14-16 minutes, now it takes 10-11 minutes. Now it looks like this:

        public static char[] ToCharArray(int[] codePoints, int offset, int count)
        {
            if (count < 0)
            {
                throw new System.ArgumentException();
            }
            int countThreashhold = 1024; // If the number of chars exceeds this, we count them instead of allocating count * 2
            // LUCENENET: as a first approximation, assume each codepoint 
            // is 2 characters (since it cannot be longer than this)
            int arrayLength = count * 2;
            // LUCENENET: if we go over the threashhold, count the number of 
            // chars we will need so we can allocate the precise amount of memory
            if (count > countThreashhold)
            {
                arrayLength = 0;
                for (int r = offset, e = offset + count; r < e; ++r)
                {
                    arrayLength += codePoints[r] < 0x010000 ? 1 : 2;
                }
                if (arrayLength < 1)
                {
                    arrayLength = count * 2;
                }
            }
            // Initialize our array to our exact or oversized length.
            // It is now safe to assume we have enough space for all of the characters.
            char[] chars = new char[arrayLength];
            int w = 0;
            for (int r = offset, e = offset + count; r < e; ++r)
            {
                int cp = codePoints[r];
                if (cp < 0 || cp > 0x10ffff)
                {
                    throw new System.ArgumentException();
                }
                if (cp < 0x010000)
                {
                    chars[w++] = (char)cp;
                }
                else
                {
                    chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_));
                    chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_));
                }
            }

            var result = new char[w];
            Array.Copy(chars, result, w);
            return result;
        }

Anyway, if you think we need to adjust the threshold or if you have any other suggestions, let me know.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:59 PM
To: dev@lucenenet.apache.org
Subject: Proposal: UnicodeUtils.ToCharArray implementation

When the dust settles, I propose to rewrite UnicodeUtils.ToCharArray as follows:

        public static char[] ToCharArray(int[] codePoints, int offset, int count)
        {
            if (count < 0)
            {
                throw new System.ArgumentException();
            }
                           // as a first approximation, assume each codepoint is 1 character
            char[] chars = new char[count];
            int w = 0;
            for (int r = offset, e = offset + count; r < e; ++r)
            {
                int cp = codePoints[r];
                if (cp < 0 || cp > 0x10ffff)
                {
                    throw new System.ArgumentException();
                }
                                  if (cp < 0x010000)
                  {
                        chars[w++] = (char)cp;
                  }
                  else
                  {
                        chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_));
                                                       // if we need more room, add enough to hold 2-character codepoints for the rest of
                                                       // the string. Normally, this resize operation will happen at most once.
                                                       if (w >= chars.Length)
                                                              Array.Resize(ref chars, chars.Length + (e - r) * 2 - 1);
                                                chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_));
                  }
            }
                           // resize to the exact length: it's slightly faster to check if the resize is needed
                           if (w != chars.Length)
                                  Array.Resize(ref chars, w);
            return chars;
        }

This avoids exception overhead and at least one heap allocation.  If no code points generate 2 characters, the method doesn't allocate anything extra than the result array.


Vincent


RE: Proposal: UnicodeUtils.ToCharArray implementation

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Vincent,

I put this in, but was getting exceptions because the array overflowed in the Lucene.Net.Misc.Util.Fst.TestFSTsMisc.TestRandomWords() test.

I ended up changing it since it was highly optimized for non-surrogate pairs, but if using surrogate pairs it was copying arrays over and over to resize the array on every iteration in a tight loop. So, to compromise, we just double the count in cases where it is 1024 chars or less and do a loop to count the actual chars to set the array size exactly if it is over that. Running all of the tests in Analysis.Common used to take 14-16 minutes, now it takes 10-11 minutes. Now it looks like this:

        public static char[] ToCharArray(int[] codePoints, int offset, int count)
        {
            if (count < 0)
            {
                throw new System.ArgumentException();
            }
            int countThreashhold = 1024; // If the number of chars exceeds this, we count them instead of allocating count * 2
            // LUCENENET: as a first approximation, assume each codepoint 
            // is 2 characters (since it cannot be longer than this)
            int arrayLength = count * 2;
            // LUCENENET: if we go over the threashhold, count the number of 
            // chars we will need so we can allocate the precise amount of memory
            if (count > countThreashhold)
            {
                arrayLength = 0;
                for (int r = offset, e = offset + count; r < e; ++r)
                {
                    arrayLength += codePoints[r] < 0x010000 ? 1 : 2;
                }
                if (arrayLength < 1)
                {
                    arrayLength = count * 2;
                }
            }
            // Initialize our array to our exact or oversized length.
            // It is now safe to assume we have enough space for all of the characters.
            char[] chars = new char[arrayLength];
            int w = 0;
            for (int r = offset, e = offset + count; r < e; ++r)
            {
                int cp = codePoints[r];
                if (cp < 0 || cp > 0x10ffff)
                {
                    throw new System.ArgumentException();
                }
                if (cp < 0x010000)
                {
                    chars[w++] = (char)cp;
                }
                else
                {
                    chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_));
                    chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_));
                }
            }

            var result = new char[w];
            Array.Copy(chars, result, w);
            return result;
        }

Anyway, if you think we need to adjust the threshold or if you have any other suggestions, let me know.

Thanks,
Shad Storhaug (NightOwl888)

-----Original Message-----
From: Van Den Berghe, Vincent [mailto:Vincent.VanDenBerghe@bvdinfo.com] 
Sent: Wednesday, March 22, 2017 9:59 PM
To: dev@lucenenet.apache.org
Subject: Proposal: UnicodeUtils.ToCharArray implementation

When the dust settles, I propose to rewrite UnicodeUtils.ToCharArray as follows:

        public static char[] ToCharArray(int[] codePoints, int offset, int count)
        {
            if (count < 0)
            {
                throw new System.ArgumentException();
            }
                           // as a first approximation, assume each codepoint is 1 character
            char[] chars = new char[count];
            int w = 0;
            for (int r = offset, e = offset + count; r < e; ++r)
            {
                int cp = codePoints[r];
                if (cp < 0 || cp > 0x10ffff)
                {
                    throw new System.ArgumentException();
                }
                                  if (cp < 0x010000)
                  {
                        chars[w++] = (char)cp;
                  }
                  else
                  {
                        chars[w++] = (char)(LEAD_SURROGATE_OFFSET_ + (cp >> LEAD_SURROGATE_SHIFT_));
                                                       // if we need more room, add enough to hold 2-character codepoints for the rest of
                                                       // the string. Normally, this resize operation will happen at most once.
                                                       if (w >= chars.Length)
                                                              Array.Resize(ref chars, chars.Length + (e - r) * 2 - 1);
                                                chars[w++] = (char)(TRAIL_SURROGATE_MIN_VALUE + (cp & TRAIL_SURROGATE_MASK_));
                  }
            }
                           // resize to the exact length: it's slightly faster to check if the resize is needed
                           if (w != chars.Length)
                                  Array.Resize(ref chars, w);
            return chars;
        }

This avoids exception overhead and at least one heap allocation.  If no code points generate 2 characters, the method doesn't allocate anything extra than the result array.


Vincent