You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucenenet.apache.org by GitBox <gi...@apache.org> on 2022/10/27 08:10:31 UTC

[GitHub] [lucenenet] NightOwl888 opened a new pull request, #725: PERFORMANCE: Reduced char[] and string allocations (mostly in analysis)

NightOwl888 opened a new pull request, #725:
URL: https://github.com/apache/lucenenet/pull/725

- `Lucene.Net.Document.CompressionTools::CompressString()`: Eliminated unnecessary `ToCharArray()` allocation
- `Lucene.Net.Codecs.SimpleText.SimpleTextUtil::Write()`: Removed unnecessary `ToCharArray()` allocation
- `Lucene.Net.Analysis.CharFilters.HTMLStripCharFilter`: Removed allocation during parse of hexadecimal integers by using `J2N.Numerics.Int32.TryParse()` overloads to specify index, length and radix.
- Added a `CharArrayFormatter` struct to defer the allocation of constructing a string for `Debugging.Assert()` until after an assertion failure.
- Added `maxStackByteLimit` system property that can be used to increase/decrease the stack threshold bytes where it switches to the heap.
- `StemmerOverrideFilter.Builder` - Added overloads of `Add()` for `char[]` and `ICharSequence`. Added guard clauses. Modified to use `Span<char>` on the stack when under the `maxStackByteLimit` setting.
- `Lucene.Net.Util.UnicodeUtil`: Added an overload of `UTF16toUTF8()` for `Span<T>` source to `BytesRef` destination. Added documentation and guard clauses. Renamed `s` parameter to `source` to be consistent across all overloads.
- `Lucene.Net.Analysis.Util.CharacterUtils`: Use spans and stackalloc to reduce heap allocations when lowercasing.
- `Lucene.Net.Util.TestUnicodeUtil::TestUTF8toUTF32()`: Added tests for `ICharSequence` and `char[]` overloads, changed the original test to test `string` instead of `char[]`.
- `Lucene.Net.Analysis.Util.SegmentingTokenizerBase`: Removed unnecessary string allocations that were added during the port due to missing APIs that are now available.
- `Lucene.Net.Analysis.Ja.GraphvizFormatter`: Removed unnecessary `surfaceForm` variable string allocation.
- `Lucene.Net.Analysis.In.IndicNormalizer`: Replaced static constructor with inline `LoadScripts()` method. Moved location of scripts field to ensure decompositions is initialized first.
- `Lucene.Net.Analysis.In.IndicNormalizer`: Refactored `ScriptData` from using `Dictionary<Regex, ScriptData>` to using `List<ScriptData>` which eliminated unnecessary hashtable lookup. Use static fields for `unknownScript` and `[ThreadStatic] previousScriptData` to cache the last script seen to optimize character script matching.
- `Lucene.Net.Analysis.Th.ThaiWordBreaker`: Removed unnecessary string allocations and concatenation. Use `CharsRef` to reuse the same memory. Removed `Regex` and replaced with `UnicodeSet` to detect Thai code points, since the latter doesn't require converting to a string to detect a match.
- `Lucene.Net.Analysis.Ga.IrishLowerCaseFilter`: Use stack and spans to reduce allocations and improve throughput.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [lucenenet] NightOwl888 merged pull request #725: PERFORMANCE: Reduced char[] and string allocations (mostly in analysis)

Posted by GitBox <gi...@apache.org>.

NightOwl888 merged PR #725:
URL: https://github.com/apache/lucenenet/pull/725


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@lucenenet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org