You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "jpountz (via GitHub)" <gi...@apache.org> on 2023/07/25 13:59:21 UTC

[GitHub] [lucene] jpountz commented on a diff in pull request #12455: Clean up writing String to ByteBuffersDataOutput

jpountz commented on code in PR #12455:
URL: https://github.com/apache/lucene/pull/12455#discussion_r1273592164


##########
lucene/core/src/java/org/apache/lucene/store/ByteBuffersDataOutput.java:
##########
@@ -549,66 +540,22 @@ private static int computeBlockSizeBitsFor(long bytes) {
     return blockBits;
   }
 
-  // TODO: move this block-based conversion to UnicodeUtil.
-
-  private static final long HALF_SHIFT = 10;
-  private static final int SURROGATE_OFFSET =
-      Character.MIN_SUPPLEMENTARY_CODE_POINT
-          - (UnicodeUtil.UNI_SUR_HIGH_START << HALF_SHIFT)
-          - UnicodeUtil.UNI_SUR_LOW_START;
-
-  /** A consumer-based UTF16-UTF8 encoder (writes the input string in smaller buffers.). */
-  private static int UTF16toUTF8(
-      final CharSequence s,
-      final int offset,
-      final int length,
-      byte[] buf,
-      IntConsumer bufferFlusher) {
-    int utf8Len = 0;
-    int j = 0;
-    for (int i = offset, end = offset + length; i < end; i++) {
-      final int chr = (int) s.charAt(i);
-
-      if (j + 4 >= buf.length) {
-        bufferFlusher.accept(j);
-        utf8Len += j;
-        j = 0;
-      }
-
-      if (chr < 0x80) buf[j++] = (byte) chr;
-      else if (chr < 0x800) {
-        buf[j++] = (byte) (0xC0 | (chr >> 6));
-        buf[j++] = (byte) (0x80 | (chr & 0x3F));
-      } else if (chr < 0xD800 || chr > 0xDFFF) {
-        buf[j++] = (byte) (0xE0 | (chr >> 12));
-        buf[j++] = (byte) (0x80 | ((chr >> 6) & 0x3F));
-        buf[j++] = (byte) (0x80 | (chr & 0x3F));
-      } else {
-        // A surrogate pair. Confirm valid high surrogate.
-        if (chr < 0xDC00 && (i < end - 1)) {
-          int utf32 = (int) s.charAt(i + 1);
-          // Confirm valid low surrogate and write pair.
-          if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
-            utf32 = (chr << 10) + utf32 + SURROGATE_OFFSET;
-            i++;
-            buf[j++] = (byte) (0xF0 | (utf32 >> 18));
-            buf[j++] = (byte) (0x80 | ((utf32 >> 12) & 0x3F));
-            buf[j++] = (byte) (0x80 | ((utf32 >> 6) & 0x3F));
-            buf[j++] = (byte) (0x80 | (utf32 & 0x3F));
-            continue;
-          }
-        }
-        // Replace unpaired surrogate or out-of-order low surrogate
-        // with substitution character.
-        buf[j++] = (byte) 0xEF;
-        buf[j++] = (byte) 0xBF;
-        buf[j++] = (byte) 0xBD;
+  /** Writes a long string in chunks */
+  private void writeLongString(final String s) throws IOException {
+    final int byteLen = UnicodeUtil.calcUTF16toUTF8Length(s, 0, s.length());
+    writeVInt(byteLen);
+    final byte[] buf =
+        new byte[Math.min(byteLen, UnicodeUtil.MAX_UTF8_BYTES_PER_CHAR * MAX_CHARS_PER_WINDOW)];
+    for (int i = 0, end = s.length(); i < end; ) {
+      int step = Math.min(end - i, MAX_CHARS_PER_WINDOW);
+      // don't split a surrogate pair, since the pair together takes 2 bytes per char but have space
+      // for 3 bytes per char it's safe to encode one more char here

Review Comment:
   It's true, but at the same time it's a bit counter-intuitive to me to have this MAX_CHARS_PER_WINDOW constant and yet go over it at times. Could we do `step--` below or `Math.min(end - i, MAX_CHARS_PER_WINDOW-1)` above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org