You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/10 15:33:53 UTC

[GitHub] [lucene] jpountz opened a new issue, #12071: Can we better take advantage of compact strings?

jpountz opened a new issue, #12071:
URL: https://github.com/apache/lucene/issues/12071

   ### Description
   
   There's a non-negligible time that we spend on UTF-16 / UTF-8 conversions using our own `UnicodeUtil`, e.g. via the `BytesRef(String)` constructor. But since the introduction of compact strings, `String#getBytes(StandardCharsets.UTF_8)` has some fast tracks, e.g. if neither of the bytes has its highest bit set, then the string is an ASCII string that is the same as the UTF-8 representation.
   
   I ran a quick microbenchmark that suggests that `String#getBytes` can indeed be significantly faster on ASCII strings:
    - `charsetEncoder` leverages `StandardCharsets.UTF_8.encode(state.input)`
    - `stringGetBytes` leverages `String#getBytes(StandardCharsets.UTF_8)`.
    - `unicodeUtil` leverages `UnicodeUtil#UTF16toUTF8`.
   
   ```
   Benchmark                                    (input)   Mode  Cnt    Score    Error   Units
   ConversionBenchmark.charsetEncoder                 a  thrpt    5   36.890 ±  0.572  ops/us
   ConversionBenchmark.charsetEncoder  abcdefghijklmnop  thrpt    5   18.717 ±  1.367  ops/us
   ConversionBenchmark.charsetEncoder         recherché  thrpt    5   10.098 ±  0.328  ops/us
   ConversionBenchmark.stringGetBytes                 a  thrpt    5  142.186 ± 18.849  ops/us
   ConversionBenchmark.stringGetBytes  abcdefghijklmnop  thrpt    5  111.259 ±  2.203  ops/us
   ConversionBenchmark.stringGetBytes         recherché  thrpt    5   53.565 ±  0.483  ops/us
   ConversionBenchmark.unicodeUtil                    a  thrpt    5  103.123 ±  3.970  ops/us
   ConversionBenchmark.unicodeUtil     abcdefghijklmnop  thrpt    5   54.223 ±  1.342  ops/us
   ConversionBenchmark.unicodeUtil            recherché  thrpt    5   58.166 ±  1.504  ops/us
   ```
   
   Yet switching from `UnicodeUtil` to `String#getBytes` cannot be done transparently because they use a different replacement character for mismatched surrogate pairs. So I wonder if we have options for leveraging `String#getBytes` internally to make things a bit faster, of if this sort of things should be left for applications built on top of Lucene, e.g. using the `StringField(String, BytesRef, Store)` constructor instead of the `StringField(String, String, Store)` constructor and doing the UTF8 conversion themselves using `String#getBytes`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #12071: Can we better take advantage of compact strings?

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #12071:
URL: https://github.com/apache/lucene/issues/12071#issuecomment-1379313710

   Nor does it allocate stuff. that's a problem with String.getBytes is that it forces allocation too.
   
   Sorry, I don't see anything here. If you want to speed up UnicodeUtil conversions, pressure openjdk to release their vector api


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on issue #12071: Can we better take advantage of compact strings?

Posted by GitBox <gi...@apache.org>.
msokolov commented on issue #12071:
URL: https://github.com/apache/lucene/issues/12071#issuecomment-1379144309

   I wonder if we could update `UnicodeUtil` to use `getBytes` internally? It could check the size of the byte array, and if it is equal to the length of the string, then just return it. Otherwise it could scan the string for the replacement char and update it in place before returning, in order to maintain backward compatibility. Not sure if it would give back the gains in those cases though?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on issue #12071: Can we better take advantage of compact strings?

Posted by GitBox <gi...@apache.org>.
rmuir commented on issue #12071:
URL: https://github.com/apache/lucene/issues/12071#issuecomment-1379311821

   there seems to be some confusion, the purpose of unicodeutil is not to allocate String in the first place. It doesnt use String hence getBytes is not really relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org