You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dawid Weiss (Jira)" <ji...@apache.org> on 2020/08/12 08:34:00 UTC

[jira] [Commented] (LUCENE-9457) Why is Kuromoji tokenization throughput bimodal?

    [ https://issues.apache.org/jira/browse/LUCENE-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17176164#comment-17176164 ] 

Dawid Weiss commented on LUCENE-9457:
-------------------------------------

bq. It could be hotspot noise maybe?  

Could be. Or it could be something else running in the background? It'd be good to somehow monitor background CPU activity while these benchmarks are being made. I'm not much of a sysop to help out here though. 

> Why is Kuromoji tokenization throughput bimodal?
> ------------------------------------------------
>
>                 Key: LUCENE-9457
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9457
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>
> With the recent accidental regression of Japanese (Kuromoji) tokenization throughput due to exciting FST optimizations, we [added new nightly Lucene benchmarks|https://github.com/mikemccand/luceneutil/issues/64] to measure tokenization throughput for {{JapaneseTokenizer}}: [https://home.apache.org/~mikemccand/lucenebench/analyzers.html]
> It has already been running for ~5-6 weeks now!  But for some reason, it looks bi-modal?  "Normally" it is ~.45 M tokens/sec, but for two data points it dropped down to ~.33 M tokens/sec, which is odd.  It could be hotspot noise maybe?  But would be good to get to the root cause and fix it if possible.
> Hotspot noise that randomly steals ~27% of your tokenization throughput is no good!!
> Or does anyone have any other ideas of what could be bi-modal in Kuromoji?  I don't think [this performance test|https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/TestAnalyzerPerf.java] has any randomness in it...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org