You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2015/07/17 16:40:04 UTC
[jira] [Commented] (LUCENE-6682) StandardTokenizer performance bug:
buffer is unnecessarily copied when maxTokenLength doesn't change
[ https://issues.apache.org/jira/browse/LUCENE-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631414#comment-14631414 ]
Steve Rowe commented on LUCENE-6682:
------------------------------------
In {{setMaxTokenLength()}} we should call {{setBufferSize()}} only if the length has changed.
Also, {{setMaxTokenLength()}} maxes out {{maxTokenLength}} at 1M chars (i.e. UTF-16 code units), but then will lie and report the requested length (at {{getMaxTokenLength()}}) when it exceeded 1M chars! Not cool. We should instead throw an exception when the requested {{maxTokenLength}} exceeds 1M.
Here's a patch that fixes both:
{code:java}
Index: lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
===================================================================
--- lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java (revision 1691570)
+++ lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java (working copy)
@@ -88,6 +88,8 @@
"<HANGUL>"
};
+ public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
+
private int skippedPositions;
private int maxTokenLength = StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH;
@@ -97,9 +99,13 @@
public void setMaxTokenLength(int length) {
if (length < 1) {
throw new IllegalArgumentException("maxTokenLength must be greater than zero");
+ } else if (length > MAX_TOKEN_LENGTH_LIMIT) {
+ throw new IllegalArgumentException("maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT);
}
- this.maxTokenLength = length;
- scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer size to 1M chars
+ if (length != maxTokenLength) {
+ maxTokenLength = length;
+ scanner.setBufferSize(length);
+ }
}
/** @see #setMaxTokenLength */
{code}
> StandardTokenizer performance bug: buffer is unnecessarily copied when maxTokenLength doesn't change
> ----------------------------------------------------------------------------------------------------
>
> Key: LUCENE-6682
> URL: https://issues.apache.org/jira/browse/LUCENE-6682
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Steve Rowe
>
> From Piotr Idzikowski on java-user mailing list [http://markmail.org/message/af26kr7fermt2tfh]:
> {quote}
> I am developing own analyzer based on StandardAnalyzer.
> I realized that tokenizer.setMaxTokenLength is called many times.
> {code:java}
> protected TokenStreamComponents createComponents(final String fieldName,
> final Reader reader) {
> final StandardTokenizer src = new StandardTokenizer(getVersion(),
> reader);
> src.setMaxTokenLength(maxTokenLength);
> TokenStream tok = new StandardFilter(getVersion(), src);
> tok = new LowerCaseFilter(getVersion(), tok);
> tok = new StopFilter(getVersion(), tok, stopwords);
> return new TokenStreamComponents(src, tok) {
> @Override
> protected void setReader(final Reader reader) throws IOException {
> src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
> super.setReader(reader);
> }
> };
> }
> {code}
> Does it make sense if length stays the same? I see it finally calls this
> one( in StandardTokenizerImpl ):
> {code:java}
> public final void setBufferSize(int numChars) {
> ZZ_BUFFERSIZE = numChars;
> char[] newZzBuffer = new char[ZZ_BUFFERSIZE];
> System.arraycopy(zzBuffer, 0, newZzBuffer, 0,
> Math.min(zzBuffer.length, ZZ_BUFFERSIZE));
> zzBuffer = newZzBuffer;
> }
> {code}
> So it just copies old array content into the new one.
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org