You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2015/07/17 16:48:04 UTC
[jira] [Comment Edited] (LUCENE-6682) StandardTokenizer performance bug: buffer is unnecessarily copied when maxTokenLength doesn't change

    [ https://issues.apache.org/jira/browse/LUCENE-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631414#comment-14631414 ] 

Steve Rowe edited comment on LUCENE-6682 at 7/17/15 2:47 PM:
-------------------------------------------------------------

In {{setMaxTokenLength()}} we should call {{setBufferSize()}} only if the length has changed.

Also, {{setMaxTokenLength()}} maxes out the buffer size at 1M chars (i.e. UTF-16 code units), but {{getMaxTokenLength()}} will lie and reported the requested length, even though it exceeded 1M chars, and the buffer length, maxed out at 1M chars, is what really controls max token length.  Not cool.  We should instead throw an exception when the requested {{maxTokenLength}} exceeds 1M.

Here's a patch that fixes both:

{code:java}
Index: lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
===================================================================
--- lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java	(revision 1691570)
+++ lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java	(working copy)
@@ -88,6 +88,8 @@
     "<HANGUL>"
   };
   
+  public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
+  
   private int skippedPositions;
 
   private int maxTokenLength = StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH;
@@ -97,9 +99,13 @@
   public void setMaxTokenLength(int length) {
     if (length < 1) {
       throw new IllegalArgumentException("maxTokenLength must be greater than zero");
+    } else if (length > MAX_TOKEN_LENGTH_LIMIT) {
+      throw new IllegalArgumentException("maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT);
     }
-    this.maxTokenLength = length;
-    scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer size to 1M chars
+    if (length != maxTokenLength) {
+      maxTokenLength = length;
+      scanner.setBufferSize(length);
+    }
   }
 
   /** @see #setMaxTokenLength */
{code}


was (Author: steve_rowe):
In {{setMaxTokenLength()}} we should call {{setBufferSize()}} only if the length has changed.

Also, {{setMaxTokenLength()}} maxes out {{maxTokenLength}} at 1M chars (i.e. UTF-16 code units), but then will lie and report the requested length (at {{getMaxTokenLength()}}) when it exceeded 1M chars!  Not cool.  We should instead throw an exception when the requested {{maxTokenLength}} exceeds 1M.

Here's a patch that fixes both:

{code:java}
Index: lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
===================================================================
--- lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java	(revision 1691570)
+++ lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java	(working copy)
@@ -88,6 +88,8 @@
     "<HANGUL>"
   };
   
+  public static final int MAX_TOKEN_LENGTH_LIMIT = 1024 * 1024;
+  
   private int skippedPositions;
 
   private int maxTokenLength = StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH;
@@ -97,9 +99,13 @@
   public void setMaxTokenLength(int length) {
     if (length < 1) {
       throw new IllegalArgumentException("maxTokenLength must be greater than zero");
+    } else if (length > MAX_TOKEN_LENGTH_LIMIT) {
+      throw new IllegalArgumentException("maxTokenLength may not exceed " + MAX_TOKEN_LENGTH_LIMIT);
     }
-    this.maxTokenLength = length;
-    scanner.setBufferSize(Math.min(length, 1024 * 1024)); // limit buffer size to 1M chars
+    if (length != maxTokenLength) {
+      maxTokenLength = length;
+      scanner.setBufferSize(length);
+    }
   }
 
   /** @see #setMaxTokenLength */
{code}

> StandardTokenizer performance bug: buffer is unnecessarily copied when maxTokenLength doesn't change
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6682
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6682
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Steve Rowe
>
> From Piotr Idzikowski on java-user mailing list [http://markmail.org/message/af26kr7fermt2tfh]:
> {quote}
> I am developing own analyzer based on StandardAnalyzer.
> I realized that tokenizer.setMaxTokenLength is called many times.
> {code:java}
> protected TokenStreamComponents createComponents(final String fieldName,
> final Reader reader) {
>     final StandardTokenizer src = new StandardTokenizer(getVersion(),
> reader);
>     src.setMaxTokenLength(maxTokenLength);
>     TokenStream tok = new StandardFilter(getVersion(), src);
>     tok = new LowerCaseFilter(getVersion(), tok);
>     tok = new StopFilter(getVersion(), tok, stopwords);
>     return new TokenStreamComponents(src, tok) {
>       @Override
>       protected void setReader(final Reader reader) throws IOException {
>         src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
>         super.setReader(reader);
>       }
>     };
>   }
> {code}
> Does it make sense if length stays the same? I see it finally calls this
> one( in StandardTokenizerImpl ):
> {code:java}
> public final void setBufferSize(int numChars) {
>      ZZ_BUFFERSIZE = numChars;
>      char[] newZzBuffer = new char[ZZ_BUFFERSIZE];
>      System.arraycopy(zzBuffer, 0, newZzBuffer, 0,
> Math.min(zzBuffer.length, ZZ_BUFFERSIZE));
>      zzBuffer = newZzBuffer;
>    }
> {code}
> So it just copies old array content into the new one.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org