You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by rm...@apache.org on 2014/01/06 17:44:14 UTC
svn commit: r1555907 - in /lucene/dev/trunk/lucene: CHANGES.txt core/src/java/org/apache/lucene/analysis/package.html

Author: rmuir
Date: Mon Jan  6 16:44:14 2014
New Revision: 1555907

URL: http://svn.apache.org/r1555907
Log:
LUCENE-5384: Add some analysis api tips to the package.html (closes #12)

Modified:
    lucene/dev/trunk/lucene/CHANGES.txt
    lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/analysis/package.html

Modified: lucene/dev/trunk/lucene/CHANGES.txt
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/CHANGES.txt?rev=1555907&r1=1555906&r2=1555907&view=diff
==============================================================================
--- lucene/dev/trunk/lucene/CHANGES.txt (original)
+++ lucene/dev/trunk/lucene/CHANGES.txt Mon Jan  6 16:44:14 2014
@@ -141,6 +141,12 @@ Changes in Runtime Behavior
   AlreadyClosedException if the refCount in incremented but
   is less that 1. (Simon Willnauer) 
 
+Documentation
+
+* LUCENE-5384: Add some tips for making tokenfilters and tokenizers 
+  to the analysis package overview.  
+  (Benson Margulies via Robert Muir - pull request #12)
+
 ======================= Lucene 4.6.0 =======================
 
 New Features

Modified: lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/analysis/package.html
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/analysis/package.html?rev=1555907&r1=1555906&r2=1555907&view=diff
==============================================================================
--- lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/analysis/package.html (original)
+++ lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/analysis/package.html Mon Jan  6 16:44:14 2014
@@ -386,7 +386,15 @@ and proximity searches (though sentence 
   <li>The first position increment must be &gt; 0.</li>
   <li>Positions must not go backward.</li>
   <li>Tokens that have the same start position must have the same start offset.</li>
-  <li>Tokens that have the same end position (taking into account the position length) must have the same end offset.</li>
+  <li>Tokens that have the same end position (taking into account the
+  position length) must have the same end offset.</li>
+  <li>Tokenizers must call {@link
+  org.apache.lucene.util.AttributeSource#clearAttributes()} in
+  incrementToken().</li>
+  <li>Tokenizers must override {@link
+  org.apache.lucene.analysis.TokenStream#end()}, and pass the final
+  offset (the total number of input characters processed) to both
+  parameters of {@link org.apache.lucene.analysis.tokenattributes.OffsetAttribute#setOffset(int, int)}.</li>
 </ul>
 <p>
    Although these rules might seem easy to follow, problems can quickly happen when chaining
@@ -395,7 +403,8 @@ and proximity searches (though sentence 
 </p>
 <ul>
   <li>Token filters should not modify offsets. If you feel that your filter would need to modify offsets, then it should probably be implemented as a tokenizer.</li>
-  <li>Token filters should not insert positions. If a filter needs to add tokens, then they shoud all have a position increment of 0.</li>
+  <li>Token filters should not insert positions. If a filter needs to add tokens, then they should all have a position increment of 0.</li>
+  <li>When they add tokens, token filters should call {@link org.apache.lucene.util.AttributeSource#clearAttributes()} first.</li>
   <li>When they remove tokens, token filters should increment the position increment of the following token.</li>
   <li>Token filters should preserve position lengths.</li>
 </ul>
@@ -467,6 +476,14 @@ and proximity searches (though sentence 
     </td>
   </tr>
 </table>
+<h3>Testing Your Analysis Component</h3>
+<p>
+    The lucene-test-framework component defines
+    <a href="{@docRoot}/../test-framework/org/apache/lucene/analysis/BaseTokenStreamTestCase.html">BaseTokenStreamTestCase</a>. By extending
+    this class, you can create JUnit tests that validate that your
+    Analyzer and/or analysis components correctly implement the
+    protocol. The checkRandomData methods of that class are particularly effective in flushing out errors.
+</p>
 <h3>Using the TokenStream API</h3>
 There are a few important things to know in order to use the new API efficiently which are summarized here. You may want
 to walk through the example below first and come back to this section afterwards.