You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by mi...@apache.org on 2018/04/06 19:34:33 UTC

lucene-solr:branch_7x: LUCENE-8238: improve javadocs for WordDelimiterFilter and WordDelimiterGraphFilter

Repository: lucene-solr
Updated Branches:
  refs/heads/branch_7x 1d8313ca8 -> 77e2ed277


LUCENE-8238: improve javadocs for WordDelimiterFilter and WordDelimiterGraphFilter


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/77e2ed27
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/77e2ed27
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/77e2ed27

Branch: refs/heads/branch_7x
Commit: 77e2ed277aa2e606fcd679d7f26e90225b7d3b4f
Parents: 1d8313c
Author: Mike McCandless <mi...@apache.org>
Authored: Fri Apr 6 15:20:22 2018 -0400
Committer: Mike McCandless <mi...@apache.org>
Committed: Fri Apr 6 15:21:06 2018 -0400

----------------------------------------------------------------------
 lucene/CHANGES.txt                                |  5 +++++
 .../miscellaneous/WordDelimiterFilter.java        | 18 ++++++++++++------
 .../miscellaneous/WordDelimiterGraphFilter.java   | 17 +++++++++++------
 3 files changed, 28 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/77e2ed27/lucene/CHANGES.txt
----------------------------------------------------------------------
diff --git a/lucene/CHANGES.txt b/lucene/CHANGES.txt
index 0491597..1d5fe56 100644
--- a/lucene/CHANGES.txt
+++ b/lucene/CHANGES.txt
@@ -63,6 +63,11 @@ Build
 
 * LUCENE-8230: Upgrade forbiddenapis to version 2.5.  (Uwe Schindler)
 
+Documentation
+
+* LUCENE-8238: Improve WordDelimiterFilter and WordDelimiterGraphFilter javadocs
+xo  (Mike Sokolov via Mike McCandless)
+
 ======================= Lucene 7.3.0 =======================
 
 API Changes

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/77e2ed27/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
----------------------------------------------------------------------
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
index aef697c..313386b 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
@@ -55,11 +55,14 @@ import org.apache.lucene.util.InPlaceMergeSorter;
  * </li>
  * </ul>
  * 
- * The <b>combinations</b> parameter affects how subwords are combined:
+ * The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
+ * various <b>CATENATE_...</b> parameters affect how those parts are combined.
+ *
  * <ul>
- * <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
- * &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
- * <li>combinations="1" means that in addition to the subwords, maximum runs of
+ * <li>If no CATENATE option is set, then no subword combinations are generated:
+ * <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
+ * positions)</li>
+ * <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
  * non-numeric subwords are catenated and produced at the same position of the
  * last subword in the run:
  * <ul>
@@ -72,12 +75,15 @@ import org.apache.lucene.util.InPlaceMergeSorter;
  * </li>
  * </ul>
  * </li>
+ * <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
+ * <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
  * </ul>
+ *
  * One use for {@link WordDelimiterFilter} is to help match words with different
  * subword delimiters. For example, if the source text contained "wi-fi" one may
  * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
- * is to specify combinations="1" in the analyzer used for indexing, and
- * combinations="0" (the default) in the analyzer used for querying. Given that
+ * is to specify CATENATE options in the analyzer used for indexing, and
+ * not in the analyzer used for querying. Given that
  * the current {@link StandardTokenizer} immediately removes many intra-word
  * delimiters, it is recommended that this filter be used after a tokenizer that
  * does not do this (such as {@link WhitespaceTokenizer}).

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/77e2ed27/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
----------------------------------------------------------------------
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
index a6ade19..7949fa2 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
@@ -62,11 +62,14 @@ import org.apache.lucene.util.RamUsageEstimator;
  * </li>
  * </ul>
  * 
- * The <b>combinations</b> parameter affects how subwords are combined:
+ * The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
+ * various <b>CATENATE_...</b> parameters affect how those parts are combined.
+ *
  * <ul>
- * <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
- * &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
- * <li>combinations="1" means that in addition to the subwords, maximum runs of
+ * <li>If no CATENATE option is set, then no subword combinations are generated:
+ * <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
+ * positions)</li>
+ * <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
  * non-numeric subwords are catenated and produced at the same position of the
  * last subword in the run:
  * <ul>
@@ -79,12 +82,14 @@ import org.apache.lucene.util.RamUsageEstimator;
  * </li>
  * </ul>
  * </li>
+ * <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
+ * <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
  * </ul>
  * One use for {@link WordDelimiterGraphFilter} is to help match words with different
  * subword delimiters. For example, if the source text contained "wi-fi" one may
  * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
- * is to specify combinations="1" in the analyzer used for indexing, and
- * combinations="0" (the default) in the analyzer used for querying. Given that
+ * is to specify CATENATE options in the analyzer used for indexing, and not
+ * in the analyzer used for querying. Given that
  * the current {@link StandardTokenizer} immediately removes many intra-word
  * delimiters, it is recommended that this filter be used after a tokenizer that
  * does not do this (such as {@link WhitespaceTokenizer}).