You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by ab...@apache.org on 2018/04/10 14:12:14 UTC

[35/50] lucene-solr:jira/solr-12181: LUCENE-8238: improve javadocs for WordDelimiterFilter and WordDelimiterGraphFilter

LUCENE-8238: improve javadocs for WordDelimiterFilter and WordDelimiterGraphFilter


Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/0f53adbe
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/0f53adbe
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/0f53adbe

Branch: refs/heads/jira/solr-12181
Commit: 0f53adbee49015aa01e8f66945f82e88a9172c7c
Parents: 5c37b07
Author: Mike McCandless <mi...@apache.org>
Authored: Fri Apr 6 15:20:22 2018 -0400
Committer: Mike McCandless <mi...@apache.org>
Committed: Fri Apr 6 15:20:22 2018 -0400

----------------------------------------------------------------------
 lucene/CHANGES.txt                                |  5 +++++
 .../miscellaneous/WordDelimiterFilter.java        | 18 ++++++++++++------
 .../miscellaneous/WordDelimiterGraphFilter.java   | 17 +++++++++++------
 3 files changed, 28 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/0f53adbe/lucene/CHANGES.txt
----------------------------------------------------------------------
diff --git a/lucene/CHANGES.txt b/lucene/CHANGES.txt
index 84e242d..f90f9e3 100644
--- a/lucene/CHANGES.txt
+++ b/lucene/CHANGES.txt
@@ -153,6 +153,11 @@ Build
 
 * LUCENE-8230: Upgrade forbiddenapis to version 2.5.  (Uwe Schindler)
 
+Documentation
+
+* LUCENE-8238: Improve WordDelimiterFilter and WordDelimiterGraphFilter javadocs
+xo  (Mike Sokolov via Mike McCandless)
+
 ======================= Lucene 7.3.0 =======================
 
 API Changes

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/0f53adbe/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
----------------------------------------------------------------------
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
index aef697c..313386b 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.java
@@ -55,11 +55,14 @@ import org.apache.lucene.util.InPlaceMergeSorter;
  * </li>
  * </ul>
  * 
- * The <b>combinations</b> parameter affects how subwords are combined:
+ * The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
+ * various <b>CATENATE_...</b> parameters affect how those parts are combined.
+ *
  * <ul>
- * <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
- * &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
- * <li>combinations="1" means that in addition to the subwords, maximum runs of
+ * <li>If no CATENATE option is set, then no subword combinations are generated:
+ * <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
+ * positions)</li>
+ * <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
  * non-numeric subwords are catenated and produced at the same position of the
  * last subword in the run:
  * <ul>
@@ -72,12 +75,15 @@ import org.apache.lucene.util.InPlaceMergeSorter;
  * </li>
  * </ul>
  * </li>
+ * <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
+ * <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
  * </ul>
+ *
  * One use for {@link WordDelimiterFilter} is to help match words with different
  * subword delimiters. For example, if the source text contained "wi-fi" one may
  * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
- * is to specify combinations="1" in the analyzer used for indexing, and
- * combinations="0" (the default) in the analyzer used for querying. Given that
+ * is to specify CATENATE options in the analyzer used for indexing, and
+ * not in the analyzer used for querying. Given that
  * the current {@link StandardTokenizer} immediately removes many intra-word
  * delimiters, it is recommended that this filter be used after a tokenizer that
  * does not do this (such as {@link WhitespaceTokenizer}).

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/0f53adbe/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
----------------------------------------------------------------------
diff --git a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
index a6ade19..7949fa2 100644
--- a/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
+++ b/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.java
@@ -62,11 +62,14 @@ import org.apache.lucene.util.RamUsageEstimator;
  * </li>
  * </ul>
  * 
- * The <b>combinations</b> parameter affects how subwords are combined:
+ * The <b>GENERATE...</b> options affect how incoming tokens are broken into parts, and the
+ * various <b>CATENATE_...</b> parameters affect how those parts are combined.
+ *
  * <ul>
- * <li>combinations="0" causes no subword combinations: <code>"PowerShot"</code>
- * &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token positions)</li>
- * <li>combinations="1" means that in addition to the subwords, maximum runs of
+ * <li>If no CATENATE option is set, then no subword combinations are generated:
+ * <code>"PowerShot"</code> &#8594; <code>0:"Power", 1:"Shot"</code> (0 and 1 are the token
+ * positions)</li>
+ * <li>CATENATE_WORDS means that in addition to the subwords, maximum runs of
  * non-numeric subwords are catenated and produced at the same position of the
  * last subword in the run:
  * <ul>
@@ -79,12 +82,14 @@ import org.apache.lucene.util.RamUsageEstimator;
  * </li>
  * </ul>
  * </li>
+ * <li>CATENATE_NUMBERS works like CATENATE_WORDS, but for adjacent digit sequences.</li>
+ * <li>CATENATE_ALL smushes together all the token parts without distinguishing numbers and words.</li>
  * </ul>
  * One use for {@link WordDelimiterGraphFilter} is to help match words with different
  * subword delimiters. For example, if the source text contained "wi-fi" one may
  * want "wifi" "WiFi" "wi-fi" "wi+fi" queries to all match. One way of doing so
- * is to specify combinations="1" in the analyzer used for indexing, and
- * combinations="0" (the default) in the analyzer used for querying. Given that
+ * is to specify CATENATE options in the analyzer used for indexing, and not
+ * in the analyzer used for querying. Given that
  * the current {@link StandardTokenizer} immediately removes many intra-word
  * delimiters, it is recommended that this filter be used after a tokenizer that
  * does not do this (such as {@link WhitespaceTokenizer}).