You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by rm...@apache.org on 2010/04/22 11:52:01 UTC
svn commit: r936700 - in /lucene/dev/trunk/lucene/contrib/icu/src/java:
org/apache/lucene/analysis/icu/package.html overview.html
Author: rmuir
Date: Thu Apr 22 09:52:01 2010
New Revision: 936700
URL: http://svn.apache.org/viewvc?rev=936700&view=rev
Log:
enhance contrib/icu documentation
Added:
lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html (with props)
Modified:
lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html
Added: lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html?rev=936700&view=auto
==============================================================================
--- lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html (added)
+++ lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html Thu Apr 22 09:52:01 2010
@@ -0,0 +1,22 @@
+<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<html><head></head>
+<body>
+Analysis components based on ICU
+</body>
+</html>
Propchange: lucene/dev/trunk/lucene/contrib/icu/src/java/org/apache/lucene/analysis/icu/package.html
------------------------------------------------------------------------------
svn:eol-style = native
Modified: lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html?rev=936700&r1=936699&r2=936700&view=diff
==============================================================================
--- lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html (original)
+++ lucene/dev/trunk/lucene/contrib/icu/src/java/overview.html Thu Apr 22 09:52:01 2010
@@ -16,12 +16,32 @@
-->
<html>
<head>
+ <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>
- Apache Lucene ICUCollationKeyFilter/Analyzer
+ Apache Lucene ICU integration module
</title>
</head>
<body>
<p>
+This module exposes functionality from
+<a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
+library that enhances Java's internationalization support by improving
+performance, keeping current with the Unicode Standard, and providing richer
+APIs. This module exposes the following functionality:
+</p>
+<ul>
+ <li><a href="#collation">Collation</a>: Compare strings according to the
+ conventions and standards of a particular language, region or country.</li>
+ <li><a href="#normalization">Normalization</a>: Converts text to a unique,
+ equivalent form.</li>
+ <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
+ Unicode's Default Caseless Matching algorithm.</li>
+ <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
+ (such as accent marks) between similar characters for a loose or fuzzy search.</li>
+</ul>
+<hr/>
+<h1><a name="collation">Collation</a></h1>
+<p>
<code>ICUCollationKeyFilter</code>
converts each token into its binary <code>CollationKey</code> using the
provided <code>Collator</code>, and then encode the <code>CollationKey</code>
@@ -30,11 +50,9 @@
stored as an index term.
</p>
<p>
- <code>ICUCollationKeyFilter</code> depends on ICU4J 4.0 to produce the
- <code>CollationKey</code>s. <code>icu4j-collation-4.0.jar</code>,
- a trimmed-down version of <code>icu4j-4.0.jar</code> that contains only the
- code and data needed to support collation, is included in Lucene's Subversion
- repository at <code>contrib/icu/lib/</code>.
+ <code>ICUCollationKeyFilter</code> depends on ICU4J 4.4 to produce the
+ <code>CollationKey</code>s. <code>icu4j-4.4.jar</code>
+ is included in Lucene's Subversion repository at <code>contrib/icu/lib/</code>.
</p>
<h2>Use Cases</h2>
@@ -176,7 +194,96 @@
you use <code>CollationKeyFilter</code> to generate index terms, do not use
<code>ICUCollationKeyFilter</code> on the query side, or vice versa.
</p>
-<pre>
-</pre>
+<hr/>
+<h1><a name="normalization">Normalization</a></h1>
+<p>
+ <code>ICUNormalizer2Filter</code> normalizes term text to a
+ <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
+ that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
+ forms are standardized to a unique form.
+</p>
+<h2>Use Cases</h2>
+<ul>
+ <li> Removing differences in width for Asian-language text.
+ </li>
+ <li> Standardizing complex text with non-spacing marks so that characters are
+ ordered consistently.
+ </li>
+</ul>
+<h2>Example Usages</h2>
+<h3>Normalizing text to NFC</h3>
+<code><pre>
+ /**
+ * Normalizer2 objects are unmodifiable and immutable.
+ */
+ Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
+ /**
+ * This filter will normalize to NFC.
+ */
+ TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
+</pre></code>
+<hr/>
+<h1><a name="casefolding">Case Folding</a></h1>
+<p>
+Default caseless matching, or case-folding is more than just conversion to
+lowercase. For example, it handles cases such as the Greek sigma, so that
+"ÎάÏοÏ" and "ÎÎΪÎΣ" will match correctly.
+</p>
+<p>
+Case-folding is still only an approximation of the language-specific rules
+governing case. If the specific language is known, consider using
+ICUCollationKeyFilter and indexing collation keys instead. This implementation
+performs the "full" case-folding specified in the Unicode standard, and this
+may change the length of the term. For example, the German à is case-folded
+to the string 'ss'.
+</p>
+<p>
+Case folding is related to normalization, and as such is coupled with it in
+this integration. To perform case-folding, you use normalization with the form
+"nfkc_cf" (which is the default).
+</p>
+<h2>Use Cases</h2>
+<ul>
+ <li>
+ As a more thorough replacement for LowerCaseFilter that has good behavior
+ for most languages.
+ </li>
+</ul>
+<h2>Example Usages</h2>
+<h3>Lowercasing text</h3>
+<code><pre>
+ /**
+ * This filter will case-fold and normalize to NFKC.
+ */
+ TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
+</pre></code>
+<hr/>
+<h1><a name="searchfolding">Search Term Folding</a></h1>
+<p>
+Search term folding removes distinctions (such as accent marks) between
+similar characters. It is useful for a fuzzy or loose search.
+</p>
+<p>
+Search term folding implements many of the foldings specified in
+<a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
+as a special normalization form. This folding applies NFKC, Case Folding, and
+many character foldings recursively.
+</p>
+<h2>Use Cases</h2>
+<ul>
+ <li>
+ As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
+ that applies the same ideas to many more languages.
+ </li>
+</ul>
+<h2>Example Usages</h2>
+<h3>Removing accents</h3>
+<code><pre>
+ /**
+ * This filter will case-fold, remove accents and other distinctions, and
+ * normalize to NFKC.
+ */
+ TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
+</pre></code>
</body>
</html>