You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2017/06/06 10:09:46 UTC

[02/21] opennlp git commit: OPENNLP-979 Update lemmatizer doc after API change

OPENNLP-979 Update lemmatizer doc after API change


Project: http://git-wip-us.apache.org/repos/asf/opennlp/repo
Commit: http://git-wip-us.apache.org/repos/asf/opennlp/commit/ee9fdb8a
Tree: http://git-wip-us.apache.org/repos/asf/opennlp/tree/ee9fdb8a
Diff: http://git-wip-us.apache.org/repos/asf/opennlp/diff/ee9fdb8a

Branch: refs/heads/LangDetect
Commit: ee9fdb8aad0e4c43bba85e50be3687475bf2221d
Parents: 839ff10
Author: Rodrigo Agerri <ra...@apache.org>
Authored: Wed May 17 23:04:23 2017 +0200
Committer: Rodrigo Agerri <ra...@apache.org>
Committed: Wed May 17 23:04:23 2017 +0200

----------------------------------------------------------------------
 opennlp-docs/src/docbkx/lemmatizer.xml | 54 ++++++++++++++++-------------
 1 file changed, 30 insertions(+), 24 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/opennlp/blob/ee9fdb8a/opennlp-docs/src/docbkx/lemmatizer.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/lemmatizer.xml b/opennlp-docs/src/docbkx/lemmatizer.xml
index 1fa5540..630b04d 100644
--- a/opennlp-docs/src/docbkx/lemmatizer.xml
+++ b/opennlp-docs/src/docbkx/lemmatizer.xml
@@ -121,10 +121,9 @@ String[] postags = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN",
     "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS",
     "." };
 
-String[] lemmas = lemmatizer.lemmatize(tokens, postags);
-String[] decodedLemmas = lemmatizer.decodeLemmas(tokens, lemmas);]]>
+String[] lemmas = lemmatizer.lemmatize(tokens, postags);]]>
 		</programlisting>
-				The decodedLemmas array contains one lemma for each token in the
+				The lemmas array contains one lemma for each token in the
 				input array. The corresponding
 				tag and lemma can be found at the same index as the token has in the
 				input array.
@@ -133,29 +132,37 @@ String[] decodedLemmas = lemmatizer.decodeLemmas(tokens, lemmas);]]>
 			<para>
 				The DictionaryLemmatizer is constructed
 				by passing the InputStream of a lemmatizer dictionary. Such dictionary
-				consists of a
-				text file containing, for each row, a word, its postag and the
-				corresponding lemma:
+				consists of a text file containing, for each row, a word, its postag and the
+				corresponding lemma, each column separated by a tab character.
 				<screen>
 		<![CDATA[
-show    NN      show
-showcase        NN      showcase
-showcases       NNS     showcase
-showdown        NN      showdown
-showdowns       NNS     showdown
-shower  NN      shower
-showers NNS     shower
-showman NN      showman
-showmanship     NN      showmanship
-showmen NNS     showman
-showroom        NN      showroom
-showrooms       NNS     showroom
-shows   NNS     show
-showstopper     NN      showstopper
-showstoppers    NNS     showstopper
-shrapnel        NN      shrapnel
+show		NN	show
+showcase	NN	showcase
+showcases	NNS	showcase
+showdown	NN	showdown
+showdowns	NNS	showdown
+shower		NN	shower
+showers		NNS	shower
+showman		NN	showman
+showmanship	NN	showmanship
+showmen		NNS	showman
+showroom	NN	showroom
+showrooms	NNS	showroom
+shows		NNS	show
+shrapnel	NN	shrapnel
 		]]>
 		</screen>
+				Alternatively, if a (word,postag) pair can output multiple lemmas, the
+				the lemmatizer dictionary would consists of a text file containing, for 
+				each row, a word, its postag and the corresponding lemmas separated by "#":
+				<screen>
+		<![CDATA[
+muestras	NN	muestra
+cantaba		V	cantar
+fue		V	ir#ser
+entramos	V	entrar
+		]]>
+					</screen>
 				First the dictionary must be loaded into memory from disk or another
 				source.
 				In the sample below it is loaded from disk.
@@ -180,8 +187,7 @@ DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);]]>
 			</para>
 			<para>
 				The following code shows how to find a lemma using a
-				DictionaryLemmatizer. There is no need to decode the
-				lemmas when using the DictionaryLemmatizer.
+				DictionaryLemmatizer.
 				<programlisting language="java">
 		  <![CDATA[
 String[] tokens = new String[]{"Most", "large", "cities", "in", "the", "US", "had",