You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by co...@apache.org on 2016/12/28 03:43:59 UTC

opennlp git commit: Adds a small documentation section for Morfologik add-on

Repository: opennlp
Updated Branches:
  refs/heads/902 001b97068 -> 4f2441bc1


Adds a small documentation section for Morfologik add-on

See issue OPENNLP-902


Project: http://git-wip-us.apache.org/repos/asf/opennlp/repo
Commit: http://git-wip-us.apache.org/repos/asf/opennlp/commit/4f2441bc
Tree: http://git-wip-us.apache.org/repos/asf/opennlp/tree/4f2441bc
Diff: http://git-wip-us.apache.org/repos/asf/opennlp/diff/4f2441bc

Branch: refs/heads/902
Commit: 4f2441bc1b50502b95a86bff94e8a9544322baf5
Parents: 001b970
Author: William Colen <co...@apache.org>
Authored: Wed Dec 28 01:43:55 2016 -0200
Committer: William Colen <co...@apache.org>
Committed: Wed Dec 28 01:43:55 2016 -0200

----------------------------------------------------------------------
 .../src/docbkx/morfologik-addon.out.xml         |   0
 opennlp-docs/src/docbkx/morfologik-addon.xml    | 153 +++++++++++++++++++
 opennlp-docs/src/docbkx/opennlp.xml             |   1 +
 3 files changed, 154 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/morfologik-addon.out.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.out.xml b/opennlp-docs/src/docbkx/morfologik-addon.out.xml
new file mode 100644
index 0000000..e69de29

http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/morfologik-addon.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.xml b/opennlp-docs/src/docbkx/morfologik-addon.xml
new file mode 100644
index 0000000..6f18844
--- /dev/null
+++ b/opennlp-docs/src/docbkx/morfologik-addon.xml
@@ -0,0 +1,153 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
+	license agreements. See the NOTICE file distributed with this work for additional 
+	information regarding copyright ownership. The ASF licenses this file to 
+	you under the Apache License, Version 2.0 (the "License"); you may not use 
+	this file except in compliance with the License. You may obtain a copy of 
+	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
+	by applicable law or agreed to in writing, software distributed under the 
+	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
+	OF ANY KIND, either express or implied. See the License for the specific 
+	language governing permissions and limitations under the License. -->
+
+
+<chapter id="tools.morfologik-addon">
+	<title>Morfologik Addon</title>
+		<para>
+			<ulink url="https://github.com/morfologik/morfologik-stemming"><citetitle>Morfologik</citetitle></ulink>
+			provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries.
+		</para>
+		<para>
+			The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use of FSA Morfologik dictionary tools.
+		</para>
+		<section id="tools.morfologik-addon.api">
+			<title>Morfologik Integration</title>
+			<para>
+			To allow for an easy integration with OpenNLP, the following implementations are provided:
+			<itemizedlist mark='opencircle'>
+				<listitem>
+					<para>
+					The <code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>, which helps creating a POSTagger model with an embedded FSA TagDictionary.
+					</para>
+				</listitem>
+				<listitem>
+					<para>
+					The <code>MorfologikTagDictionary</code> implements a FSA based <code>TagDictionary</code>, allowing for much smaller files than the default XML based with improved memory consumption.
+					</para>
+				</listitem>
+				<listitem>
+					<para>
+					The <code>MorfologikLemmatizer</code> implements a FSA based <code>Lemmatizer</code> dictionaries.
+					</para>
+				</listitem>
+			</itemizedlist>
+		</para>
+		<para>
+		The first two implementations can be used directly from command line, as in the example bellow. Having a FSA Morfologik dictionary (see next section how to build one), you can train a POS Tagger
+		model with an embedded FSA dictionary. 
+		</para>
+		<para>
+		The example trains a POSTagger with a CONLL corpus named <code>portuguese_bosque_train.conll</code> and a FSA dictionary named 
+		<code>pt-morfologik.dict</code>. It will output a model named <code>pos-pt_fsadic.model</code>.
+		
+		<screen>
+		<![CDATA[
+$ bin/opennlp POSTaggerTrainer -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll \
+	 -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict]]>
+		</screen>
+		
+		</para>
+		<para>
+		Another example follows. It shows how to use the <code>MorfologikLemmatizer</code>. You will need a lemma dictionary and info file, in this example, we will use a very small Portuguese dictionary. 
+		Its syntax is <code>lemma,lexeme,postag</code>.
+		</para>
+		<para>
+		File <code>lemmaDictionary.txt:</code>
+		<screen>
+		<![CDATA[
+casa,casa,NOUN
+casar,casa,V
+casar,casar,V-INF
+Casa,Casa,PROP
+casa,casinha,NOUN
+casa,casona,NOUN
+menino,menina,NOUN
+menino,menino,NOUN
+menino,menin�o,NOUN
+menino,menininho,NOUN
+carro,carro,NOUN]]>
+		</screen>
+		</para>
+		<para>
+		Mandatory metadata file, which must have the same name but .info extension <code>lemmaDictionary.info:</code>
+		<screen>
+		<![CDATA[
+#
+# REQUIRED PROPERTIES
+#
+
+# Column (lemma, inflected, tag) separator. This must be a single byte in the target encoding.
+fsa.dict.separator=,
+
+# The charset in which the input is encoded. UTF-8 is strongly recommended.
+fsa.dict.encoding=UTF-8
+
+# The type of lemma-inflected form encoding compression that precedes automaton
+# construction. Allowed values: [suffix, infix, prefix, none].
+# Details are in Daciuk's paper and in the code. 
+# Leave at 'prefix' if not sure.
+fsa.dict.encoder=prefix
+		]]>
+		</screen>
+		</para>
+		<para>
+		The following code creates a binary FSA Morfologik dictionary, loads it in MorfologikLemmatizer and uses it to 
+		find the lemma the word "casa" noun and verb.
+		
+				<programlisting language="java">
+		<![CDATA[
+// Part 1: compile a FSA lemma dictionary 
+   
+// we need the tabular dictionary. It is mandatory to have info 
+//  file with same name, but .info extension
+Path textLemmaDictionary = Paths.get("dictionaryWithLemma.txt");
+
+// this will build a binary dictionary located in compiledLemmaDictionary
+Path compiledLemmaDictionary = new MorfologikDictionayBuilder()
+    .build(textLemmaDictionary);
+
+// Part 2: load a MorfologikLemmatizer and use it
+MorfologikLemmatizer lemmatizer = new MorfologikLemmatizer(compiledLemmaDictionary);
+
+String[] toks = {"casa", "casa"};
+String[] tags = {"NOUN", "V"};
+
+String[] lemmas = lemmatizer.lemmatize(toks, tags);
+System.out.println(Arrays.toString(lemmas)); // outputs [casa, casar]
+    ]]>
+			</programlisting>
+		
+		</para>
+		</section>
+		<section id="tools.morfologik-addon.cmdline">
+			<title>Morfologik CLI Tools</title>
+			<para>
+				The Morfologik addon provides a command line tool. <code>XMLDictionaryToTable</code> makes easy to convert from an OpenNLP XML based dictionary
+				to a tabular format. <code>MorfologikDictionaryBuilder</code> can take a tabular dictionary and output a binary Morfologik FSA dictionary.
+			</para>
+			<screen>
+		<![CDATA[
+$ sh bin/morfologik-addon
+OpenNLP Morfologik Addon. Usage: opennlp-morfologik-addon TOOL
+where TOOL is one of:
+  MorfologikDictionaryBuilder    builds a binary POS Dictionary using Morfologik
+  XMLDictionaryToTable           reads an OpenNLP XML tag dictionary and outputs it in a tabular file
+All tools print help when invoked with help parameter
+Example: opennlp-morfologik-addon POSDictionaryBuilder help
+		]]>
+		</screen>
+		</section>
+</chapter>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/opennlp.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/opennlp.xml b/opennlp-docs/src/docbkx/opennlp.xml
index 257bbb4..172d06c 100644
--- a/opennlp-docs/src/docbkx/opennlp.xml
+++ b/opennlp-docs/src/docbkx/opennlp.xml
@@ -89,5 +89,6 @@ under the License.
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./corpora.xml" />
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./machine-learning.xml" />
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./uima-integration.xml" />
+	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./morfologik-addon.xml" />
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./cli.xml" />
 </book>