You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by co...@apache.org on 2016/12/28 03:45:09 UTC
[5/6] opennlp git commit: Adds a small documentation section for
Morfologik add-on
Adds a small documentation section for Morfologik add-on
See issue OPENNLP-902
Project: http://git-wip-us.apache.org/repos/asf/opennlp/repo
Commit: http://git-wip-us.apache.org/repos/asf/opennlp/commit/4f2441bc
Tree: http://git-wip-us.apache.org/repos/asf/opennlp/tree/4f2441bc
Diff: http://git-wip-us.apache.org/repos/asf/opennlp/diff/4f2441bc
Branch: refs/heads/trunk
Commit: 4f2441bc1b50502b95a86bff94e8a9544322baf5
Parents: 001b970
Author: William Colen <co...@apache.org>
Authored: Wed Dec 28 01:43:55 2016 -0200
Committer: William Colen <co...@apache.org>
Committed: Wed Dec 28 01:43:55 2016 -0200
----------------------------------------------------------------------
.../src/docbkx/morfologik-addon.out.xml | 0
opennlp-docs/src/docbkx/morfologik-addon.xml | 153 +++++++++++++++++++
opennlp-docs/src/docbkx/opennlp.xml | 1 +
3 files changed, 154 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/morfologik-addon.out.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.out.xml b/opennlp-docs/src/docbkx/morfologik-addon.out.xml
new file mode 100644
index 0000000..e69de29
http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/morfologik-addon.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/morfologik-addon.xml b/opennlp-docs/src/docbkx/morfologik-addon.xml
new file mode 100644
index 0000000..6f18844
--- /dev/null
+++ b/opennlp-docs/src/docbkx/morfologik-addon.xml
@@ -0,0 +1,153 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
+ license agreements. See the NOTICE file distributed with this work for additional
+ information regarding copyright ownership. The ASF licenses this file to
+ you under the Apache License, Version 2.0 (the "License"); you may not use
+ this file except in compliance with the License. You may obtain a copy of
+ the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
+ by applicable law or agreed to in writing, software distributed under the
+ License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
+ OF ANY KIND, either express or implied. See the License for the specific
+ language governing permissions and limitations under the License. -->
+
+
+<chapter id="tools.morfologik-addon">
+ <title>Morfologik Addon</title>
+ <para>
+ <ulink url="https://github.com/morfologik/morfologik-stemming"><citetitle>Morfologik</citetitle></ulink>
+ provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries.
+ </para>
+ <para>
+ The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use of FSA Morfologik dictionary tools.
+ </para>
+ <section id="tools.morfologik-addon.api">
+ <title>Morfologik Integration</title>
+ <para>
+ To allow for an easy integration with OpenNLP, the following implementations are provided:
+ <itemizedlist mark='opencircle'>
+ <listitem>
+ <para>
+ The <code>MorfologikPOSTaggerFactory</code> extends <code>POSTaggerFactory</code>, which helps creating a POSTagger model with an embedded FSA TagDictionary.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The <code>MorfologikTagDictionary</code> implements a FSA based <code>TagDictionary</code>, allowing for much smaller files than the default XML based with improved memory consumption.
+ </para>
+ </listitem>
+ <listitem>
+ <para>
+ The <code>MorfologikLemmatizer</code> implements a FSA based <code>Lemmatizer</code> dictionaries.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ The first two implementations can be used directly from command line, as in the example bellow. Having a FSA Morfologik dictionary (see next section how to build one), you can train a POS Tagger
+ model with an embedded FSA dictionary.
+ </para>
+ <para>
+ The example trains a POSTagger with a CONLL corpus named <code>portuguese_bosque_train.conll</code> and a FSA dictionary named
+ <code>pt-morfologik.dict</code>. It will output a model named <code>pos-pt_fsadic.model</code>.
+
+ <screen>
+ <![CDATA[
+$ bin/opennlp POSTaggerTrainer -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll \
+ -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict]]>
+ </screen>
+
+ </para>
+ <para>
+ Another example follows. It shows how to use the <code>MorfologikLemmatizer</code>. You will need a lemma dictionary and info file, in this example, we will use a very small Portuguese dictionary.
+ Its syntax is <code>lemma,lexeme,postag</code>.
+ </para>
+ <para>
+ File <code>lemmaDictionary.txt:</code>
+ <screen>
+ <![CDATA[
+casa,casa,NOUN
+casar,casa,V
+casar,casar,V-INF
+Casa,Casa,PROP
+casa,casinha,NOUN
+casa,casona,NOUN
+menino,menina,NOUN
+menino,menino,NOUN
+menino,menin�o,NOUN
+menino,menininho,NOUN
+carro,carro,NOUN]]>
+ </screen>
+ </para>
+ <para>
+ Mandatory metadata file, which must have the same name but .info extension <code>lemmaDictionary.info:</code>
+ <screen>
+ <![CDATA[
+#
+# REQUIRED PROPERTIES
+#
+
+# Column (lemma, inflected, tag) separator. This must be a single byte in the target encoding.
+fsa.dict.separator=,
+
+# The charset in which the input is encoded. UTF-8 is strongly recommended.
+fsa.dict.encoding=UTF-8
+
+# The type of lemma-inflected form encoding compression that precedes automaton
+# construction. Allowed values: [suffix, infix, prefix, none].
+# Details are in Daciuk's paper and in the code.
+# Leave at 'prefix' if not sure.
+fsa.dict.encoder=prefix
+ ]]>
+ </screen>
+ </para>
+ <para>
+ The following code creates a binary FSA Morfologik dictionary, loads it in MorfologikLemmatizer and uses it to
+ find the lemma the word "casa" noun and verb.
+
+ <programlisting language="java">
+ <![CDATA[
+// Part 1: compile a FSA lemma dictionary
+
+// we need the tabular dictionary. It is mandatory to have info
+// file with same name, but .info extension
+Path textLemmaDictionary = Paths.get("dictionaryWithLemma.txt");
+
+// this will build a binary dictionary located in compiledLemmaDictionary
+Path compiledLemmaDictionary = new MorfologikDictionayBuilder()
+ .build(textLemmaDictionary);
+
+// Part 2: load a MorfologikLemmatizer and use it
+MorfologikLemmatizer lemmatizer = new MorfologikLemmatizer(compiledLemmaDictionary);
+
+String[] toks = {"casa", "casa"};
+String[] tags = {"NOUN", "V"};
+
+String[] lemmas = lemmatizer.lemmatize(toks, tags);
+System.out.println(Arrays.toString(lemmas)); // outputs [casa, casar]
+ ]]>
+ </programlisting>
+
+ </para>
+ </section>
+ <section id="tools.morfologik-addon.cmdline">
+ <title>Morfologik CLI Tools</title>
+ <para>
+ The Morfologik addon provides a command line tool. <code>XMLDictionaryToTable</code> makes easy to convert from an OpenNLP XML based dictionary
+ to a tabular format. <code>MorfologikDictionaryBuilder</code> can take a tabular dictionary and output a binary Morfologik FSA dictionary.
+ </para>
+ <screen>
+ <![CDATA[
+$ sh bin/morfologik-addon
+OpenNLP Morfologik Addon. Usage: opennlp-morfologik-addon TOOL
+where TOOL is one of:
+ MorfologikDictionaryBuilder builds a binary POS Dictionary using Morfologik
+ XMLDictionaryToTable reads an OpenNLP XML tag dictionary and outputs it in a tabular file
+All tools print help when invoked with help parameter
+Example: opennlp-morfologik-addon POSDictionaryBuilder help
+ ]]>
+ </screen>
+ </section>
+</chapter>
\ No newline at end of file
http://git-wip-us.apache.org/repos/asf/opennlp/blob/4f2441bc/opennlp-docs/src/docbkx/opennlp.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/opennlp.xml b/opennlp-docs/src/docbkx/opennlp.xml
index 257bbb4..172d06c 100644
--- a/opennlp-docs/src/docbkx/opennlp.xml
+++ b/opennlp-docs/src/docbkx/opennlp.xml
@@ -89,5 +89,6 @@ under the License.
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./corpora.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./machine-learning.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./uima-integration.xml" />
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./morfologik-addon.xml" />
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./cli.xml" />
</book>