You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jz...@apache.org on 2022/03/16 15:15:17 UTC
[opennlp] branch master updated: OPENNLP-1356: Documenting onnx support. (#404)

This is an automated email from the ASF dual-hosted git repository.

jzemerick pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/opennlp.git


The following commit(s) were added to refs/heads/master by this push:
     new 6c5e9b6  OPENNLP-1356: Documenting onnx support. (#404)
6c5e9b6 is described below

commit 6c5e9b6260287227f9dde64c18853196640820ae
Author: Jeff Zemerick <je...@mtnfog.com>
AuthorDate: Wed Mar 16 08:14:52 2022 -0700

    OPENNLP-1356: Documenting onnx support. (#404)
---
 opennlp-docs/src/docbkx/doccat.xml           | 32 ++++++++++++++++++++++++----
 opennlp-docs/src/docbkx/introduction.xml     | 26 ++++++++++++++++++++++
 opennlp-docs/src/docbkx/namefinder.xml       | 27 ++++++++++++++++++++---
 opennlp-docs/src/docbkx/opennlp.xml          |  2 +-
 opennlp-docs/src/main/resources/xsl/html.xsl |  5 ++++-
 5 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/opennlp-docs/src/docbkx/doccat.xml b/opennlp-docs/src/docbkx/doccat.xml
index 3c456b9..3181640 100644
--- a/opennlp-docs/src/docbkx/doccat.xml
+++ b/opennlp-docs/src/docbkx/doccat.xml
@@ -49,8 +49,13 @@ adjustments to obligations towards dealers.]]>
 	<section id="tools.doccat.classifying.cmdline">
 		<title>Document Categorizer Tool</title>
 		<para>
+			Note that ONNX model support is not available through the command line tool. The models that can be trained
+			using the tool are OpenNLP models. ONNX models are trained through deep learning frameworks and then
+			utilized by OpenNLP.
+		</para>
+		<para>
 		The easiest way to try out the document categorizer is the command line tool. The tool is only
-		intended for demonstration and testing. The following command shows how to use the document categorizer tool. 
+		intended for demonstration and testing. The following command shows how to use the document categorizer tool.
 		  <screen>
 			<![CDATA[
 $ opennlp Doccat model]]>
@@ -63,11 +68,11 @@ $ opennlp Doccat model]]>
 		<title>Document Categorizer API</title>
 		<para>
 			To perform classification you will need a maxent model -
-			these are encapsulated in the DoccatModel class of OpenNLP tools.
+			these are encapsulated in the DoccatModel class of OpenNLP tools - or an ONNX model trained
+			for document classification.
 		</para>
 		<para>
-			First you need to grab the bytes from the serialized model on an InputStream - 
-			we'll leave it you to do that, since you were the one who serialized it to begin with. Now for the easy part:
+			Using an OpenNLP model, first you need to grab the bytes from the serialized model on an InputStream:
 						<programlisting language="java">
 				<![CDATA[
 InputStream is = ...
@@ -82,6 +87,25 @@ double[] outcomes = myCategorizer.categorize(inputText);
 String category = myCategorizer.getBestCategory(outcomes);]]>
 				</programlisting>
 		</para>
+		<section id="tools.namefind.api.onnx">
+			<title>Using an ONNX Model</title>
+			<para>
+				Using an ONNX model is similar, except we will utilize the <code>DocumentCategorizerDL</code> class instead.
+				You must provide the path to the model file and the vocabulary file to the document categorizer.
+				(There is no need to load the model as an InputStream as in the previous example.)
+				<programlisting language="java">
+					<![CDATA[
+File model = new File("/path/to/model.onnx");
+File vocab = new File("/path/to/vocab.txt");
+Map<Integer, String> categories = new HashMap<>();
+String[] inputText = new String[]{"My input text is great."};
+final DocumentCategorizerDL myCategorizer = new DocumentCategorizerDL(model, vocab, categories);
+double[] outcomes = myCategorizer.categorize(inputText);
+String category = myCategorizer.getBestCategory(outcomes);]]>
+				</programlisting>
+				For additional examples, refer to the <code>DocumentCategorizerDLEval</code> class.
+			</para>
+		</section>
 	</section>
 	</section>
 	<section id="tools.doccat.training">
diff --git a/opennlp-docs/src/docbkx/introduction.xml b/opennlp-docs/src/docbkx/introduction.xml
index 9ba7727..484e5b0 100644
--- a/opennlp-docs/src/docbkx/introduction.xml
+++ b/opennlp-docs/src/docbkx/introduction.xml
@@ -297,4 +297,30 @@ $ opennlp ToolNameEvaluator -model en-model-name.bin -lang en -data input.test -
         </section>
     </section>
 
+    <section id="intro.models">
+    <title>OpenNLP Models</title>
+        <section id="intro.models.native">
+            <title>OpenNLP Models</title>
+            <para>
+                OpenNLP supports training NLP models that can be used by OpenNLP. In this
+                documentation we will refer to these models as "OpenNLP models." All NLP
+                components of OpenNLP support this type of model. The sections below in
+                this documentation describe how to train and use these models. <ulink url="https://opennlp.apache.org/models.html">Pre-trained
+                models</ulink> are available for some languages and some of the OpenNLP components.
+            </para>
+        </section>
+        <section id="intro.models.onnx">
+            <title>ONNX Models</title>
+            <para>
+                OpenNLP supports ONNX models via the ONNX Runtime for the <link linkend="tools.namefind">Name Finder</link>.
+                and <link linkend="tools.doccat">Document Categorizer</link>. This allows models trained by other frameworks
+                such as PyTorch and Tensorflow to be used by OpenNLP. The documentation for
+                each of the OpenNLP components that supports ONNX models describes how to
+                use ONNX models for inference. Note that OpenNLP does not support training
+                models that can be used by the ONNX Runtime - ONNX models must be created
+                outside of OpenNLP using other tools.
+            </para>
+        </section>
+    </section>
+
 </chapter>
diff --git a/opennlp-docs/src/docbkx/namefinder.xml b/opennlp-docs/src/docbkx/namefinder.xml
index e98dc49..d84e257 100644
--- a/opennlp-docs/src/docbkx/namefinder.xml
+++ b/opennlp-docs/src/docbkx/namefinder.xml
@@ -149,9 +149,25 @@ Span nameSpans[] = nameFinder.find(sentence);]]>
 			Additionally to the statistical Name Finder, OpenNLP also offers a dictionary and a regular
 			expression name finder implementation.
 		</para>
-		<para>
-			TODO: Explain how to retrieve probs from the name finder for names and for non recognized names
-		</para>
+			<section id="tools.namefind.api.onnx">
+			<title>Using an ONNX Model</title>
+				<para>
+					Using an ONNX model is similar, except we will utilize the <code>NameFinderDL</code> class instead.
+					You must provide the path to the model file and the vocabulary file to the name finder.
+					(There is no need to load the model as an InputStream as in the previous example.) The name finder
+					requires a tokenized list of strings as input. The output will be an array of spans.
+					<programlisting language="java">
+						<![CDATA[
+File model = new File("/path/to/model.onnx");
+File vocab = new File("/path/to/vocab.txt");
+Map<Integer, String> categories = new HashMap<>();
+String[] tokens = new String[]{"George", "Washington", "was", "president", "of", "the", "United", "States", "."};
+NameFinderDL nameFinderDL = new NameFinderDL(model, vocab, false, getIds2Labels());
+Span[] spans = nameFinderDL.find(tokens);]]>
+					</programlisting>
+					For additional examples, refer to the <code>NameFinderDLEval</code> class.
+				</para>
+			</section>
 	</section>
 	</section>
 	<section id="tools.namefind.training">
@@ -170,6 +186,11 @@ Span nameSpans[] = nameFinder.find(sentence);]]>
 			download page on various corpora.
 		</para>
 		<para>
+			Note that ONNX model support is not available through the command line tool. The models that can be trained
+			using the tool are OpenNLP models. ONNX models are trained through deep learning frameworks and then
+			utilized by OpenNLP.
+		</para>
+		<para>
 			The data can be converted to the OpenNLP name finder training format. Which is one
             sentence per line. Some other formats are available as well.
 			The sentence must be tokenized and contain spans which mark the entities. Documents are separated by
diff --git a/opennlp-docs/src/docbkx/opennlp.xml b/opennlp-docs/src/docbkx/opennlp.xml
index 2f7e2fa..75184b9 100644
--- a/opennlp-docs/src/docbkx/opennlp.xml
+++ b/opennlp-docs/src/docbkx/opennlp.xml
@@ -64,7 +64,7 @@ under the License.
 
 		<copyright>
 			<year>2011</year>
-			<year>2014</year>
+			<year>2022</year>
 			<holder>The Apache Software Foundation</holder>
 		</copyright>
 		
diff --git a/opennlp-docs/src/main/resources/xsl/html.xsl b/opennlp-docs/src/main/resources/xsl/html.xsl
index 730b6c6..a9f587e 100644
--- a/opennlp-docs/src/main/resources/xsl/html.xsl
+++ b/opennlp-docs/src/main/resources/xsl/html.xsl
@@ -25,6 +25,9 @@
   <xsl:import href="urn:docbkx:stylesheet"/>
 
   <!-- set bellow all your custom xsl configuration -->
-  <xsl:import href="urn:docbkx:stylesheet/highlight.xsl"/> 
+  <xsl:import href="urn:docbkx:stylesheet/highlight.xsl"/>
+
+  <xsl:param name="generate.section.toc.level" select="4"></xsl:param>
+  <xsl:param name="toc.section.depth">4</xsl:param>
 
 </xsl:stylesheet>
\ No newline at end of file