You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/05/31 14:53:33 UTC

svn commit: r1129655 - /incubator/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml

Author: joern
Date: Tue May 31 12:53:33 2011
New Revision: 1129655

URL: http://svn.apache.org/viewvc?rev=1129655&view=rev
Log:
OPENNLP-17 Added documentation about custom feature generator xml

Modified:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml

Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml?rev=1129655&r1=1129654&r2=1129655&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/namefinder.xml Tue May 31 12:53:33 2011
@@ -263,44 +263,168 @@ try {
 		
 		<section id="tools.namefind.training.featuregen">
 		<title>Custom Feature Generation</title>
-		<para>
-			OpenNLP defines a default feature generation which is used when no custom feature
-			generation is specified. Users which want to experiment with the feature generation
-			can provide a custom feature generator. The custom generator must be used for training
-			and for detecting the names. If the feature generation during training time and detection
-			time is different the name finder might not be able to detect names.
-			The following lines show how to construct a custom feature generator
-			<programlisting language="java">
-				<![CDATA[
-AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator(
-         new AdaptiveFeatureGenerator[]{
-           new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2),
-           new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2),
-           new OutcomePriorFeatureGenerator(),
-           new PreviousMapFeatureGenerator(),
-           new BigramNameFeatureGenerator(),
-           new SentenceFeatureGenerator(true, false)
-           });]]>
-			</programlisting>
-			which is similar to the default feature generator.
-			The javadoc of the feature generator classes explain what the individual feature generators do.
-			To write a custom feature generator please implement the AdaptiveFeatureGenerator interface or
-			if it must not be adaptive extend the FeatureGeneratorAdapter.
-			The train method which should be used is defined as
-			<programlisting language="java">
-				<![CDATA[
+			<para>
+				OpenNLP defines a default feature generation which is used when no custom feature
+				generation is specified. Users which want to experiment with the feature generation
+				can provide a custom feature generator. Either via API or via an xml descriptor file.
+			</para>
+			<section id="tools.namefind.training.featuregen.api">
+			<title>Feature Generation defined by API</title>
+			<para>
+				The custom generator must be used for training
+				and for detecting the names. If the feature generation during training time and detection
+				time is different the name finder might not be able to detect names.
+				The following lines show how to construct a custom feature generator
+				<programlisting language="java">
+					<![CDATA[
+	AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator(
+	         new AdaptiveFeatureGenerator[]{
+	           new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2),
+	           new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2),
+	           new OutcomePriorFeatureGenerator(),
+	           new PreviousMapFeatureGenerator(),
+	           new BigramNameFeatureGenerator(),
+	           new SentenceFeatureGenerator(true, false)
+	           });]]>
+				</programlisting>
+				which is similar to the default feature generator.
+				The javadoc of the feature generator classes explain what the individual feature generators do.
+				To write a custom feature generator please implement the AdaptiveFeatureGenerator interface or
+				if it must not be adaptive extend the FeatureGeneratorAdapter.
+				The train method which should be used is defined as
+				<programlisting language="java">
+					<![CDATA[
 public static TokenNameFinderModel train(String languageCode, String type, ObjectStream<NameSample> samples, 
        AdaptiveFeatureGenerator generator, final Map<String, Object> resources, 
        int iterations, int cutoff) throws IOException]]>
-			</programlisting>
-			and can take feature generator as an argument.
-			To detect names the model which was returned from the train method and the
-			feature generator must be passed to the NameFinderME constructor.
-			<programlisting language="java">
-				<![CDATA[
+				</programlisting>
+				and can take feature generator as an argument.
+				To detect names the model which was returned from the train method and the
+				feature generator must be passed to the NameFinderME constructor.
+				<programlisting language="java">
+					<![CDATA[
 new NameFinderME(model, featureGenerator, NameFinderME.DEFAULT_BEAM_SIZE);]]>
-			 </programlisting>	 
-		</para>
+				 </programlisting>	 
+			</para>
+			</section>
+			<section id="tools.namefind.training.featuregen.xml">
+			<title>Feature Generation defined by XML Descriptor</title>
+			<para>
+			OpenNLP can also use a xml descritpor file to configure the featuer generation. The descriptor
+			file is stored inside the model after training and the feature generators are configured
+			correctly when the name finder is instantiated.
+			
+			The following sample shows a xml descriptor:
+				<programlisting language="xml">
+					<![CDATA[
+<generators>
+  <cache> 
+    <generators>
+      <window prevLength = "2" nextLength = "2">          
+        <tokenclass/>
+      </window>
+      <window prevLength = "2" nextLength = "2">                
+        <token/>
+      </window>
+      <definition/>
+      <prevmap/>
+      <bigram/>
+      <sentence begin="true" end="false"/>
+    </generators>
+  </cache> 
+</generators>]]>
+				 </programlisting>
+		    The root element must be generators, each sub-element adds a feature generator to the configuration.
+		    The sample xml is equivalent to the generators defined by the API above.
+			</para>
+			<para>
+			The following table shows the supported elements:
+			<table>
+			  <title>Genertor elements</title>
+			  <tgroup cols="2">
+			    <colspec colname="c1"/>
+			    <colspec colname="c2"/>
+			    <thead>
+			      <row>
+				<entry>Element</entry>
+				<entry>Aggregated</entry>
+				<entry>Attributes</entry>
+			      </row>
+			    </thead>
+			    <tbody>
+			      <row>
+					<entry>generators</entry>
+					<entry>yes</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>cache</entry>
+					<entry>yes</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>charngram</entry>
+					<entry>no</entry>
+					<entry><emphasis>min</emphasis> and <emphasis>max</emphasis> specify the length of the generated character ngrams</entry>
+			      </row>
+			      <row>
+					<entry>definition</entry>
+					<entry>no</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>dictionary</entry>
+					<entry>no</entry>
+					<entry><emphasis>dict</emphasis> is the key of the dictionary resource to use,
+					       and <emphasis>prefix</emphasis> is a feature prefix string</entry>
+			      </row>
+			      <row>
+					<entry>prevmap</entry>
+					<entry>no</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>sentence</entry>
+					<entry>no</entry>
+					<entry><emphasis>begin</emphasis> and <emphasis>end</emphasis> to generate begin or end features, both are optional and are boolean values</entry>
+			      </row>
+			      <row>
+					<entry>tokenclass</entry>
+					<entry>no</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>token</entry>
+					<entry>no</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>bigram</entry>
+					<entry>no</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>tokenpattern</entry>
+					<entry>no</entry>
+					<entry>none</entry>
+			      </row>
+			      <row>
+					<entry>window</entry>
+					<entry>yes</entry>
+					<entry><emphasis>prevLength</emphasis> and <emphasis>nextLength</emphasis> must be integers ans specify the window size</entry>
+			      </row>
+			      <row>
+					<entry>custom</entry>
+					<entry>no</entry>
+					<entry><emphasis>class</emphasis> is the name of the feature generator class whcih will be loaded</entry>
+			      </row>
+			    </tbody>
+			  </tgroup>
+			</table>
+			Aggregated feature generators can contain other generators, like the cache or the window feature
+			generator in the sample.
+			</para>
+			</section>
 		</section>
 	</section>
 	<section id="tools.namefind.eval">