You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by co...@apache.org on 2011/01/13 19:17:54 UTC

svn commit: r1058698 - /incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml

Author: colen
Date: Thu Jan 13 18:17:53 2011
New Revision: 1058698

URL: http://svn.apache.org/viewvc?rev=1058698&view=rev
Log:
OPENNLP-60 Added documentation about ChunkConverter for AD format.

Modified:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml

Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml?rev=1058698&r1=1058697&r2=1058698&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml Thu Jan 13 18:17:53 2011
@@ -151,29 +151,36 @@ F-Measure: 0.8267557582133971]]>
 	<section id="tools.corpora.arvores-deitadas">
 		<title>Arvores Deitadas</title>
 		<para>
-		TODO: Insert description after discussion on ML is finished.
-		</para>
-		
+		The Portuguese corpora available at <ulink url="Floresta Sintá(c)tica">http://www.linguateca.pt</ulink> project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format.  
+		</para>		
 		<section id="tools.corpora.arvores-deitadas.getting">
 			<title>Getting the data</title>
 			<para>
 			The Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html
 			</para>
 			<para>
-			The direct link to the corpus file: http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz
+			The Name Finder models were trained using the Amazonia corpus: <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz">amazonia.ad</ulink>.
+			The Chunker models were trained using the <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz">Bosque_CF_8.0.ad</ulink>.
 			</para>
 		</section>
 		
 		<section id="tools.corpora.arvores-deitadas.converting">
 			<title>Converting the data</title>
 			<para>
-				For now only the Token Name Finder is available:
+				To extract NameFinder training data from Amazonia corpus:
 			<programlisting>
 			<![CDATA[
-$ bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data ../corpus/amazonia.ad \
+$ bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad \
     -lang pt -types per > corpus.txt]]>
 			</programlisting>
 			</para>
+			<para>
+				To extract Chunker training data from Bosque_CF_8.0.ad corpus:
+			<programlisting>
+			<![CDATA[
+$ bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data Bosque_CF_8.0.ad.txt > bosque-chunk]]>
+			</programlisting>
+			</para>
 		</section>
 		<section id="tools.corpora.arvores-deitadas.evaluation">
 			<title>Evaluation</title>