You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by co...@apache.org on 2011/01/13 19:17:54 UTC
svn commit: r1058698 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
Author: colen
Date: Thu Jan 13 18:17:53 2011
New Revision: 1058698
URL: http://svn.apache.org/viewvc?rev=1058698&view=rev
Log:
OPENNLP-60 Added documentation about ChunkConverter for AD format.
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml?rev=1058698&r1=1058697&r2=1058698&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml Thu Jan 13 18:17:53 2011
@@ -151,29 +151,36 @@ F-Measure: 0.8267557582133971]]>
<section id="tools.corpora.arvores-deitadas">
<title>Arvores Deitadas</title>
<para>
- TODO: Insert description after discussion on ML is finished.
- </para>
-
+ The Portuguese corpora available at <ulink url="Floresta Sintá(c)tica">http://www.linguateca.pt</ulink> project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format.
+ </para>
<section id="tools.corpora.arvores-deitadas.getting">
<title>Getting the data</title>
<para>
The Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html
</para>
<para>
- The direct link to the corpus file: http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz
+ The Name Finder models were trained using the Amazonia corpus: <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz">amazonia.ad</ulink>.
+ The Chunker models were trained using the <ulink url="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz">Bosque_CF_8.0.ad</ulink>.
</para>
</section>
<section id="tools.corpora.arvores-deitadas.converting">
<title>Converting the data</title>
<para>
- For now only the Token Name Finder is available:
+ To extract NameFinder training data from Amazonia corpus:
<programlisting>
<![CDATA[
-$ bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data ../corpus/amazonia.ad \
+$ bin/opennlp TokenNameFinderConverter ad -encoding ISO-8859-1 -data amazonia.ad \
-lang pt -types per > corpus.txt]]>
</programlisting>
</para>
+ <para>
+ To extract Chunker training data from Bosque_CF_8.0.ad corpus:
+ <programlisting>
+ <![CDATA[
+$ bin/opennlp ChunkerConverter ad -encoding ISO-8859-1 -data Bosque_CF_8.0.ad.txt > bosque-chunk]]>
+ </programlisting>
+ </para>
</section>
<section id="tools.corpora.arvores-deitadas.evaluation">
<title>Evaluation</title>