You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ctakes.apache.org by bu...@apache.org on 2012/11/15 23:34:53 UTC
svn commit: r838531 - in /websites/staging/ctakes/trunk/content: ./ ctakes/2.6.0/ctakes-2.6-Chunker.html

Author: buildbot
Date: Thu Nov 15 22:34:52 2012
New Revision: 838531

Log:
Staging update by buildbot for ctakes

Added:
    websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.html
Modified:
    websites/staging/ctakes/trunk/content/   (props changed)

Propchange: websites/staging/ctakes/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 15 22:34:52 2012
@@ -1 +1 @@
-1410069
+1410072

Added: websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.html
==============================================================================
--- websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.html (added)
+++ websites/staging/ctakes/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.html Thu Nov 15 22:34:52 2012
@@ -0,0 +1,258 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<!--
+ 
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+ 
+       http://www.apache.org/licenses/LICENSE- 2.0
+ 
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+<link href="/ctakes/css/ctakes.css" rel="stylesheet" type="text/css">
+
+<title></title>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+
+</head>
+ 
+<body>
+ <div class="banner">
+      <div id="bannerleft">
+		<a href="http://www.apache.org/"><img src="http://www.apache.org/images/asf_logo_wide.gif" alt="The Apache Software Foundation" border="0"/></a>
+	<br/>
+			<img alt="cTAKES logo" src="/ctakes/images/ctakes_logo.jpg" border="0"/>
+      </div>  
+    <div id="bannerright">	
+	      <img id="asf-logo" alt="Apache Incubator" src="http://incubator.apache.org/images/egg-logo.png" border="0"/></a>			
+	  </div>
+ </div>  
+  <div id="clear"></div>
+
+
+  <div id="sidenav">
+    <h1 id="general">General</h1>
+<ul>
+<li><a href="/ctakes/index.html">About</a></li>
+<li><a href="/ctakes/gettingstarted.html">Getting Started</a></li>
+<li><a href="/ctakes/downloads.html">Downloads</a></li>
+<li><a href="/ctakes/glossary.html">Glossary</a></li>
+</ul>
+<h1 id="community">Community</h1>
+<ul>
+<li><a href="/ctakes/get-involved.html">Get Involved</a></li>
+<li><a href="https://issues.apache.org/jira/browse/ctakes">Bug Tracker</a></li>
+<li><a href="/ctakes/mailing-lists.html">Mailing Lists</a></li>
+<li><a href="/ctakes/people.html">People</a></li>
+<li><a href="http://incubator.apache.org/projects/ctakes.html">Incubator page</a></li>
+<li><a href="/ctakes/license.html">License</a></li>
+<li><a href="/ctakes/history.html">History</a></li>
+<li><a href="/ctakes/community-faqs.html">Community FAQs</a></li>
+</ul>
+<h1 id="users">Users</h1>
+<ul>
+<li><a href="/ctakes/userguide.html">User Guide</a></li>
+<li><a href="/ctakes/user-faqs.html">User FAQs</a></li>
+</ul>
+<h1 id="developers">Developers</h1>
+<ul>
+<li><a href="/ctakes/developerguide.html">Developer Guide</a></li>
+<li><a href="/ctakes/developer-faqs.html">Developer FAQs</a></li>
+</ul>
+<h1 id="ppmc">PPMC</h1>
+<ul>
+<li><a href="/ctakes/ppmc-faqs.html">PPMC FAQs</a></li>
+<li><a href="/ctakes/ctakes-release-guide.html">Release Guide</a> <br />
+</li>
+</ul>
+<h1 id="asf">ASF</h1>
+<ul>
+<li><a href="http://www.apache.org">Apache Software Foundation</a></li>
+<li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li>
+<li><a href="http://www.apache.org/foundation/sponsorship.html">Become a Sponsor</a></li>
+</ul>
+  </div>
+  <div id="contenta">
+    <h1 id="ctakes-26-chunker">cTAKES 2.6 - Chunker</h1>
+<h2 id="overview-of-chunker">Overview of Chunker</h2>
+<p>In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a
+component that tags noun phrases, verb phrases, etc.</p>
+<p>This project supports three tasks:</p>
+<ul>
+<li>Building a model from training data;</li>
+<li>Tagging text, using a trained model;</li>
+<li>Adjusting the end offset of certain chunks so they envelop other chunks, for certain patterns of chunks.</li>
+</ul>
+<p>This project provides a UIMA wrapper around the popular OpenNLP chunker. The
+UIMA examples project provides default wrappers for several of the components
+in OpenNLP, but not for the chunker. We have borrowed from the UIMA examples
+project liberally. Our wrapper works with our type system. Additionally, we
+added features and supporting components.</p>
+<p>A chunker model is included with this project.</p>
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>The model derives from a combination of GENIA, Penn Treebank (Wall Street
+Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior
+to model building the clinical data was deidentified for patient names to
+preserve patient confidentiality. Any person name in the model will originate
+from non-patient data sources.</p>
+<h2 id="building-a-model-prepare-genia-training-data">Building a model - Prepare GENIA training data</h2>
+<p>You need to download a copy of GENIA's Treebank corpus from
+<a href="http://www-tsujii.is.s.u">tokyo.ac.jp/~genia/topics/Corpus/GTB.html</a>. The
+version we used is called "beta". This version is distributed in a set of two
+files, one dated Sept. 22, 2004, with 200 "abstracts", and the other July 11,
+2005, with 300 "abstracts". Please download both. After extraction, place all
+the .tree files from the two download into one directory, which we'll refer to
+&lt;genia-trees&gt;.</p>
+<p>Please also download <a href="http://ilk.uvt.nl/team/sabine/homepage/software.html">chunklink from
+ilk.uvt.nl</a>. The version
+we used is chunklink_2-2-2000_for_conll.pl. This tool, from the <a href="http://ilk.uvt.nl/">Induction of
+Linguistic Knowledge (ILK)</a> group of Tilburg University,
+The Netherlands, converts Penn Treebank II files into a one-word-per-line
+format.</p>
+<p>Next, we'll use data.chunk.genia.Genia2PTB to convert Genia Treebank corpus to
+Penn Treebank II format, then use chunklink to convert to chunk data, and
+finally use data.chunk.Chunklink2OpenNLP to convert to OpenNLP format.</p>
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>This Java class a) renames the .tree files to files that look like
+wsj_0001.mrg and puts them in a directory structure expected by chunklink and
+creates a mapping of the original new names to the old names; b) reformats the
+way pos tags are formatted; c) adds an extra set of parentheses to each line
+of the data.</p>
+<ul>
+<li>Run data.chunk.genia.Genia2PTB:</li>
+</ul>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;;</em></strong> <strong>data.chunk.genia.Genia2PTB</strong> <strong><em>&lt;genia-trees&gt;</em></strong> <strong><em>&lt;ptb-trees&gt;</em></strong><br />
+<strong><em>&lt;genia-ptb-name-mapping&gt;'</em></strong><br />
+Where</p>
+<p><strong><em>&lt;genia-trees&gt;</em></strong> is the directory which holds the GENIA corpus files;<br />
+<strong><em>&lt;ptb-trees&gt;</em></strong> is the the directory where the converted PTB trees will be written to;<br />
+<strong><em>&lt;genia-ptb-name-mapping&gt;</em></strong>is a file that will created by Genia2PTB to save file name mappings.</p>
+<p><img alt="" src="/images/icons/emoticons/check.png" /></p>
+<p><strong>Tip</strong><br />
+</p>
+<p>There are a number of <strong>problematic sentences</strong> in the second set of 300
+treebanked abstracts (in &lt;ptb-trees&gt; after processing by
+data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We
+removed them when building our model. The original GENIA file names are listed
+below for your reference. You need to remove the lines from the output of
+Genia2PTB. To find out the converted file names, please look at &lt;genia-ptb-
+name-mapping&gt;.</p>
+<p>Line numbers are separated by commas.</p>
+<ul>
+<li>93123257.tree - 6</li>
+<li>93172387.tree - 3</li>
+<li>93186809.tree - 5</li>
+<li>93280865.tree - 7</li>
+<li>94085904.tree - 6</li>
+<li>94193110.tree - 2</li>
+<li>96247631.tree - 3, 5</li>
+<li>96353916.tree - 10</li>
+<li>96357043.tree - 4</li>
+<li>97031819.tree - 3, 4</li>
+<li>97054651.tree - 7</li>
+<li>97074532.tree - 6, 7</li>
+<li>Run chunklink:</li>
+</ul>
+<p><strong>perl chunklink_2-2-2000_for_conll.pl -NHhftc</strong> <strong><em>&lt;ptb-trees&gt; /wsj</em></strong><strong>????.mrg&gt;</strong> <strong>&lt;chunklink-chunks&gt;_</strong><br />
+Where</p>
+<p><strong><em>&lt;chunklink-chunks&gt;</em></strong> is the redirected standard output from chunklink. <br />
+</p>
+<p><img alt="" src="/images/icons/emoticons/information.png" /></p>
+<p>The chunklink script doesn't seem to work on Windows. But we did manage to run
+it in a Cygwin session.</p>
+<ul>
+<li>Run data.chunk.Chunklink2OpenNLP</li>
+</ul>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong> <strong>data.chunk.Chunklink2OpenNLP</strong> <strong><em>&lt;chunklink-chunks&gt; &lt;training-data&gt;</em></strong><br />
+Where</p>
+<p><strong><em>&lt;chunklink-chunks&gt;</em></strong> is the output of chunklink from the previous step.<br />
+<strong><em>&lt;training-data&gt;</em></strong> is the resulting training data file.</p>
+<ul>
+<li>Prepare Penn Treebank training data</li>
+</ul>
+<p>Please refer to the section called <a href="http://ohnlp.sourceforge.net/cTAKES/#ftn.id506867">Obtaining training data in the cTAKES
+documentation on
+SourceForge</a> on <a href="http://ohnlp.sourceforge.net/cTAKES/#get_ptb">how to
+obtain Penn Treebank corpus</a>.</p>
+<p>Preparing Penn Treebank data is similar to preparing GENIA data, as described
+in the section called <a href="http://ohnlp.sourceforge.net/cTAKES/#prepare_genia_chunk">Prepare GENIA training data in the cTAKES documentation
+on SourceForge</a>,
+except that the first step is not necessary.</p>
+<ul>
+<li>Run chunklink:</li>
+</ul>
+<p>Where</p>
+<p><strong>perl chunklink_2-2-2000_for_conll.pl -NHhftc</strong> <strong><em>&lt;ptb-corpus&gt;</em></strong> <strong>/wsj_????.mrg &gt;</strong> <strong><em>&lt;chunklink-chunks&gt;</em></strong><br />
+<strong><em>&lt;ptb-corpus&gt;</em></strong> is your Penn Treebank corpus directory.<br />
+<strong><em>&lt;chunklink-chunks&gt;</em></strong> the redirected standard output.</p>
+<ul>
+<li>Run Chunklink2OpenNLP</li>
+</ul>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong> <strong>data.chunk.Chunklink2OpenNLP</strong> <strong><em>&lt;chunklink-chunks&gt;</em></strong> <strong><em>&lt;training-data&gt;</em></strong></p>
+<p>Where</p>
+<p><strong><em>&lt;chunklink-chunks&gt;</em></strong> is the output of chunklink from the previous step.<br />
+<strong><em>&lt;training-data&gt;</em></strong> is the resulting training data file.<br />
+<strong>Build a model from your training data</strong><br />
+Building a chunker model is much easier than preparing the training data.
+After you have obtained training data, run the OpenNLP tool:</p>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong> <strong>opennlp.tools.chunker.ChunkerME</strong> <strong><em>&lt;training-data&gt;</em></strong> <strong><em>&lt;model-name&gt;</em></strong> <strong><em>iterations</em></strong> <strong><em>cutoff</em></strong><br />
+Where</p>
+<p><strong><em>&lt;training-data&gt;</em></strong> is an OpenNLP training data file.<br />
+<strong><em>&lt;model-name&gt;</em></strong> is the file name of the resulting model. The name should end with either .txt (for a plain text model) or .bin.gz (for a compressed binary model).<br />
+<strong><em>iterations</em></strong> determines how many training iterations will be performed. The default is 100.<br />
+<strong><em>cutoff</em></strong> determines the minimum number of times a feature has to be seen to be considered for inclusion in the model.The default cutoff is 5<br />
+The iterations and cutoff arguments are, taken together, optional, that is,
+you should provide both or provide neither.</p>
+<h2 id="analysis-engines-annotators">Analysis engines (annotators)</h2>
+<h3 id="chunkerxml">Chunker.xml</h3>
+<p>The file cTAKESdesc/chunkerdesc/analysis_engine/Chunker.xml provides a
+descriptor for the Chunker analysis engine which is the UIMA component we have
+written that wraps the OpenNLP chunker. It calls
+<strong>edu.mayo.bmi.uima.chunker.Chunker</strong>, whose Javadoc provides information on
+how to customize this descriptor.</p>
+<p><strong>Parameters</strong><br />
+ModelFile</p>
+<p>the file that contains the chunker tagging model</p>
+<p>ChunkerCreatorClass</p>
+<p>the full class name of an implementation of the interface
+edu.mayo.bmi.uima.chunker.ChunkerCreator</p>
+<h3 id="chunkeraggregatexml">ChunkerAggregate.xml</h3>
+<p>The file cTAKESdesc/chunkerdesc/analysis_engine/ChunkerAggregate.xml provides
+a descriptor that defines a pipeline for shallow parsing so that all the
+necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the
+CAS. It inherits two parameters from
+<a href="http://ohnlp.sourceforge.net/cTAKES/#chunker_xml">Chunker.xml</a> and three from
+<a href="http://ohnlp.sourceforge.net/cTAKES/#postagger_xml">POSTagger.xml</a>.</p>
+<ul>
+<li>Start UIMA CPE GUI.</li>
+</ul>
+<p><strong>java -cp</strong> <strong><em>&lt;classpath&gt;</em></strong> <strong>org.apache.uima.tools.cpm.CpmFrame</strong></p>
+<ul>
+<li>Open this file.</li>
+<li>Set the parameters for the collection reader to point to a local collection of files that you want shallow parsed.</li>
+<li>Set the parameters for the Chunker as appropriate for your environment.</li>
+<li>Set the output directory of the XCAS Writer CAS Consumer.</li>
+</ul>
+<p>The results of running the pipeline are written to the output directory as
+XCAS files. These files can be viewed in the CAS Visual Debugger.</p>
+  </div>
+ 
+ <div id="footera">
+    <div id="copyrighta">
+      <p>Copyright &#169; 2011 The Apache Software Foundation, Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version 2.0</a>.<br/>Apache and the Apache feather logo are trademarks of The Apache Software Foundation.</p>
+    </div>
+ </div>
+ 
+</body>
+</html>
+