You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ctakes.apache.org by bl...@apache.org on 2012/11/15 23:34:48 UTC

svn commit: r1410072 - /incubator/ctakes/site/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.mdtext

Author: bleeker
Date: Thu Nov 15 22:34:47 2012
New Revision: 1410072

URL: http://svn.apache.org/viewvc?rev=1410072&view=rev
Log:
CMS commit to ctakes by bleeker

Added:
    incubator/ctakes/site/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.mdtext   (with props)

Added: incubator/ctakes/site/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.mdtext
URL: http://svn.apache.org/viewvc/incubator/ctakes/site/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.mdtext?rev=1410072&view=auto
==============================================================================
--- incubator/ctakes/site/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.mdtext (added)
+++ incubator/ctakes/site/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.mdtext Thu Nov 15 22:34:47 2012
@@ -0,0 +1,214 @@
+Title:
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+# cTAKES 2.6 - Chunker
+
+## Overview of Chunker
+
+In cTAKES when we refer to a "chunker" we often mean a shallow parser, i.e. a
+component that tags noun phrases, verb phrases, etc.
+
+This project supports three tasks:
+
+  * Building a model from training data;
+  * Tagging text, using a trained model;
+  * Adjusting the end offset of certain chunks so they envelop other chunks, for certain patterns of chunks.
+
+This project provides a UIMA wrapper around the popular OpenNLP chunker. The
+UIMA examples project provides default wrappers for several of the components
+in OpenNLP, but not for the chunker. We have borrowed from the UIMA examples
+project liberally. Our wrapper works with our type system. Additionally, we
+added features and supporting components.
+
+A chunker model is included with this project.
+
+![](/images/icons/emoticons/information.png)
+
+The model derives from a combination of GENIA, Penn Treebank (Wall Street
+Journal) and anonymized clinical data per Safe Harbor HIPAA guidelines. Prior
+to model building the clinical data was deidentified for patient names to
+preserve patient confidentiality. Any person name in the model will originate
+from non-patient data sources.
+
+## Building a model - Prepare GENIA training data
+
+You need to download a copy of GENIA's Treebank corpus from
+[tokyo.ac.jp/~genia/topics/Corpus/GTB.html](http://www-tsujii.is.s.u). The
+version we used is called "beta". This version is distributed in a set of two
+files, one dated Sept. 22, 2004, with 200 "abstracts", and the other July 11,
+2005, with 300 "abstracts". Please download both. After extraction, place all
+the .tree files from the two download into one directory, which we'll refer to
+<genia-trees>.
+
+Please also download [chunklink from
+ilk.uvt.nl](http://ilk.uvt.nl/team/sabine/homepage/software.html). The version
+we used is chunklink_2-2-2000_for_conll.pl. This tool, from the [Induction of
+Linguistic Knowledge (ILK)](http://ilk.uvt.nl/) group of Tilburg University,
+The Netherlands, converts Penn Treebank II files into a one-word-per-line
+format.
+
+Next, we'll use data.chunk.genia.Genia2PTB to convert Genia Treebank corpus to
+Penn Treebank II format, then use chunklink to convert to chunk data, and
+finally use data.chunk.Chunklink2OpenNLP to convert to OpenNLP format.
+
+![](/images/icons/emoticons/information.png)
+
+This Java class a) renames the .tree files to files that look like
+wsj_0001.mrg and puts them in a directory structure expected by chunklink and
+creates a mapping of the original new names to the old names; b) reformats the
+way pos tags are formatted; c) adds an extra set of parentheses to each line
+of the data.
+
+  * Run data.chunk.genia.Genia2PTB:
+
+**java -cp** **_<classpath>;_** **data.chunk.genia.Genia2PTB** **_<genia-trees>_** **_<ptb-trees>_**  
+**_<genia-ptb-name-mapping>'_**  
+Where
+
+**_<genia-trees>_** is the directory which holds the GENIA corpus files;  
+**_<ptb-trees>_** is the the directory where the converted PTB trees will be written to;  
+**_<genia-ptb-name-mapping>_**is a file that will created by Genia2PTB to save file name mappings.
+
+![](/images/icons/emoticons/check.png)
+
+**Tip**  
+
+There are a number of **problematic sentences** in the second set of 300
+treebanked abstracts (in <ptb-trees> after processing by
+data.chunk.genia.Genia2PTB) that caused the chunklink script to fail. We
+removed them when building our model. The original GENIA file names are listed
+below for your reference. You need to remove the lines from the output of
+Genia2PTB. To find out the converted file names, please look at <genia-ptb-
+name-mapping>.
+
+Line numbers are separated by commas.
+
+  * 93123257.tree - 6
+  * 93172387.tree - 3
+  * 93186809.tree - 5
+  * 93280865.tree - 7
+  * 94085904.tree - 6
+  * 94193110.tree - 2
+  * 96247631.tree - 3, 5
+  * 96353916.tree - 10
+  * 96357043.tree - 4
+  * 97031819.tree - 3, 4
+  * 97054651.tree - 7
+  * 97074532.tree - 6, 7
+  * Run chunklink:
+
+**perl chunklink_2-2-2000_for_conll.pl -NHhftc** **_<ptb-trees> /wsj_****????.mrg>** **<chunklink-chunks>_**  
+Where
+
+**_<chunklink-chunks>_** is the redirected standard output from chunklink.   
+
+![](/images/icons/emoticons/information.png)
+
+The chunklink script doesn't seem to work on Windows. But we did manage to run
+it in a Cygwin session.
+
+  * Run data.chunk.Chunklink2OpenNLP
+
+**java -cp** **_<classpath>_** **data.chunk.Chunklink2OpenNLP** **_<chunklink-chunks> <training-data>_**  
+Where
+
+**_<chunklink-chunks>_** is the output of chunklink from the previous step.  
+**_<training-data>_** is the resulting training data file.
+
+  * Prepare Penn Treebank training data
+
+Please refer to the section called [Obtaining training data in the cTAKES
+documentation on
+SourceForge](http://ohnlp.sourceforge.net/cTAKES/#ftn.id506867) on [how to
+obtain Penn Treebank corpus](http://ohnlp.sourceforge.net/cTAKES/#get_ptb).
+
+Preparing Penn Treebank data is similar to preparing GENIA data, as described
+in the section called [Prepare GENIA training data in the cTAKES documentation
+on SourceForge](http://ohnlp.sourceforge.net/cTAKES/#prepare_genia_chunk),
+except that the first step is not necessary.
+
+  * Run chunklink:
+
+Where
+
+**perl chunklink_2-2-2000_for_conll.pl -NHhftc** **_<ptb-corpus>_** **/wsj_????.mrg >** **_<chunklink-chunks>_**  
+**_<ptb-corpus>_** is your Penn Treebank corpus directory.  
+**_<chunklink-chunks>_** the redirected standard output.
+
+  * Run Chunklink2OpenNLP
+
+**java -cp** **_<classpath>_** **data.chunk.Chunklink2OpenNLP** **_<chunklink-chunks>_** **_<training-data>_**
+
+Where
+
+**_<chunklink-chunks>_** is the output of chunklink from the previous step.  
+**_<training-data>_** is the resulting training data file.  
+**Build a model from your training data**  
+Building a chunker model is much easier than preparing the training data.
+After you have obtained training data, run the OpenNLP tool:
+
+**java -cp** **_<classpath>_** **opennlp.tools.chunker.ChunkerME** **_<training-data>_** **_<model-name>_** **_iterations_** **_cutoff_**  
+Where
+
+**_<training-data>_** is an OpenNLP training data file.  
+**_<model-name>_** is the file name of the resulting model. The name should end with either .txt (for a plain text model) or .bin.gz (for a compressed binary model).  
+**_iterations_** determines how many training iterations will be performed. The default is 100.  
+**_cutoff_** determines the minimum number of times a feature has to be seen to be considered for inclusion in the model.The default cutoff is 5  
+The iterations and cutoff arguments are, taken together, optional, that is,
+you should provide both or provide neither.
+
+## Analysis engines (annotators)
+
+### Chunker.xml
+
+The file cTAKESdesc/chunkerdesc/analysis_engine/Chunker.xml provides a
+descriptor for the Chunker analysis engine which is the UIMA component we have
+written that wraps the OpenNLP chunker. It calls
+**edu.mayo.bmi.uima.chunker.Chunker**, whose Javadoc provides information on
+how to customize this descriptor.
+
+**Parameters**  
+ModelFile
+
+the file that contains the chunker tagging model
+
+ChunkerCreatorClass
+
+the full class name of an implementation of the interface
+edu.mayo.bmi.uima.chunker.ChunkerCreator
+
+### ChunkerAggregate.xml
+
+The file cTAKESdesc/chunkerdesc/analysis_engine/ChunkerAggregate.xml provides
+a descriptor that defines a pipeline for shallow parsing so that all the
+necessary inputs (e.g. tokens, sentences, and POS tags) have been added to the
+CAS. It inherits two parameters from
+[Chunker.xml](http://ohnlp.sourceforge.net/cTAKES/#chunker_xml) and three from
+[POSTagger.xml](http://ohnlp.sourceforge.net/cTAKES/#postagger_xml).
+
+  * Start UIMA CPE GUI.
+
+**java -cp** **_<classpath>_** **org.apache.uima.tools.cpm.CpmFrame**
+
+  * Open this file.
+  * Set the parameters for the collection reader to point to a local collection of files that you want shallow parsed.
+  * Set the parameters for the Chunker as appropriate for your environment.
+  * Set the output directory of the XCAS Writer CAS Consumer.
+
+The results of running the pipeline are written to the output directory as
+XCAS files. These files can be viewed in the CAS Visual Debugger.
\ No newline at end of file

Propchange: incubator/ctakes/site/trunk/content/ctakes/2.6.0/ctakes-2.6-Chunker.mdtext
------------------------------------------------------------------------------
    svn:eol-style = native