You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@ctakes.apache.org by ch...@apache.org on 2015/11/02 17:59:52 UTC
svn commit: r1712083 [2/3] - in /ctakes/sandbox/ctakes-clinical-deid: ./ GATE/ GATE/pipeline/ GATE/plugins/ GATE/plugins/ANNIE/ GATE/plugins/ANNIE/.annie-defaults-metadata/ GATE/plugins/ANNIE/resources/ GATE/plugins/ANNIE/resources/gazetteer/ GATE/plug...

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/nationality.lst
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/nationality.lst?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/nationality.lst (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/nationality.lst Mon Nov  2 16:59:48 2015
@@ -0,0 +1,228 @@
+Afghan
+African-American
+Albanian
+Algerian
+Andorran
+Angolan
+Antiguans
+Argentinean
+Argentinian
+Armenian
+Australian
+Austrian
+Azerbaijani
+Bahamian
+Bahraini
+Bangladeshi
+Barbadian
+Barbudans
+Batswana
+Belarusan
+Belarusian
+Belgian
+Belizean
+Beninese
+Bhutanese
+Bolivian
+Bosnian
+Botswanan
+Brazilian
+British
+Bruneian
+Bulgarian
+Burkinabe
+Burkinese
+Burmese
+Burundian
+Cambodian
+Cameroonian
+Canadian
+Chadian
+Chilean
+Chinese
+Colombian
+Columbian
+Comoran
+Congolese
+Croatian
+Cuban
+Cypriot
+Czech
+Danish
+Djibouti
+Djiboutian
+Dominican
+Dutch
+East Timorese
+Ecuadorean
+Ecuadorian
+Egyptian
+Emirati
+Emirian
+Equadorian
+Eritrean
+Estonian
+Ethiopian
+Fijian
+Filipino
+Finnish
+French
+Gabonese
+Gambian
+Georgian
+German
+Ghanaian
+Greek
+Grenadian
+Guatemalan
+Guinea-Bissauan
+Guinean
+Guyanese
+Haitian
+Herzegovinian
+Honduran
+Hungarian
+Icelander
+Icelandic
+I-Kiribati
+Indian
+Indonesian
+Iranian
+Iraqi
+Irish
+Israeli
+Italian
+Ivorian
+Jamaican
+Japanese
+Jordanian
+Kazakh
+Kazakhstani
+Khazakhstani
+Kenyan
+Kittian
+Nevisian
+Kuwaiti
+Kyrgyz
+Laotian
+Latvian
+Lebanese
+Liberian
+Libyan
+Liechtensteiner
+Lithuanian
+Luxembourger
+Macedonian
+Madagascan
+Malagasy
+Malawian
+Malaysian
+Maldivan
+Maldivian
+Malian
+Maltese
+Marshallese
+Mauritanian
+Mauritian
+Mexican
+Micronesian
+Moldovan
+Monacan
+Mongolian
+Montenegrin
+Moroccan
+Mosotho
+Motswana
+Mozambican
+Namibian
+Nauruan
+Nepalese
+New Zealander
+Nicaraguan
+Nigerian
+Nigerien
+Northern Irish
+North Korean
+Norwegian
+Omani
+Pakistani
+Palauan
+Panamanian
+Paraguayan
+Peruvian
+Philippine
+Polish
+Portuguese
+Qatari
+Romanian
+Russian
+Rwandan
+Salvadoran
+Salvadorean
+Samoan
+Scottish
+Senegalese
+Serb
+Serbian
+Seychellois
+Singaporean
+Slovak
+Slovakian
+Slovene
+Slovenian
+Solomon Islander
+Somali
+South African
+South Korean
+Spanish
+Sri Lankan
+Sudanese
+Surinamer
+Surinamese
+Swazi
+Swedish
+Swiss
+Syrian
+Tadjik
+Tadjiki
+Taiwanese
+Tajikistani
+Tajik
+Tajiki
+Tanzanian
+Thai
+Tobagonian
+Togolese
+Tongan
+Trinidadian
+Tunisian
+Turkish
+Turkmen
+Turkoman
+Tuvaluan
+Ugandan
+Ukrainian
+Uruguayan
+Uzbek
+Uzbeki
+Uzbekistani
+Vanuatuan
+Venezuelan
+Vietnamese
+Welsh
+Yemeni
+Yemenite
+Yugoslav
+Zairean
+Zambian
+Zimbabwean
+English
+San Marinese
+Sao Tomean
+Papua New Guinean
+Western Samoan
+Saint Lucian
+Sierra Leonean
+Sierra Leonian
+Equatorial Guinean
+

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/spoken_language.lst
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/spoken_language.lst?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/spoken_language.lst (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/spoken_language.lst Mon Nov  2 16:59:48 2015
@@ -0,0 +1,174 @@
+Afghan-speaking
+Albanian-speaking
+Algerian-speaking
+Andorran-speaking
+Angolan-speaking
+Argentinian-speaking
+Armenian-speaking
+Australian-speaking
+Austrian-speaking
+Azerbaijani-speaking
+Bahamian-speaking
+Bahraini-speaking
+Bangladeshi-speaking
+Barbadian-speaking
+Belarusian-speaking
+Belarusan-speaking
+Belgian-speaking
+Belizean-speaking
+Beninese-speaking
+Bhutanese-speaking
+Bolivian-speaking
+Bosnian-speaking
+Botswanan-speaking
+Brazilian-speaking
+British-speaking
+Bruneian-speaking
+Bulgarian-speaking
+Burkinese-speaking
+Burmese-speaking
+Burundian-speaking
+Cambodian-speaking
+Cameroonian-speaking
+Canadian-speaking
+Cape Verdean-speaking
+Chadian-speaking
+Chilean-speaking
+Chinese-speaking
+Colombian-speaking
+Congolese-speaking
+Croatian-speaking
+Cuban-speaking
+Cypriot-speaking
+Czech-speaking
+Danish-speaking
+Djiboutian-speaking
+Dominican-speaking
+Dominican-speaking
+Ecuadorean-speaking
+English-speaking
+Egyptian-speaking
+Salvadorean-speaking
+Eritrean-speaking
+Estonian-speaking
+Ethiopian-speaking
+Fijian-speaking
+Finnish-speaking
+Gabonese-speaking
+Gambian-speaking
+Georgian-speaking
+German-speaking
+Ghanaian-speaking
+Greek-speaking
+Grenadian-speaking
+Guatemalan-speaking
+Guinean-speaking
+Guyanese-speaking
+Haitian-speaking
+Dutch-speaking
+Honduran-speaking
+Hungarian-speaking
+Icelandic-speaking
+Indian-speaking
+Indonesian-speaking
+Iranian-speaking
+Iraqi-speaking
+Irish-speaking
+Italian-speaking
+Jamaican-speaking
+Japanese-speaking
+Jordanian-speaking
+Kazakh-speaking
+Kenyan-speaking
+Kuwaiti-speaking
+Laotian-speaking
+Latvian-speaking
+Lebanese-speaking
+Liberian-speaking
+Libyan-speaking
+Lithuanian-speaking
+Macedonian-speaking
+Madagascan-speaking
+Malawian-speaking
+Malaysian-speaking
+Maldivian-speaking
+Malian-speaking
+Maltese-speaking
+Mauritanian-speaking
+Mauritian-speaking
+Mexican-speaking
+Moldovan-speaking
+Monacan-speaking
+Mongolian-speaking
+Montenegrin-speaking
+Moroccan-speaking
+Mozambican-speaking
+Namibian-speaking
+Nepalese-speaking
+Dutch-speaking
+Nicaraguan-speaking
+Nigerien-speaking
+Nigerian-speaking
+Norwegian-speaking
+Omani-speaking
+Pakistani-speaking
+Panamanian-speaking
+Guinean-speaking
+Paraguayan-speaking
+Peruvian-speaking
+Persian-speaking
+Philippine-speaking
+Polish-speaking
+Portuguese-speaking
+Qatari-speaking
+Romanian-speaking
+Russian-speaking
+Rwandan-speaking
+Saudi-speaking
+Scottish-speaking
+Senegalese-speaking
+Serb-speaking
+Serbian-speaking
+Seychellois-speaking
+Sierra Leonian-speaking
+Singaporean-speaking
+Slovak-speaking
+Slovene-speaking
+Slovenian-speaking
+Somali-speaking
+Spanish-speaking
+Sri Lankan-speaking
+Sudanese-speaking
+Surinamese-speaking
+Swazi-speaking
+Swedish-speaking
+Swiss-speaking
+Syrian-speaking
+Taiwanese-speaking
+Tajik-speaking
+Tadjik-speaking
+Tanzanian-speaking
+Thai-speaking
+Togolese-speaking
+Tobagonian-speaking
+Turkish-speaking
+Turkoman-speaking
+Turkmen-speaking
+Tuvaluan-speaking
+Ugandan-speaking
+Ukrainian-speaking
+Emirati-speaking
+British-speaking
+Uruguayan-speaking
+Uzbek-speaking
+Vanuatuan-speaking
+Venezuelan-speaking
+Vietnamese-speaking
+Welsh-speaking
+Western Samoan-speaking
+Yemeni-speaking
+Yugoslav-speaking
+Zairean-speaking
+Zambian-speaking
+Zimbabwean-speaking
+Equadorian-speaking

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state.lst
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state.lst?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state.lst (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state.lst Mon Nov  2 16:59:48 2015
@@ -0,0 +1,59 @@
+ALABAMA
+ALASKA
+AMERICAN SAMOA
+ARIZONA
+ARKANSAS
+CALIFORNIA
+COLORADO
+CONNECTICUT
+DELAWARE
+DISTRICT OF COLUMBIA
+FEDERATED STATES OF MICRONESIA
+FLORIDA
+GEORGIA
+GUAM
+HAWAII
+IDAHO
+ILLINOIS
+INDIANA
+IOWA
+KANSAS
+KENTUCKY
+LOUISIANA
+MAINE
+MARSHALL ISLANDS
+MARYLAND
+MASSACHUSETTS
+MICHIGAN
+MINNESOTA
+MISSISSIPPI
+MISSOURI
+MONTANA
+NEBRASKA
+NEVADA
+NEW HAMPSHIRE
+NEW JERSEY
+NEW MEXICO
+NEW YORK
+NORTH CAROLINA
+NORTH DAKOTA
+NORTHERN MARIANA ISLANDS
+OHIO
+OKLAHOMA
+OREGON
+PALAU
+PENNSYLVANIA
+PUERTO RICO
+RHODE ISLAND
+SOUTH CAROLINA
+SOUTH DAKOTA
+TENNESSEE
+TEXAS
+UTAH
+VERMONT
+VIRGINIA
+VIRGIN ISLANDS
+WASHINGTON
+WEST VIRGINIA
+WISCONSIN
+WYOMING

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state_acronym_abbreviation.lst
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state_acronym_abbreviation.lst?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state_acronym_abbreviation.lst (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/gazetteer/us_state_acronym_abbreviation.lst Mon Nov  2 16:59:48 2015
@@ -0,0 +1,104 @@
+AK
+AL
+AR
+AZ
+CA
+CO
+CT
+DC
+FL
+FM
+GA
+GU
+IA
+ID
+IL
+KS
+KY
+LA
+MA
+ME
+MH
+MI
+MN
+MO
+MP
+MS
+MT
+NC
+ND
+NE
+NH
+NJ
+NM
+NV
+NY
+OH
+OK
+PA
+PR
+PW
+RI
+SC
+SD
+TN
+TX
+UT
+VA
+VI
+VT
+WA
+WI
+WV
+WY
+Ala
+Amer. Samoa
+Ariz
+Ark
+Calif
+Colo
+Conn
+C.Z.
+D.C.
+Del
+Fla
+Ill
+Ind
+Kans
+Mass
+Mich
+Minn
+Mo
+Mont
+N.C.
+N.Dak
+N. Dak
+Nebr
+Nev
+N.H.
+N.J.
+N.Mex
+N. Mex
+N.Y.
+Ohio
+Okla
+Ore
+Oreg
+Pa
+P.R.
+R.I.
+S.C.
+S.Dak
+S. Dak
+Tenn
+Tex
+Utah
+Va.
+V.I.
+Vt
+Wash
+Wis
+Wisc
+W.Va
+W. Va
+Wyo

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/abbreviations.lst
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/abbreviations.lst?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/abbreviations.lst (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/abbreviations.lst Mon Nov  2 16:59:48 2015
@@ -0,0 +1,96 @@
+AG
+APR
+Apr
+AUG
+Aug
+Adm
+Brig
+CO
+CORP
+Capt
+Cmdr
+Co
+Col
+Comdr
+DEC
+Dec
+DR
+Dr
+FEB
+Feb
+Fig
+FRI
+GMBH
+Gen
+Gov
+INC
+JAN
+Jan
+JUL
+Jul
+JUN
+Jun
+LTD
+Lt
+Ltd
+MAR
+Mar
+MON
+Mon
+MP
+Maj
+Mr
+Mrs
+Ms
+NA
+NOV
+Nov
+NV
+OCT
+Oct
+Oy
+PLC
+Prof
+Rep
+SA
+SAT
+Sat
+SEP
+Sep
+SIR
+SR
+SUN
+Sun
+Sen
+Sgt
+SpA
+St
+THU
+Thu
+THUR
+Thur
+TUE
+Tue
+VP
+WED
+Wed
+ad
+al
+b
+ed
+eds
+eg
+e.g
+(e.g
+[e.g
+et
+etc
+fig
+i.e
+(i.e
+[i.e
+p
+usu
+vs
+yr
+yrs

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/lists.def
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/lists.def?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/lists.def (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/gazetteer/lists.def Mon Nov  2 16:59:48 2015
@@ -0,0 +1 @@
+abbreviations.lst:splitter_abbreviation

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/cleanup.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/cleanup.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/cleanup.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/cleanup.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,33 @@
+/*
+*  cleanup.jape
+*
+* Copyright (c) 1998-2007, The University of Sheffield.
+*
+*  This file is part of GATE (see http://gate.ac.uk/), and is free
+*  software, licenced under the GNU Library General Public License,
+*  Version 2, June 1991 (in the distribution as file licence.html,
+*  and also available at http://gate.ac.uk/gate/licence.html).
+*
+*  Valentin Tablan, March 7th, 2007 
+* 
+*  $Id$
+*/
+
+Phase:	cleanup
+Input: Token
+Options: control = once
+
+//Removes temporary data created by the sentence splitter
+Rule: cleanUp
+{Token}
+-->
+{
+  //if there were any sentences created, then we need to remove the document
+  //feature -> useful for future runs
+  doc.getFeatures().remove("temp-last-sentence-end");
+  //remove all lookups used for abbreviations
+  FeatureMap constraints = Factory.newFeatureMap();
+  constraints.put("majorType", "splitter_abbreviation");
+  AnnotationSet toRemove = inputAS.get("Lookup", constraints);
+  if(toRemove != null) inputAS.removeAll(toRemove);
+}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find-single-nl.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find-single-nl.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find-single-nl.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find-single-nl.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,134 @@
+/*
+*  cr.jape
+*
+* Copyright (c) 1998-2004, The University of Sheffield.
+*
+*  This file is part of GATE (see http://gate.ac.uk/), and is free
+*  software, licenced under the GNU Library General Public License,
+*  Version 2, June 1991 (in the distribution as file licence.html,
+*  and also available at http://gate.ac.uk/gate/licence.html).
+*
+*  Diana Maynard, 10 Sep 2001
+* 
+*  $Id: find-single-nl.jape 9798 2008-08-07 15:26:51Z ian_roberts $
+*/
+
+Phase:	find
+Input: Token SpaceToken Lookup DEFAULT_TOKEN
+Options: control = appelt
+
+Macro: FULLSTOP
+(
+ {Token.string=="."}
+)
+
+//we'll allow two, three or four dots 
+Macro: THREEDOTS
+(
+ {Token.string=="."}
+ {Token.string=="."}
+ ({Token.string=="."})?
+ ({Token.string=="."})?
+)
+
+Macro: PUNCT
+(
+  {Token.string == "!"} | 
+  {Token.string == "?"}
+)
+
+Macro: NEWLINE
+(
+  {SpaceToken.string == "\n"} |
+  {SpaceToken.string=="\n\r"} |
+  ({SpaceToken.string=="\n"}{SpaceToken.string=="\r"}) |
+  {SpaceToken.string=="\r\n"} |
+  ({SpaceToken.string=="\r"}{SpaceToken.string=="\n"})
+)
+
+
+//normal sentence split 
+Rule: Split1
+(
+ (PUNCT)+ |
+ FULLSTOP |
+ THREEDOTS
+)
+:split
+-->
+:split.Split = {kind = "internal"}
+
+//a single new line generates an external split
+Rule: CR
+(
+  NEWLINE
+  ({SpaceToken.kind == space})*
+):cr
+-->
+:cr.Split = {kind = "external"}
+  
+//Anything more than four dots is a line of dots (e.g. in tables of contents)
+Rule: Ldots
+  FULLSTOP
+  FULLSTOP
+  FULLSTOP
+  FULLSTOP
+  (FULLSTOP)+
+-->
+{}
+
+//Java class names
+Rule:dottedName
+  {Token.kind == word}
+  (FULLSTOP {Token.kind == word})+
+-->
+{}
+
+// numbers with decimal part or IP addresses
+Rule:Number
+  {Token.kind == number}
+  (FULLSTOP {Token.kind == number})+
+-->
+{}
+
+//full stops in .net, .NET, .Net
+Rule:DotNetStop
+ FULLSTOP
+ (
+   {Token.string == "NET"} |
+   {Token.string == "net"} |
+   {Token.string == "Net"}
+ )
+-->
+{}
+
+//file extensions like .exe or .EXE
+//unfortunately we can't avoid .Exe as this might be a legitimate split
+//even if there is no space after the full stop.
+Rule:DotFileName
+  FULLSTOP
+  (
+    {Token.orth == "lowercase"} | 
+    {Token.orth == "allCaps"} |
+    {Token.orth == "mixedCaps"}
+  )
+-->
+{}
+
+//Known abbreviations like "Prof."
+//this relies on the gazetteer which has a funny sense of whole words
+//hence we need to check for a space before
+//otherwise things like "640p." would not be identitfied (because p is a
+//known abbreviation).
+Rule: Abbrev1
+ {SpaceToken}
+ {Lookup.majorType == "splitter_abbreviation"}
+ {Token.string == "."}
+-->
+{}
+
+//Abbreviations like "B.B.C."
+Rule: Abbrev2
+({Token.orth=="upperInitial", Token.length=="1"} FULLSTOP)+
+-->
+{}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/find.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,137 @@
+/*
+*  cr.jape
+*
+* Copyright (c) 1998-2004, The University of Sheffield.
+*
+*  This file is part of GATE (see http://gate.ac.uk/), and is free
+*  software, licenced under the GNU Library General Public License,
+*  Version 2, June 1991 (in the distribution as file licence.html,
+*  and also available at http://gate.ac.uk/gate/licence.html).
+*
+*  Diana Maynard, 10 Sep 2001
+* 
+*  $Id: find.jape 8524 2007-04-05 12:11:15Z valyt $
+*/
+
+Phase:	find
+Input: Token SpaceToken Lookup DEFAULT_TOKEN
+Options: control = appelt
+
+Macro: FULLSTOP
+(
+ {Token.string=="."}
+)
+
+//we'll allow two, three or four dots 
+Macro: THREEDOTS
+(
+ {Token.string=="."}
+ {Token.string=="."}
+ ({Token.string=="."})?
+ ({Token.string=="."})?
+)
+
+Macro: PUNCT
+(
+  {Token.string == "!"} | 
+  {Token.string == "?"}
+)
+
+Macro: NEWLINE
+(
+  {SpaceToken.string == "\n"} |
+  {SpaceToken.string=="\n\r"} |
+  ({SpaceToken.string=="\n"}{SpaceToken.string=="\r"}) |
+  {SpaceToken.string=="\r\n"} |
+  ({SpaceToken.string=="\r"}{SpaceToken.string=="\n"})
+)
+
+
+//normal sentence split 
+Rule: Split1
+(
+ (PUNCT)+ |
+ FULLSTOP |
+ THREEDOTS
+)
+:split
+-->
+:split.Split = {kind = "internal"}
+
+//2 new lines generate an external split
+//must be at least 2 CRs or Newlines plus optional spaces to generate a split
+Rule: CR
+(
+  NEWLINE 
+  ({SpaceToken.kind == space})*
+  NEWLINE
+  ({SpaceToken.kind == space})*
+):cr
+-->
+:cr.Split = {kind = "external"}
+  
+//Anything more than four dots is a line of dots (e.g. in tables of contents)
+Rule: Ldots
+  FULLSTOP
+  FULLSTOP
+  FULLSTOP
+  FULLSTOP
+  (FULLSTOP)+
+-->
+{}
+
+//Java class names
+Rule:dottedName
+  {Token.kind == word}
+  (FULLSTOP {Token.kind == word})+
+-->
+{}
+
+// numbers with decimal part or IP addresses
+Rule:Number
+  {Token.kind == number}
+  (FULLSTOP {Token.kind == number})+
+-->
+{}
+
+//full stops in .net, .NET, .Net
+Rule:DotNetStop
+ FULLSTOP
+ (
+   {Token.string == "NET"} |
+   {Token.string == "net"} |
+   {Token.string == "Net"}
+ )
+-->
+{}
+
+//file extensions like .exe or .EXE
+//unfortunately we can't avoid .Exe as this might be a legitimate split
+//even if there is no space after the full stop.
+Rule:DotFileName
+  FULLSTOP
+  (
+    {Token.orth == "lowercase"} | 
+    {Token.orth == "allCaps"} |
+    {Token.orth == "mixedCaps"}
+  )
+-->
+{}
+
+//Known abbreviations like "Prof."
+//this relies on the gazetteer which has a funny sense of whole words
+//hence we need to check for a space before
+//otherwise things like "640p." would not be identitfied (because p is a
+//known abbreviation).
+Rule: Abbrev1
+ {SpaceToken}
+ {Lookup.majorType == "splitter_abbreviation"}
+ {Token.string == "."}
+-->
+{}
+
+//Abbreviations like "B.B.C."
+Rule: Abbrev2
+({Token.orth=="upperInitial", Token.length=="1"} FULLSTOP)+
+-->
+{}
\ No newline at end of file

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main-single-nl.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main-single-nl.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main-single-nl.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main-single-nl.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,12 @@
+// SplitMain
+// Valentin Tablan 17/05/2001
+
+
+//A sentence splitter
+MultiPhase:	main
+Phases: 
+
+prepare
+find-single-nl
+split
+cleanup

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/main.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,12 @@
+// SplitMain
+// Valentin Tablan 17/05/2001
+
+
+//A sentence splitter
+MultiPhase:	main
+Phases: 
+
+prepare
+find
+split
+cleanup

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/no-splits.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/no-splits.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/no-splits.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/no-splits.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,31 @@
+/*
+*  no-splits.jape
+*
+* Copyright (c) 1998-2004, The University of Sheffield.
+*
+*  This file is part of GATE (see http://gate.ac.uk/), and is free
+*  software, licenced under the GNU Library General Public License,
+*  Version 2, June 1991 (in the distribution as file licence.html,
+*  and also available at http://gate.ac.uk/gate/licence.html).
+*
+*  Valentin Tablan, 23 Jan 2007
+* 
+*  $Id$
+*/
+
+//This grammar deals with documents that have no splits
+
+Phase:noSplits
+Input: Token Split
+Options: control = once
+
+Rule: blah
+{Token}
+-->
+{
+  AnnotationSet splits = inputAS.get("Split");
+  if(splits == null || splits.isEmpty()){
+    outputAS.add(outputAS.firstNode(), outputAS.lastNode(), 
+            "TempNoSplitText", Factory.newFeatureMap());
+  }
+}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/prepare.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/prepare.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/prepare.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/prepare.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,29 @@
+/*
+*  prepare.jape
+*
+* Copyright (c) 1998-2007, The University of Sheffield.
+*
+*  This file is part of GATE (see http://gate.ac.uk/), and is free
+*  software, licenced under the GNU Library General Public License,
+*  Version 2, June 1991 (in the distribution as file licence.html,
+*  and also available at http://gate.ac.uk/gate/licence.html).
+*
+*  Valentin Tablan, June 22nd, 2007 
+* 
+*  $Id$
+*/
+
+Phase:	prepare
+Input: Token
+Options: control = once
+
+//Makes sure there is no temporary data created by a previous run of the 
+//sentence splitter
+Rule: cleanUp
+{Token}
+-->
+{
+  //if there were any sentences created, then we need to remove the document
+  //feature -> useful for future runs
+  doc.getFeatures().remove("temp-last-sentence-end");
+}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/split.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/split.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/split.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/sentenceSplitter/grammar/split.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,105 @@
+/*
+*  splitter.jape
+*
+* Copyright (c) 1998-2004, The University of Sheffield.
+*
+*  This file is part of GATE (see http://gate.ac.uk/), and is free
+*  software, licenced under the GNU Library General Public License,
+*  Version 2, June 1991 (in the distribution as file licence.html,
+*  and also available at http://gate.ac.uk/gate/licence.html).
+*
+*  Valentin Tablan, March 7th, 2007
+*
+*  $Id: split.jape 17376 2014-02-21 10:07:24Z dgmaynard $
+*/
+
+Phase:split
+Input: Split TempNoSplitText
+Options: control = first
+
+
+//sentence that consumes a split
+Rule: internalSplits
+({Split.kind == "internal"}):isplit
+-->
+{
+  Long endOffset = ((AnnotationSet)bindings.get("isplit")).
+      lastNode().getOffset();
+  //find the end offset of previous sentences
+  Long lastOffset = (Long)doc.getFeatures().get("temp-last-sentence-end");
+  if(lastOffset == null) lastOffset = new Long(0);
+//  
+//  AnnotationSet sentences = outputAS.get("Sentence");
+//  Long lastOffset = sentences == null || sentences.isEmpty() ?
+//          new Long(0) :
+//          sentences.lastNode().getOffset();  
+  //get the start offset of the first token.kind==word
+  AnnotationSet tokens = inputAS.getContained(lastOffset, endOffset);
+  if(tokens != null) tokens = tokens.get("Token");
+  if(tokens != null && tokens.size() > 0){
+    List<Annotation> tokList = new ArrayList<Annotation>(tokens);
+    Collections.sort(tokList, new OffsetComparator());
+    for(Annotation token : tokList){
+      String tokenKind = (String)token.getFeatures().get("kind");
+      //if("word".equals(tokenKind)){
+        Long startOffset = token.getStartNode().getOffset();
+        if(startOffset.compareTo(endOffset) < 0){
+          //create the new sentence
+          try{
+            outputAS.add(startOffset, endOffset, "Sentence", 
+                    Factory.newFeatureMap());
+            //save the new end offset
+            doc.getFeatures().put("temp-last-sentence-end", endOffset);
+          }catch( InvalidOffsetException ioe){
+            throw new GateRuntimeException(ioe);
+          }
+       // }
+        return;
+      }
+    }
+  }
+}
+
+//sentence that doesn't consume a split
+Rule: externalSplits
+({Split.kind == "external"}):esplit
+-->
+{
+  Long endOffset = ((AnnotationSet)bindings.get("esplit")).
+      firstNode().getOffset();
+//  //get the end offset of the previous sentence
+//  AnnotationSet sentences = outputAS.get("Sentence");
+//  Long lastOffset = sentences == null || sentences.isEmpty() ?
+//          new Long(0) :
+//          sentences.lastNode().getOffset();  
+  //find the end offset of previous sentences
+  Long lastOffset = (Long)doc.getFeatures().get("temp-last-sentence-end");
+  if(lastOffset == null) lastOffset = new Long(0);
+  
+  //get the start offset of the first token.kind==word
+  AnnotationSet tokens = inputAS.getContained(lastOffset, endOffset);
+  if(tokens != null) tokens = tokens.get("Token");
+  if(tokens != null && tokens.size() > 0){
+    //we have a more precise end offset
+    endOffset = tokens.lastNode().getOffset();
+    List<Annotation> tokList = new ArrayList<Annotation>(tokens);
+    Collections.sort(tokList, new OffsetComparator());
+    for(Annotation token : tokList){
+      String tokenKind = (String)token.getFeatures().get("kind");
+     // if("word".equals(tokenKind)){
+        Long startOffset = token.getStartNode().getOffset();
+        if(startOffset.compareTo(endOffset) < 0){
+          //create the new sentence
+          try{
+            outputAS.add(startOffset, endOffset, "Sentence", 
+                    Factory.newFeatureMap());
+            doc.getFeatures().put("temp-last-sentence-end", endOffset);
+          }catch( InvalidOffsetException ioe){
+            throw new GateRuntimeException(ioe);
+          }
+       // }
+        return;
+      }
+    }
+  }
+}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/AlternateTokeniser.rules
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/AlternateTokeniser.rules?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/AlternateTokeniser.rules (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/AlternateTokeniser.rules Mon Nov  2 16:59:48 2015
@@ -0,0 +1,103 @@
+#AlternateTokeniser.rules#
+#diana 28/6/00#
+#update 25/04/06#
+
+#Tokeniser rule file
+#Each rule should be on one line
+#Lines that end with "\" are appended with the next one. This facility \
+ is used for longer rules that cannot be written on a single line
+#
+#Lines starting with "#" are treated as comment
+//Lines starting with "//" are treated as comment
+# Empty lines are ignored.
+
+#A rule has a left hand side (LHS) and a right hand side (RHS);
+#the RHS is a regular expression tha has to be matched on the input
+#the LHS describes the annotations to be added to the AnnotationSet.
+#LHS is separated from the RHS by '>'
+#LHS knows about the following operators:
+#	+ (1..n)
+#	* (0..n)
+#	| (boolean OR)
+#
+#RHS uses as separator ';' and has the following format
+#{LHS} > {Annotation type};{attribute1}={value1};...;{attribute n}={value n}
+
+
+#The primitive constructs are:
+#UNASSIGNED
+#UPPERCASE_LETTER
+#LOWERCASE_LETTER
+#TITLECASE_LETTER
+#MODIFIER_LETTER
+#OTHER_LETTER
+#NON_SPACING_MARK
+#ENCLOSING_MARK
+#COMBINING_SPACING_MARK
+#DECIMAL_DIGIT_NUMBER
+#LETTER_NUMBER
+#OTHER_NUMBER
+#SPACE_SEPARATOR
+#LINE_SEPARATOR
+#PARAGRAPH_SEPARATOR
+#CONTROL
+#FORMAT
+#PRIVATE_USE
+#SURROGATE
+#DASH_PUNCTUATION
+#START_PUNCTUATION
+#END_PUNCTUATION
+#CONNECTOR_PUNCTUATION
+#OTHER_PUNCTUATION
+#MATH_SYMBOL
+#CURRENCY_SYMBOL
+#MODIFIER_SYMBOL
+#OTHER_SYMBOL
+#...representing the corresponding enumerated Unicode category types
+# See java.lang.Character for the Java version you are using
+
+#------- The rules start here -----------------
+
+#words#
+// a word can be any combination of letters,
+// excluding hyphens, symbols and punctuation, e.g. apostrophes
+
+
+"UPPERCASE_LETTER" (LOWERCASE_LETTER)* > Token;orth=upperInitial;kind=word;
+
+"UPPERCASE_LETTER" (UPPERCASE_LETTER)+ > Token;orth=allCaps;kind=word;
+
+"LOWERCASE_LETTER" (LOWERCASE_LETTER)* > Token;orth=lowercase;kind=word;
+
+
+// MixedCaps is any mixture of caps and small letters that doesn't 
+// fit in the preceding categories
+
+("LOWERCASE_LETTER" "LOWERCASE_LETTER"+"UPPERCASE_LETTER"+ \
+ (UPPERCASE_LETTER|LOWERCASE_LETTER)*)|\
+("LOWERCASE_LETTER" "LOWERCASE_LETTER"*"UPPERCASE_LETTER"+\
+ (UPPERCASE_LETTER|LOWERCASE_LETTER)*)|\
+("UPPERCASE_LETTER" "UPPERCASE_LETTER" (UPPERCASE_LETTER|LOWERCASE_LETTER)*\
+ ("LOWERCASE_LETTER")+ (UPPERCASE_LETTER|LOWERCASE_LETTER)*)|\
+("UPPERCASE_LETTER" "LOWERCASE_LETTER"+ ("UPPERCASE_LETTER"+ "LOWERCASE_LETTER"+))+\
+> Token;orth=mixedCaps;kind=word;
+
+
+#numbers#
+// a number is any combination of digits
+"DECIMAL_DIGIT_NUMBER"+ >Token;kind=number;
+"OTHER_NUMBER"+ >Token;kind=number;
+
+#whitespace#
+(SPACE_SEPARATOR) >SpaceToken;kind=space;
+(CONTROL) >SpaceToken;kind=control;
+
+#symbols#
+(MODIFIER_SYMBOL|MATH_SYMBOL|OTHER_SYMBOL) > Token;kind=symbol;
+CURRENCY_SYMBOL > Token;kind=symbol;symbolkind=currency;
+
+#punctuation#
+"DASH_PUNCTUATION" >Token;kind=punctuation;subkind=dashpunct;
+(CONNECTOR_PUNCTUATION|OTHER_PUNCTUATION)>Token;kind=punctuation;
+("START_PUNCTUATION"|"INITIAL_QUOTE_PUNCTUATION") >Token;kind=punctuation;position=startpunct;
+("END_PUNCTUATION"|"FINAL_QUOTE_PUNCTUATION") >Token;kind=punctuation;position=endpunct;

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/DefaultTokeniser.rules
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/DefaultTokeniser.rules?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/DefaultTokeniser.rules (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/DefaultTokeniser.rules Mon Nov  2 16:59:48 2015
@@ -0,0 +1,104 @@
+#DefaultTokeniser.rules#
+#diana 28/6/00#
+#update 9/7/00#
+
+#Tokeniser rule file
+#Each rule should be on one line
+#Lines that end with "\" are appended with the next one. This facility \
+ is used for longer rules that cannot be written on a single line
+#
+#Lines starting with "#" are treated as comment
+//Lines starting with "//" are treated as comment
+# Empty lines are ignored.
+
+#A rule has a left hand side (LHS) and a right hand side (RHS);
+#the LHS is a regular expression tha has to be matched on the input
+#the RHS describes the annotations to be added to the AnnotationSet.
+#LHS is separated from the RHS by '>'
+#LHS knows about the following operators:
+#	+ (1..n)
+#	* (0..n)
+#	| (boolean OR)
+#
+#RHS uses as separator ';' and has the following format
+#{LHS} > {Annotation type};{attribute1}={value1};...;{attribute n}={value n}
+
+
+#The primitive constructs are:
+#UNASSIGNED
+#UPPERCASE_LETTER
+#LOWERCASE_LETTER
+#TITLECASE_LETTER
+#MODIFIER_LETTER
+#OTHER_LETTER
+#NON_SPACING_MARK
+#ENCLOSING_MARK
+#COMBINING_SPACING_MARK
+#DECIMAL_DIGIT_NUMBER
+#LETTER_NUMBER
+#OTHER_NUMBER
+#SPACE_SEPARATOR
+#LINE_SEPARATOR
+#PARAGRAPH_SEPARATOR
+#CONTROL
+#FORMAT
+#PRIVATE_USE
+#SURROGATE
+#DASH_PUNCTUATION
+#START_PUNCTUATION
+#END_PUNCTUATION
+#CONNECTOR_PUNCTUATION
+#OTHER_PUNCTUATION
+#MATH_SYMBOL
+#CURRENCY_SYMBOL
+#MODIFIER_SYMBOL
+#OTHER_SYMBOL
+#...representing the corresponding enumerated Unicode category types
+# See java.lang.Character for the Java version you are using
+
+#------- The rules start here -----------------
+
+#words#
+// a word can be any combination of letters, including hyphens,
+// but excluding symbols and punctuation, e.g. apostrophes
+// Note that there is an alternative version of the tokeniser that
+// treats hyphens as separate tokens
+
+
+"UPPERCASE_LETTER" (LOWERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION|FORMAT)*)* > Token;orth=upperInitial;kind=word;
+"UPPERCASE_LETTER" (DASH_PUNCTUATION|FORMAT)* (UPPERCASE_LETTER|DASH_PUNCTUATION|FORMAT)+ > Token;orth=allCaps;kind=word;
+"LOWERCASE_LETTER" (LOWERCASE_LETTER|DASH_PUNCTUATION|FORMAT)* > Token;orth=lowercase;kind=word;
+
+// MixedCaps is any mixture of caps and small letters that doesn't
+// fit in the preceding categories
+
+("LOWERCASE_LETTER" "LOWERCASE_LETTER"+"UPPERCASE_LETTER"+ \
+ (UPPERCASE_LETTER|LOWERCASE_LETTER)*)|\
+("LOWERCASE_LETTER" "LOWERCASE_LETTER"*"UPPERCASE_LETTER"+\
+ (UPPERCASE_LETTER|LOWERCASE_LETTER|DASH_PUNCTUATION|FORMAT)*)|\
+("UPPERCASE_LETTER" (DASH_PUNCTUATION)* "UPPERCASE_LETTER" (UPPERCASE_LETTER|LOWERCASE_LETTER|DASH_PUNCTUATION|FORMAT)*\
+ ("LOWERCASE_LETTER")+ (UPPERCASE_LETTER|LOWERCASE_LETTER|DASH_PUNCTUATION|FORMAT)*)|\
+("UPPERCASE_LETTER" "LOWERCASE_LETTER"+ ("UPPERCASE_LETTER"+ "LOWERCASE_LETTER"+)+)|\
+ ((UPPERCASE_LETTER)+ (LOWERCASE_LETTER)+ (UPPERCASE_LETTER)+)\
+> Token;orth=mixedCaps;kind=word;
+
+(OTHER_LETTER|COMBINING_SPACING_MARK|NON_SPACING_MARK)+ >Token;kind=word;type=other;
+
+#numbers#
+// a number is any combination of digits
+"DECIMAL_DIGIT_NUMBER"+ >Token;kind=number;
+"OTHER_NUMBER"+ >Token;kind=number;
+
+#whitespace#
+(SPACE_SEPARATOR) >SpaceToken;kind=space;
+(CONTROL) >SpaceToken;kind=control;
+
+#symbols#
+(MODIFIER_SYMBOL|MATH_SYMBOL|OTHER_SYMBOL) > Token;kind=symbol;
+CURRENCY_SYMBOL > Token;kind=symbol;symbolkind=currency;
+
+#punctuation#
+(DASH_PUNCTUATION|FORMAT) >Token;kind=punctuation;subkind=dashpunct;
+(CONNECTOR_PUNCTUATION|OTHER_PUNCTUATION)>Token;kind=punctuation;
+("START_PUNCTUATION"|"INITIAL_QUOTE_PUNCTUATION") >Token;kind=punctuation;position=startpunct;
+("END_PUNCTUATION"|"FINAL_QUOTE_PUNCTUATION") >Token;kind=punctuation;position=endpunct;

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/postprocess.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/postprocess.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/postprocess.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/plugins/ANNIE/resources/tokeniser/postprocess.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,246 @@
+// Valentin Tablan, 29/06/2001
+// $id$
+
+
+Phase:postprocess
+Input: Token SpaceToken
+Options: control = appelt
+
+//adjusts the tokeniser output
+
+Rule: simpleJoin
+ (
+  //'30s, ..., 'Cause, 'em, 'N, 'S, 's, 'T, 'd, , 'll, 'm, 're, 's, 'til, 've
+  (
+   {Token.string=="'"}
+   ({Token.string=="30s"}|{Token.string=="40s"}|{Token.string=="50s"}|{Token.string=="60s"}
+    |{Token.string=="70s"}|{Token.string=="80s"}|{Token.string=="90s"}|{Token.string=="Cause"}
+    |{Token.string=="cause"}|{Token.string=="Em"}|{Token.string=="em"}|{Token.string=="N"}
+    |{Token.string=="S"}|{Token.string=="s"}|{Token.string=="T"}|{Token.string=="d"}
+    |{Token.string=="ll"}|{Token.string=="m"}|{Token.string=="re"}|{Token.string=="s"}
+    |{Token.string=="til"}|{Token.string=="ve"})
+  )
+  |
+  //'n'
+  ({Token.string=="'"} {Token.string=="n"} {Token.string=="'"})
+  |
+  //C'mon
+  (({Token.string=="C"}|{Token.string=="c"}){Token.string=="'"} {Token.string=="mon"})
+  |
+  //o'clock
+  (({Token.string=="O"}|{Token.string=="o"}){Token.string=="'"} {Token.string=="clock"})
+  |
+  //ma'am
+  (({Token.string=="ma"}|{Token.string=="Ma"}){Token.string=="'"} {Token.string=="am"})
+ ):left
+-->
+{
+  gate.AnnotationSet toRemove = (gate.AnnotationSet)bindings.get("left");
+  outputAS.removeAll(toRemove);
+  //get the tokens
+  java.util.ArrayList tokens = new java.util.ArrayList(toRemove);
+  //define a comparator for annotations by start offset
+  Collections.sort(tokens, new gate.util.OffsetComparator());
+  String text = "";
+  Iterator tokIter = tokens.iterator();
+  while(tokIter.hasNext())
+    text += (String)((Annotation)tokIter.next()).getFeatures().get("string");
+
+  gate.FeatureMap features = Factory.newFeatureMap();
+  features.put("kind", "word");
+  features.put("string", text);
+  features.put("length", Integer.toString(text.length()));
+  features.put("orth", "apostrophe");
+  outputAS.add(toRemove.firstNode(), toRemove.lastNode(), "Token", features);
+}
+
+
+Rule: ordinals
+  //3rd, 1st, 22nd
+  (
+   ({Token.kind=="number"}):number
+   ({Token.string=="st"}|{Token.string=="nd"}|{Token.string=="rd"}|{Token.string=="th"}):ending
+  ):left
+-->
+{
+  Annotation numberAnn = (Annotation)((AnnotationSet)bindings.get("number")).
+	iterator().next();	
+  Annotation endingAnn = (Annotation)((AnnotationSet)bindings.get("ending")).
+	iterator().next();
+  
+  String numberStr = (String)numberAnn.getFeatures().get("string");
+  String endingStr = (String)endingAnn.getFeatures().get("string");
+  if((numberStr.endsWith("1") && endingStr.equals("st"))
+     |
+     (numberStr.endsWith("2") && endingStr.equals("nd"))
+     |
+     (numberStr.endsWith("3") && endingStr.equals("rd"))
+     |
+     (endingStr.equals("th"))
+){
+    //remove old tokens
+    gate.AnnotationSet toRemove = (gate.AnnotationSet)bindings.get("left");
+    inputAS.removeAll(toRemove);
+    //create the new token
+    FeatureMap features = Factory.newFeatureMap();
+    features.put("kind", "word");
+    features.put("string", numberStr + endingStr);
+    features.put("length", (numberStr + endingStr).length());
+    outputAS.add(toRemove.firstNode(), toRemove.lastNode(), "Token", features);
+  }
+  
+}
+
+
+//?n't
+Rule: VBneg
+   ({Token}):one
+   ({Token.string=="'"}{Token.string=="t"}):two
+-->
+{
+  gate.Annotation firstToken = (gate.Annotation)
+                               ((gate.AnnotationSet)bindings.get("one")).iterator().next();
+  String firstTokenText = (String)firstToken.getFeatures().get("string");
+  if(firstTokenText.endsWith("n")){
+    //remove the old tokens
+    outputAS.removeAll((gate.AnnotationSet)bindings.get("one"));
+    outputAS.removeAll((gate.AnnotationSet)bindings.get("two"));
+    //create the new tokens
+    Long ofs0 = firstToken.getStartNode().getOffset();
+    Long ofs1 = new Long(firstToken.getEndNode().getOffset().longValue() - 1);
+    Long ofs2 = ((gate.AnnotationSet)bindings.get("two")).lastNode().getOffset();
+    try{
+      gate.FeatureMap features;
+      if(!ofs0.equals(ofs1)){
+        features = Factory.newFeatureMap();
+        features.put("kind", "word");
+        String text = firstTokenText.substring(0, firstTokenText.length() - 1);
+        features.put("string", text);
+        features.put("length", Integer.toString(text.length()));
+        features.put("orth", firstToken.getFeatures().get("orth"));
+        outputAS.add(ofs0, ofs1, "Token", features);
+      }
+
+      features = Factory.newFeatureMap();
+      features.put("kind", "word");
+      features.put("string", "n't");
+      features.put("length", "3");
+      features.put("orth", "lowercase");
+      outputAS.add(ofs1, ofs2, "Token", features);
+    }catch(Exception e){
+      e.printStackTrace();
+    }
+  }//if first token ends with "n"
+}
+
+
+/* ?N'T (AF, 2011-01)
+ * copied & slightly modified from ?n't rule above   */
+Rule: VBnegUppercase
+   ({Token}):one
+   ({Token.string=="'"}{Token.string=="T"}):two
+-->
+{
+  gate.Annotation firstToken = (gate.Annotation)
+                               ((gate.AnnotationSet)bindings.get("one")).iterator().next();
+  String firstTokenText = (String)firstToken.getFeatures().get("string");
+  if(firstTokenText.endsWith("N")){
+    //remove the old tokens
+    outputAS.removeAll((gate.AnnotationSet)bindings.get("one"));
+    outputAS.removeAll((gate.AnnotationSet)bindings.get("two"));
+    //create the new tokens
+    Long ofs0 = firstToken.getStartNode().getOffset();
+    Long ofs1 = new Long(firstToken.getEndNode().getOffset().longValue() - 1);
+    Long ofs2 = ((gate.AnnotationSet)bindings.get("two")).lastNode().getOffset();
+    try{
+      gate.FeatureMap features;
+      if(!ofs0.equals(ofs1)){
+        features = Factory.newFeatureMap();
+        features.put("kind", "word");
+        String text = firstTokenText.substring(0, firstTokenText.length() - 1);
+        features.put("string", text);
+        features.put("length", Integer.toString(text.length()));
+        features.put("orth", firstToken.getFeatures().get("orth"));
+        outputAS.add(ofs0, ofs1, "Token", features);
+      }
+
+      features = Factory.newFeatureMap();
+      features.put("kind", "word");
+      features.put("string", "N'T");
+      features.put("length", "3");
+      features.put("orth", "uppercase");
+      outputAS.add(ofs1, ofs2, "Token", features);
+    }catch(Exception e){
+      e.printStackTrace();
+    }
+  }//if first token ends with "N"
+}
+
+
+/* "cannot" (AF, 2011-01) */
+Rule: Cannot
+({Token.string ==~ "[Cc][Aa][Nn][Nn][Oo][Tt]"}):cannot
+-->
+:cannot  {
+  Annotation cannot = cannotAnnots.iterator().next();
+  String cannotStr = cannot.getFeatures().get("string").toString();
+  String canStr = cannotStr.substring(0,3);
+  String notStr = cannotStr.substring(3,6);
+
+  Long start = cannot.getStartNode().getOffset();
+  Long end   = cannot.getEndNode().getOffset();
+  Long middle = start + 3L;
+
+  /* Copy orth, &c., from the original Token;
+   * overwrite the others appropriately.  */
+  FeatureMap canFM = Factory.newFeatureMap();
+  FeatureMap notFM = Factory.newFeatureMap();
+  canFM.putAll(cannot.getFeatures());
+  notFM.putAll(cannot.getFeatures());
+
+  canFM.put("string", canStr);
+  notFM.put("string", notStr);
+  canFM.put("length", Integer.toString(3));
+  notFM.put("length", Integer.toString(3));
+
+  try {
+    outputAS.add(start, middle, "Token", canFM);
+    outputAS.add(middle, end, "Token", notFM);
+  }
+  catch (InvalidOffsetException e) {
+    /* This should never happen */
+    e.printStackTrace();
+  }
+
+  outputAS.remove(cannot);
+}
+
+
+// CR+LF | CR |LF+CR -> One single SpaceToken
+Rule: NewLine
+ (
+  ({SpaceToken.string=="\n"}) |
+  ({SpaceToken.string=="\r"}) |
+  ({SpaceToken.string=="\n"}{SpaceToken.string=="\r"}) |
+  ({SpaceToken.string=="\r"}{SpaceToken.string=="\n"})
+  ):left
+-->
+{
+  gate.AnnotationSet toRemove = (gate.AnnotationSet)bindings.get("left");
+  outputAS.removeAll(toRemove);
+  //get the tokens
+  java.util.ArrayList tokens = new java.util.ArrayList(toRemove);
+  //define a comparator for annotations by start offset
+  Collections.sort(tokens, new gate.util.OffsetComparator());
+  String text = "";
+  Iterator tokIter = tokens.iterator();
+  while(tokIter.hasNext())
+    text += (String)((Annotation)tokIter.next()).getFeatures().get("string");
+
+  gate.FeatureMap features = Factory.newFeatureMap();
+  features.put("kind", "control");
+  features.put("string", text);
+  features.put("length", Integer.toString(text.length()));
+  outputAS.add(toRemove.firstNode(), toRemove.lastNode(), "SpaceToken", features);
+}
+

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/AGE.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/AGE.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/AGE.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/AGE.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,124 @@
+Phase: AGE
+Input: Token
+Options: control = appelt
+
+Rule: getAGE1
+(
+{Token.string ==~ "[0-9]{1,2}"}({Token.string ==~ "'?s"})?
+):label
+(
+({Token.string == "-"})? {Token.string ==~ "(?i)yo(RHM)?|yr|year|yo?F|yo?M|(year|yr|yrs)-old"}| 
+{Token.string ==~ "(?i)years?"}{Token.string ==~ "(?i)old"}|
+{Token.string ==~ "(?i)y"}({Token.string == "."}{Token.string ==~ "(?i)o"}|{Token.string ==~ "(?i)/|m|f"})
+):post_context
+-->
+:label.AGE = {CATEGORY="AGE"}
+
+
+Rule: getAGE2
+(
+{Token.string ==~ "(?i)age"}({Token})? 
+):pre_context
+(
+{Token.string ==~ "[0-9]{1,2}"}({Token.string ==~ "'?s"})?
+):label
+-->
+:label.AGE = {CATEGORY="AGE"}
+
+
+Rule: getAGE3
+(
+({Token.string ==~ "(?i)passed"}{Token.string ==~ "(?i)away"}|{Token.string ==~ "(?i)died|deceased"})
+({Token.string ==~ "(?i)from|with|of"}{Token})?({Token.string ==~ "(?i)at"})?
+):pre_context
+(
+{Token.string ==~ "[0-9]{1,2}"}({Token.string ==~ "'?s"})?
+):label
+-->
+:label.AGE = {CATEGORY="AGE"}
+
+Rule: getAGE4
+(
+{Token.string ==~ "[0-9]{2}"}
+):label
+(
+{Token.string ==~ "(?i)M|F|male|female"} ({Token.string ==~ "(?i)with|/|h|hx|s|w|p|who|comes|admitted"}|{Token.orth == allCaps})
+):post_nolabel
+-->
+:label.AGE = {CATEGORY="AGE"}
+
+Rule: getAGE5
+(
+{Token.string ==~ "(?i)in|by"}({Token.string ==~ "(?i)his|her"})({Token})?|
+{Token.string ==~ "(?i)lived"}{Token.string ==~ "(?i)into|to"}({Token.string ==~ "(?i)his|her|their"})?|
+{Token.string ==~ "(?i)who|she|he"}{Token.string ==~ "(?i)is"}{Token.string == "now"}
+):pre_context
+(
+{Token.string ==~ "[0-9]{2}"}({Token.string ==~ "'?s"})?
+):label
+
+-->
+:label.AGE = {CATEGORY="AGE"}
+
+//Family age
+Rule: getAGE6
+Priority: 80
+(
+{Token.string ==~ "(?i)brother|sister|grandmother|grandfather|father|mother"}({Token.string ==~ "(?i)at"})?
+):pre_context
+(
+{Token.string ==~ "[0-9]{2}"}
+):label
+-->
+:label.AGE = {CATEGORY="AGE"}
+
+// Expand pre_context
+Rule: getAGE_7
+(
+({Token.string == "MI"}|{Token.orth == upperInitial, Token.length <=3}|{Token.string ==~ "(?i)myocardial"}{Token.string ==~ "(?i)infarction"}|{Token.string ==~ "(?i)cancer"})
+{Token.string == "at"}
+):pre_context
+(
+{Token.string ==~ "[0-9]{2}"}
+):label
+-->
+:label.AGE = {CATEGORY="AGE"}
+
+Rule: getAGE8
+(
+{Token.string ==~ "[0-9]{1,2}"}
+):label
+(
+{Token.string ==~ "st|th"}{Token.string ==~ "[Bb]irthday"}
+):post_context
+-->
+:label.AGE = {CATEGORY="AGE", RULE="8"}
+
+Rule: getAGE_9
+(
+{Token.string ==~ "(?i)ages?|aged"}
+):post_nolabel
+(({Token.string ==~ "[0-9]{1,2}"})?):label1
+(({Token.string ==~ ",|/|and"})?):no_label1
+(({Token.string ==~ "[0-9]{1,2}"})?):label2
+(({Token.string ==~ ",|/|and"})?):no_label2
+(({Token.string ==~ "[0-9]{1,2}"})?):label3
+(({Token.string ==~ ",|/|and"})?):no_label3
+({Token.string ==~ "[0-9]{1,2}"}):label
+-->
+:label1.AGE = {CATEGORY="AGE"},
+:label2.AGE = {CATEGORY="AGE"},
+:label3.AGE = {CATEGORY="AGE"},
+:label.AGE = {CATEGORY="AGE"}
+
+Rule: getAGE_8
+Priority: 80
+(({Token.string ==~ "[0-9]{1,2}"})?):label1
+(({Token.string ==~ ",|/|and"})?):no_label1
+({Token.string ==~ "[0-9]{1,2}"}):label
+(
+{Token.string ==~ "(?i)years"}{Token.string ==~ "(?i)of"}{Token.string ==~ "(?i)age"}
+):post_nolabel
+-->
+:label1.AGE = {CATEGORY="AGE"},
+:label.AGE = {CATEGORY="AGE"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/COUNTRY.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/COUNTRY.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/COUNTRY.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/COUNTRY.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,60 @@
+Imports: {
+import static gate.Utils.*;
+}
+
+Phase: COUNTRY
+Input: Token Split Lookup
+Options: control = appelt
+
+Rule: getCountry
+({!Split}):pre_context
+(
+ {Lookup.minorType=="country"}
+):label
+-->
+:label.COUNTRY = {CATEGORY="LOCATION"}
+
+
+Macro: PRENEG(
+ {!Token.string ==~ "(?i)speaks?|some"}
+)
+
+Macro: POSTNEG(
+ {!Token.string ==~ "(?i)general|hospital|clinic|city|restaurant|area|street|road"}
+)
+
+/* 
+ * Nationality is considered as COUNTRY
+ */
+Rule: getNationality
+(PRENEG):pre_context //negated context; if given context, do not annotate
+(
+ {Lookup.minorType=="nationality"}
+):label
+(POSTNEG):post_context //negated context; if given context, do not annotate
+-->
+:label.COUNTRY = {CATEGORY="LOCATION"}
+
+
+/* 
+ * Spoken language is considered as COUNTRY
+ */
+Rule: getSpokenLanguage
+(
+ {Lookup.minorType=="nationality"}
+):label
+-->
+:label
+{  
+    AnnotationSet matchedAnns = bindings.get("label");  
+    Annotation annotation = matchedAnns.iterator().next();
+ 
+	try{
+		FeatureMap newFeatures = Factory.newFeatureMap();
+		newFeatures.put("CATEGORY", "LOCATION");   
+    		outputAS.add(start(annotation), end(annotation)-9, "COUNTRY", newFeatures);
+	   } catch (InvalidOffsetException e) {
+   		throw new LuckyException(e);
+  	}
+  	//outputAS.remove(annotation);
+}  

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/DOCTOR.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/DOCTOR.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/DOCTOR.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/DOCTOR.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,118 @@
+Phase: DrNAME
+Input: Token Split
+Options: control = appelt
+
+Rule: getDrNAME
+Priority: 100
+(
+{Token.string == "M"}{Token.string == "."}{Token.string == "D"}{Token.string == "."}
+{Split.kind == external}{Split.kind == external}({Split.kind == external}{Split.kind == external})?
+):pre_context
+(({Token.orth == allCaps})?):drname
+(({Token.string == "/"})):nolabel
+(({Token.orth == allCaps}|{Token.kind == word})):drname1
+(({Token.string == "/"})?):nolabel2
+(({Token.orth == allCaps}|{Token.kind == word})?):drname2
+(({Token.string == "/"})?):nolabel3
+(({Token.orth == allCaps}|{Token.kind == word})?):drname3
+({Split.kind == external}{Split.kind == external}):post_context
+-->
+:drname.DOCTOR_rule = {CATEGORY="NAME"},
+:drname1.DOCTOR_rule = {CATEGORY="NAME"},
+:drname2.DOCTOR_rule = {CATEGORY="NAME"},
+:drname3.DOCTOR_rule = {CATEGORY="NAME"}
+
+Rule: getDrNAME2
+Priority: 100
+({Token.string == "^"}):pre_context
+(
+{Token.orth == allCaps}({Token.string == ","})?{Token.orth == allCaps}
+):label
+({Split}):post_context
+-->
+:label.DOCTOR = {CATEGORY="NAME"}
+
+Rule: getDrNAME3
+Priority: 100
+({Token.string ==~ "(?i)Drs?"}({Token.string == "."})?):pre_context
+(
+{Token.orth == upperInitial, !Token.string ==~ "(?i)Done|Take|PO"} 
+{Token.orth == upperInitial}
+({Token.orth == upperInitial})? 
+({Token.orth == upperInitial})?|
+{Token.orth == upperInitial, !Token.string ==~ "(?i)Done|Take|PO"}{Token.string == ".", !Split}{Token.orth == upperInitial}|
+{Token.kind == word, !Token.string ==~ "(?i)take|dr"}
+):label
+-->
+:label.DOCTOR = {CATEGORY="NAME"}
+
+Rule: getDrNAME4
+Priority: 100
+(
+{Token.orth == upperInitial, !Token.string ==~ "Dr|Name"}({Token.string == ","})? {Token.orth == upperInitial}({Token.orth == upperInitial})? ({Token.orth == upperInitial})?|
+
+{Token.orth == upperInitial, !Token.string ==~ "Dr|Name"}({Token.orth == upperInitial})?{Token.string == ".", !Split}{Token.orth == upperInitial} ({Token.string == ".", !Split}{Token.orth == upperInitial})? ({Token.orth == upperInitial})?|
+
+({Token.orth == upperInitial, !Token.string == "Dr"}{Token.string == ".", !Split})?{Token.orth == allCaps} {Token.orth == allCaps} |
+
+{Token.orth == allCaps, !Token.string == "DR"}({Token.string == ","})? {Token.orth == allCaps} ({Token.orth == upperInitial}{Token.string == "."})?|
+
+{Token.orth == allCaps, !Token.string == "DR"}({Token.string == ","}|{Token.orth == upperInitial}{Token.string == "."})?{Token.orth == allCaps} ({Token.orth == allCaps})?|
+
+{Token.orth == allCaps}{Token.orth == upperInitial}
+):label
+(
+({Split})?
+({Token.string == ","})?{Token.string ==~ "MD|NP|PA-C|MDA|MD-Attending|MSN|ANP|NP"}| //add MS TP:5, FP:2 | MSN TP:3 | RN ?
+({Token.string == ","})?{Token.string ==~ "M|Ph"}{Token.string == "."}{Token.string == "D"}{Token.string == "."}|
+({Token.string == ","})?{Token.string == "N"}{Token.string == "."}{Token.string == "P"}{Token.string == "."}
+):post_context
+-->
+:label.DOCTOR = {CATEGORY="NAME"}
+
+
+//Very few occurrances. 
+Rule: getDrNAME5
+Priority: 90
+(({Token.string ==~ "(?i)PCP|PA|PRS|PCP"}|{Token.string ==~ "(?i)Transcribed|Dictated|electronically|signed|recommended"}{Token.string ==~ "(?i)by|for"}|
+{Token.string ==~ "(?i)physician"})
+({Token.string == ":"})?):pre_context
+(
+{Token.orth == upperInitial, !Token.string == "Dr"}{Token.orth == upperInitial}({Token.string == "."})?{Token.orth == upperInitial}|
+{Token.orth == upperInitial, !Token.string == "Dr"}({Token.string == ","})? {Token.orth == upperInitial}({Token.orth == upperInitial})?|
+{Token.orth == allCaps, !Token.string == "Dr"}({Token.string == ","}|{Token.orth == upperInitial}{Token.string == "."})?{Token.orth == allCaps} ({Token.orth == allCaps})?
+):label
+-->
+:label.DOCTOR = {CATEGORY="NAME"}
+
+Rule: getDrNAME6
+Priority: 80
+({Split}):pre_context
+({Token.orth == allCaps}|{Token.kind == word}):label
+(({Token.string ==~ "/|:"})?):nolabel
+(({Token.orth == allCaps}|{Token.kind == word})?):label1
+(({Token.string ==~ "/|:"})?):nolabel2
+(({Token.orth == allCaps}|{Token.kind == word})?):label2
+(({Token.string ==~ "/|:"})?):nolabel3
+(({Token.orth == allCaps}|{Token.kind == word})?):label3
+(
+({Token.string == ";"})?({Token.string ==~ "\\d{2}"}{Token.string == "-"})?{Token.string ==~ "\\d{6,9}"}{Token.string == "."}{Token.string == "doc"}
+):post_context
+-->
+:label.DOCTOR = {CATEGORY="NAME"},
+:label1.DOCTOR = {CATEGORY="NAME"},
+:label2.DOCTOR = {CATEGORY="NAME"},
+:label3.DOCTOR = {CATEGORY="NAME"}
+
+Rule: getDrNAME7
+Priority: 80
+({Token.string ==~ "(?i)Attending|Residents?|Provider|Intern|Att|Surgeon|Cardiologist|MD|Staff"}({Token.string ==~ "(?i)physician"})?{Token.string == ":"}):pre_context
+(
+{Token.orth == upperInitial, !Token.string == "Dr"}{Token.orth == upperInitial}({Token.string == "."})?{Token.orth == upperInitial}|
+{Token.orth == upperInitial, !Token.string == "Dr"}({Token.string == ","})? {Token.orth == upperInitial}({Token.orth == upperInitial})?|
+{Token.orth == allCaps, !Token.string == "Dr"} ({Token.string == ","})?{Token.orth == allCaps}|
+{Token.orth == upperInitial, !Token.string == "Dr"}
+):label
+-->
+:label.DOCTOR = {CATEGORY="NAME"}
+

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/FAX.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/FAX.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/FAX.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/FAX.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,24 @@
+Phase: CONTACT
+Input: Token Split
+Options: control = appelt
+
+
+Macro: FAXPATTERN(
+({Token.string == "("})?{Token.string ==~ "[0-9]{3}"}({Token.string == ")"})?({Token.string == "-"})?
+{Token.string ==~ "[0-9]{3,4}"}(({Token.string == "-"})?{Token.string == "-"}{Token.string ==~ "[0-9]{3,4}"})?|
+{Token.string ==~ "[0-9]{3}"}({Token.string ==~ "-|\\."})?{Token.string ==~ "[0-9]{4}"}({Token.string ==~ "-|\\."}{Token.string ==~ "[0-9]{3}"})?| 
+{Token.string ==~ "[0-9]{3}"}({Token.string ==~ "-|\\."})?{Token.string ==~ "[0-9]{3}"}({Token.string ==~ "-|\\."}{Token.string ==~ "[0-9]{4}"})?|
+{Token.string ==~ "[0-9]{3}"}{Token.string ==~ "[0-9]{3}"}({Token.string ==~ "[0-9]{4}"})?|
+{Token.string ==~ "[0-9]{3}"}{Token.string ==~ "[0-9]{4}"}({Token.string ==~ "[0-9]{3}"})?
+)
+
+Rule: getFAX
+(
+({Token.string ==~ "(?i)fax"}({Token.string ==~ "(?i)No|Num|Number"})?)
+({Token.string == ":"}|{Token.string == "#"}|{Token.string == "."})?
+):pre_context
+(
+FAXPATTERN
+):label
+-->
+:label.FAX = {CATEGORY="CONTACT"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ID_NUM.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ID_NUM.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ID_NUM.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ID_NUM.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,81 @@
+Phase: ID_NUMBER
+Input: Token Split
+Options: control = appelt
+
+Rule: getIDNUM1
+({Token.string ==~ "(?i)eScription"}{Token.string ==~ "(?i)document"}{Token.string == ":"}):pre_context
+(
+{Token.string ==~ "\\d{1}"}{Token.string == "-"}{Token.string ==~ "\\d{6,8}"}
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+
+Rule: getIDNUM2
+(
+{Token.string ==~ "(?i)Member"}{Token.string ==~ "(?i)ID"}({Token.string == ":"}|{Token.string == "#"})
+):pre_context
+(
+{Token.string ==~ "\\d{8,12}"}
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+Rule: getIDNUM3
+(({Token.string == "_"})[3]):pre_context
+(
+{Token.kind == word, Token.orth == allCaps}{Token.string ==~ "\\d{3}"}{Token.string == "/"}{Token.string ==~ "\\d{3,5}"}|
+{Token.kind == word, Token.orth == allCaps}{Token.string ==~ "\\d{4}"}{Token.string == "/"}{Token.string ==~ "\\d{3,5}"}|
+{Token.string ==~ "\\d{4,5}"}{Token.string == "/"}{Token.string ==~ "\\d{4,5}"}
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+Rule: getIDNUM4
+(
+{Token.kind == word, Token.length == 2}{Token.string == ":"}{Token.kind == word, Token.length == 2}
+{Token.string == ":"}{Token.string ==~ "\\d{3,5}"}({Token.orth == upperInitial, Token.length == 1}|{Token.string == "/"}{Token.string ==~ "[0-9]{4}"})?
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+Rule: getIDNUM5
+(
+({Token.string ==~ "\\d{2}"}{Token.string == "-"})?{Token.string ==~ "\\d{6,9}"}
+):label
+({Token.string == "."}{Token.string ==~ "doc?"}):post_context
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+Rule: getIDNUM6
+({Token.string ==~ "(?i)Backjob|Voicejob|Job|Exam|Cardiology|Specimen|TR:?"}
+({Token.string ==~ "(?i)ID|Number"})?{Token.string == ":"}):pre_context
+(
+{Token.string ==~ "\\d{6,8}"}|{Token.orth == allCaps}{Token.string ==~ "\\d{6,9}"}
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+Rule: getIDNUM7
+({Token.string ==~ "(?i)LOT|FI|PA"}({Token.string ==~ "#|:"})?):pre_context
+(
+{Token.string ==~ "\\d{4,6}"}
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+Rule: getIDNUM8
+(
+{Token.orth == allCaps, Token.length == 2}{Token.string == ":"}{Token.string ==~ "\\d{4,6}"}{Token.string == ":"}{Token.string ==~ "\\d{2,4}"}
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}
+
+
+Rule: getIDNUM9
+({Token.string ==~ "(?i)Exam"}{Token.string ==~ "(?i)Code"}{Token.string == ":"}):pre_context
+(
+({Token.string ==~ "\\d{2,3}"})?({Token.orth == allCaps})?({Token.string ==~ "\\d{1}"})?
+):label
+-->
+:label.IDNUM = {CATEGORY="ID"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/MEDICALREC_NUM.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/MEDICALREC_NUM.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/MEDICALREC_NUM.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/MEDICALREC_NUM.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,67 @@
+Phase: MEDICALRECORD_NUMBER
+Input: Token Split
+Options: control = appelt
+
+
+Rule: getMEDRECORDNUM1
+(
+{Token.string ==~ "\\d{3}"}{Token.string == "-"}{Token.string ==~ "\\d{2}"}{Token.string == "-"}{Token.string ==~ "\\d{2}"}
+({Token.string == "-"}{Token.string ==~ "\\d{1}"})?
+):label
+-->
+:label.MEDICALRECORD = {CATEGORY="ID"}
+
+
+Rule: getMEDRECORDNUM2
+(({Token.string == "MRN"}|{Token.string == "Medical"}{Token.string == "Record"}{Token.string == "Number"})({Token.string == "#"})?({Token.string == ":"})?):pre_context
+(
+{Token.string ==~ "\\d{7,8}"}|{Token.string ==~ "\\d{3}"}{Token.string ==~ "\\d{2}"}{Token.string ==~ "\\d{2}"}
+):label
+-->
+:label.MEDICALRECORD = {CATEGORY="ID"}
+
+
+Rule: getMEDRECORDNUM3 
+({Token.string ==~ "MR|PH|DHN|LMH|Unit"}({Token.string == "#"})?({Token.string == ":"})?):pre_context
+(
+{Token.string ==~ "\\d{7,8}"}|{Token.string ==~ "\\d{3}"}{Token.string ==~ "\\d{2}"}{Token.string ==~ "\\d{2}"}
+):label
+-->
+:label.MEDICALRECORD = {CATEGORY="ID"}
+
+Rule: getMEDRECORDNUM4
+({Token.string == "^"}):pre_context
+(
+{Token.string ==~ "\\d{7,8}"}
+):label
+({Token.string == "^"}):post_context
+-->
+:label.MEDICALRECORD = {CATEGORY="ID"}
+
+Rule: getMEDRECORDNUM5
+(({Split})?):nolabel
+(
+{Token.string ==~ "\\d{8}"}|{Token.kind == word, Token.orth == allCaps}{Token.string ==~ "\\d{6,9}"} //N.B. only one example <WORD><NUM>
+):label
+({Split.kind == external}):post_context
+-->
+:label.MEDICALRECORD = {CATEGORY="ID"}
+
+
+//OVERLAPS with IDNUM
+Rule: getMEDRECORDNUM_6
+({Token.string ==~ "(?)Report|Unit"}{Token.string ==~ "(?)Number"}{Token.string == ":"}):pre_context
+(
+{Token.kind == word, Token.orth == allCaps}{Token.string ==~ "\\d{6,9}"}
+):label
+-->
+:label.MEDICALRECORD = {CATEGORY="ID"}
+
+Rule: getMEDRECORDNUM_7
+//OVERLAPS with IDNUM
+({Token.string ==~ "(?)Accession"}{Token.string ==~ "#|:"}):pre_context
+(
+{Token.string ==~ "\\d{4,5}"}({Token.string == ":"})?{Token.kind == word, Token.orth == upperInitial, Token.length == 1}{Token.string ==~ "\\d{4,5}"}|{Token.string ==~ "\\d{5,8}"}
+):label
+-->
+:label.MEDICALRECORD = {CATEGORY="ID"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PATIENT.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PATIENT.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PATIENT.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PATIENT.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,91 @@
+Phase: PATIENT
+Input: Token Split
+Options: control = appelt
+
+Rule: getPatient1
+Priority: 100
+({Token.string ==~ "Mr|Mrs|Ms|Miss|R[Ee]"}({Token.string ==~ ".|:", !Split.kind == external})? ):pre_context
+(
+{Token.orth == upperInitial, !Token.string ==~ "(?i)Name|Dr|Mr|Mrs|Ms|Miss|contin", Token.length > 1} 
+({Token.orth == upperInitial, !Token.string == "Done"})? 
+({Token.orth == upperInitial, !Token.string == "Done"})? 
+({Token.orth == upperInitial, !Token.string == "Done"})?|
+
+{Token.orth == upperInitial, !Token.string ==~ "Name|Dr|Mr|Mrs|Ms|Miss"}
+{Token.string == ".", !Split}
+{Token.orth == upperInitial} 
+(({Token.orth == upperInitial})?{Token.string == ".", !Split})?| 
+
+{Token.orth == upperInitial, !Token.string ==~ "Name|Dr"}
+{Token.string == ","}
+{Token.orth == upperInitial} ({Token.orth == upperInitial, Token.length == 1} {Token.string == "."})? |
+
+{Token.orth == allCaps}
+({Token.string == ","})?
+{Token.orth == allCaps}  ({Token.orth == upperInitial, Token.length == 1} ({Token.string == "."})?)?
+):label
+-->
+:label.PATIENT = {CATEGORY="NAME"}
+
+
+Rule: getPatient2
+Priority: 100
+({Token.string ==~ "(?i)patient|name|pt"}({Token.string ==~ "(?i)name"})?{Token.string == ":"}):pre_context
+(
+{Token.orth == upperInitial, !Token.string ==~ "(?i)Name|Dr|Mr|Mrs|Ms|Miss|contin"} 
+({Token.orth == upperInitial, !Token.string == "Done"})? 
+({Token.orth == upperInitial, !Token.string == "Done"})? 
+({Token.orth == upperInitial, !Token.string == "Done"})?|
+
+{Token.orth == upperInitial, !Token.string ==~ "Name|Dr|Mr|Mrs|Ms|Miss"}
+{Token.string == ".", !Split}
+{Token.orth == upperInitial} 
+(({Token.orth == upperInitial})?{Token.string == ".", !Split})?| 
+
+{Token.orth == upperInitial, !Token.string ==~ "Name|Dr"}
+{Token.string == ","}
+{Token.orth == upperInitial} ({Token.orth == upperInitial, Token.length == 1} {Token.string == "."})?|
+
+{Token.orth == allCaps}
+{Token.string == ","}
+{Token.orth == allCaps} ({Token.orth == upperInitial, Token.length == 1} ({Token.string == "."})?)?
+):label
+-->
+:label.PATIENT = {CATEGORY="NAME"}
+
+
+Rule: getPatient3
+Priority: 80
+({Token.string == "seeing"}):pre_context
+(
+{Token.orth == upperInitial, !Token.string ==~ "(?i)Done|Dr|Mr|Mrs|Miss|Ms|Pt|Patients"}
+({Token.orth == upperInitial, !Token.string == "Done"})?
+):label
+-->
+:label.PATIENT = {CATEGORY="NAME"}
+
+Rule: getPatient4
+Priority: 80
+({Split.kind == external}{Split.kind == external}):pre_context
+(
+{Token.orth == allCaps}
+({Token.string == ","})?
+{Token.orth == allCaps} ({Token.kind == upperInitial} {Token.string == "."})?|
+{Token.kind == upperInitial}({Token.string == ","})?{Token.kind == upperInitial}
+):label
+(
+{Split.kind == external}({Split.kind == external})?
+{Token.string ==~ "[0-9]{7,8}"}
+{Split.kind == external}
+):post_context
+-->
+:label.PATIENT = {CATEGORY="NAME"}
+
+Rule: getPatient5
+Priority: 90
+({Token.string ==~ "Mr|Mrs|Ms|Miss"}({Token.string ==~ "."})?):pre_context
+(
+{Token.kind == word, !Token.string ==~ "(?i)take|pt"}
+):label
+-->
+:label.PATIENT = {CATEGORY="NAME"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PHONE.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PHONE.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PHONE.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/PHONE.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,101 @@
+Phase: PHONE
+Input: Token Split
+Options: control = appelt
+
+Rule: getPHONE1
+Priority: 100 
+(
+({Token.string ==~ "(?i)(tele)?phone"}|{Token.string ==~ "(?i)tele?"}|{Token.string ==~ "(?i)contact"})
+({Token.string == ":"}|{Token.string == "#"}|{Token.string == "."})?
+):pre_context
+(
+({Token.string == "("})?{Token.string ==~ "[0-9]{3}"}({Token.string == ")"})?({Token.string == "-"})?
+{Token.string ==~ "[0-9]{4}"}{Token.string == "-"}{Token.string ==~ "[0-9]{3}"}|
+{Token.string ==~ "[0-9]{3}"}{Token.string ==~ "-|\\."}{Token.string ==~ "[0-9]{4}"}({Token.string ==~ "-|\\."}{Token.string ==~ "[0-9]{3}"})?| //ADDED \.
+{Token.string ==~ "[0-9]{3}"}{Token.string ==~ "-|\\."}{Token.string ==~ "[0-9]{3}"}({Token.string ==~ "-|\\."}{Token.string ==~ "[0-9]{4}"})?| //ADDED \.
+{Token.string ==~ "[0-9]{3}"}{Token.string ==~ "[0-9]{3}"}({Token.string ==~ "[0-9]{4}"})?|
+{Token.string ==~ "[0-9]{3}"}{Token.string ==~ "[0-9]{4}"}({Token.string ==~ "[0-9]{3}"})?
+):label
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+
+
+Rule: getPHONE2
+Priority: 90
+(
+(({Token.string ==~ "(?i)beeper|phone|contact"}|{Token.string ==~ "(?i)pager?-?"})({Token.kind == word})[0,3])
+({Token.string == ":"}|{Token.string == "#"}({Token.string == ":"})?)?
+):pre_context
+(
+{Token.string ==~ "[0-9]{5}"}|{Token.string ==~ "[0-9]{1}"}{Token.string == "-"}{Token.string ==~ "[0-9]{4}"}
+):label
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+
+
+Rule: getPHONE3
+Priority: 90
+(
+({Token.string == "("})?{Token.string ==~ "[0-9]{3}"}({Token.string == ")"})?({Token.string == "-"})?{Token.string ==~ "[0-9]{3}"}({Token.string == "-"})?{Token.string ==~ "[0-9]{4}"}
+|{Token.string ==~ "[0-9]{3}"}{Token.string == "-"}{Token.string ==~ "[0-9]{4}"}({Token.string == "-"}{Token.string ==~ "[0-9]{3}"})? //OBS! TP:5, FP:2, OL:1
+):label
+({!Token.string ==~ "(?i)cc|Units?"}):post_context //POSTNEG
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+
+Rule: getPHONE4
+Priority: 100 
+({Token.string == "("}({Token.orth == upperInitial})?):pre_context
+(
+{Token.string ==~ "[0-9]{8}"}//|{Token.string ==~ "[0-9]{5}"}
+):label
+({Token.string == ")"}):post_context
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+
+Rule: getPHONE5
+Priority: 100 
+({Token.string == "("}{Token.orth == upperInitial}):pre_context
+(
+{Token.string ==~ "[0-9]{8}"}|{Token.string ==~ "[0-9]{5}"}
+):label
+({Token.string == ")"}):post_context
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+
+
+Rule: getPHONE6
+Priority: 80
+(({!Token.string ==~ "(?i)Accession|[0-9]{2,5}"}){Token.string ==~ "#|B|b|X|x|P|pgr?"}({Token.string == "."})?):pre_context
+(
+{Token.string ==~ "[0-9]{5}"}|{Token.string ==~ "[0-9]{1}"}{Token.string == "-"}{Token.string ==~ "[0-9]{4}"}
+):label
+(({Split})?):post_context
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+
+
+Rule: getPHONE7
+Priority: 70
+(
+({Token.string ==~ "(?i)MD|phd|PA-C|Intern"}|{Token.string == "M"}{Token.string == "."}{Token.string == "D"}{Token.string == "."}|
+{Token.string == "Dr"}{Token.string == "."}{Token.orth == upperInitial} ({Token.orth == upperInitial})?)
+({Token.string ==~ ",|#"}|{Split})?
+):pre_context
+(
+{Token.string ==~ "[0-9]{5}"}|{Token.string ==~ "[0-9]{1}"}{Token.string == "-"}{Token.string ==~ "[0-9]{4}"}
+):label
+({Split}):post_context
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+
+Rule: getPHONE8
+Priority: 70
+(({!Token.orth == allCaps}){Split}):pre_context
+(
+{Token.string ==~ "[0-9]{5}"}|{Token.string ==~ "[0-9]{1}"}{Token.string == "-"}{Token.string ==~ "[0-9]{4}"}
+):label
+({Split}):post_context
+-->
+:label.PHONE = {CATEGORY="CONTACT"}
+

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STATE.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STATE.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STATE.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STATE.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,88 @@
+/**
+ * Note, the following STATEs produced many FPs (due to their ambiguity and the gazetteer's caseInsensetive matching): IN|AS|DE|HI|OR
+ * Hence, they have been removed from the ANNIE/resources/gazetteers/us_state_acronym_abbreviation.lst
+ **/
+Phase: STATE
+Input: Token Split Lookup
+Options: control = appelt
+
+Macro: PRENEG(
+ {!Token.kind == number}
+)
+
+Macro: POSTNEG(
+ {!Token.string ==~ "(?i)hospital|heart|association|medical|care|nursing|avenue|street|road|drive|boulevard"}|{!Token.orth == upperInitial}
+)
+
+Rule: getSTATE
+Priority:100
+(
+ {Lookup.minorType=="state"}|{Lookup.minorType=="state_a_a"}
+):label
+(
+ ({Token.string == ","})? {Token.string ==~ "[0-9]{5}"}({Token.string == "-"}{Token.string ==~ "[0-9]{4}"})?
+):post_context
+-->
+:label.STATE = {CATEGORY="LOCATION"}
+
+Rule: getSTATE2
+Priority:90
+(PRENEG):pre_context //negated context; if given context, do not annotate
+(
+{Lookup.minorType=="state"}
+):label
+(POSTNEG):post_context //negated context; if given context, do not annotate
+-->
+:label.STATE = {CATEGORY="LOCATION"}
+
+/**
+ * Tag acronyms and abbreviation used in narrative text by contextual clues (pre_context)
+ */
+Rule: getSTATE3
+Priority:90
+(
+{Token.string ==~ "(?i)originally"}{Token.string == "from"}({Token})[0,3]|
+{Token.string ==~ "(?i)home|son|daughter|mother|father|raised|grew|lived|lives"}({Token})[0,3]{Token.string == "in"}({Token})[0,3]
+):pre_context
+(
+ {Lookup.minorType=="state_a_a"}
+):label
+-->
+:label.STATE = {CATEGORY="LOCATION"}
+
+
+Rule: getSTATE4
+Priority:90
+({Token.orth == upperInitial}({Token.orth == upperInitial})?{Token.string == ","}):pre_context
+(
+ {Lookup.minorType=="state_a_a"}
+):label
+-->
+:label.STATE = {CATEGORY="LOCATION"}
+
+
+Rule: getSTATE5
+Priority:90
+(
+ {Lookup.minorType=="state"}
+):label
+(POSTNEG):post_context
+-->
+:label.STATE = {CATEGORY="LOCATION"}
+
+
+/**
+ * What about: IN|AS|DE|HI|OR
+ * "MD" not included in the gazetteer, hence: 
+ * Tag "MD" that occurres between assumed STREET and ZIP 
+ * TODO: use actual STREET and ZIP annotations! 
+ */ 
+Rule: getSTATE6
+Priority:90
+({Split}{Token.orth == upperInitial}{Token.string == ","}):pre_context
+(
+  {Token.string ==~ "MD|IN|AS|DE|HI|OR"}
+):label
+({Token.string ==~ "[0-9]{5}"}):post_context
+-->
+:label.STATE = {CATEGORY="LOCATION"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STREET.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STREET.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STREET.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/STREET.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,60 @@
+Phase: STREET
+Input: Token Split
+Options: control = appelt
+
+// street1-3 can be joined.
+
+Rule: getSTREET1
+Priority:100
+(
+{Token.string ==~ "[0-9]{1,4}"} 
+({Token.orth == upperInitial}|{Token.orth == allCaps}) 
+({Token.orth == upperInitial})? 
+{Token.string ==~ "(?i)street|st|road|rd|lane|avenue|ave|court|boulevard|terrace|circle|place|drive|way"}
+):label
+-->
+:label.STREET = {CATEGORY="LOCATION"}
+
+Rule: getSTREET2
+Priority:90
+(
+{Token.string ==~ "[0-9]{1,4}"} 
+({Token.orth == upperInitial}|{Token.orth == allCaps}) 
+({Token.orth == upperInitial})? 
+{Token.string ==~ "(?i)st|dr|ave|rd|ln|ct"}{Token.string == ".", Split}
+):label
+-->
+:label.STREET = {CATEGORY="LOCATION"}
+
+Rule: getSTREET3
+Priority:80
+(
+{Token.string ==~ "[0-9]{1,4}"} 
+({Token.orth == upperInitial}|{Token.orth == allCaps}) 
+({Token.orth == upperInitial})? 
+{Token.string ==~ "(?i)St|Ave|Rd|Ln|Ct"}
+):label
+-->
+:label.STREET = {CATEGORY="LOCATION"}
+
+Rule: getSTREET4
+Priority:90
+({!Token.string == "/"} ):pre_context //PRENEG
+(
+{Token.string ==~ "[0-9]{1,4}"} 
+({Token.orth == upperInitial}|{Token.orth == allCaps}) 
+{Token.string ==~ "(?i)Dr"}({Token.string == "."})?
+):label
+({!Token.orth == upperInitial}):post_context //POSTNEG
+-->
+:label.STREET = {CATEGORY="LOCATION"}
+
+
+Rule: getSTREET5
+Priority:70
+(
+({Token.orth == upperInitial}|{Token.orth == allCaps}) 
+{Token.string ==~ "(?i)street|road|avenue"}
+):label
+-->
+:label.STREET = {CATEGORY="LOCATION"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/USERNAME.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/USERNAME.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/USERNAME.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/USERNAME.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,20 @@
+Phase: USERNAME
+Input: Token 
+Options: control = appelt
+
+Rule: getUSERNAME1
+({Token.string == "["}):pre_context
+(
+{Token.string ==~ "[A-Za-z]{2,3}"}{Token.string ==~ "\\d{1,3}"}
+):label
+({Token.string == "]"}):post_context
+-->
+:label.USERNAME = {CATEGORY="NAME"}
+
+Rule: getUSERNAME2
+({Token.string == "M"}{Token.string == "."}{Token.string == "D"}{Token.string == "."}):pre_context
+(
+{Token.string ==~ "[A-Za-z]{2}", !Token.string ==~ "[Oo]n"}{Token.string ==~ "\\d{1,3}"}
+):label
+-->
+:label.USERNAME = {CATEGORY="NAME"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ZIP.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ZIP.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ZIP.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/ZIP.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,11 @@
+Phase: ZIP
+Input: Token Split Lookup
+Options: control = appelt
+
+Rule: getZIP
+(({Lookup.minorType=="state_a_a"}|{Lookup.minorType=="state"})({Token.string == ","})?):pre_context
+(
+{Token.string ==~ "[0-9]{5}"}({Token.string == "-"}{Token.string ==~ "[0-9]{4}"})?
+):label
+-->
+  :label.ZIP = {CATEGORY="LOCATION"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_boundary_adjustment.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_boundary_adjustment.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_boundary_adjustment.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_boundary_adjustment.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,17 @@
+/*
+ * Including this in the pipeline to improve the PATIENT NER.
+ *
+ */
+Phase: BoundaryAdjustment
+Input: Token Split PATIENT
+Options: control = appelt
+
+Rule: expand
+(
+{PATIENT} {Token.orth == upperInitial}|
+{PATIENT}({Token.string == "."}{PATIENT})?| 
+{PATIENT}({Token.string == ","}({PATIENT}|{Token.orth == allCaps}|{Token.orth == upperInitial}))?
+):label
+-->
+:label.PATIENT={CATEGORY="NAME"}
+

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann1.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann1.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann1.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann1.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,41 @@
+//Post-processing
+//Move from AnnSet:passTwo to AnnSet:final_predictions
+Phase: Label
+Input: DOCTOR IDNUM MEDICALRECORD ZIP PATIENT
+Options: control = all
+
+Rule: copyIDNUM
+(
+ {IDNUM}
+):label
+-->
+  :label.IDNUM = {CATEGORY="ID"}
+
+Rule: copyMEDICALRECORD
+(
+ {MEDICALRECORD}
+):label
+-->
+  :label.MEDICALRECORD = {CATEGORY="ID"}
+
+Rule: copyZIP
+(
+ {ZIP}
+):label
+-->
+  :label.ZIP = {CATEGORY="LOCATION"}
+
+Rule: copyDOCTOR
+(
+ {DOCTOR}
+):label
+-->
+  :label.DOCTOR = {CATEGORY="NAME"}
+
+Rule: copyPATIENT
+(
+ {PATIENT}
+):label
+-->
+  :label.PATIENT = {CATEGORY="NAME"}
+

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann2.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann2.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann2.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_copy_ann2.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,68 @@
+//Post-processing
+//Move from AnnSet:passOne to AnnSet:final_predictions
+Phase: Label
+Input: AGE COUNTRY EMAIL FAX PHONE STATE STREET URL USERNAME
+Options: control = all
+
+Rule: copyAGE
+(
+ {AGE}
+):label
+-->
+  :label.AGE = {CATEGORY="AGE"}
+
+Rule: copyCOUNTRY
+(
+ {COUNTRY}
+):label
+-->
+  :label.COUNTRY = {CATEGORY="LOCATION"}
+
+Rule: copyEMAIL
+(
+ {EMAIL}
+):label
+-->
+  :label.EMAIL = {CATEGORY="CONTACT"}
+
+Rule: copyFAX
+(
+ {FAX}
+):label
+-->
+  :label.FAX = {CATEGORY="CONTACT"}
+
+Rule: copyPHONE
+(
+ {PHONE}
+):label
+-->
+  :label.PHONE = {CATEGORY="CONTACT"}
+
+Rule: copySTATE
+(
+ {STATE}
+):label
+-->
+  :label.STATE = {CATEGORY="LOCATION"}
+
+Rule: copySTREET
+(
+ {STREET}
+):label
+-->
+  :label.STREET = {CATEGORY="LOCATION"}
+
+Rule: copyURL
+(
+ {URL}
+):label
+-->
+  :label.URL = {CATEGORY="CONTACT"}
+
+Rule: copyUSERNAME
+(
+ {USERNAME}
+):label
+-->
+  :label.USERNAME = {CATEGORY="NAME"}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_priority_sorting.jape
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_priority_sorting.jape?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_priority_sorting.jape (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/PostProc_priority_sorting.jape Mon Nov  2 16:59:48 2015
@@ -0,0 +1,13 @@
+//Post-processing
+//i.e., patient and doctor annotation used are in AnnSet:passTwo 
+Phase: Label
+Input: COUNTRY PATIENT DOCTOR
+Options: control = appelt
+
+//Priority sorting: COUNTRY over PATIENT,DOCTOR
+Rule: copyCOUNTRY
+(
+{COUNTRY notWithin PATIENT, COUNTRY notWithin DOCTOR} //overlapping?
+):label
+-->
+:label.COUNTRY = {}

Added: ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/README
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/README?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/README (added)
+++ ctakes/sandbox/ctakes-clinical-deid/GATE/rule-set/postproc/README Mon Nov  2 16:59:48 2015
@@ -0,0 +1,4 @@
+To improve the PATIENT NER (13%+ F1-measure), include PostProc_boundary_adjustment.jape in the post processing pipeline:
+
+1. remove copyPATIENT from PostProc_copy_ann1.jape 
+2. ensure the input annotation set for PostProc_boundary_adjustment.jape is twoPass and output annotation set is final_predictions.

Added: ctakes/sandbox/ctakes-clinical-deid/LICENSE
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/LICENSE?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/LICENSE (added)
+++ ctakes/sandbox/ctakes-clinical-deid/LICENSE Mon Nov  2 16:59:48 2015
@@ -0,0 +1,14 @@
+Copyright 2015 Azad Dehghan
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+

Added: ctakes/sandbox/ctakes-clinical-deid/README
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/README?rev=1712083&view=auto
==============================================================================
--- ctakes/sandbox/ctakes-clinical-deid/README (added)
+++ ctakes/sandbox/ctakes-clinical-deid/README Mon Nov  2 16:59:48 2015
@@ -0,0 +1,36 @@
+	cDeid is a de-identification tool for clinical letters.
+	Copyright (C) 2015  Azad Dehghan.
+
+CONTACT: azad.dehghan@gmail.com
+
+SUMMARY:
+The cDeid v0.1 (US version) is a de-identification tool with state-of-the-art performance. The current version includes the following NERs: PATIENT, DOCTOR, USERNAME, STREET, ZIP, STATE, COUNTRY, PHONE, FAX, URL, EMAIL, AGE, MEDICALRECORD and IDNUM. 
+
+This tool was developed and validated using i2b2/UTHealth 2014 Track I data. 
+
+TODO:
+- Include NERs: DATE, HOSPITAL, ORGANIZATION and PROFESSION
+- Include ML models
+- Include priority sorting
+
+USAGE:
+The cDeid is a simple command line tool. The source code should be straight forward to disinsect and integrate (see Controller.java for example) into your own application. In addition, the validationtools.jar can be used to continue further development and validation (see 'TESTING' commented code in Controller.java).  
+
+Basic usage using cDeid exectuable: 
+
+Use case 1: Print help screen.
+ java -jar cDeid.jar -h
+
+Use case 2: Process a set of plain text files in directory inputdir/ and save results in outputdir/. 
+ java -jar cDeid.jar --xml inputdir/ outputdir/ 
+
+
+
+CONTRIBUTORs:
+...
+
+REFERENCE:
+I would appreciate it if you would cite the following paper when using or referring to the cDeid:
+
+[1] A. Dehghan et al., Combining knowledge- and data-driven methods for de-identification of clinical narratives, J Biomed Inform (2015), http://dx.doi.org/10.1016/j.jbi.2015.06.029
+

Added: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/controller/Controller.class
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/controller/Controller.class?rev=1712083&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/controller/Controller.class
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/controller/Document.class
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/controller/Document.class?rev=1712083&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/controller/Document.class
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/io/Output.class
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/io/Output.class?rev=1712083&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/io/Output.class
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/PassOne.class
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/PassOne.class?rev=1712083&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/PassOne.class
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/ner/EmailNER.class
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/ner/EmailNER.class?rev=1712083&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/ner/EmailNER.class
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/ner/UrlNER.class
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/ner/UrlNER.class?rev=1712083&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/firstpass/ner/UrlNER.class
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/other/PostProcess.class
URL: http://svn.apache.org/viewvc/ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/other/PostProcess.class?rev=1712083&view=auto
==============================================================================
Binary file - no diff available.

Propchange: ctakes/sandbox/ctakes-clinical-deid/bin/co/dehghan/cdeid/pipeline/other/PostProcess.class
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream