You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Peter Abramowitsch <pa...@gmail.com> on 2020/08/25 20:54:11 UTC
Short gene term collisions
As a thank you for your suggestions, here's a little file that may help.
It's a command file for sed that will remove all short gene synonyms for
HGNC that collide with common english words of of 2,3,4 characters in
length. You will only need it if you've included HGNC in your
vocabularies and Gene & Receptor TUIs in your dictionary
The common words list is a bit weird, containing some contemporary acronyms
that are not strictly speaking words. But feel free to improve
https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt
sed -f deletion_short_gene_terms_script < original_dict.script >
scrubbed_dict.script
Peter