You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ctakes.apache.org by Peter Abramowitsch <pa...@gmail.com> on 2020/08/25 20:54:11 UTC

Short gene term collisions

As a thank you for your suggestions, here's a little file that may help.

It's a command file for sed that will remove all short gene synonyms for
HGNC that collide with common english words of of 2,3,4 characters in
length.   You will only need it if you've included HGNC in your
vocabularies and Gene & Receptor TUIs in your dictionary

The common words list is a bit weird, containing some contemporary acronyms
that are not strictly speaking words.  But feel free to improve

https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt

sed -f deletion_short_gene_terms_script < original_dict.script >
scrubbed_dict.script

Peter