You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ctakes.apache.org by Peter Abramowitsch <pa...@gmail.com> on 2020/08/25 01:07:11 UTC

Question about window size in term lookup

Hello all

Is there a mechanism, a lookup file, etc which overrides the window size
set on the term annotator or the chunker.   Changing the window size from
the default of 3 to 2 opens the floodgate to false acronym annotations.  So
my question is whether there's a place where one can register specific two
character terms, for example BP or PT which will be found even with a
window size set to three.

A similar question about Genes.   On adding the HGNC vocabulary I notice
that there are many thousands of aliases for genes which overlap other
common acronyms and english words such as trip, spring, plan, bed, yes,
rip, prn etc.   I'm not sure if these aliases are ever used.   So I created
a sed script with 4000 regex expressions to remove the 2 and 3 letter gene
synonyms from a script file.  I will only suppress the 4 letter synonyms
manually where they cause trouble.     But does anyone have a  more elegant
solution?

Peter

Re: Question about window size in term lookup [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.

>>> -- What was your english dictionary source?  I suppose that there could
be some blacklisting in a dictionary creator.

I found these two

https://raw.githubusercontent.com/first20hours/google-10000-english/master/google-10000-english-usa.txt
https://www.mit.edu/~ecprice/wordlist.10000

But these were probably derived from internet usage.   I was surprised by
some of the words that showed up

P.

On Tue, Aug 25, 2020 at 9:09 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> Hi Peter,
>
> >I'm inferring that there's no way to set the
> window size to N and have an exception list of a few items that are of
> length < N.
> -- As far as I can recall there isn't any such method in the lookup.
>
> > Join all the 2&3 character gene
> terms with the 10,000 most common english words
> -- I have seen this done elsewhere, and can't remember if anybody tested
> precision gained vs. recall lost.  It would be highly related to
> note/specialty type.
> -- What was your english dictionary source?  I suppose that there could be
> some blacklisting in a dictionary creator.
>
> >It reduced the number of items to remove by an order of magnitude.   ~4000
> down to ~400
> -- Very nice.
>
> >performance is a big factor in our project.
> -- Yup.
>
>
> If only the dictionary lookup differentiated between all-caps words and
> lower or mixed case ...
>
> Thanks for sharing your ideas,
> Sean
>
>
>
> ________________________________________
> From: Peter Abramowitsch <pa...@gmail.com>
> Sent: Tuesday, August 25, 2020 11:56 AM
> To: dev@ctakes.apache.org
> Subject: Re: Question about window size in term lookup [EXTERNAL]
>
> * External Email - Caution *
>
>
> Thanks Sean.  A lot of good ideas.  I hadn't even been thinking of
> post-filtering, but that's a very viable approach. Something like using
> tweezers to remove a splinter instead of removing them from all the pieces
> of wood you might encounter.   I like how you use the functor approach on
> the filters.
>
> Yesterday I tried another method too.   Join all the 2&3 character gene
> terms with the 10,000 most common english words - then take the resulting
> list and use it to create a deletion list in the dictionary creation step.
> It reduced the number of items to remove by an order of magnitude.   ~4000
> down to ~400
>
> Deleting it in the dictionary is more painful up front, but more performant
> than post filtering, for two obvious reasons,  but using your approach and
> checking if the # of gene references is > 0, one can choose to filter only
> specific notes and that would increase performance again.  Unfortunately
> performance is a big factor in our project.
>
> From your response and Kean's I'm inferring that there's no way to set the
> window size to N and have an exception list of a few items that are of
> length < N.  Right?  If there were, it would be in the chunker, not the
> term lookup.
>
> Thanks again for your suggestions!
>
> Peter
>
> On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean <
> Sean.Finan@childrens.harvard.edu> wrote:
>
> > I think that Kean is correct.  I usually create an annotator that removes
> > terms that I don't want.  It is usually fairly easy.
> >
> >       final Predicate<IdentifiedAnnotation> is2char
> >             = a -> a.getCoveredText().length() == 2;
> >
> >       final String geneTui = SemanticTui.getTui( "Gene or Genome"
> ).name();
> >
> >       OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
> >                          .stream()
> >                          .filter( is2char )
> >                          .forEach( Annotation::removeFromIndexes );
> >
> >
> > Or, if you want to grab a few that aren't specifically "Gene" but are in
> > the same semantic group (without looking it up in class SemanticGroup),
> and
> > in the HGNC vocabulary :
> >
> >       final Class<? extends IdentifiedAnnotation> geneClass
> >             = SemanticTui.getTui( "Gene or Genome" )
> >                          .getGroup()
> >                          .getCtakesClass();
> >
> >       final Predicate<IdentifiedAnnotation> isHgnc
> >             = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey(
> > "hgnc" );
> >
> >       JCasUtil.select( jCas, geneClass )
> >               .stream()
> >               .filter( is2char )
> >               .filter( isHgnc )
> >               .forEach( Annotation::removeFromIndexes );
> >
> >
> > "hgnc" may need to be "HGNC" ... and will only exist if you stored the
> > HGNC codes in your dictionary.
> >
> >
> > Or you can do it focusing on what you do want.
> >
> >       final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of(
> > SemanticGroup.DRUG, SemanticGroup.LAB );
> >
> >       final Predicate<IdentifiedAnnotation> isTrashGroup
> >             = a -> SemanticGroup.getGroups( a )
> >                                 .stream()
> >                                 .noneMatch( WANTED_GROUP::contains );
> >
> >       JCasUtil.select( jCas, IdentifiedAnnotation.class )
> >               .stream()
> >               .filter( is2char )
> >               .filter( isTrashGroup )
> >               .forEach( Annotation::removeFromIndexes );
> >
> > Or if you want to cover all combinations that aren't all uppercase:
> >
> >       final Predicate<IdentifiedAnnotation> notCaps
> >             = a -> a.getCoveredText()
> >                     .chars()
> >                     .anyMatch( Character::isLowerCase );
> >
> >       JCasUtil.select( jCas, IdentifiedAnnotation.class )
> >               .stream()
> >               .filter( is2char )
> >               .filter( notCaps )
> >               .forEach( Annotation::removeFromIndexes );
> >
> > Or mix and modify.  For instance, ignore character length but  Tui = Gene
> > and the text is not all caps.
> >
> > Sometimes I enjoy mocking up code ...
> >
> > Sean
> >
> > ________________________________________
> > From: Kean Kaufmann <ke...@recordsone.com>
> > Sent: Monday, August 24, 2020 9:35 PM
> > To: dev@ctakes.apache.org
> > Subject: Re: Question about window size in term lookup [EXTERNAL]
> >
> > * External Email - Caution *
> >
> >
> > >
> > > my question is whether there's a place where one can register specific
> > two
> > > character terms, for example BP or PT which will be found even with a
> > > window size set to three.
> >
> >
> > My brute-force approach is pretty brutal: Change the window size to two,
> > annotate terms, then remove all two-letter annotations except the very
> few
> > I'm interested in.
> >
> > On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <
> > pabramowitsch@gmail.com>
> > wrote:
> >
> > > Hello all
> > >
> > > Is there a mechanism, a lookup file, etc which overrides the window
> size
> > > set on the term annotator or the chunker.   Changing the window size
> from
> > > the default of 3 to 2 opens the floodgate to false acronym annotations.
> > So
> > > my question is whether there's a place where one can register specific
> > two
> > > character terms, for example BP or PT which will be found even with a
> > > window size set to three.
> > >
> > > A similar question about Genes.   On adding the HGNC vocabulary I
> notice
> > > that there are many thousands of aliases for genes which overlap other
> > > common acronyms and english words such as trip, spring, plan, bed, yes,
> > > rip, prn etc.   I'm not sure if these aliases are ever used.   So I
> > created
> > > a sed script with 4000 regex expressions to remove the 2 and 3 letter
> > gene
> > > synonyms from a script file.  I will only suppress the 4 letter
> synonyms
> > > manually where they cause trouble.     But does anyone have a  more
> > elegant
> > > solution?
> > >
> > > Peter
> > >
> >
>

Re: Question about window size in term lookup [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

Hi Peter,

>I'm inferring that there's no way to set the
window size to N and have an exception list of a few items that are of
length < N.
-- As far as I can recall there isn't any such method in the lookup.

> Join all the 2&3 character gene
terms with the 10,000 most common english words
-- I have seen this done elsewhere, and can't remember if anybody tested precision gained vs. recall lost.  It would be highly related to note/specialty type.
-- What was your english dictionary source?  I suppose that there could be some blacklisting in a dictionary creator.

>It reduced the number of items to remove by an order of magnitude.   ~4000
down to ~400
-- Very nice.

>performance is a big factor in our project.
-- Yup.


If only the dictionary lookup differentiated between all-caps words and lower or mixed case ...

Thanks for sharing your ideas,
Sean



________________________________________
From: Peter Abramowitsch <pa...@gmail.com>
Sent: Tuesday, August 25, 2020 11:56 AM
To: dev@ctakes.apache.org
Subject: Re: Question about window size in term lookup [EXTERNAL]

* External Email - Caution *


Thanks Sean.  A lot of good ideas.  I hadn't even been thinking of
post-filtering, but that's a very viable approach. Something like using
tweezers to remove a splinter instead of removing them from all the pieces
of wood you might encounter.   I like how you use the functor approach on
the filters.

Yesterday I tried another method too.   Join all the 2&3 character gene
terms with the 10,000 most common english words - then take the resulting
list and use it to create a deletion list in the dictionary creation step.
It reduced the number of items to remove by an order of magnitude.   ~4000
down to ~400

Deleting it in the dictionary is more painful up front, but more performant
than post filtering, for two obvious reasons,  but using your approach and
checking if the # of gene references is > 0, one can choose to filter only
specific notes and that would increase performance again.  Unfortunately
performance is a big factor in our project.

From your response and Kean's I'm inferring that there's no way to set the
window size to N and have an exception list of a few items that are of
length < N.  Right?  If there were, it would be in the chunker, not the
term lookup.

Thanks again for your suggestions!

Peter

On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> I think that Kean is correct.  I usually create an annotator that removes
> terms that I don't want.  It is usually fairly easy.
>
>       final Predicate<IdentifiedAnnotation> is2char
>             = a -> a.getCoveredText().length() == 2;
>
>       final String geneTui = SemanticTui.getTui( "Gene or Genome" ).name();
>
>       OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
>                          .stream()
>                          .filter( is2char )
>                          .forEach( Annotation::removeFromIndexes );
>
>
> Or, if you want to grab a few that aren't specifically "Gene" but are in
> the same semantic group (without looking it up in class SemanticGroup), and
> in the HGNC vocabulary :
>
>       final Class<? extends IdentifiedAnnotation> geneClass
>             = SemanticTui.getTui( "Gene or Genome" )
>                          .getGroup()
>                          .getCtakesClass();
>
>       final Predicate<IdentifiedAnnotation> isHgnc
>             = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey(
> "hgnc" );
>
>       JCasUtil.select( jCas, geneClass )
>               .stream()
>               .filter( is2char )
>               .filter( isHgnc )
>               .forEach( Annotation::removeFromIndexes );
>
>
> "hgnc" may need to be "HGNC" ... and will only exist if you stored the
> HGNC codes in your dictionary.
>
>
> Or you can do it focusing on what you do want.
>
>       final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of(
> SemanticGroup.DRUG, SemanticGroup.LAB );
>
>       final Predicate<IdentifiedAnnotation> isTrashGroup
>             = a -> SemanticGroup.getGroups( a )
>                                 .stream()
>                                 .noneMatch( WANTED_GROUP::contains );
>
>       JCasUtil.select( jCas, IdentifiedAnnotation.class )
>               .stream()
>               .filter( is2char )
>               .filter( isTrashGroup )
>               .forEach( Annotation::removeFromIndexes );
>
> Or if you want to cover all combinations that aren't all uppercase:
>
>       final Predicate<IdentifiedAnnotation> notCaps
>             = a -> a.getCoveredText()
>                     .chars()
>                     .anyMatch( Character::isLowerCase );
>
>       JCasUtil.select( jCas, IdentifiedAnnotation.class )
>               .stream()
>               .filter( is2char )
>               .filter( notCaps )
>               .forEach( Annotation::removeFromIndexes );
>
> Or mix and modify.  For instance, ignore character length but  Tui = Gene
> and the text is not all caps.
>
> Sometimes I enjoy mocking up code ...
>
> Sean
>
> ________________________________________
> From: Kean Kaufmann <ke...@recordsone.com>
> Sent: Monday, August 24, 2020 9:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: Question about window size in term lookup [EXTERNAL]
>
> * External Email - Caution *
>
>
> >
> > my question is whether there's a place where one can register specific
> two
> > character terms, for example BP or PT which will be found even with a
> > window size set to three.
>
>
> My brute-force approach is pretty brutal: Change the window size to two,
> annotate terms, then remove all two-letter annotations except the very few
> I'm interested in.
>
> On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <
> pabramowitsch@gmail.com>
> wrote:
>
> > Hello all
> >
> > Is there a mechanism, a lookup file, etc which overrides the window size
> > set on the term annotator or the chunker.   Changing the window size from
> > the default of 3 to 2 opens the floodgate to false acronym annotations.
> So
> > my question is whether there's a place where one can register specific
> two
> > character terms, for example BP or PT which will be found even with a
> > window size set to three.
> >
> > A similar question about Genes.   On adding the HGNC vocabulary I notice
> > that there are many thousands of aliases for genes which overlap other
> > common acronyms and english words such as trip, spring, plan, bed, yes,
> > rip, prn etc.   I'm not sure if these aliases are ever used.   So I
> created
> > a sed script with 4000 regex expressions to remove the 2 and 3 letter
> gene
> > synonyms from a script file.  I will only suppress the 4 letter synonyms
> > manually where they cause trouble.     But does anyone have a  more
> elegant
> > solution?
> >
> > Peter
> >
>

Re: Question about window size in term lookup [EXTERNAL]

Posted by Peter Abramowitsch <pa...@gmail.com>.

Thanks Sean.  A lot of good ideas.  I hadn't even been thinking of
post-filtering, but that's a very viable approach. Something like using
tweezers to remove a splinter instead of removing them from all the pieces
of wood you might encounter.   I like how you use the functor approach on
the filters.

Yesterday I tried another method too.   Join all the 2&3 character gene
terms with the 10,000 most common english words - then take the resulting
list and use it to create a deletion list in the dictionary creation step.
It reduced the number of items to remove by an order of magnitude.   ~4000
down to ~400

Deleting it in the dictionary is more painful up front, but more performant
than post filtering, for two obvious reasons,  but using your approach and
checking if the # of gene references is > 0, one can choose to filter only
specific notes and that would increase performance again.  Unfortunately
performance is a big factor in our project.

From your response and Kean's I'm inferring that there's no way to set the
window size to N and have an exception list of a few items that are of
length < N.  Right?  If there were, it would be in the chunker, not the
term lookup.

Thanks again for your suggestions!

Peter

On Tue, Aug 25, 2020 at 5:50 AM Finan, Sean <
Sean.Finan@childrens.harvard.edu> wrote:

> I think that Kean is correct.  I usually create an annotator that removes
> terms that I don't want.  It is usually fairly easy.
>
>       final Predicate<IdentifiedAnnotation> is2char
>             = a -> a.getCoveredText().length() == 2;
>
>       final String geneTui = SemanticTui.getTui( "Gene or Genome" ).name();
>
>       OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
>                          .stream()
>                          .filter( is2char )
>                          .forEach( Annotation::removeFromIndexes );
>
>
> Or, if you want to grab a few that aren't specifically "Gene" but are in
> the same semantic group (without looking it up in class SemanticGroup), and
> in the HGNC vocabulary :
>
>       final Class<? extends IdentifiedAnnotation> geneClass
>             = SemanticTui.getTui( "Gene or Genome" )
>                          .getGroup()
>                          .getCtakesClass();
>
>       final Predicate<IdentifiedAnnotation> isHgnc
>             = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey(
> "hgnc" );
>
>       JCasUtil.select( jCas, geneClass )
>               .stream()
>               .filter( is2char )
>               .filter( isHgnc )
>               .forEach( Annotation::removeFromIndexes );
>
>
> "hgnc" may need to be "HGNC" ... and will only exist if you stored the
> HGNC codes in your dictionary.
>
>
> Or you can do it focusing on what you do want.
>
>       final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of(
> SemanticGroup.DRUG, SemanticGroup.LAB );
>
>       final Predicate<IdentifiedAnnotation> isTrashGroup
>             = a -> SemanticGroup.getGroups( a )
>                                 .stream()
>                                 .noneMatch( WANTED_GROUP::contains );
>
>       JCasUtil.select( jCas, IdentifiedAnnotation.class )
>               .stream()
>               .filter( is2char )
>               .filter( isTrashGroup )
>               .forEach( Annotation::removeFromIndexes );
>
> Or if you want to cover all combinations that aren't all uppercase:
>
>       final Predicate<IdentifiedAnnotation> notCaps
>             = a -> a.getCoveredText()
>                     .chars()
>                     .anyMatch( Character::isLowerCase );
>
>       JCasUtil.select( jCas, IdentifiedAnnotation.class )
>               .stream()
>               .filter( is2char )
>               .filter( notCaps )
>               .forEach( Annotation::removeFromIndexes );
>
> Or mix and modify.  For instance, ignore character length but  Tui = Gene
> and the text is not all caps.
>
> Sometimes I enjoy mocking up code ...
>
> Sean
>
> ________________________________________
> From: Kean Kaufmann <ke...@recordsone.com>
> Sent: Monday, August 24, 2020 9:35 PM
> To: dev@ctakes.apache.org
> Subject: Re: Question about window size in term lookup [EXTERNAL]
>
> * External Email - Caution *
>
>
> >
> > my question is whether there's a place where one can register specific
> two
> > character terms, for example BP or PT which will be found even with a
> > window size set to three.
>
>
> My brute-force approach is pretty brutal: Change the window size to two,
> annotate terms, then remove all two-letter annotations except the very few
> I'm interested in.
>
> On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <
> pabramowitsch@gmail.com>
> wrote:
>
> > Hello all
> >
> > Is there a mechanism, a lookup file, etc which overrides the window size
> > set on the term annotator or the chunker.   Changing the window size from
> > the default of 3 to 2 opens the floodgate to false acronym annotations.
> So
> > my question is whether there's a place where one can register specific
> two
> > character terms, for example BP or PT which will be found even with a
> > window size set to three.
> >
> > A similar question about Genes.   On adding the HGNC vocabulary I notice
> > that there are many thousands of aliases for genes which overlap other
> > common acronyms and english words such as trip, spring, plan, bed, yes,
> > rip, prn etc.   I'm not sure if these aliases are ever used.   So I
> created
> > a sed script with 4000 regex expressions to remove the 2 and 3 letter
> gene
> > synonyms from a script file.  I will only suppress the 4 letter synonyms
> > manually where they cause trouble.     But does anyone have a  more
> elegant
> > solution?
> >
> > Peter
> >
>

Re: Question about window size in term lookup [EXTERNAL]

Posted by "Finan, Sean" <Se...@childrens.harvard.edu>.

I think that Kean is correct.  I usually create an annotator that removes terms that I don't want.  It is usually fairly easy.

      final Predicate<IdentifiedAnnotation> is2char
            = a -> a.getCoveredText().length() == 2;

      final String geneTui = SemanticTui.getTui( "Gene or Genome" ).name();

      OntologyConceptUtil.getAnnotationsByTui( jCas, geneTui )
                         .stream()
                         .filter( is2char )
                         .forEach( Annotation::removeFromIndexes );


Or, if you want to grab a few that aren't specifically "Gene" but are in the same semantic group (without looking it up in class SemanticGroup), and in the HGNC vocabulary :

      final Class<? extends IdentifiedAnnotation> geneClass
            = SemanticTui.getTui( "Gene or Genome" )
                         .getGroup()
                         .getCtakesClass();

      final Predicate<IdentifiedAnnotation> isHgnc
            = a -> OntologyConceptUtil.getSchemeCodes( a ).containsKey( "hgnc" );

      JCasUtil.select( jCas, geneClass )
              .stream()
              .filter( is2char )
              .filter( isHgnc )
              .forEach( Annotation::removeFromIndexes );


"hgnc" may need to be "HGNC" ... and will only exist if you stored the HGNC codes in your dictionary.


Or you can do it focusing on what you do want.  

      final Collection<SemanticGroup> WANTED_GROUP = EnumSet.of( SemanticGroup.DRUG, SemanticGroup.LAB );
      
      final Predicate<IdentifiedAnnotation> isTrashGroup
            = a -> SemanticGroup.getGroups( a )
                                .stream()
                                .noneMatch( WANTED_GROUP::contains );
      
      JCasUtil.select( jCas, IdentifiedAnnotation.class )
              .stream()
              .filter( is2char )
              .filter( isTrashGroup )
              .forEach( Annotation::removeFromIndexes );

Or if you want to cover all combinations that aren't all uppercase:

      final Predicate<IdentifiedAnnotation> notCaps
            = a -> a.getCoveredText()
                    .chars()
                    .anyMatch( Character::isLowerCase );

      JCasUtil.select( jCas, IdentifiedAnnotation.class )
              .stream()
              .filter( is2char )
              .filter( notCaps )
              .forEach( Annotation::removeFromIndexes );

Or mix and modify.  For instance, ignore character length but  Tui = Gene and the text is not all caps.  

Sometimes I enjoy mocking up code ...

Sean

________________________________________
From: Kean Kaufmann <ke...@recordsone.com>
Sent: Monday, August 24, 2020 9:35 PM
To: dev@ctakes.apache.org
Subject: Re: Question about window size in term lookup [EXTERNAL]

* External Email - Caution *


>
> my question is whether there's a place where one can register specific two
> character terms, for example BP or PT which will be found even with a
> window size set to three.


My brute-force approach is pretty brutal: Change the window size to two,
annotate terms, then remove all two-letter annotations except the very few
I'm interested in.

On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <pa...@gmail.com>
wrote:

> Hello all
>
> Is there a mechanism, a lookup file, etc which overrides the window size
> set on the term annotator or the chunker.   Changing the window size from
> the default of 3 to 2 opens the floodgate to false acronym annotations.  So
> my question is whether there's a place where one can register specific two
> character terms, for example BP or PT which will be found even with a
> window size set to three.
>
> A similar question about Genes.   On adding the HGNC vocabulary I notice
> that there are many thousands of aliases for genes which overlap other
> common acronyms and english words such as trip, spring, plan, bed, yes,
> rip, prn etc.   I'm not sure if these aliases are ever used.   So I created
> a sed script with 4000 regex expressions to remove the 2 and 3 letter gene
> synonyms from a script file.  I will only suppress the 4 letter synonyms
> manually where they cause trouble.     But does anyone have a  more elegant
> solution?
>
> Peter
>

Re: Question about window size in term lookup

Posted by Kean Kaufmann <ke...@recordsone.com>.

>
> my question is whether there's a place where one can register specific two
> character terms, for example BP or PT which will be found even with a
> window size set to three.


My brute-force approach is pretty brutal: Change the window size to two,
annotate terms, then remove all two-letter annotations except the very few
I'm interested in.

On Mon, Aug 24, 2020 at 9:07 PM Peter Abramowitsch <pa...@gmail.com>
wrote:

> Hello all
>
> Is there a mechanism, a lookup file, etc which overrides the window size
> set on the term annotator or the chunker.   Changing the window size from
> the default of 3 to 2 opens the floodgate to false acronym annotations.  So
> my question is whether there's a place where one can register specific two
> character terms, for example BP or PT which will be found even with a
> window size set to three.
>
> A similar question about Genes.   On adding the HGNC vocabulary I notice
> that there are many thousands of aliases for genes which overlap other
> common acronyms and english words such as trip, spring, plan, bed, yes,
> rip, prn etc.   I'm not sure if these aliases are ever used.   So I created
> a sed script with 4000 regex expressions to remove the 2 and 3 letter gene
> synonyms from a script file.  I will only suppress the 4 letter synonyms
> manually where they cause trouble.     But does anyone have a  more elegant
> solution?
>
> Peter
>