You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by F Knudson <fk...@lanl.gov> on 2008/07/24 15:53:35 UTC

Tokenizing and searching named character entity references

Greetings:

I am working with many different data sources - some source employ "entity
references" ; others do not.  My goal is to make the searching across
sources as consistent as possible.

Example text - 

Source1:   weakening H&delta; absorption
Source1:   zero-field gap &omega;

Source2:  weakening H delta absorption
Source2:  zero-field gap omega

Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 -
the entity is replaced with the "named character entity" - 

This works great.  

But I want the searching tokens to be identical for each source.  I need to
capture &delta;  as a token.


<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> 
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.ISOLatin1AccentFilterFactory"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateA
ll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
</fieldType>
 
Is this possible with the SOLR supplied tokenizers?  I experimented with
different combinations and orders and was not successful.

Is this possible using synonyms?  I also experimented with this route but
again was not successful.

Do I need to create a custom tokenizer?

Thanks
Frances
-- 
View this message in context: http://www.nabble.com/Tokenizing-and-searching-named-character-entity-references-tp18632403p18632403.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Tokenizing and searching named character entity references

Posted by Chris Hostetter <ho...@fucit.org>.

: You could extend HTMLStripReader to not decode named character entities, 
: e.g. by overriding HTMLStripReader.read() so that it calls an 
: alternative readEntity(), which instead of converting entity references 
: to characters would just leave the entity references as-is, something 
: like:

Alternately: use SynonymFilterFactory to map any entity "names" to the 
real Unicode character so your "Source2" style docs get "omega" replaced 
with the same character the HTMLStrip*TokenizerFactories generate when 
they encounter the HTML entities.

generating the list of synonyms from the comment at the end of 
HTMLSripReader.java should be easy.


: > Source1:   weakening H&delta; absorption
: > Source1:   zero-field gap &omega;
: > 
: > Source2:  weakening H delta absorption
: > Source2:  zero-field gap omega



-Hoss

RE: Tokenizing and searching named character entity references

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Frances,

HTMLStripWhitespaceTokenizerFactory wraps a WhitespaceTokenizer around an HTMLStripReader.

You could extend HTMLStripReader to not decode named character entities, e.g. by overriding HTMLStripReader.read() so that it calls an alternative readEntity(), which instead of converting entity references to characters would just leave the entity references as-is, something like:

public class MyHTMLStripReader extends HTMLStripReader {

  ///// override read() to call myReadEntity(), but no other changes
  public int read() throws IOException {
    ...
    switch (ch) {
      case '&':
        saveState();
        ch = myReadEntity(); ///// Change this line to call new method
        if (ch>=0) return ch;
        if (ch==MISMATCH) {
          restoreState();
          return '&';
        }
        break;
      ...
    }
  }

  private int myReadEntity() throws IOException {
    int ch = next();
    if (ch=='#') return readNumericEntity();
    return MISMATCH;  ///// Always a mismatch, except for numeric entities
  }
}

Then you could create a new Factory, something like:

public class MyHTMLStripWhitespaceTokenizerFactory extends BaseTokenizerFactory {
  public TokenStream create(Reader input) {
    return new WhitespaceTokenizer(new MyHTMLStripReader(input));
  }
}

Steve

On 07/24/2008 at 9:53 AM, F Knudson wrote:
> 
> Greetings:
> 
> I am working with many different data sources - some source
> employ "entity references" ; others do not.  My goal is to
> make the searching across sources as consistent as possible.
> 
> Example text -
> 
> Source1:   weakening H&delta; absorption
> Source1:   zero-field gap &omega;
> 
> Source2:  weakening H delta absorption
> Source2:  zero-field gap omega
> 
> Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory
> for Source1 - the entity is replaced with the "named character
> entity" - This works great.
> 
> But I want the searching tokens to be identical for each
> source.  I need to capture &delta;  as a token.
> 
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateA ll="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
> </fieldType>
> 
> Is this possible with the SOLR supplied tokenizers?  I
> experimented with different combinations and orders and was
> not successful.
> 
> Is this possible using synonyms?  I also experimented with
> this route but again was not successful.
> 
> Do I need to create a custom tokenizer?
> 
> Thanks 
> Frances