You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul <pa...@nines.org> on 2011/04/28 18:10:27 UTC

Searching for escaped characters

I'm trying to create a test to make sure that character sequences like
"&egrave;" are successfully converted to their equivalent utf
character (that is, in this case, "è").

So, I'd like to search my solr index using the equivalent of the
following regular expression:

&\w{1,6};

To find any escaped sequences that might have slipped through.

Is this possible? I have indexed these fields with text_lu, which
looks like this:

   <fieldtype name="text_lu" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

Thanks,
Paul

Re: Searching for escaped characters

Posted by Mike Sokolov <so...@ifactory.com>.
StandardTokenizer will have stripped punctuation I think.  You might try 
searching for all the entity names though:

(agrave | egrave | omacron | etc... )

The names are pretty distinctive.  Although you might have problems with 
greek letters.

-Mike

On 04/28/2011 12:10 PM, Paul wrote:
> I'm trying to create a test to make sure that character sequences like
> "&egrave;" are successfully converted to their equivalent utf
> character (that is, in this case, "è").
>
> So, I'd like to search my solr index using the equivalent of the
> following regular expression:
>
> &\w{1,6};
>
> To find any escaped sequences that might have slipped through.
>
> Is this possible? I have indexed these fields with text_lu, which
> looks like this:
>
>     <fieldtype name="text_lu" class="solr.TextField" positionIncrementGap="100">
>        <analyzer>
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>          <filter class="solr.StandardFilterFactory"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
>      </fieldtype>
>
> Thanks,
> Paul
>