You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steve Rowe <sa...@gmail.com> on 2013/03/01 18:21:30 UTC

Re: Defining tokenizer pattern with < character

Kristian,

I think what you want is pattern="&lt;[^&gt;]&gt;" (untested) - that is, you probably don't want to regex-escape the character class brackets "[" and "]", and you should html-escape the angle brackets.

Steve
 
On Mar 1, 2013, at 11:42 AM, "Van Tassell, Kristian" <kr...@siemens.com> wrote:

> I'm trying to define the pattern:
> 
>   <tokenizer class="solr.PatternTokenizerFactory" pattern="<\[^\>\]*>" group="0"/>
> 
> But getting an error from Solr:
> 
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Schema Parsing Failed: The value of attribute "pattern" associated with an element type "null" must not contain the '<' character.
> 
> I'm trying to tokenize a CDATA section I am indexing. I've tried escaping the < character numerous ways (and used the &lt; entity...) but can't get it to work.
> 
> Any ideas? Thanks in advance!


RE: Defining tokenizer pattern with < character

Posted by "Van Tassell, Kristian" <kr...@siemens.com>.
It was a subset of HTML, yes, and it appears to work for my needs, thank you!

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Friday, March 01, 2013 11:31 AM
To: solr-user@lucene.apache.org
Subject: Re: Defining tokenizer pattern with < character

Are you trying to strip out HTML tags? There are built-in classes that do that.

Or you might want to parse the XML or HTML before you pass it to Solr. An XML parser will interpret CDATA so that you never have to think about it. The parsed data is just text.

wunder

On Mar 1, 2013, at 9:21 AM, Steve Rowe wrote:

> Kristian,
> 
> I think what you want is pattern="&lt;[^&gt;]&gt;" (untested) - that is, you probably don't want to regex-escape the character class brackets "[" and "]", and you should html-escape the angle brackets.
> 
> Steve
> 
> On Mar 1, 2013, at 11:42 AM, "Van Tassell, Kristian" <kr...@siemens.com> wrote:
> 
>> I'm trying to define the pattern:
>> 
>>  <tokenizer class="solr.PatternTokenizerFactory" pattern="<\[^\>\]*>" group="0"/>
>> 
>> But getting an error from Solr:
>> 
>> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Schema Parsing Failed: The value of attribute "pattern" associated with an element type "null" must not contain the '<' character.
>> 
>> I'm trying to tokenize a CDATA section I am indexing. I've tried escaping the < character numerous ways (and used the &lt; entity...) but can't get it to work.
>> 
>> Any ideas? Thanks in advance!
> 






Re: Defining tokenizer pattern with < character

Posted by Walter Underwood <wu...@wunderwood.org>.
Are you trying to strip out HTML tags? There are built-in classes that do that.

Or you might want to parse the XML or HTML before you pass it to Solr. An XML parser will interpret CDATA so that you never have to think about it. The parsed data is just text.

wunder

On Mar 1, 2013, at 9:21 AM, Steve Rowe wrote:

> Kristian,
> 
> I think what you want is pattern="&lt;[^&gt;]&gt;" (untested) - that is, you probably don't want to regex-escape the character class brackets "[" and "]", and you should html-escape the angle brackets.
> 
> Steve
> 
> On Mar 1, 2013, at 11:42 AM, "Van Tassell, Kristian" <kr...@siemens.com> wrote:
> 
>> I'm trying to define the pattern:
>> 
>>  <tokenizer class="solr.PatternTokenizerFactory" pattern="<\[^\>\]*>" group="0"/>
>> 
>> But getting an error from Solr:
>> 
>> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Schema Parsing Failed: The value of attribute "pattern" associated with an element type "null" must not contain the '<' character.
>> 
>> I'm trying to tokenize a CDATA section I am indexing. I've tried escaping the < character numerous ways (and used the &lt; entity...) but can't get it to work.
>> 
>> Any ideas? Thanks in advance!
>