You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Erik Hatcher <er...@gmail.com> on 2009/08/18 04:17:51 UTC

CharFilter, analysis.jsp

I'm interested in using a CharFilter, something like this:

     <fieldType name="html_text" class="solr.TextField">
       <analyzer>
         <charFilter class="solr.HTMLStripCharFilterFactory"/>
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       </analyzer>
     </fieldType>

In hopes of being able to put in a value like "<html><body>whatever</ 
body></html>" and have "whatever" come back out.  In analysis.jsp, I  
see that happening in the verbose output but it doesn't make it to the  
tokenizer input - the original string makes it there.

I must be misunderstanding something about CharFilter's and how to use  
them in Solr.  HTMLStripWhitespaceTokenizerFactory is deprecated in  
favor of the above design, I think, but does what I'm after.

Solr only seems to use CharFilter's in analysis.jsp.  Is that  
correct?  Shouldn't they be factored into the analyzer for each  
field?  (like in FieldAnalysisRequestHandler)

Thanks,
	Erik


Re: CharFilter, analysis.jsp

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Opened:
https://issues.apache.org/jira/browse/SOLR-1370

I'll take a look into this tomorrow.
Thanks,

Koji

Yonik Seeley wrote:
> On Mon, Aug 17, 2009 at 11:03 PM, Erik Hatcher<er...@gmail.com> wrote:
>   
>> That fixes it with analysis.jsp, but not with FieldAnalysisRequestHandler I
>> don't think.  Using that field definition below, and this request -
>>
>> http://localhost:8983/solr/analysis/field?analysis.fieldtype=html_text&analysis.fieldvalue=%3Chtml%3E%3Cbody%3Ewhatever%3C/body%3E%3C/html%3E
>>
>> I still see <str name="text"><html><body>whatever</body></html></str> come
>> out of WhitespaceTokenizer.
>>
>> Does the consumer of an Analyzer from a FieldType have to do anything
>> special to utilize CharFilter's?  Or it should all "just work"?
>>     
>
> Normal users of the Analyzer should see it just work - but
> FieldAnalysisRequestHandler doesn't use the Analyzer... it pulls it
> apart and uses the parts separately.  It would be up to that code to
> apply any char filters, and apparently it doesn't.
>
> -Yonik
> http://www.lucidimagination.com
>
>   


Re: CharFilter, analysis.jsp

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Mon, Aug 17, 2009 at 11:03 PM, Erik Hatcher<er...@gmail.com> wrote:
> That fixes it with analysis.jsp, but not with FieldAnalysisRequestHandler I
> don't think.  Using that field definition below, and this request -
>
> http://localhost:8983/solr/analysis/field?analysis.fieldtype=html_text&analysis.fieldvalue=%3Chtml%3E%3Cbody%3Ewhatever%3C/body%3E%3C/html%3E
>
> I still see <str name="text"><html><body>whatever</body></html></str> come
> out of WhitespaceTokenizer.
>
> Does the consumer of an Analyzer from a FieldType have to do anything
> special to utilize CharFilter's?  Or it should all "just work"?

Normal users of the Analyzer should see it just work - but
FieldAnalysisRequestHandler doesn't use the Analyzer... it pulls it
apart and uses the parts separately.  It would be up to that code to
apply any char filters, and apparently it doesn't.

-Yonik
http://www.lucidimagination.com

Re: CharFilter, analysis.jsp

Posted by Erik Hatcher <er...@gmail.com>.
That fixes it with analysis.jsp, but not with  
FieldAnalysisRequestHandler I don't think.  Using that field  
definition below, and this request -

http://localhost:8983/solr/analysis/field?analysis.fieldtype=html_text&analysis.fieldvalue=%3Chtml%3E%3Cbody%3Ewhatever%3C/body%3E%3C/html%3E

I still see <str name="text"><html><body>whatever</body></html></str>  
come out of WhitespaceTokenizer.

Does the consumer of an Analyzer from a FieldType have to do anything  
special to utilize CharFilter's?  Or it should all "just work"?

	Erik


On Aug 17, 2009, at 10:52 PM, Yonik Seeley wrote:

> I broke it with reusable token streams.  Just checked in a fix - can
> you try now?
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Mon, Aug 17, 2009 at 10:17 PM, Erik  
> Hatcher<er...@gmail.com> wrote:
>> I'm interested in using a CharFilter, something like this:
>>
>>    <fieldType name="html_text" class="solr.TextField">
>>      <analyzer>
>>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>      </analyzer>
>>    </fieldType>
>>
>> In hopes of being able to put in a value like
>> "<html><body>whatever</body></html>" and have "whatever" come back  
>> out.  In
>> analysis.jsp, I see that happening in the verbose output but it  
>> doesn't make
>> it to the tokenizer input - the original string makes it there.
>>
>> I must be misunderstanding something about CharFilter's and how to  
>> use them
>> in Solr.  HTMLStripWhitespaceTokenizerFactory is deprecated in  
>> favor of the
>> above design, I think, but does what I'm after.
>>
>> Solr only seems to use CharFilter's in analysis.jsp.  Is that  
>> correct?
>>  Shouldn't they be factored into the analyzer for each field?   
>> (like in
>> FieldAnalysisRequestHandler)
>>
>> Thanks,
>>        Erik
>>
>>


Re: CharFilter, analysis.jsp

Posted by Yonik Seeley <yo...@lucidimagination.com>.
I broke it with reusable token streams.  Just checked in a fix - can
you try now?

-Yonik
http://www.lucidimagination.com


On Mon, Aug 17, 2009 at 10:17 PM, Erik Hatcher<er...@gmail.com> wrote:
> I'm interested in using a CharFilter, something like this:
>
>    <fieldType name="html_text" class="solr.TextField">
>      <analyzer>
>        <charFilter class="solr.HTMLStripCharFilterFactory"/>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      </analyzer>
>    </fieldType>
>
> In hopes of being able to put in a value like
> "<html><body>whatever</body></html>" and have "whatever" come back out.  In
> analysis.jsp, I see that happening in the verbose output but it doesn't make
> it to the tokenizer input - the original string makes it there.
>
> I must be misunderstanding something about CharFilter's and how to use them
> in Solr.  HTMLStripWhitespaceTokenizerFactory is deprecated in favor of the
> above design, I think, but does what I'm after.
>
> Solr only seems to use CharFilter's in analysis.jsp.  Is that correct?
>  Shouldn't they be factored into the analyzer for each field?  (like in
> FieldAnalysisRequestHandler)
>
> Thanks,
>        Erik
>
>