You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2012/06/01 00:04:00 UTC

Re: Strip html

: I make a transformation XSLT which return :
: ---------------------------------------
: si les ruches d’abeilles prouvent la
:                   monarchie, les fourmillières, les troupes d’éléphants ou
: de castors prouvent la république.
: ---------------------------------------
: i put this html in solr:  $doc->addField('body_strip_html', $body_norm);   
	...
: But this don't work!
: I want to return this xml files (look exemple) if i search "castor".

I'm confused.

a) you said you've already transformed your input XML into plain text -- 
so i don't see what you need HTML striping at all.
b) your current problem doesn't seem to have anything to do with HTML or 
XML ... you're asking why a document containing "castors" (plural) doesn't 
match a query for "castor" (singular) but the field type you say are using 
has a very simple analyzer that doens't do any stemming of any kind...

>>        <analyzer>
>>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>                <tokenizer class="solr.StandardTokenizerFactory"/>
>>        </analyzer>

..since there is no HTML in your input, HTMLStripCharFilterFactory is a 
no-op.  which leaves StandardTokenizerFactory which just does 
tokenization.

It seems like all you need to do is add a stemmer (and for efficiency: 
remove the HTMLStripCharFilterFactory).  I'm no expert, but it looks like 
you are indexing french, so i would suggest using a french stemmer...

https://wiki.apache.org/solr/LanguageAnalysis#French



-Hoss