You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2012/06/01 00:04:00 UTC
Re: Strip html
: I make a transformation XSLT which return :
: ---------------------------------------
: si les ruches d’abeilles prouvent la
: monarchie, les fourmillières, les troupes d’éléphants ou
: de castors prouvent la république.
: ---------------------------------------
: i put this html in solr: $doc->addField('body_strip_html', $body_norm);
...
: But this don't work!
: I want to return this xml files (look exemple) if i search "castor".
I'm confused.
a) you said you've already transformed your input XML into plain text --
so i don't see what you need HTML striping at all.
b) your current problem doesn't seem to have anything to do with HTML or
XML ... you're asking why a document containing "castors" (plural) doesn't
match a query for "castor" (singular) but the field type you say are using
has a very simple analyzer that doens't do any stemming of any kind...
>> <analyzer>
>> <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> <tokenizer class="solr.StandardTokenizerFactory"/>
>> </analyzer>
..since there is no HTML in your input, HTMLStripCharFilterFactory is a
no-op. which leaves StandardTokenizerFactory which just does
tokenization.
It seems like all you need to do is add a stemmer (and for efficiency:
remove the HTMLStripCharFilterFactory). I'm no expert, but it looks like
you are indexing french, so i would suggest using a french stemmer...
https://wiki.apache.org/solr/LanguageAnalysis#French
-Hoss