You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ziqi Zhang <zi...@sheffield.ac.uk> on 2015/09/25 11:15:47 UTC
token n-gram: leading and trailing stopwords removal only
Hi
Is there a way to remove just the leading and trailing stopwords from a
token n-gram?
Currently I have the following combination which removes any n-gram that
contains a stopword:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ShingleFilterFactory"
minShingleSize="2" maxShingleSize="3"
outputUnigrams="true"
outputUnigramsIfNoShingles="false" tokenSeparator=" "/>
<filter class="solr.PatternReplaceFilterFactory"
pattern=".*_.*" replacement=""/>
</analyzer>
For example, if my document contains these ngrams: "Tower of London",
"Tower in London", "and London", "London", with "of, in" as stopwords,
the single filter will produce:
tower _ london, tower _ london, _ london, london
(note that however the second "tower _ london" is different from the
first but this bit of information is lost)
and the pattern filter will then delete the first 3 n-grams.
What I really want to do though is to keep "tower of london", "tower in
london", "london", "london".
Is this possible?
Many thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org