You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ziqi Zhang <zi...@sheffield.ac.uk> on 2015/09/25 11:15:47 UTC

token n-gram: leading and trailing stopwords removal only

Hi

Is there a way to remove just the leading and trailing stopwords from a 
token n-gram?

Currently I have the following combination which removes any n-gram that 
contains a stopword:

<analyzer type="index">
                 <tokenizer class="solr.StandardTokenizerFactory" />
                 <filter class="solr.StopFilterFactory" 
ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
                 <filter class="solr.LowerCaseFilterFactory" />
                 <filter class="solr.ShingleFilterFactory" 
minShingleSize="2" maxShingleSize="3"
                         outputUnigrams="true" 
outputUnigramsIfNoShingles="false" tokenSeparator=" "/>
                 <filter class="solr.PatternReplaceFilterFactory" 
pattern=".*_.*" replacement=""/>
</analyzer>

For example, if my document contains these ngrams: "Tower of London", 
"Tower in London", "and London", "London", with "of, in" as stopwords, 
the single filter will produce:
tower _ london, tower _ london, _ london, london
(note that however the second "tower _ london" is different from the 
first but this bit of information is lost)

and the pattern filter will then delete the first 3 n-grams.

What I really want to do though is to keep "tower of london", "tower in 
london", "london", "london".

Is this possible?

Many thanks!


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org