You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bertrand Mathieu <bm...@universcine.com> on 2009/06/03 12:48:28 UTC

Alphabetical index for faceting

Hello,

My goal is to get an index for alphabetical faceting of titles. For this I'm
trying to define a fieldType meant to index first letter of text, with
stopwords removed. My problem is that without WordDelimiterFilterFactory
stopwords are not removed, and with it I end up with 2 tokens (and I'd like
to keep just the first one).

For example, the string "The Curse of Monkey Island" should be indexed as
"c".

Here is my field type definition as of now:

    <fieldType name="alphabetical" class="solr.TextField"
sortMissingLast="true"
               omitNorms="true">

      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.ISOLatin1AccentFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_fr.txt"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([0-9a-z]).*" replacement="$1" replace="all" />
      </analyzer>

    </fieldType>

With my example it gives with 3 tokens: "c", "m", "i".

I have not been able to find any documentation related to what I want to do
(wrong keywords in google?). At this point I'm beginning to think that I
will have to write a custom filter that would replace the
patternreplacefilterfactory: it would keep the first character of the first
token and discard everything else. Unfortunatly I have not programmed with
java for years, so I try to avoid that solution if possible.

And since I don't see my need as something as uncommon, I am wondering what
I am missing. Any idea?

-- 
Bertrand Mathieu