You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by HL <fr...@gmail.com> on 2014/05/24 15:56:48 UTC
special TItle Sorting etc
I am trying to sort by title field asc or desc
in a manner that is influenced by the stopwords list of a language,
for Instance I would like the title
"The Book", and "A Wallet" when sorted appear as
title
---------
The Book
A Wallet
but while I only managed to get my head smashed on the solr wall,
while I had NO SUCCESS what-so-ever !
So far I've tried to do this from Solr by various filedType definitions
and either copy the contents of title to BIB_title_sort
or via a dynamicField with a suffix or a prefix,
or even import the title straight into the field.
Here is my last FAILED attempt to do that
<fieldType name="sortString" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_el.txt,lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
My question is
Is there a possible way to do that in SOLR?
OR
Do I HAVE TO remove the STOP WORDS and so on, during the IMPORT process,
by only writing custom scripts??
Thanks in advance,
Harry
Re: special TItle Sorting etc
Posted by Steve Rowe <sa...@gmail.com>.
Hi Harry,
You should be using solr.StrField, or KeywordTokenizer with solr.TextField - otherwise you’ll get multiple tokens, and for sorting, you want just one.
Here’s one way to get what you want: copyfield your title to a sortable field with a fieldtype something like (untested):
<fieldType name=“titleSort” class=“solr.TextField” sortMissingLast=“true” omitNorms=“true”>
<analyzer>
<charFilter class=“solr.PatternReplaceCharFilterFactory”
pattern=“^(?i)(a|an|the)\s+”
replacement=“”/>
<tokenizer class=“solr.KeywordTokenizerFactory”/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
The “(?i)” thing at the start of the pattern will cause it to match case-insensitively.
A common strategy for sorting titles while ignoring initial articles is to place the article at the end, separated by a comma, e.g. “Book, The” and “Wallet, A”; such a sorting mechanism would allow you to consistently sort “Book”, “The Book”, and “A Book” - here’s a slightly different version of the above field type that achieves this (again, untested):
<fieldType name=“titleSort” class=“solr.TextField” sortMissingLast=“true” omitNorms=“true”>
<analyzer>
<charFilter class=“solr.PatternReplaceCharFilterFactory”
pattern=“^(?i)(a|an|the)\s+(.*)”
replacement=“$2, $1”/>
<tokenizer class=“solr.KeywordTokenizerFactory”/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldType>
Steve
On May 24, 2014, at 9:56 AM, HL <fr...@gmail.com> wrote:
> I am trying to sort by title field asc or desc
> in a manner that is influenced by the stopwords list of a language,
>
> for Instance I would like the title
> "The Book", and "A Wallet" when sorted appear as
>
> title
> ---------
> The Book
> A Wallet
>
> but while I only managed to get my head smashed on the solr wall,
> while I had NO SUCCESS what-so-ever !
>
>
> So far I've tried to do this from Solr by various filedType definitions and either copy the contents of title to BIB_title_sort
> or via a dynamicField with a suffix or a prefix,
> or even import the title straight into the field.
>
> Here is my last FAILED attempt to do that
>
> <fieldType name="sortString" class="solr.TextField" sortMissingLast="true" omitNorms="true">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.ICUFoldingFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_el.txt,lang/stopwords_en.txt" enablePositionIncrements="true"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
>
> My question is
>
> Is there a possible way to do that in SOLR?
> OR
> Do I HAVE TO remove the STOP WORDS and so on, during the IMPORT process, by only writing custom scripts??
> Thanks in advance,
> Harry
>
>
>
>