You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by HL <fr...@gmail.com> on 2014/05/24 15:56:48 UTC

special TItle Sorting etc

I am trying to sort by title field  asc or desc
in a manner that is influenced by the stopwords list of a language,

for Instance I would like the title
"The Book", and "A Wallet"  when sorted  appear as

title
---------
The Book
A Wallet

but while I only managed to get my head smashed on the solr wall,
while I had NO SUCCESS what-so-ever !


So far I've tried to do this from Solr by various  filedType definitions 
and either copy the contents of title to BIB_title_sort
or via a dynamicField  with a suffix or a prefix,
or even import the title straight into the field.

Here is my last FAILED attempt to do that

<fieldType name="sortString" class="solr.TextField" 
sortMissingLast="true" omitNorms="true">
         <analyzer type="index">
             <tokenizer class="solr.StandardTokenizerFactory"/>
             <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
             <filter class="solr.ICUFoldingFilterFactory"/>
             <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_el.txt,lang/stopwords_en.txt" 
enablePositionIncrements="true"/>
             <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
         </analyzer>
       </fieldType>

My question is

Is there a possible way to do that in SOLR?
OR
Do I HAVE TO remove the STOP WORDS and so on, during the IMPORT process, 
by only writing custom scripts??
Thanks in advance,
Harry

Re: special TItle Sorting etc

Posted by Steve Rowe <sa...@gmail.com>.

Hi Harry,

You should be using solr.StrField, or KeywordTokenizer with solr.TextField - otherwise you’ll get multiple tokens, and for sorting, you want just one.

Here’s one way to get what you want: copyfield your title to a sortable field with a fieldtype something like (untested):

<fieldType name=“titleSort” class=“solr.TextField” sortMissingLast=“true” omitNorms=“true”>
  <analyzer>
    <charFilter class=“solr.PatternReplaceCharFilterFactory”
                pattern=“^(?i)(a|an|the)\s+” 
                replacement=“”/>
    <tokenizer class=“solr.KeywordTokenizerFactory”/>
    <filter class="solr.ICUFoldingFilterFactory"/>
  </analyzer>
</fieldType>

The “(?i)” thing at the start of the pattern will cause it to match case-insensitively.

A common strategy for sorting titles while ignoring initial articles is to place the article at the end, separated by a comma, e.g. “Book, The” and “Wallet, A”; such a sorting mechanism would allow you to consistently sort “Book”, “The Book”, and “A Book” - here’s a slightly different version of the above field type that achieves this (again, untested):

<fieldType name=“titleSort” class=“solr.TextField” sortMissingLast=“true” omitNorms=“true”>
  <analyzer>
    <charFilter class=“solr.PatternReplaceCharFilterFactory”
                pattern=“^(?i)(a|an|the)\s+(.*)” 
                replacement=“$2, $1”/>
    <tokenizer class=“solr.KeywordTokenizerFactory”/>
    <filter class="solr.ICUFoldingFilterFactory"/>
  </analyzer>
</fieldType>

Steve

On May 24, 2014, at 9:56 AM, HL <fr...@gmail.com> wrote:

> I am trying to sort by title field  asc or desc
> in a manner that is influenced by the stopwords list of a language,
> 
> for Instance I would like the title
> "The Book", and "A Wallet"  when sorted  appear as
> 
> title
> ---------
> The Book
> A Wallet
> 
> but while I only managed to get my head smashed on the solr wall,
> while I had NO SUCCESS what-so-ever !
> 
> 
> So far I've tried to do this from Solr by various  filedType definitions and either copy the contents of title to BIB_title_sort
> or via a dynamicField  with a suffix or a prefix,
> or even import the title straight into the field.
> 
> Here is my last FAILED attempt to do that
> 
> <fieldType name="sortString" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>        <analyzer type="index">
>            <tokenizer class="solr.StandardTokenizerFactory"/>
>            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>            <filter class="solr.ICUFoldingFilterFactory"/>
>            <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_el.txt,lang/stopwords_en.txt" enablePositionIncrements="true"/>
>            <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        </analyzer>
>      </fieldType>
> 
> My question is
> 
> Is there a possible way to do that in SOLR?
> OR
> Do I HAVE TO remove the STOP WORDS and so on, during the IMPORT process, by only writing custom scripts??
> Thanks in advance,
> Harry
> 
> 
> 
>