You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Paul Rosen <pa...@performantsoftware.com> on 2009/08/19 20:45:45 UTC

strange sorting results: each word in field is sorted

I'm trying to sort, but I am not always getting the correct results and 
I'm not sure where to start tracking down the problem.

You can see the problem here (at least until it's fixed!): 
http://nines.performantsoftware.com/search/saved?user=paul&name=poem

If you sort by Title/Ascending, you get partially sorted results, but it 
seems to be using a random word to sort on instead of sorting on the 
entire title.

Page one starts good with:

(blank)
Adieu
Advertisement
Afterwards
etc....

but by page 6 it starts to break down:

Elizabeth Barrett Browning
Albert and Elweena
Emerson and Bacon
etc...
Errata
Anne Bannerman: Biographical Essay
Aboringines (Estonia)
etc...

I notice in the above list that there is SOME word that is sorted, just 
not the first one. (In fact, it seems to be the word that appears 
greatest in the sort order.)

Then at the end, for instance page 336, it sorts some titles with 
diacritical marks:

Roman à Clef
The Forgotten Reaping-Hook: Sex in My Ántonia
Social (Re)Visioning in the Fields of My Ántonia
etc...

I'm not sure what info would be useful to help debug. In my schema.xml 
file, I've clipped what seems to be the relevant part:

<fieldtype name="text_lu" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.StandardFilterFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
</fieldtype>

<field name="title" type="text_lu" indexed="true" stored="true" 
multiValued="true"/>

Thanks,
Paul

Re: strange sorting results: each word in field is sorted

Posted by Erik Hatcher <er...@gmail.com>.

On Aug 19, 2009, at 3:50 PM, Paul Rosen wrote:
>> I'm surprised you're not seeing an exception when trying to sort on  
>> title given this configuration.  Sorting must be done on single  
>> valued indexed fields, that have at most a single term indexed per  
>> document.  I recommend you use copyField to copy title to  
>> title_sort and configure a title_sort field as a "string" or a  
>> field type that analyzes only to a single term (like simply keyword  
>> tokenizing -> lower case filter.
>>    Erik
>
> I want to double check this (since you probably remember how long it  
> takes to recreate the indexes). I think you're saying to add these  
> two lines, then re-index:
>
> <field name="title_sort" type="string" indexed="true" stored="true"/>
> <copyField source="title" dest="title_sort"/>

For the simplest case, yes.  You do have to be careful the sort field  
is not multiValued - and I believe the NINES model allowed for  
multiple titles.  So it might be necessary for your indexing client to  
specify the single sort field value instead of leveraging copyField.

> Now, this is case-sensitive, right? So would this make it case- 
> insensitive?

Yes, the above would be case sensitive.

> <fieldtype name="sort_string"class="solr.StrField"  
> sortMissingLast="true">
>  <analyzer>
>    <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldtype>
> <field name="title_sort" type="sort_string" indexed="true"  
> stored="true"/>
> <copyField source="title" dest="title_sort"/>

That <analyzer> definition isn't quite right - you must have at least  
a tokenizer.  The KeywordTokenizer "tokenizes" the entire string into  
a single token, though.  In Solr's example schema there is a field  
type like this:

     <fieldType name="alphaOnlySort" class="solr.TextField"  
sortMissingLast="true" omitNorms="true">
       <analyzer>
         <!-- KeywordTokenizer does no actual tokenizing, so the entire
              input string is preserved as a single token
           -->
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <!-- The LowerCase TokenFilter does what you expect, which  
can be
              when you want your sorting to be case insensitive
           -->
         <filter class="solr.LowerCaseFilterFactory" />
         <!-- The TrimFilter removes any leading or trailing  
whitespace -->
         <filter class="solr.TrimFilterFactory" />
         <!-- The PatternReplaceFilter gives you the flexibility to use
              Java Regular expression to replace any sequence of  
characters
              matching a pattern with an arbitrary replacement string,
              which may include back references to portions of the  
original
              string matched by the pattern.

              See the Java Regular Expression documentation for more
              information on pattern and replacement string syntax.

              http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
           -->
         <filter class="solr.PatternReplaceFilterFactory"
                 pattern="([^a-z])" replacement="" replace="all"
         />
       </analyzer>
     </fieldType>

> Also, I'm guessing from seeing the current results that this  
> wouldn't collate the characters with diacritical marks correctly. Is  
> there a way to indicate that, for instance, A-grave would sort next  
> to A?

Yes, you can incorporate the diacritic normalizing filter into the  
analyzer definition above.  AsciiFoldingFilter or the ISO Latin1 one.

> And, while I'm on the subject, I have to do the same thing with the  
> Author field, but unfortunately, that is sometimes "First Last" and  
> sometimes "Last, First". Is there any way to sort those by last  
> name, or do I just have to encourage the index people to be more  
> consistent?

Good luck with getting consistency in your domain!  :)

But it certainly makes sense to request that from the data providers,  
in at least some form that can be turned into the sortable value.

> I can think of a fairly simple algorithm, but am not sure where to  
> implement it:
>
> - if the word "and" or "&" appears, just look at the left side of  
> the field (in other words, sort by the first name that appears.)
> - if there is a comma, but it is part of ", jr." or some other  
> common suffixes like that, ignore it.
> - otherwise, if there is no comma, sort by the last word, unless it  
> is "jr", "sr", "III", etc., then sort by the word before that.
> - otherwise, sort by the first word.

Probably best to implement that in the indexing client code, but  
simple transformations could be implemented using the  
PatternReplaceFilter like above.

	Erik

Re: strange sorting results: each word in field is sorted

Posted by Paul Rosen <pa...@performantsoftware.com>.

Erik Hatcher wrote:
> 
> On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote:
>> You can see the problem here (at least until it's fixed!): 
>> http://nines.performantsoftware.com/search/saved?user=paul&name=poem
> 
> Hi Paul - that project looks familiar!  :)

Hi Erik! I should hope so! And I've gone a year without having to delve 
into solr much since it has just plain worked.

Thanks for the speedy reply.

> I'm surprised you're not seeing an exception when trying to sort on 
> title given this configuration.  Sorting must be done on single valued 
> indexed fields, that have at most a single term indexed per document.  I 
> recommend you use copyField to copy title to title_sort and configure a 
> title_sort field as a "string" or a field type that analyzes only to a 
> single term (like simply keyword tokenizing -> lower case filter.
> 
>     Erik

I want to double check this (since you probably remember how long it 
takes to recreate the indexes). I think you're saying to add these two 
lines, then re-index:

<field name="title_sort" type="string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>

Now, this is case-sensitive, right? So would this make it case-insensitive?

<fieldtype name="sort_string"class="solr.StrField" sortMissingLast="true">
   <analyzer>
     <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
</fieldtype>
<field name="title_sort" type="sort_string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>

Also, I'm guessing from seeing the current results that this wouldn't 
collate the characters with diacritical marks correctly. Is there a way 
to indicate that, for instance, A-grave would sort next to A?

And, while I'm on the subject, I have to do the same thing with the 
Author field, but unfortunately, that is sometimes "First Last" and 
sometimes "Last, First". Is there any way to sort those by last name, or 
do I just have to encourage the index people to be more consistent?

I can think of a fairly simple algorithm, but am not sure where to 
implement it:

- if the word "and" or "&" appears, just look at the left side of the 
field (in other words, sort by the first name that appears.)
- if there is a comma, but it is part of ", jr." or some other common 
suffixes like that, ignore it.
- otherwise, if there is no comma, sort by the last word, unless it is 
"jr", "sr", "III", etc., then sort by the word before that.
- otherwise, sort by the first word.

That would get most of the cases.

Thanks,
Paul

Re: strange sorting results: each word in field is sorted

Posted by Erik Hatcher <er...@gmail.com>.

On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote:
> You can see the problem here (at least until it's fixed!): http://nines.performantsoftware.com/search/saved?user=paul&name=poem

Hi Paul - that project looks familiar!  :)

> If you sort by Title/Ascending, you get partially sorted results,  
> but it seems to be using a random word to sort on instead of sorting  
> on the entire title.
>
> I'm not sure what info would be useful to help debug. In my  
> schema.xml file, I've clipped what seems to be the relevant part:
>
> <fieldtype name="text_lu" class="solr.TextField"  
> positionIncrementGap="100">
>  <analyzer>
>    <tokenizer class="solr.StandardTokenizerFactory"/>
>    <filter class="solr.StandardFilterFactory"/>
>    <filter class="solr.LowerCaseFilterFactory"/>
>  </analyzer>
> </fieldtype>
>
> <field name="title" type="text_lu" indexed="true" stored="true"  
> multiValued="true"/>

I'm surprised you're not seeing an exception when trying to sort on  
title given this configuration.  Sorting must be done on single valued  
indexed fields, that have at most a single term indexed per document.   
I recommend you use copyField to copy title to title_sort and  
configure a title_sort field as a "string" or a field type that  
analyzes only to a single term (like simply keyword tokenizing ->  
lower case filter.

	Erik