You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Rosen <pa...@performantsoftware.com> on 2009/08/19 20:45:45 UTC
strange sorting results: each word in field is sorted
I'm trying to sort, but I am not always getting the correct results and
I'm not sure where to start tracking down the problem.
You can see the problem here (at least until it's fixed!):
http://nines.performantsoftware.com/search/saved?user=paul&name=poem
If you sort by Title/Ascending, you get partially sorted results, but it
seems to be using a random word to sort on instead of sorting on the
entire title.
Page one starts good with:
(blank)
Adieu
Advertisement
Afterwards
etc....
but by page 6 it starts to break down:
Elizabeth Barrett Browning
Albert and Elweena
Emerson and Bacon
etc...
Errata
Anne Bannerman: Biographical Essay
Aboringines (Estonia)
etc...
I notice in the above list that there is SOME word that is sorted, just
not the first one. (In fact, it seems to be the word that appears
greatest in the sort order.)
Then at the end, for instance page 336, it sorts some titles with
diacritical marks:
Roman à Clef
The Forgotten Reaping-Hook: Sex in My Ántonia
Social (Re)Visioning in the Fields of My Ántonia
etc...
I'm not sure what info would be useful to help debug. In my schema.xml
file, I've clipped what seems to be the relevant part:
<fieldtype name="text_lu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<field name="title" type="text_lu" indexed="true" stored="true"
multiValued="true"/>
Thanks,
Paul
Re: strange sorting results: each word in field is sorted
Posted by Erik Hatcher <er...@gmail.com>.
On Aug 19, 2009, at 3:50 PM, Paul Rosen wrote:
>> I'm surprised you're not seeing an exception when trying to sort on
>> title given this configuration. Sorting must be done on single
>> valued indexed fields, that have at most a single term indexed per
>> document. I recommend you use copyField to copy title to
>> title_sort and configure a title_sort field as a "string" or a
>> field type that analyzes only to a single term (like simply keyword
>> tokenizing -> lower case filter.
>> Erik
>
> I want to double check this (since you probably remember how long it
> takes to recreate the indexes). I think you're saying to add these
> two lines, then re-index:
>
> <field name="title_sort" type="string" indexed="true" stored="true"/>
> <copyField source="title" dest="title_sort"/>
For the simplest case, yes. You do have to be careful the sort field
is not multiValued - and I believe the NINES model allowed for
multiple titles. So it might be necessary for your indexing client to
specify the single sort field value instead of leveraging copyField.
> Now, this is case-sensitive, right? So would this make it case-
> insensitive?
Yes, the above would be case sensitive.
> <fieldtype name="sort_string"class="solr.StrField"
> sortMissingLast="true">
> <analyzer>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldtype>
> <field name="title_sort" type="sort_string" indexed="true"
> stored="true"/>
> <copyField source="title" dest="title_sort"/>
That <analyzer> definition isn't quite right - you must have at least
a tokenizer. The KeywordTokenizer "tokenizes" the entire string into
a single token, though. In Solr's example schema there is a field
type like this:
<fieldType name="alphaOnlySort" class="solr.TextField"
sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- KeywordTokenizer does no actual tokenizing, so the entire
input string is preserved as a single token
-->
<tokenizer class="solr.KeywordTokenizerFactory"/>
<!-- The LowerCase TokenFilter does what you expect, which
can be
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
<!-- The TrimFilter removes any leading or trailing
whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- The PatternReplaceFilter gives you the flexibility to use
Java Regular expression to replace any sequence of
characters
matching a pattern with an arbitrary replacement string,
which may include back references to portions of the
original
string matched by the pattern.
See the Java Regular Expression documentation for more
information on pattern and replacement string syntax.
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html
-->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])" replacement="" replace="all"
/>
</analyzer>
</fieldType>
> Also, I'm guessing from seeing the current results that this
> wouldn't collate the characters with diacritical marks correctly. Is
> there a way to indicate that, for instance, A-grave would sort next
> to A?
Yes, you can incorporate the diacritic normalizing filter into the
analyzer definition above. AsciiFoldingFilter or the ISO Latin1 one.
> And, while I'm on the subject, I have to do the same thing with the
> Author field, but unfortunately, that is sometimes "First Last" and
> sometimes "Last, First". Is there any way to sort those by last
> name, or do I just have to encourage the index people to be more
> consistent?
Good luck with getting consistency in your domain! :)
But it certainly makes sense to request that from the data providers,
in at least some form that can be turned into the sortable value.
> I can think of a fairly simple algorithm, but am not sure where to
> implement it:
>
> - if the word "and" or "&" appears, just look at the left side of
> the field (in other words, sort by the first name that appears.)
> - if there is a comma, but it is part of ", jr." or some other
> common suffixes like that, ignore it.
> - otherwise, if there is no comma, sort by the last word, unless it
> is "jr", "sr", "III", etc., then sort by the word before that.
> - otherwise, sort by the first word.
Probably best to implement that in the indexing client code, but
simple transformations could be implemented using the
PatternReplaceFilter like above.
Erik
Re: strange sorting results: each word in field is sorted
Posted by Paul Rosen <pa...@performantsoftware.com>.
Erik Hatcher wrote:
>
> On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote:
>> You can see the problem here (at least until it's fixed!):
>> http://nines.performantsoftware.com/search/saved?user=paul&name=poem
>
> Hi Paul - that project looks familiar! :)
Hi Erik! I should hope so! And I've gone a year without having to delve
into solr much since it has just plain worked.
Thanks for the speedy reply.
> I'm surprised you're not seeing an exception when trying to sort on
> title given this configuration. Sorting must be done on single valued
> indexed fields, that have at most a single term indexed per document. I
> recommend you use copyField to copy title to title_sort and configure a
> title_sort field as a "string" or a field type that analyzes only to a
> single term (like simply keyword tokenizing -> lower case filter.
>
> Erik
I want to double check this (since you probably remember how long it
takes to recreate the indexes). I think you're saying to add these two
lines, then re-index:
<field name="title_sort" type="string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>
Now, this is case-sensitive, right? So would this make it case-insensitive?
<fieldtype name="sort_string"class="solr.StrField" sortMissingLast="true">
<analyzer>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<field name="title_sort" type="sort_string" indexed="true" stored="true"/>
<copyField source="title" dest="title_sort"/>
Also, I'm guessing from seeing the current results that this wouldn't
collate the characters with diacritical marks correctly. Is there a way
to indicate that, for instance, A-grave would sort next to A?
And, while I'm on the subject, I have to do the same thing with the
Author field, but unfortunately, that is sometimes "First Last" and
sometimes "Last, First". Is there any way to sort those by last name, or
do I just have to encourage the index people to be more consistent?
I can think of a fairly simple algorithm, but am not sure where to
implement it:
- if the word "and" or "&" appears, just look at the left side of the
field (in other words, sort by the first name that appears.)
- if there is a comma, but it is part of ", jr." or some other common
suffixes like that, ignore it.
- otherwise, if there is no comma, sort by the last word, unless it is
"jr", "sr", "III", etc., then sort by the word before that.
- otherwise, sort by the first word.
That would get most of the cases.
Thanks,
Paul
Re: strange sorting results: each word in field is sorted
Posted by Erik Hatcher <er...@gmail.com>.
On Aug 19, 2009, at 2:45 PM, Paul Rosen wrote:
> You can see the problem here (at least until it's fixed!): http://nines.performantsoftware.com/search/saved?user=paul&name=poem
Hi Paul - that project looks familiar! :)
> If you sort by Title/Ascending, you get partially sorted results,
> but it seems to be using a random word to sort on instead of sorting
> on the entire title.
>
> I'm not sure what info would be useful to help debug. In my
> schema.xml file, I've clipped what seems to be the relevant part:
>
> <fieldtype name="text_lu" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StandardFilterFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldtype>
>
> <field name="title" type="text_lu" indexed="true" stored="true"
> multiValued="true"/>
I'm surprised you're not seeing an exception when trying to sort on
title given this configuration. Sorting must be done on single valued
indexed fields, that have at most a single term indexed per document.
I recommend you use copyField to copy title to title_sort and
configure a title_sort field as a "string" or a field type that
analyzes only to a single term (like simply keyword tokenizing ->
lower case filter.
Erik