You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by David Hastings <ha...@gmail.com> on 2020/07/15 19:06:50 UTC

sorting help

howdy,
i have a field that sorts fine all other content, and i cant seem to debug
why it wont sort for me on this one chunk of it.
"sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}}, "response
":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
"title":"Finance,
consolidation and rescheduling of debts", { "title":"Rights in former
German Islands in Pacific", },

its using a copyfield from "title" to "alphatitle" that replaces all
punctuation
pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory

and if i use just title it flips:

"title":"Finance, consolidation and rescheduling of debts"}, { "title":"Rights
in former German Islands in Pacific"}, { "title":"Money orders"}]

and im banging my head trying to figure out what it is about this
content in particular that is not sorting the way I would expect.
don't suppose someone would be able to lead me to a good place to look?

Re: sorting help

Posted by Dave <ha...@gmail.com>.

That’s a good place to start. The idea was to make sure titles that started with a date would not always be at the forefront and the actual title of the doc would be sorted. 

> On Jul 15, 2020, at 4:58 PM, Erick Erickson <er...@gmail.com> wrote:
> 
> Yeah, it’s always a question “how much is enough/too much”.
> 
> That looks reasonable for alphatitle, but what about title? Your original
> question was that the sorting changes depending on which field you 
> sort on. If your title field uses something that tokenizes or doesn’t
> include the same analysis chain (particularly the lowercasing
> and patternreplace) then I’d expect the order to change.
> 
> Best,
> Erick
> 
>> On Jul 15, 2020, at 4:49 PM, David Hastings <ha...@gmail.com> wrote:
>> 
>> thanks, ill check the admin, didnt want to send a big clock of text but:
>> 
>> 
>>  -
>>     -
>> 
>>     Tokenizer:
>>     org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
>>     solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
>>     -
>> 
>>     Token Filters:
>>     org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
>>     solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
>>     org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
>>     solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
>>     org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
>>     ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
>>     luceneMatchVersion: 7.1.0
>>  -
>> 
>>  Query Analyzer:
>>  <http://192.168.1.33:7300/solr/#/mega/analysis?analysis.fieldname=alphatitle>
>>  org.apache.solr.analysis.TokenizerChain
>>     -
>> 
>>     Tokenizer:
>>     org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
>>     solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
>>     -
>> 
>>     Token Filters:
>>     org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
>>     solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
>>     org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
>>     solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
>>     org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
>>     ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
>>     luceneMatchVersion: 7.1.0
>> 
>> 
>>> On Wed, Jul 15, 2020 at 4:47 PM Erick Erickson <er...@gmail.com>
>>> wrote:
>>> 
>>> I’d look two places:
>>> 
>>> 1> try the admin/analysis page from the admin UI. In particular, look at
>>> what tokens actually get in the index.
>>> 
>>> 2> again, the admin UI will let you choose the field (alphatitle and
>>> title) and see what the actual indexed tokens are.
>>> 
>>> Both have the issue that I don’t know what tokenizer you are using. For
>>> sorting it better be something
>>> like KeywordTokenizer. Anything that breaks up the input into separate
>>> tokens will produce surprises.
>>> 
>>> And unless you have lowercaseFilter in front of your patternreplace,
>>> you’re removing uppercase characters.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Jul 15, 2020, at 3:06 PM, David Hastings <
>>> hastings.recursive@gmail.com> wrote:
>>>> 
>>>> howdy,
>>>> i have a field that sorts fine all other content, and i cant seem to
>>> debug
>>>> why it wont sort for me on this one chunk of it.
>>>> "sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}},
>>> "response
>>>> ":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
>>>> "title":"Finance,
>>>> consolidation and rescheduling of debts", { "title":"Rights in former
>>>> German Islands in Pacific", },
>>>> 
>>>> its using a copyfield from "title" to "alphatitle" that replaces all
>>>> punctuation
>>>> pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory
>>>> 
>>>> and if i use just title it flips:
>>>> 
>>>> "title":"Finance, consolidation and rescheduling of debts"}, {
>>> "title":"Rights
>>>> in former German Islands in Pacific"}, { "title":"Money orders"}]
>>>> 
>>>> and im banging my head trying to figure out what it is about this
>>>> content in particular that is not sorting the way I would expect.
>>>> don't suppose someone would be able to lead me to a good place to look?
>>> 
>>> 
>

Re: sorting help

Posted by Erick Erickson <er...@gmail.com>.

Yeah, it’s always a question “how much is enough/too much”.

That looks reasonable for alphatitle, but what about title? Your original
question was that the sorting changes depending on which field you 
sort on. If your title field uses something that tokenizes or doesn’t
include the same analysis chain (particularly the lowercasing
and patternreplace) then I’d expect the order to change.

Best,
Erick

> On Jul 15, 2020, at 4:49 PM, David Hastings <ha...@gmail.com> wrote:
> 
> thanks, ill check the admin, didnt want to send a big clock of text but:
> 
> 
>   -
>      -
> 
>      Tokenizer:
>      org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
>      solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
>      -
> 
>      Token Filters:
>      org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
>      solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
>      org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
>      solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
>      org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
>      ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
>      luceneMatchVersion: 7.1.0
>   -
> 
>   Query Analyzer:
>   <http://192.168.1.33:7300/solr/#/mega/analysis?analysis.fieldname=alphatitle>
>   org.apache.solr.analysis.TokenizerChain
>      -
> 
>      Tokenizer:
>      org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
>      solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
>      -
> 
>      Token Filters:
>      org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
>      solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
>      org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
>      solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
>      org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
>      ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
>      luceneMatchVersion: 7.1.0
> 
> 
> On Wed, Jul 15, 2020 at 4:47 PM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> I’d look two places:
>> 
>> 1> try the admin/analysis page from the admin UI. In particular, look at
>> what tokens actually get in the index.
>> 
>> 2> again, the admin UI will let you choose the field (alphatitle and
>> title) and see what the actual indexed tokens are.
>> 
>> Both have the issue that I don’t know what tokenizer you are using. For
>> sorting it better be something
>> like KeywordTokenizer. Anything that breaks up the input into separate
>> tokens will produce surprises.
>> 
>> And unless you have lowercaseFilter in front of your patternreplace,
>> you’re removing uppercase characters.
>> 
>> Best,
>> Erick
>> 
>>> On Jul 15, 2020, at 3:06 PM, David Hastings <
>> hastings.recursive@gmail.com> wrote:
>>> 
>>> howdy,
>>> i have a field that sorts fine all other content, and i cant seem to
>> debug
>>> why it wont sort for me on this one chunk of it.
>>> "sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}},
>> "response
>>> ":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
>>> "title":"Finance,
>>> consolidation and rescheduling of debts", { "title":"Rights in former
>>> German Islands in Pacific", },
>>> 
>>> its using a copyfield from "title" to "alphatitle" that replaces all
>>> punctuation
>>> pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory
>>> 
>>> and if i use just title it flips:
>>> 
>>> "title":"Finance, consolidation and rescheduling of debts"}, {
>> "title":"Rights
>>> in former German Islands in Pacific"}, { "title":"Money orders"}]
>>> 
>>> and im banging my head trying to figure out what it is about this
>>> content in particular that is not sorting the way I would expect.
>>> don't suppose someone would be able to lead me to a good place to look?
>> 
>>

Re: sorting help

Posted by David Hastings <ha...@gmail.com>.

thanks, ill check the admin, didnt want to send a big clock of text but:


   -
      -

      Tokenizer:
      org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
      solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
      -

      Token Filters:
      org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
      solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
      org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
      solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
      org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
      ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
      luceneMatchVersion: 7.1.0
   -

   Query Analyzer:
   <http://192.168.1.33:7300/solr/#/mega/analysis?analysis.fieldname=alphatitle>
   org.apache.solr.analysis.TokenizerChain
      -

      Tokenizer:
      org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
      solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
      -

      Token Filters:
      org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
      solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
      org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
      solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
      org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
      ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
      luceneMatchVersion: 7.1.0


On Wed, Jul 15, 2020 at 4:47 PM Erick Erickson <er...@gmail.com>
wrote:

> I’d look two places:
>
> 1> try the admin/analysis page from the admin UI. In particular, look at
> what tokens actually get in the index.
>
> 2> again, the admin UI will let you choose the field (alphatitle and
> title) and see what the actual indexed tokens are.
>
> Both have the issue that I don’t know what tokenizer you are using. For
> sorting it better be something
> like KeywordTokenizer. Anything that breaks up the input into separate
> tokens will produce surprises.
>
> And unless you have lowercaseFilter in front of your patternreplace,
> you’re removing uppercase characters.
>
> Best,
> Erick
>
> > On Jul 15, 2020, at 3:06 PM, David Hastings <
> hastings.recursive@gmail.com> wrote:
> >
> > howdy,
> > i have a field that sorts fine all other content, and i cant seem to
> debug
> > why it wont sort for me on this one chunk of it.
> > "sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}},
> "response
> > ":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
> > "title":"Finance,
> > consolidation and rescheduling of debts", { "title":"Rights in former
> > German Islands in Pacific", },
> >
> > its using a copyfield from "title" to "alphatitle" that replaces all
> > punctuation
> > pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory
> >
> > and if i use just title it flips:
> >
> > "title":"Finance, consolidation and rescheduling of debts"}, {
> "title":"Rights
> > in former German Islands in Pacific"}, { "title":"Money orders"}]
> >
> > and im banging my head trying to figure out what it is about this
> > content in particular that is not sorting the way I would expect.
> > don't suppose someone would be able to lead me to a good place to look?
>
>

Re: sorting help

Posted by Erick Erickson <er...@gmail.com>.

I’d look two places:

1> try the admin/analysis page from the admin UI. In particular, look at what tokens actually get in the index.

2> again, the admin UI will let you choose the field (alphatitle and title) and see what the actual indexed tokens are.

Both have the issue that I don’t know what tokenizer you are using. For sorting it better be something
like KeywordTokenizer. Anything that breaks up the input into separate tokens will produce surprises.

And unless you have lowercaseFilter in front of your patternreplace, you’re removing uppercase characters.

Best,
Erick

> On Jul 15, 2020, at 3:06 PM, David Hastings <ha...@gmail.com> wrote:
> 
> howdy,
> i have a field that sorts fine all other content, and i cant seem to debug
> why it wont sort for me on this one chunk of it.
> "sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}}, "response
> ":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
> "title":"Finance,
> consolidation and rescheduling of debts", { "title":"Rights in former
> German Islands in Pacific", },
> 
> its using a copyfield from "title" to "alphatitle" that replaces all
> punctuation
> pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory
> 
> and if i use just title it flips:
> 
> "title":"Finance, consolidation and rescheduling of debts"}, { "title":"Rights
> in former German Islands in Pacific"}, { "title":"Money orders"}]
> 
> and im banging my head trying to figure out what it is about this
> content in particular that is not sorting the way I would expect.
> don't suppose someone would be able to lead me to a good place to look?