You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2012/11/12 14:19:17 UTC

content disappears in the index

Hi list,
a user reported wrong sorting of our search service running on solr.
While chasing this issue I traced it back through lucene into the index.
I have a text field for sorting (stored,indexed,tokenized,omitNorms,sortMissingLast)
and three docs with author names.

If I trace at org.apache.lucene.document.Document.add(IndexableField) while
indexing I can see all three author names added as field to each documents.

After searching with *:* for the three docs and doing a sort the sorting is wrong
because one of the author names is reduced to the first char, all other chars are lost.

So having the authors names (Alexander, Arslanagic, Brennmoen) indexed, the result
of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
But this happens because the author "Arslanagic" is reduced to "a" during indexing (???)
and if sorted "a" is before "alexander".

Currently I use 4.0 but have the same issue with 3.6.1.

Without tracing through tons of code:
- which is the last breakpoint for debugging to see the docs right before they go into the index
- which is the first breakpoint for debugging to see the docs coming right out of the index

Regards
Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: content disappears in the index

Posted by Erick Erickson <er...@gmail.com>.

Oddly I had the exact same thought. Although it's not obvious from the name
(and common usage) of trim-like functions that you'd also have a way to
specify maximum length (after trimming I'd assume).

And the other thought I had was that TrimFilter should optionally take a
list of characters to trim. Then I thought of regex, especially to specify
character classes like \w..... naaahhhhhh, we just went there......

but I think I'd prefer a separate filter. If for no other reason that by
including a length in the trim filter, you implicitly disallow having
spaces in the beginning or end of your tokens. Why you'd want this I don't
have a use-case for, but there's no good reason I can think of to couple
these two different functions....

FWIW,
Erick


On Wed, Nov 14, 2012 at 2:05 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Hi Geoff,
> cool, that will eliminate possible regex pitfalls in schema.xml
>
> I was thinking about enhancing an existing filter as multi-purpose filter.
> E.g. TrimFilter, if maxLength is set then also limit the termAtt to
> maxLength.
> This will keep the number of available filters small, especially for
> simple tasks.
> Any thoughts from the core developers about this idea?
>
> Regards
> Bernd
>
>
> Am 13.11.2012 17:56, schrieb Geoff Cooney:
> > Hi,
> >
> > I've been following this thread and happen to have a simple
> > TruncatingFilter class I wrote for the same purpose.  I think this should
> > do what you want:
> >
> >
> >
> > import java.io.IOException;
> >
> > import org.apache.lucene.analysis.TokenFilter;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> > public class TruncatingFilter extends TokenFilter {
> >     private final CharTermAttribute termAtt =
> > addAttribute(CharTermAttribute.class);
> >     private final int maxLength;
> >
> >     protected TruncatingFilter(TokenStream input, int maxLength) {
> >         super(input);
> >         this.maxLength = maxLength;
> >     }
> >
> >     @Override
> >     public boolean incrementToken() throws IOException {
> >         if (input.incrementToken()) {
> >             if (termAtt.length() > maxLength) {
> >                 termAtt.setLength(maxLength);
> >             }
> >
> >             return true;
> >         } else {
> >             return false;
> >         }
> >     }
> >
> > }
> >
> > Cheers,
> > Geoff
> >
> >
> > On Tue, Nov 13, 2012 at 7:54 AM, Erick Erickson <erickerickson@gmail.com
> >wrote:
> >
> >> There's nothing in Solr that I know of that does this. It would be a
> pretty
> >> easy custom filter to create though....
> >>
> >> FWIW,
> >> Erick
> >>
> >>
> >> On Tue, Nov 13, 2012 at 7:02 AM, Robert Muir <rc...@gmail.com> wrote:
> >>
> >>> On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling
> >>> <be...@uni-bielefeld.de> wrote:
> >>>> By the way, why does TrimFilter option updateOffset defaults to false,
> >>>> just keep it backwards compatible?
> >>>>
> >>>
> >>> In my opinion this option should be removed.
> >>>
> >>> TokenFilters shouldn't muck with offsets, for a lot of reasons, but
> >>> especially because its too late to interact with any charfilter.
> >>>
> >>> This is the tokenizer's job.
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: content disappears in the index

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Hi Geoff,
cool, that will eliminate possible regex pitfalls in schema.xml

I was thinking about enhancing an existing filter as multi-purpose filter.
E.g. TrimFilter, if maxLength is set then also limit the termAtt to maxLength.
This will keep the number of available filters small, especially for simple tasks.
Any thoughts from the core developers about this idea?

Regards
Bernd


Am 13.11.2012 17:56, schrieb Geoff Cooney:
> Hi,
> 
> I've been following this thread and happen to have a simple
> TruncatingFilter class I wrote for the same purpose.  I think this should
> do what you want:
> 
> 
> 
> import java.io.IOException;
> 
> import org.apache.lucene.analysis.TokenFilter;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> 
> public class TruncatingFilter extends TokenFilter {
>     private final CharTermAttribute termAtt =
> addAttribute(CharTermAttribute.class);
>     private final int maxLength;
> 
>     protected TruncatingFilter(TokenStream input, int maxLength) {
>         super(input);
>         this.maxLength = maxLength;
>     }
> 
>     @Override
>     public boolean incrementToken() throws IOException {
>         if (input.incrementToken()) {
>             if (termAtt.length() > maxLength) {
>                 termAtt.setLength(maxLength);
>             }
> 
>             return true;
>         } else {
>             return false;
>         }
>     }
> 
> }
> 
> Cheers,
> Geoff
> 
> 
> On Tue, Nov 13, 2012 at 7:54 AM, Erick Erickson <er...@gmail.com>wrote:
> 
>> There's nothing in Solr that I know of that does this. It would be a pretty
>> easy custom filter to create though....
>>
>> FWIW,
>> Erick
>>
>>
>> On Tue, Nov 13, 2012 at 7:02 AM, Robert Muir <rc...@gmail.com> wrote:
>>
>>> On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling
>>> <be...@uni-bielefeld.de> wrote:
>>>> By the way, why does TrimFilter option updateOffset defaults to false,
>>>> just keep it backwards compatible?
>>>>
>>>
>>> In my opinion this option should be removed.
>>>
>>> TokenFilters shouldn't muck with offsets, for a lot of reasons, but
>>> especially because its too late to interact with any charfilter.
>>>
>>> This is the tokenizer's job.
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: content disappears in the index

Posted by Geoff Cooney <co...@gmail.com>.

Hi,

I've been following this thread and happen to have a simple
TruncatingFilter class I wrote for the same purpose.  I think this should
do what you want:



import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class TruncatingFilter extends TokenFilter {
    private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
    private final int maxLength;

    protected TruncatingFilter(TokenStream input, int maxLength) {
        super(input);
        this.maxLength = maxLength;
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (input.incrementToken()) {
            if (termAtt.length() > maxLength) {
                termAtt.setLength(maxLength);
            }

            return true;
        } else {
            return false;
        }
    }

}

Cheers,
Geoff


On Tue, Nov 13, 2012 at 7:54 AM, Erick Erickson <er...@gmail.com>wrote:

> There's nothing in Solr that I know of that does this. It would be a pretty
> easy custom filter to create though....
>
> FWIW,
> Erick
>
>
> On Tue, Nov 13, 2012 at 7:02 AM, Robert Muir <rc...@gmail.com> wrote:
>
> > On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling
> > <be...@uni-bielefeld.de> wrote:
> > > By the way, why does TrimFilter option updateOffset defaults to false,
> > > just keep it backwards compatible?
> > >
> >
> > In my opinion this option should be removed.
> >
> > TokenFilters shouldn't muck with offsets, for a lot of reasons, but
> > especially because its too late to interact with any charfilter.
> >
> > This is the tokenizer's job.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: content disappears in the index

Posted by Erick Erickson <er...@gmail.com>.

There's nothing in Solr that I know of that does this. It would be a pretty
easy custom filter to create though....

FWIW,
Erick


On Tue, Nov 13, 2012 at 7:02 AM, Robert Muir <rc...@gmail.com> wrote:

> On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling
> <be...@uni-bielefeld.de> wrote:
> > By the way, why does TrimFilter option updateOffset defaults to false,
> > just keep it backwards compatible?
> >
>
> In my opinion this option should be removed.
>
> TokenFilters shouldn't muck with offsets, for a lot of reasons, but
> especially because its too late to interact with any charfilter.
>
> This is the tokenizer's job.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: content disappears in the index

Posted by Robert Muir <rc...@gmail.com>.

On Mon, Nov 12, 2012 at 10:47 PM, Bernd Fehling
<be...@uni-bielefeld.de> wrote:
> By the way, why does TrimFilter option updateOffset defaults to false,
> just keep it backwards compatible?
>

In my opinion this option should be removed.

TokenFilters shouldn't muck with offsets, for a lot of reasons, but
especially because its too late to interact with any charfilter.

This is the tokenizer's job.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: content disappears in the index

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Hi Erik,

I like the fortune cookie :-)

I came to the same solution as you did but with a short java proggy by
trying different patterns, so try and error ;-)

This brings me to the question, is there now (with 4.0) any filter doing
the job for me? I took a look at LengthFilter but it has a different purpose.
And TrimFilter has also a different usage.
By the way, why does TrimFilter option updateOffset defaults to false,
just keep it backwards compatible?

Thanks for your help,
Bernd


Am 13.11.2012 02:16, schrieb Erick Erickson:
> Because your regex is wrong? (sorry, couldn't resist).
> 
> Regexes always give me indigestion. But if you look at your results, your
> regex isn't working in any case at all. The second group is being removed
> from the end of the string. I _think_ what's happening is that the longest
> possible string is being matched (which will usually be your second group).
> Then from what's left, your first group is being captured. If you look at
> what you have above, none of the matches is 31 characters long. I don't
> think you need the second group at all.
> 
> This works for me:
> <filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30}).*"
>                                                      replacement="$1"
> replace="all"/>
> 
> This pattern works too: pattern="^(.{1,30}).*"
> 
> But like I said, I'm no expert with regex'es, I usually have to fumble
> around quite a bit to get what I want.
> 
> Found in a fortune cookie according to legend:
> "A programmer had a problem. He solved it with regular expressions. Now he
> has two problems".
> 
> 
> 
> 
> On Mon, Nov 12, 2012 at 9:04 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Yes, it is the second PatternReplaceFilterFactory.
>>
>> the String "Arslanagic, Aida ; Siqveland, Elisabeth" is reduced to "a",
>> whereas the other strings are:
>> "Alexander, Kvam ; Bjørn, Nyland ; Bjørn, Reiten ; Øystein, Huse" -->
>> "alexanderkvambj"
>> "Brennmoen, Ingar ; Hauklien, Øystein ; Hedalen, Trond ; Kvam, Erik" -->
>> "brennmoeningarhauk"
>>
>> Now this explains the sorting (shit in --> shit out).
>>
>> But why is the first string reduced to "a", wrong regular expression?
>>
>> Bernd
>>
>>
>>
>> Am 12.11.2012 14:51, schrieb Bernd Fehling:
>>> The field type is derived from the distributed alphaOnlySort as follows:
>>>
>>> <fieldType name="alphaOnlySortLim" class="solr.TextField"
>> sortMissingLast="true" omitNorms="true">
>>>   <analyzer>
>>>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>>>     <filter class="solr.LowerCaseFilterFactory" />
>>>     <filter class="solr.TrimFilterFactory" />
>>>     <filter class="solr.PatternReplaceFilterFactory"
>> pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])"
>>>                                                      replacement=""
>> replace="all"/>
>>>     <filter class="solr.PatternReplaceFilterFactory"
>> pattern="(.{1,30})(.{31,})"
>>>                                                      replacement="$1"
>> replace="all"/>
>>>   </analyzer>
>>> </fieldType>
>>>
>>> It reduces long lists of author names (100 and more authors) to the
>> first 30 chars
>>> for sorting and removes some illegal chars to keep sorting with utf8
>> solid.
>>>
>>> Don't see any problems there.
>>>
>>> Will check with admin/analysis page.
>>>
>>> Bernd
>>>
>>>
>>> Am 12.11.2012 14:28, schrieb Erick Erickson:
>>>> First, sorting on tokenized fields is undefined/unsupported. You _might_
>>>> get away with it if the author field always reduces to one token, i.e.
>> if
>>>> you're always indexing only the last name.
>>>>
>>>> I should say unsupported/undefined when more than one token is the
>> result
>>>> of analysis. You can do things like use the KeywordTokenizer followed by
>>>> tranformations on the _entire_ input field (lowercasing is popular for
>>>> instance).
>>>>
>>>> So somehow the analysis chain you have defined for this field grabs
>>>> "Arslanagic"
>>>> and translates it into "a". Synonyms? Stemming? Some "interesting"
>> sequence?
>>>>
>>>> The fastest way to look at that would be in Solr's admin/analysis page.
>>>> Just put Arslanagic into the index box and you should see which of the
>>>> steps does the translation. Although changing it to "a" is really weird,
>>>> it's almost certainly something you've defined in the indexing analysis
>>>> chain.
>>>>
>>>> FWIW,
>>>> Erick
>>>>
>>>>
>>>> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
>>>> bernd.fehling@uni-bielefeld.de> wrote:
>>>>
>>>>> Hi list,
>>>>> a user reported wrong sorting of our search service running on solr.
>>>>> While chasing this issue I traced it back through lucene into the
>> index.
>>>>> I have a text field for sorting
>>>>> (stored,indexed,tokenized,omitNorms,sortMissingLast)
>>>>> and three docs with author names.
>>>>>
>>>>> If I trace at org.apache.lucene.document.Document.add(IndexableField)
>> while
>>>>> indexing I can see all three author names added as field to each
>> documents.
>>>>>
>>>>> After searching with *:* for the three docs and doing a sort the
>> sorting
>>>>> is wrong
>>>>> because one of the author names is reduced to the first char, all other
>>>>> chars are lost.
>>>>>
>>>>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
>>>>> the result
>>>>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is
>> wrong.
>>>>> But this happens because the author "Arslanagic" is reduced to "a"
>> during
>>>>> indexing (???)
>>>>> and if sorted "a" is before "alexander".
>>>>>
>>>>> Currently I use 4.0 but have the same issue with 3.6.1.
>>>>>
>>>>> Without tracing through tons of code:
>>>>> - which is the last breakpoint for debugging to see the docs right
>> before
>>>>> they go into the index
>>>>> - which is the first breakpoint for debugging to see the docs coming
>> right
>>>>> out of the index
>>>>>
>>>>> Regards
>>>>> Bernd
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: content disappears in the index

Posted by Erick Erickson <er...@gmail.com>.

Because your regex is wrong? (sorry, couldn't resist).

Regexes always give me indigestion. But if you look at your results, your
regex isn't working in any case at all. The second group is being removed
from the end of the string. I _think_ what's happening is that the longest
possible string is being matched (which will usually be your second group).
Then from what's left, your first group is being captured. If you look at
what you have above, none of the matches is 31 characters long. I don't
think you need the second group at all.

This works for me:
<filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30}).*"
                                                     replacement="$1"
replace="all"/>

This pattern works too: pattern="^(.{1,30}).*"

But like I said, I'm no expert with regex'es, I usually have to fumble
around quite a bit to get what I want.

Found in a fortune cookie according to legend:
"A programmer had a problem. He solved it with regular expressions. Now he
has two problems".




On Mon, Nov 12, 2012 at 9:04 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Yes, it is the second PatternReplaceFilterFactory.
>
> the String "Arslanagic, Aida ; Siqveland, Elisabeth" is reduced to "a",
> whereas the other strings are:
> "Alexander, Kvam ; Bjørn, Nyland ; Bjørn, Reiten ; Øystein, Huse" -->
> "alexanderkvambj"
> "Brennmoen, Ingar ; Hauklien, Øystein ; Hedalen, Trond ; Kvam, Erik" -->
> "brennmoeningarhauk"
>
> Now this explains the sorting (shit in --> shit out).
>
> But why is the first string reduced to "a", wrong regular expression?
>
> Bernd
>
>
>
> Am 12.11.2012 14:51, schrieb Bernd Fehling:
> > The field type is derived from the distributed alphaOnlySort as follows:
> >
> > <fieldType name="alphaOnlySortLim" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
> >   <analyzer>
> >     <tokenizer class="solr.KeywordTokenizerFactory"/>
> >     <filter class="solr.LowerCaseFilterFactory" />
> >     <filter class="solr.TrimFilterFactory" />
> >     <filter class="solr.PatternReplaceFilterFactory"
> pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])"
> >                                                      replacement=""
> replace="all"/>
> >     <filter class="solr.PatternReplaceFilterFactory"
> pattern="(.{1,30})(.{31,})"
> >                                                      replacement="$1"
> replace="all"/>
> >   </analyzer>
> > </fieldType>
> >
> > It reduces long lists of author names (100 and more authors) to the
> first 30 chars
> > for sorting and removes some illegal chars to keep sorting with utf8
> solid.
> >
> > Don't see any problems there.
> >
> > Will check with admin/analysis page.
> >
> > Bernd
> >
> >
> > Am 12.11.2012 14:28, schrieb Erick Erickson:
> >> First, sorting on tokenized fields is undefined/unsupported. You _might_
> >> get away with it if the author field always reduces to one token, i.e.
> if
> >> you're always indexing only the last name.
> >>
> >> I should say unsupported/undefined when more than one token is the
> result
> >> of analysis. You can do things like use the KeywordTokenizer followed by
> >> tranformations on the _entire_ input field (lowercasing is popular for
> >> instance).
> >>
> >> So somehow the analysis chain you have defined for this field grabs
> >> "Arslanagic"
> >> and translates it into "a". Synonyms? Stemming? Some "interesting"
> sequence?
> >>
> >> The fastest way to look at that would be in Solr's admin/analysis page.
> >> Just put Arslanagic into the index box and you should see which of the
> >> steps does the translation. Although changing it to "a" is really weird,
> >> it's almost certainly something you've defined in the indexing analysis
> >> chain.
> >>
> >> FWIW,
> >> Erick
> >>
> >>
> >> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
> >> bernd.fehling@uni-bielefeld.de> wrote:
> >>
> >>> Hi list,
> >>> a user reported wrong sorting of our search service running on solr.
> >>> While chasing this issue I traced it back through lucene into the
> index.
> >>> I have a text field for sorting
> >>> (stored,indexed,tokenized,omitNorms,sortMissingLast)
> >>> and three docs with author names.
> >>>
> >>> If I trace at org.apache.lucene.document.Document.add(IndexableField)
> while
> >>> indexing I can see all three author names added as field to each
> documents.
> >>>
> >>> After searching with *:* for the three docs and doing a sort the
> sorting
> >>> is wrong
> >>> because one of the author names is reduced to the first char, all other
> >>> chars are lost.
> >>>
> >>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
> >>> the result
> >>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is
> wrong.
> >>> But this happens because the author "Arslanagic" is reduced to "a"
> during
> >>> indexing (???)
> >>> and if sorted "a" is before "alexander".
> >>>
> >>> Currently I use 4.0 but have the same issue with 3.6.1.
> >>>
> >>> Without tracing through tons of code:
> >>> - which is the last breakpoint for debugging to see the docs right
> before
> >>> they go into the index
> >>> - which is the first breakpoint for debugging to see the docs coming
> right
> >>> out of the index
> >>>
> >>> Regards
> >>> Bernd
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: content disappears in the index

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

Yes, it is the second PatternReplaceFilterFactory.

the String "Arslanagic, Aida ; Siqveland, Elisabeth" is reduced to "a",
whereas the other strings are:
"Alexander, Kvam ; Bjørn, Nyland ; Bjørn, Reiten ; Øystein, Huse" --> "alexanderkvambj"
"Brennmoen, Ingar ; Hauklien, Øystein ; Hedalen, Trond ; Kvam, Erik" --> "brennmoeningarhauk"

Now this explains the sorting (shit in --> shit out).

But why is the first string reduced to "a", wrong regular expression?

Bernd



Am 12.11.2012 14:51, schrieb Bernd Fehling:
> The field type is derived from the distributed alphaOnlySort as follows:
> 
> <fieldType name="alphaOnlySortLim" class="solr.TextField" sortMissingLast="true" omitNorms="true">
>   <analyzer>
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <filter class="solr.LowerCaseFilterFactory" />
>     <filter class="solr.TrimFilterFactory" />
>     <filter class="solr.PatternReplaceFilterFactory" pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])"
>                                                      replacement="" replace="all"/>
>     <filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30})(.{31,})"
>                                                      replacement="$1" replace="all"/>
>   </analyzer>
> </fieldType>
> 
> It reduces long lists of author names (100 and more authors) to the first 30 chars
> for sorting and removes some illegal chars to keep sorting with utf8 solid.
> 
> Don't see any problems there.
> 
> Will check with admin/analysis page.
> 
> Bernd
> 
> 
> Am 12.11.2012 14:28, schrieb Erick Erickson:
>> First, sorting on tokenized fields is undefined/unsupported. You _might_
>> get away with it if the author field always reduces to one token, i.e. if
>> you're always indexing only the last name.
>>
>> I should say unsupported/undefined when more than one token is the result
>> of analysis. You can do things like use the KeywordTokenizer followed by
>> tranformations on the _entire_ input field (lowercasing is popular for
>> instance).
>>
>> So somehow the analysis chain you have defined for this field grabs
>> "Arslanagic"
>> and translates it into "a". Synonyms? Stemming? Some "interesting" sequence?
>>
>> The fastest way to look at that would be in Solr's admin/analysis page.
>> Just put Arslanagic into the index box and you should see which of the
>> steps does the translation. Although changing it to "a" is really weird,
>> it's almost certainly something you've defined in the indexing analysis
>> chain.
>>
>> FWIW,
>> Erick
>>
>>
>> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
>> bernd.fehling@uni-bielefeld.de> wrote:
>>
>>> Hi list,
>>> a user reported wrong sorting of our search service running on solr.
>>> While chasing this issue I traced it back through lucene into the index.
>>> I have a text field for sorting
>>> (stored,indexed,tokenized,omitNorms,sortMissingLast)
>>> and three docs with author names.
>>>
>>> If I trace at org.apache.lucene.document.Document.add(IndexableField) while
>>> indexing I can see all three author names added as field to each documents.
>>>
>>> After searching with *:* for the three docs and doing a sort the sorting
>>> is wrong
>>> because one of the author names is reduced to the first char, all other
>>> chars are lost.
>>>
>>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
>>> the result
>>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
>>> But this happens because the author "Arslanagic" is reduced to "a" during
>>> indexing (???)
>>> and if sorted "a" is before "alexander".
>>>
>>> Currently I use 4.0 but have the same issue with 3.6.1.
>>>
>>> Without tracing through tons of code:
>>> - which is the last breakpoint for debugging to see the docs right before
>>> they go into the index
>>> - which is the first breakpoint for debugging to see the docs coming right
>>> out of the index
>>>
>>> Regards
>>> Bernd
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: content disappears in the index

Posted by Bernd Fehling <be...@uni-bielefeld.de>.

The field type is derived from the distributed alphaOnlySort as follows:

<fieldType name="alphaOnlySortLim" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.TrimFilterFactory" />
    <filter class="solr.PatternReplaceFilterFactory" pattern="([\x00-\x2F\x3A-\x40\x5B-\x60\x7B-\x9F\u2000-\u206F\uFEFF\uFFF9-\uFFFD])"
                                                     replacement="" replace="all"/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="(.{1,30})(.{31,})"
                                                     replacement="$1" replace="all"/>
  </analyzer>
</fieldType>

It reduces long lists of author names (100 and more authors) to the first 30 chars
for sorting and removes some illegal chars to keep sorting with utf8 solid.

Don't see any problems there.

Will check with admin/analysis page.

Bernd


Am 12.11.2012 14:28, schrieb Erick Erickson:
> First, sorting on tokenized fields is undefined/unsupported. You _might_
> get away with it if the author field always reduces to one token, i.e. if
> you're always indexing only the last name.
> 
> I should say unsupported/undefined when more than one token is the result
> of analysis. You can do things like use the KeywordTokenizer followed by
> tranformations on the _entire_ input field (lowercasing is popular for
> instance).
> 
> So somehow the analysis chain you have defined for this field grabs
> "Arslanagic"
> and translates it into "a". Synonyms? Stemming? Some "interesting" sequence?
> 
> The fastest way to look at that would be in Solr's admin/analysis page.
> Just put Arslanagic into the index box and you should see which of the
> steps does the translation. Although changing it to "a" is really weird,
> it's almost certainly something you've defined in the indexing analysis
> chain.
> 
> FWIW,
> Erick
> 
> 
> On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> Hi list,
>> a user reported wrong sorting of our search service running on solr.
>> While chasing this issue I traced it back through lucene into the index.
>> I have a text field for sorting
>> (stored,indexed,tokenized,omitNorms,sortMissingLast)
>> and three docs with author names.
>>
>> If I trace at org.apache.lucene.document.Document.add(IndexableField) while
>> indexing I can see all three author names added as field to each documents.
>>
>> After searching with *:* for the three docs and doing a sort the sorting
>> is wrong
>> because one of the author names is reduced to the first char, all other
>> chars are lost.
>>
>> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
>> the result
>> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
>> But this happens because the author "Arslanagic" is reduced to "a" during
>> indexing (???)
>> and if sorted "a" is before "alexander".
>>
>> Currently I use 4.0 but have the same issue with 3.6.1.
>>
>> Without tracing through tons of code:
>> - which is the last breakpoint for debugging to see the docs right before
>> they go into the index
>> - which is the first breakpoint for debugging to see the docs coming right
>> out of the index
>>
>> Regards
>> Bernd
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: content disappears in the index

Posted by Jack Krupansky <ja...@basetechnology.com>.

Maybe... the author names have middle or first initials? Like, maybe the 
"Arslanagic" dude has an "A" initial in his name, like "A. Arslanagic" or 
"Arslanagic, A.".

In any case, "string" is the proper type for a sorted field, although it 
would be nice if Lucene/Solr was more developer-friendly when this "mistake" 
is made.

The relevant doc is:

"Sorting can be done on the "score" of the document, or on any 
multiValued="false" indexed="true" field provided that field is either 
non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a 
single Term (ie: uses the KeywordTokenizer)"
...
"The common situation for sorting on a field that you do want to be 
tokenized for searching is to use a <copyField> to clone your field. Sort on 
one, search on the other."

See:
http://wiki.apache.org/solr/CommonQueryParameters

For example, have an "author" field that is "text" and an "author_s" (or 
"author_sorted" or "author_string") field that you copy the name to:

    <copyField source="author" dest="author_s" />

Query on "author", but sort on "author_s".

-- Jack Krupansky

-----Original Message----- 
From: Erick Erickson
Sent: Monday, November 12, 2012 5:28 AM
To: java-user
Subject: Re: content disappears in the index

First, sorting on tokenized fields is undefined/unsupported. You _might_
get away with it if the author field always reduces to one token, i.e. if
you're always indexing only the last name.

I should say unsupported/undefined when more than one token is the result
of analysis. You can do things like use the KeywordTokenizer followed by
tranformations on the _entire_ input field (lowercasing is popular for
instance).

So somehow the analysis chain you have defined for this field grabs
"Arslanagic"
and translates it into "a". Synonyms? Stemming? Some "interesting" sequence?

The fastest way to look at that would be in Solr's admin/analysis page.
Just put Arslanagic into the index box and you should see which of the
steps does the translation. Although changing it to "a" is really weird,
it's almost certainly something you've defined in the indexing analysis
chain.

FWIW,
Erick

On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Hi list,
> a user reported wrong sorting of our search service running on solr.
> While chasing this issue I traced it back through lucene into the index.
> I have a text field for sorting
> (stored,indexed,tokenized,omitNorms,sortMissingLast)
> and three docs with author names.
>
> If I trace at org.apache.lucene.document.Document.add(IndexableField) 
> while
> indexing I can see all three author names added as field to each 
> documents.
>
> After searching with *:* for the three docs and doing a sort the sorting
> is wrong
> because one of the author names is reduced to the first char, all other
> chars are lost.
>
> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
> the result
> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
> But this happens because the author "Arslanagic" is reduced to "a" during
> indexing (???)
> and if sorted "a" is before "alexander".
>
> Currently I use 4.0 but have the same issue with 3.6.1.
>
> Without tracing through tons of code:
> - which is the last breakpoint for debugging to see the docs right before
> they go into the index
> - which is the first breakpoint for debugging to see the docs coming right
> out of the index
>
> Regards
> Bernd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: content disappears in the index

Posted by Erick Erickson <er...@gmail.com>.

First, sorting on tokenized fields is undefined/unsupported. You _might_
get away with it if the author field always reduces to one token, i.e. if
you're always indexing only the last name.

I should say unsupported/undefined when more than one token is the result
of analysis. You can do things like use the KeywordTokenizer followed by
tranformations on the _entire_ input field (lowercasing is popular for
instance).

So somehow the analysis chain you have defined for this field grabs
"Arslanagic"
and translates it into "a". Synonyms? Stemming? Some "interesting" sequence?

The fastest way to look at that would be in Solr's admin/analysis page.
Just put Arslanagic into the index box and you should see which of the
steps does the translation. Although changing it to "a" is really weird,
it's almost certainly something you've defined in the indexing analysis
chain.

FWIW,
Erick

On Mon, Nov 12, 2012 at 8:19 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Hi list,
> a user reported wrong sorting of our search service running on solr.
> While chasing this issue I traced it back through lucene into the index.
> I have a text field for sorting
> (stored,indexed,tokenized,omitNorms,sortMissingLast)
> and three docs with author names.
>
> If I trace at org.apache.lucene.document.Document.add(IndexableField) while
> indexing I can see all three author names added as field to each documents.
>
> After searching with *:* for the three docs and doing a sort the sorting
> is wrong
> because one of the author names is reduced to the first char, all other
> chars are lost.
>
> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
> the result
> of sorting ascending is (Arslanagic, Alexander, Brennmoen) which is wrong.
> But this happens because the author "Arslanagic" is reduced to "a" during
> indexing (???)
> and if sorted "a" is before "alexander".
>
> Currently I use 4.0 but have the same issue with 3.6.1.
>
> Without tracing through tons of code:
> - which is the last breakpoint for debugging to see the docs right before
> they go into the index
> - which is the first breakpoint for debugging to see the docs coming right
> out of the index
>
> Regards
> Bernd
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: content disappears in the index

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hi,

could it be that the issue is tokenization? In your explanation, you write the field is tokenized, but fields used for sorting should not be tokenized and should be indexed as-is (e.g. as Lucene 4.0 StringField). If you have more than one token/document in the field, the sorting is not defined.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Bernd Fehling [mailto:bernd.fehling@uni-bielefeld.de]
> Sent: Monday, November 12, 2012 2:19 PM
> To: java-user@lucene.apache.org
> Subject: content disappears in the index
> 
> Hi list,
> a user reported wrong sorting of our search service running on solr.
> While chasing this issue I traced it back through lucene into the index.
> I have a text field for sorting
> (stored,indexed,tokenized,omitNorms,sortMissingLast)
> and three docs with author names.
> 
> If I trace at org.apache.lucene.document.Document.add(IndexableField)
> while indexing I can see all three author names added as field to each
> documents.
> 
> After searching with *:* for the three docs and doing a sort the sorting is
> wrong because one of the author names is reduced to the first char, all other
> chars are lost.
> 
> So having the authors names (Alexander, Arslanagic, Brennmoen) indexed,
> the result of sorting ascending is (Arslanagic, Alexander, Brennmoen) which
> is wrong.
> But this happens because the author "Arslanagic" is reduced to "a" during
> indexing (???) and if sorted "a" is before "alexander".
> 
> Currently I use 4.0 but have the same issue with 3.6.1.
> 
> Without tracing through tons of code:
> - which is the last breakpoint for debugging to see the docs right before they
> go into the index
> - which is the first breakpoint for debugging to see the docs coming right out
> of the index
> 
> Regards
> Bernd
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org