You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by hschillig <mo...@live.com> on 2014/10/31 16:49:52 UTC

Only copy string up to certain character symbol?

So I have a title field that is common to look like this:

Personal legal forms simplified : the ultimate guide to personal legal forms
/ Daniel Sitarz.

I made a copyField that is of type "title_only". I want to ONLY copy the
text "Personal legal forms simplified : the ultimate guide to personal legal
forms".. so everything before the "/" symbol. I have it like this in my
schema.xml:

<fieldType name="title_only" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="4"
maxGramSize="15" side="front" />
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\/.+?$)" replacement=""/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\/.+?$)" replacement=""/>
    </analyzer>
</fieldType>

My regex seems to be off though as the field still holds the entire value
when I reindex and restart SolR. Thanks for any help!



--
View this message in context: http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Only copy string up to certain character symbol?

Posted by Erick Erickson <er...@gmail.com>.

In addition to Alexandre's comment, your index chain looks suspect:

  <filter class="solr.EdgeNGramFilterFactory" minGramSize="4"
maxGramSize="15" side="front" />
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\/.+?$)" replacement=""/>

So the pattern replace stuff happens on the grams, not the full input. You might
be better off with a

solr.PatternReplaceCharFilterFactory

which works on the entire input string before even tokenization is done.

That said, Alexandre's comment is spot on. If your evidence for not respecting
the regex is that the document returns the whole input, it's because the
stored="true" stores the raw input and has nothing to do with the analysis
chain, the split to store the input happens before any kind of
analysis processing.

On Fri, Oct 31, 2014 at 9:33 AM, Alexandre Rafalovitch
<ar...@gmail.com> wrote:
> copyField can copy only part of the string but it is defined by
> character count. If you want to use regular expressions, you may be
> better off to do the copy in the UpdateRequestProcessor chain instead:
> http://www.solr-start.com/info/update-request-processors/#RegexReplaceProcessorFactory
>
> What you are doing (RegEx in the chain) only affects "indexed"
> representation of the text. Not the stored content. I suspect that's
> not what you want.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 31 October 2014 11:49, hschillig <mo...@live.com> wrote:
>> So I have a title field that is common to look like this:
>>
>> Personal legal forms simplified : the ultimate guide to personal legal forms
>> / Daniel Sitarz.
>>
>> I made a copyField that is of type "title_only". I want to ONLY copy the
>> text "Personal legal forms simplified : the ultimate guide to personal legal
>> forms".. so everything before the "/" symbol. I have it like this in my
>> schema.xml:
>>
>> <fieldType name="title_only" class="solr.TextField">
>>     <analyzer type="index">
>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="4"
>> maxGramSize="15" side="front" />
>>         <charFilter class="solr.PatternReplaceCharFilterFactory"
>> pattern="(\/.+?$)" replacement=""/>
>>     </analyzer>
>>     <analyzer type="query">
>>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <charFilter class="solr.PatternReplaceCharFilterFactory"
>> pattern="(\/.+?$)" replacement=""/>
>>     </analyzer>
>> </fieldType>
>>
>> My regex seems to be off though as the field still holds the entire value
>> when I reindex and restart SolR. Thanks for any help!
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Only copy string up to certain character symbol?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

copyField can copy only part of the string but it is defined by
character count. If you want to use regular expressions, you may be
better off to do the copy in the UpdateRequestProcessor chain instead:
http://www.solr-start.com/info/update-request-processors/#RegexReplaceProcessorFactory

What you are doing (RegEx in the chain) only affects "indexed"
representation of the text. Not the stored content. I suspect that's
not what you want.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853

On 31 October 2014 11:49, hschillig <mo...@live.com> wrote:
> So I have a title field that is common to look like this:
>
> Personal legal forms simplified : the ultimate guide to personal legal forms
> / Daniel Sitarz.
>
> I made a copyField that is of type "title_only". I want to ONLY copy the
> text "Personal legal forms simplified : the ultimate guide to personal legal
> forms".. so everything before the "/" symbol. I have it like this in my
> schema.xml:
>
> <fieldType name="title_only" class="solr.TextField">
>     <analyzer type="index">
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.EdgeNGramFilterFactory" minGramSize="4"
> maxGramSize="15" side="front" />
>         <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="(\/.+?$)" replacement=""/>
>     </analyzer>
>     <analyzer type="query">
>         <tokenizer class="solr.KeywordTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <charFilter class="solr.PatternReplaceCharFilterFactory"
> pattern="(\/.+?$)" replacement=""/>
>     </analyzer>
> </fieldType>
>
> My regex seems to be off though as the field still holds the entire value
> when I reindex and restart SolR. Thanks for any help!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html
> Sent from the Solr - User mailing list archive at Nabble.com.