You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Dan ." <ro...@gmail.com> on 2018/02/23 11:08:27 UTC

StandardTokenizer and splitting on mixedcase strings

Hi,

The StandardTokenizerFactory splits strings like 'JavaScript' into 'Java'
and 'Script', but then searches with 'javascript' do not match the document.

Is there a solr way to prevent StandardTokenizer from splitting mixedcase
strings?

Cheers,
Dan

Re: StandardTokenizer and splitting on mixedcase strings

Posted by Erick Erickson <er...@gmail.com>.

Dan:

The admin UI analysis page is invaluable for understanding exactly
what element of your analysis chain does what. So when you restructure
your analysis chain you can use it to see if the input transforms the
way you want it to.

Best,
Erick

On Mon, Feb 26, 2018 at 7:21 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> On 2/23/2018 10:55 AM, Rick Leir wrote:
>> Lowercase filter before the tokenizer?
>
> Unless somebody invents a lowercasing CharFilter, which I don't think
> exists currently, that's not possible.
>
> Groups of Solr analysis components always run in the following order:
>
> First CharFilter entries are run.
> Then the Tokenizer is run.
> Then Filter entries are run.
>
> Within each group, individual components run in the order they are
> configured, but the filters will always run after charfilters and the
> tokenizer.
>
> Thanks,
> Shawn
>

Re: StandardTokenizer and splitting on mixedcase strings

Posted by Shawn Heisey <ap...@elyograg.org>.

On 2/23/2018 10:55 AM, Rick Leir wrote:
> Lowercase filter before the tokenizer?

Unless somebody invents a lowercasing CharFilter, which I don't think
exists currently, that's not possible.

Groups of Solr analysis components always run in the following order:

First CharFilter entries are run.
Then the Tokenizer is run.
Then Filter entries are run.

Within each group, individual components run in the order they are
configured, but the filters will always run after charfilters and the
tokenizer.

Thanks,
Shawn

Re: StandardTokenizer and splitting on mixedcase strings

Posted by Rick Leir <rl...@leirtech.com>.

Dan,
Lowercase filter before the tokenizer?
Cheers -- Rick

On February 23, 2018 6:08:27 AM EST, "Dan ." <ro...@gmail.com> wrote:
>Hi,
>
>The StandardTokenizerFactory splits strings like 'JavaScript' into
>'Java'
>and 'Script', but then searches with 'javascript' do not match the
>document.
>
>Is there a solr way to prevent StandardTokenizer from splitting
>mixedcase
>strings?
>
>Cheers,
>Dan

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com

Re: StandardTokenizer and splitting on mixedcase strings

Posted by Steve Rowe <sa...@gmail.com>.

Hi Dan,

StandardTokenizerFactory does not do this.

Maybe you have a filter in your analysis chain that does this?  E.g. WordDelimiterFilterFactory has this capability.

--
Steve
www.lucidworks.com

> On Feb 23, 2018, at 6:08 AM, Dan . <ro...@gmail.com> wrote:
> 
> Hi,
> 
> The StandardTokenizerFactory splits strings like 'JavaScript' into 'Java'
> and 'Script', but then searches with 'javascript' do not match the document.
> 
> Is there a solr way to prevent StandardTokenizer from splitting mixedcase
> strings?
> 
> Cheers,
> Dan