You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Teresa McMains <te...@t14-consulting.com> on 2020/04/03 19:40:04 UTC

match string fields with embedded hyphens

Forgive me if this is unclear, I am very much new here.

I am working with a customer who needs to be able to query various account/customer ID fields which may or may not have embedded dashes.  But they want to be able to search by entering the dashes or not and by entering partial values or not.

So we may have an account or customer ID like

1234-56AB45

And they would like to retrieve this by searching for any of the following:
1234-56AB45     (full string match)
1234-56                (partial string match)
123456AB45        (full string but no dashes)
123456                  (partial string no dashes)

I've defined this field type in schema.xml as:


<!-- String replace field for account number searches -->

<fieldType name="TrimmedString" class="solr.TextField" omitNorms="true">

<analyzer>

  <tokenizer class="solr.KeywordTokenizerFactory" />


  <!-- Normalizes token text to upper case -->

  <filter class="solr.UpperCaseFilterFactory" />

  <!-- Removes anything that isn't a letter or digit -->

  <filter class="solr.PatternReplaceFilterFactory" pattern="[^A-Za-z0-9]" replacement="" replace="all"/>



</analyzer>

</fieldType>

But the behavior I see is completely unexpected.
Full string match works fine on the customer's DEV environment but not in QA (which is running the same version of SOLR)
Partial string match works for some ID fields but not others
A Partial string match when the user does not enter the dashes just never works

I don't even know where to begin.  The behavior is not consistent enough to give me a sense.

So perhaps I will just ask - how would you define a fieldType which should ignore special characters like hyphens or underscores (or anything non-alphanumeric) and works for full string or partial string search?

Thank you.

Re: match string fields with embedded hyphens

Posted by Erick Erickson <er...@gmail.com>.

Look at what’s returned when you specify &debug=query. Particularly the parsed query. That’ll show you the results of parsing. My bet: you’ll see something unexpected...

Best,
Erick

> On Apr 8, 2020, at 17:59, Teresa McMains <te...@t14-consulting.com> wrote:
> 
> I am still really struggling with this.
> 
> Current field type as defined in schema.xml:
> 
> <!-- String replace field for account number searches -->
> <fieldType name="TrimmedString" class="solr.TextField" omitNorms="true"> 
> <analyzer> 
>  <!-- Removes anything that isn't a letter or digit --> 
>  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/>
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> 
>  <!-- Normalizes token text to upper case -->
>  <filter class="solr.UpperCaseFilterFactory" /> 
> 
> </analyzer> 
> </fieldType>  
> 
> Two fields are defined using this field type:
> <field name="account_number" type="TrimmedString" indexed="true" stored="true" multivalued="false" required="false"/>
> <field name="transaction_key" type="TrimmedString" indexed="true" stored="true" multivalued="false" required="false"/>
> 
> A transaction key may look like: 107986541-85487JY_X4528745
> An account number may look like: 1258458-0659841
> 
> After making this change, I stopped solr, deleted the data directory, restarted solr and ran indexing for all data.
> 
> Transaction Key:
> Searching for:    107986541-85487JY_X4528745        Returns: 107986541-85487JY_X4528745 (good)
> Searching for: "107986541-85487JY_X4528745"        Returns: 107986541-85487JY_X4528745 (good)
> Searching for: 10798654185487JYX4528745        Returns: 107986541-85487JY_X4528745 (good)
> Searching for: "107986541-85487JY_X4528745"        Returns: 107986541-85487JY_X4528745 (good)
> Searching for: 107986541                Returns: 107986541-85487JY_X4528745 (unexpected)
> Searching for: 107986541*                Returns: MANY MANY hits that all start with 107986541 (unexpected)
> 
> Account Number: 
> Searching for: 1258458-0659841        Returns: NOTHING (bad)
> Searching for: "1258458-0659841"        Returns: 1258458-0659841 (good)
> Searching for: 12584580659841            Returns: 1258458-0659841 (good)
> Searching for: "12584580659841"        Returns: 1258458-0659841 (good)
> Searching for: 1258458-0659            Returns: 1258458-0659841 (good)
> Searching for: 1258458-0659*            Returns: NOTHING (bad)
> 
> So my questions are:
> 1) Why does searching for 107986541 And 107986541* For transaction_key return different results?
> 2) Why does searching for a full account number without quotes fail?
> 3) Why does specifying the wildcard character in the last account_number search return nothing?
> 
> Many many thanks.
> I'll get this some day,
> Teresa
> 
> 
>> 
>>>> On Apr 6, 2020, at 12:38 PM, Teresa McMains <te...@t14-consulting.com> wrote:
>>> 
>>> Erick, thank you so much for this.  I'm going to try to implement with PatternReplaceCharFilterFactory as you recommended.
>>> What you mentioned about re-indexing from an empty state made sense to me (in terms of the observed behavior) but also surprised me.  If I select "Clean" on the reindex, does it *not* start from an empty state?
>>> 
>>> Thanks!!
>>> Teresa
>>> 
>>> 
>>> -----Original Message-----
>>> From: Erick Erickson <er...@gmail.com>
>>> Sent: Friday, April 3, 2020 7:16 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: match string fields with embedded hyphens
>>> 
>>> First, thanks for taking the time to write up a clear problem statement. Putting in the field type is _really_ helpful.
>>> 
>>> By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. The problem is that wildcards are tricky, and this trips everybody up at one time or another.
>>> 
>>> The quick background is that if there’s any possibility that the filter can produce multiple tokens for a single input token, that filter is skipped during analysis at _query_ time. Imagine that your replacement was a space rather than an empty string. Then 123--456 would become _two_ tokens in subsequent processing. Now anything you do is wrong sometime, somewhere.
>>> 
>>> For instance, 123*456 would fail because it’d be looking for one token, which you wouldn’t expect. 12345* would also fail because there’s no single token like that. 123 would succeed (note no wildcard). You can see where this is going.
>>> 
>>> Which doesn’t help you solve your use-case. There are several options:
>>> 
>>> - use <charFilter class="solr.PatternReplaceCharFilterFactory"  pattern="[^A-Za-z0-9]" replacement="”/> instead of PatternReplaceFilterFactory. charFilters are applied to the raw input before analysis and don’t have the same problem with producing multiple tokens.
>>> 
>>> - WordDelimiter(Graph)FilterFactory is built for this kind of thing. There are a number of options, and this is one of the few filters that’s often different between index and query analysis chains. It can be tricky to understand all the interactions of the parameters though.
>>> 
>>> And as an aside, I don’t know how large your index is, but wildcards for one or two leading characters can get very expensive, i.e. 1*, 12* can get very costly. If you can require 3 or more leading characters there are rarely problems. You can also do a time/space tradeoff by including EdgeNgramFilterFactory in your chain at the cost of a larger index.
>>> 
>>> And finally, (and this is a total nit) there.s no reason to specify lower-case characters in your existing pattern because the upper-case filter is first. You _will_ have to specify uppercase characters if you use the charfilter.
>>> 
>>> As for why production is different than QA, my guess is that you overlaid the schema changes on an _existing_ index. Most of the time, to get consistent results, you must re-index everything starting from an _empty_ index. This is a long and complicated explanation that I won’t go into here. In fact, I usually do one of two things:
>>> 
>>> 1> define a new collection/core and index to that. If using SolrCloud, you can re-index and use collection aliasing to seamlessly switch.
>>> 
>>> 2> stop Solr. Delete all the datadirs (the parent of tlog and index) associated with any of my replicas, restart with Solr and index. You may be able to get away with using delete-by-query to remove everything in your index then optimize (one of the very few times I’ll recommend optimizing), reloading your collection and indexing. The point is to get rid of all traces of anything generated from the old schema.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Apr 3, 2020, at 3:40 PM, Teresa McMains <te...@t14-consulting.com> wrote:
>>>> 
>>>> Forgive me if this is unclear, I am very much new here.
>>>> 
>>>> I am working with a customer who needs to be able to query various account/customer ID fields which may or may not have embedded dashes.  But they want to be able to search by entering the dashes or not and by entering partial values or not.
>>>> 
>>>> So we may have an account or customer ID like
>>>> 
>>>> 1234-56AB45
>>>> 
>>>> And they would like to retrieve this by searching for any of the following:
>>>> 1234-56AB45     (full string match)
>>>> 1234-56                (partial string match)
>>>> 123456AB45        (full string but no dashes)
>>>> 123456                  (partial string no dashes)
>>>> 
>>>> I've defined this field type in schema.xml as:
>>>> 
>>>> 
>>>> <!-- String replace field for account number searches -->
>>>> 
>>>> <fieldType name="TrimmedString" class="solr.TextField"
>>>> omitNorms="true">
>>>> 
>>>> <analyzer>
>>>> 
>>>> <tokenizer class="solr.KeywordTokenizerFactory" />
>>>> 
>>>> 
>>>> <!-- Normalizes token text to upper case -->
>>>> 
>>>> <filter class="solr.UpperCaseFilterFactory" />
>>>> 
>>>> <!-- Removes anything that isn't a letter or digit -->
>>>> 
>>>> <filter class="solr.PatternReplaceFilterFactory" 
>>>> pattern="[^A-Za-z0-9]" replacement="" replace="all"/>
>>>> 
>>>> 
>>>> 
>>>> </analyzer>
>>>> 
>>>> </fieldType>
>>>> 
>>>> But the behavior I see is completely unexpected.
>>>> Full string match works fine on the customer's DEV environment but 
>>>> not in QA (which is running the same version of SOLR) Partial 
>>>> string match works for some ID fields but not others A Partial 
>>>> string match when the user does not enter the dashes just never 
>>>> works
>>>> 
>>>> I don't even know where to begin.  The behavior is not consistent enough to give me a sense.
>>>> 
>>>> So perhaps I will just ask - how would you define a fieldType which should ignore special characters like hyphens or underscores (or anything non-alphanumeric) and works for full string or partial string search?
>>>> 
>>>> Thank you.
>>>> 
>>>> 
>>> 
>

RE: match string fields with embedded hyphens

Posted by Teresa McMains <te...@t14-consulting.com>.

I am still really struggling with this.

Current field type as defined in schema.xml:

<!-- String replace field for account number searches -->
<fieldType name="TrimmedString" class="solr.TextField" omitNorms="true"> 
<analyzer> 
  <!-- Removes anything that isn't a letter or digit --> 
  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^A-Za-z0-9])" replacement=""/>
  <tokenizer class="solr.KeywordTokenizerFactory" />

  <!-- Normalizes token text to upper case -->
  <filter class="solr.UpperCaseFilterFactory" /> 

</analyzer> 
</fieldType>  

Two fields are defined using this field type:
<field name="account_number" type="TrimmedString" indexed="true" stored="true" multivalued="false" required="false"/>
<field name="transaction_key" type="TrimmedString" indexed="true" stored="true" multivalued="false" required="false"/>

A transaction key may look like: 107986541-85487JY_X4528745
An account number may look like: 1258458-0659841

After making this change, I stopped solr, deleted the data directory, restarted solr and ran indexing for all data.

Transaction Key:
Searching for:	107986541-85487JY_X4528745		Returns: 107986541-85487JY_X4528745 (good)
Searching for: "107986541-85487JY_X4528745"		Returns: 107986541-85487JY_X4528745 (good)
Searching for: 10798654185487JYX4528745		Returns: 107986541-85487JY_X4528745 (good)
Searching for: "107986541-85487JY_X4528745"		Returns: 107986541-85487JY_X4528745 (good)
Searching for: 107986541				Returns: 107986541-85487JY_X4528745 (unexpected)
Searching for: 107986541*				Returns: MANY MANY hits that all start with 107986541 (unexpected)

Account Number: 
Searching for: 1258458-0659841		Returns: NOTHING (bad)
Searching for: "1258458-0659841"		Returns: 1258458-0659841 (good)
Searching for: 12584580659841			Returns: 1258458-0659841 (good)
Searching for: "12584580659841"		Returns: 1258458-0659841 (good)
Searching for: 1258458-0659			Returns: 1258458-0659841 (good)
Searching for: 1258458-0659*			Returns: NOTHING (bad)

So my questions are:
1) Why does searching for 107986541 And 107986541* For transaction_key return different results?
2) Why does searching for a full account number without quotes fail?
3) Why does specifying the wildcard character in the last account_number search return nothing?

Many many thanks.
I'll get this some day,
Teresa


>  
> > On Apr 6, 2020, at 12:38 PM, Teresa McMains <te...@t14-consulting.com> wrote:
> > 
> > Erick, thank you so much for this.  I'm going to try to implement with PatternReplaceCharFilterFactory as you recommended.
> > What you mentioned about re-indexing from an empty state made sense to me (in terms of the observed behavior) but also surprised me.  If I select "Clean" on the reindex, does it *not* start from an empty state?
> > 
> > Thanks!!
> > Teresa
> > 
> > 
> > -----Original Message-----
> > From: Erick Erickson <er...@gmail.com>
> > Sent: Friday, April 3, 2020 7:16 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: match string fields with embedded hyphens
> > 
> > First, thanks for taking the time to write up a clear problem statement. Putting in the field type is _really_ helpful.
> > 
> > By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. The problem is that wildcards are tricky, and this trips everybody up at one time or another.
> > 
> > The quick background is that if there’s any possibility that the filter can produce multiple tokens for a single input token, that filter is skipped during analysis at _query_ time. Imagine that your replacement was a space rather than an empty string. Then 123--456 would become _two_ tokens in subsequent processing. Now anything you do is wrong sometime, somewhere.
> > 
> > For instance, 123*456 would fail because it’d be looking for one token, which you wouldn’t expect. 12345* would also fail because there’s no single token like that. 123 would succeed (note no wildcard). You can see where this is going.
> > 
> > Which doesn’t help you solve your use-case. There are several options:
> > 
> > - use <charFilter class="solr.PatternReplaceCharFilterFactory"  pattern="[^A-Za-z0-9]" replacement="”/> instead of PatternReplaceFilterFactory. charFilters are applied to the raw input before analysis and don’t have the same problem with producing multiple tokens.
> > 
> > - WordDelimiter(Graph)FilterFactory is built for this kind of thing. There are a number of options, and this is one of the few filters that’s often different between index and query analysis chains. It can be tricky to understand all the interactions of the parameters though.
> > 
> > And as an aside, I don’t know how large your index is, but wildcards for one or two leading characters can get very expensive, i.e. 1*, 12* can get very costly. If you can require 3 or more leading characters there are rarely problems. You can also do a time/space tradeoff by including EdgeNgramFilterFactory in your chain at the cost of a larger index.
> > 
> > And finally, (and this is a total nit) there.s no reason to specify lower-case characters in your existing pattern because the upper-case filter is first. You _will_ have to specify uppercase characters if you use the charfilter.
> > 
> > As for why production is different than QA, my guess is that you overlaid the schema changes on an _existing_ index. Most of the time, to get consistent results, you must re-index everything starting from an _empty_ index. This is a long and complicated explanation that I won’t go into here. In fact, I usually do one of two things:
> > 
> > 1> define a new collection/core and index to that. If using SolrCloud, you can re-index and use collection aliasing to seamlessly switch.
> > 
> > 2> stop Solr. Delete all the datadirs (the parent of tlog and index) associated with any of my replicas, restart with Solr and index. You may be able to get away with using delete-by-query to remove everything in your index then optimize (one of the very few times I’ll recommend optimizing), reloading your collection and indexing. The point is to get rid of all traces of anything generated from the old schema.
> > 
> > Best,
> > Erick
> > 
> >> On Apr 3, 2020, at 3:40 PM, Teresa McMains <te...@t14-consulting.com> wrote:
> >> 
> >> Forgive me if this is unclear, I am very much new here.
> >> 
> >> I am working with a customer who needs to be able to query various account/customer ID fields which may or may not have embedded dashes.  But they want to be able to search by entering the dashes or not and by entering partial values or not.
> >> 
> >> So we may have an account or customer ID like
> >> 
> >> 1234-56AB45
> >> 
> >> And they would like to retrieve this by searching for any of the following:
> >> 1234-56AB45     (full string match)
> >> 1234-56                (partial string match)
> >> 123456AB45        (full string but no dashes)
> >> 123456                  (partial string no dashes)
> >> 
> >> I've defined this field type in schema.xml as:
> >> 
> >> 
> >> <!-- String replace field for account number searches -->
> >> 
> >> <fieldType name="TrimmedString" class="solr.TextField"
> >> omitNorms="true">
> >> 
> >> <analyzer>
> >> 
> >> <tokenizer class="solr.KeywordTokenizerFactory" />
> >> 
> >> 
> >> <!-- Normalizes token text to upper case -->
> >> 
> >> <filter class="solr.UpperCaseFilterFactory" />
> >> 
> >> <!-- Removes anything that isn't a letter or digit -->
> >> 
> >> <filter class="solr.PatternReplaceFilterFactory" 
> >> pattern="[^A-Za-z0-9]" replacement="" replace="all"/>
> >> 
> >> 
> >> 
> >> </analyzer>
> >> 
> >> </fieldType>
> >> 
> >> But the behavior I see is completely unexpected.
> >> Full string match works fine on the customer's DEV environment but 
> >> not in QA (which is running the same version of SOLR) Partial 
> >> string match works for some ID fields but not others A Partial 
> >> string match when the user does not enter the dashes just never 
> >> works
> >> 
> >> I don't even know where to begin.  The behavior is not consistent enough to give me a sense.
> >> 
> >> So perhaps I will just ask - how would you define a fieldType which should ignore special characters like hyphens or underscores (or anything non-alphanumeric) and works for full string or partial string search?
> >> 
> >> Thank you.
> >> 
> >> 
> >

RE: match string fields with embedded hyphens

Posted by Teresa McMains <te...@t14-consulting.com>.

Erick, thank you so much for this.  I'm going to try to implement with PatternReplaceCharFilterFactory as you recommended.
What you mentioned about re-indexing from an empty state made sense to me (in terms of the observed behavior) but also surprised me.  If I select "Clean" on the reindex, does it *not* start from an empty state?

Thanks!!
Teresa

-----Original Message-----
From: Erick Erickson <er...@gmail.com> 
Sent: Friday, April 3, 2020 7:16 PM
To: solr-user@lucene.apache.org
Subject: Re: match string fields with embedded hyphens

First, thanks for taking the time to write up a clear problem statement. Putting in the field type is _really_ helpful.

By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. The problem is that wildcards are tricky, and this trips everybody up at one time or another.

The quick background is that if there’s any possibility that the filter can produce multiple tokens for a single input token, that filter is skipped during analysis at _query_ time. Imagine that your replacement was a space rather than an empty string. Then 123--456 would become _two_ tokens in subsequent processing. Now anything you do is wrong sometime, somewhere. 

For instance, 123*456 would fail because it’d be looking for one token, which you wouldn’t expect. 12345* would also fail because there’s no single token like that. 123 would succeed (note no wildcard). You can see where this is going.

Which doesn’t help you solve your use-case. There are several options:

- use <charFilter class="solr.PatternReplaceCharFilterFactory"  pattern="[^A-Za-z0-9]" replacement="”/> instead of PatternReplaceFilterFactory. charFilters are applied to the raw input before analysis and don’t have the same problem with producing multiple tokens.

- WordDelimiter(Graph)FilterFactory is built for this kind of thing. There are a number of options, and this is one of the few filters that’s often different between index and query analysis chains. It can be tricky to understand all the interactions of the parameters though.

And as an aside, I don’t know how large your index is, but wildcards for one or two leading characters can get very expensive, i.e. 1*, 12* can get very costly. If you can require 3 or more leading characters there are rarely problems. You can also do a time/space tradeoff by including EdgeNgramFilterFactory in your chain at the cost of a larger index.

And finally, (and this is a total nit) there.s no reason to specify lower-case characters in your existing pattern because the upper-case filter is first. You _will_ have to specify uppercase characters if you use the charfilter.

As for why production is different than QA, my guess is that you overlaid the schema changes on an _existing_ index. Most of the time, to get consistent results, you must re-index everything starting from an _empty_ index. This is a long and complicated explanation that I won’t go into here. In fact, I usually do one of two things:

1> define a new collection/core and index to that. If using SolrCloud, you can re-index and use collection aliasing to seamlessly switch.

2> stop Solr. Delete all the datadirs (the parent of tlog and index) associated with any of my replicas, restart with Solr and index. You may be able to get away with using delete-by-query to remove everything in your index then optimize (one of the very few times I’ll recommend optimizing), reloading your collection and indexing. The point is to get rid of all traces of anything generated from the old schema. 

Best,
Erick

> On Apr 3, 2020, at 3:40 PM, Teresa McMains <te...@t14-consulting.com> wrote:
> 
> Forgive me if this is unclear, I am very much new here.
> 
> I am working with a customer who needs to be able to query various account/customer ID fields which may or may not have embedded dashes.  But they want to be able to search by entering the dashes or not and by entering partial values or not.
> 
> So we may have an account or customer ID like
> 
> 1234-56AB45
> 
> And they would like to retrieve this by searching for any of the following:
> 1234-56AB45     (full string match)
> 1234-56                (partial string match)
> 123456AB45        (full string but no dashes)
> 123456                  (partial string no dashes)
> 
> I've defined this field type in schema.xml as:
> 
> 
> <!-- String replace field for account number searches -->
> 
> <fieldType name="TrimmedString" class="solr.TextField" 
> omitNorms="true">
> 
> <analyzer>
> 
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> 
> 
>  <!-- Normalizes token text to upper case -->
> 
>  <filter class="solr.UpperCaseFilterFactory" />
> 
>  <!-- Removes anything that isn't a letter or digit -->
> 
>  <filter class="solr.PatternReplaceFilterFactory" 
> pattern="[^A-Za-z0-9]" replacement="" replace="all"/>
> 
> 
> 
> </analyzer>
> 
> </fieldType>
> 
> But the behavior I see is completely unexpected.
> Full string match works fine on the customer's DEV environment but not 
> in QA (which is running the same version of SOLR) Partial string match 
> works for some ID fields but not others A Partial string match when 
> the user does not enter the dashes just never works
> 
> I don't even know where to begin.  The behavior is not consistent enough to give me a sense.
> 
> So perhaps I will just ask - how would you define a fieldType which should ignore special characters like hyphens or underscores (or anything non-alphanumeric) and works for full string or partial string search?
> 
> Thank you.
> 
>

Re: match string fields with embedded hyphens

Posted by Erick Erickson <er...@gmail.com>.

First, thanks for taking the time to write up a clear problem statement. Putting in the field type is _really_ helpful.

By “partial string match”, I’m assuming you’re using wildcards, i.e. 123*. The problem is that wildcards are tricky, and this trips everybody up at one time or another.

The quick background is that if there’s any possibility that the filter can produce multiple tokens for a single input token, that filter is skipped during analysis at _query_ time. Imagine that your replacement was a space rather than an empty string. Then 123--456 would become _two_ tokens in subsequent processing. Now anything you do is wrong sometime, somewhere. 

For instance, 123*456 would fail because it’d be looking for one token, which you wouldn’t expect. 12345* would also fail because there’s no single token like that. 123 would succeed (note no wildcard). You can see where this is going.

Which doesn’t help you solve your use-case. There are several options:

- use <charFilter class="solr.PatternReplaceCharFilterFactory"  pattern="[^A-Za-z0-9]" replacement="”/> instead of PatternReplaceFilterFactory. charFilters are applied to the raw input before analysis and don’t have the same problem with producing multiple tokens.

- WordDelimiter(Graph)FilterFactory is built for this kind of thing. There are a number of options, and this is one of the few filters that’s often different between index and query analysis chains. It can be tricky to understand all the interactions of the parameters though.

And as an aside, I don’t know how large your index is, but wildcards for one or two leading characters can get very expensive, i.e. 1*, 12* can get very costly. If you can require 3 or more leading characters there are rarely problems. You can also do a time/space tradeoff by including EdgeNgramFilterFactory in your chain at the cost of a larger index.

And finally, (and this is a total nit) there.s no reason to specify lower-case characters in your existing pattern because the upper-case filter is first. You _will_ have to specify uppercase characters if you use the charfilter.

As for why production is different than QA, my guess is that you overlaid the schema changes on an _existing_ index. Most of the time, to get consistent results, you must re-index everything starting from an _empty_ index. This is a long and complicated explanation that I won’t go into here. In fact, I usually do one of two things:

1> define a new collection/core and index to that. If using SolrCloud, you can re-index and use collection aliasing to seamlessly switch.

2> stop Solr. Delete all the datadirs (the parent of tlog and index) associated with any of my replicas, restart with Solr and index. You may be able to get away with using delete-by-query to remove everything in your index then optimize (one of the very few times I’ll recommend optimizing), reloading your collection and indexing. The point is to get rid of all traces of anything generated from the old schema. 

Best,
Erick

> On Apr 3, 2020, at 3:40 PM, Teresa McMains <te...@t14-consulting.com> wrote:
> 
> Forgive me if this is unclear, I am very much new here.
> 
> I am working with a customer who needs to be able to query various account/customer ID fields which may or may not have embedded dashes.  But they want to be able to search by entering the dashes or not and by entering partial values or not.
> 
> So we may have an account or customer ID like
> 
> 1234-56AB45
> 
> And they would like to retrieve this by searching for any of the following:
> 1234-56AB45     (full string match)
> 1234-56                (partial string match)
> 123456AB45        (full string but no dashes)
> 123456                  (partial string no dashes)
> 
> I've defined this field type in schema.xml as:
> 
> 
> <!-- String replace field for account number searches -->
> 
> <fieldType name="TrimmedString" class="solr.TextField" omitNorms="true">
> 
> <analyzer>
> 
>  <tokenizer class="solr.KeywordTokenizerFactory" />
> 
> 
>  <!-- Normalizes token text to upper case -->
> 
>  <filter class="solr.UpperCaseFilterFactory" />
> 
>  <!-- Removes anything that isn't a letter or digit -->
> 
>  <filter class="solr.PatternReplaceFilterFactory" pattern="[^A-Za-z0-9]" replacement="" replace="all"/>
> 
> 
> 
> </analyzer>
> 
> </fieldType>
> 
> But the behavior I see is completely unexpected.
> Full string match works fine on the customer's DEV environment but not in QA (which is running the same version of SOLR)
> Partial string match works for some ID fields but not others
> A Partial string match when the user does not enter the dashes just never works
> 
> I don't even know where to begin.  The behavior is not consistent enough to give me a sense.
> 
> So perhaps I will just ask - how would you define a fieldType which should ignore special characters like hyphens or underscores (or anything non-alphanumeric) and works for full string or partial string search?
> 
> Thank you.
> 
>

Re: match string fields with embedded hyphens

Posted by Chris Hostetter <ho...@fucit.org>.

: I am working with a customer who needs to be able to query various
: account/customer ID fields which may or may not have embedded dashes.
: But they want to be able to search by entering the dashes or not and by
: entering partial values or not.
:
: So we may have an account or customer ID like
:
: 1234-56AB45
:
: And they would like to retrieve this by searching for any of the following:
: 1234-56AB45 (full string match)
: 1234-56 (partial string match)
: 123456AB45 (full string but no dashes)
: 123456 (partial string no dashes)

To answer your lsat question first...

: So perhaps I will just ask - how would you define a fieldType which
: should ignore special characters like hyphens or underscores (or
: anything non-alphanumeric) and works for full string or partial string
: search?

This is pretty much exactly what the "Word Delimiter Filter" was designed
for, and i encourage you to play with it and it's various options and
see what happens...

https://lucene.apache.org/solr/guide/8_5/filter-descriptions.html#word-delimiter-graph-filter

You've definitely need to enable som "non-default" options (like
"catenateNumbers=true") to ensure that you'd get indexed terms like
"123456" from input "1234-56AB45"

Once thing that's not entirely clear from your question & input is how you
define "partial string" ... for example: are you expecting a query of "12"
to match your input document? because WDF won't help with that.

: But the behavior I see is completely unexpected. Full string match works
: fine on the customer's DEV environment but not in QA (which is running
: the same version of SOLR)

I garuntee you there is some difference between your DEV and QA
environments. Either in terms of the documents in the index, or the
schema THAT WAS USED WHEN INDEXING THE DOCS --
which might have been changed after the indexing happened, or
the "current" schema being used when the queries are getting
parsed, or the default request options in solrconfig.xml ... something is
absolutely different.

: Partial string match works for some ID fields but not others
: A Partial string match when the user does not enter the dashes just never works

I'm assuming these last 2 comments refer to behavior you see on *both*
your DEV and QA instances?

Depending on your definition of "partial string" (see the question i asked
above) then I _think_ the analyzer you have should work -- at least for
all the examples you've provided.

The missing piece of information is *how* you are querying: what query
parser you are using, what exactly the iput looks like; and also: the
output: what does "never works" mean? ... does it match 0 docs? does it
match docs you don't expect?

seeing the exact request URLs you are trying, with
"debug=true&echoParams=all" added, and the full output of those requests
so we can see things like the header where we can confirm what
default params might be getting added, and the query parrser debug info to
doble check how your query is being parsed, and the "explain" info to see
what docs that are matching (unexpectedly) are there.

More tips on details that can be useful to include to "help us help
you"...

https://cwiki.apache.org/confluence/display/SOLR/UsingMailingLists

-Hoss
http://www.lucidworks.com/