You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Todd Long <lo...@gmail.com> on 2015/05/18 17:20:34 UTC

Wildcard/Regex Searching with Decimal Fields

I'm having some normalization issues when trying to search decimal fields
(i.e. TrieDoubleField copied to TextField).

1. Wildcard searching: I created a separate "TextField" field type (e.g.
filter_decimal) which filters whole numbers to have at least one decimal
place (i.e. dot zero) using the pattern replace filter. When I build the
query I remove any extraneous zeros in the decimal (e.g. 235.000 becomes
235.0) to make sure my wildcard search will match on the non-wildcard
decimal (hopefully that makes sense). I then build the wildcard query based
on the original input along with the extraneous zeros removed (see examples
below). Is this the best approach or does Solr allow me to go about this
another way?

e.g.
input: 2*5.000
query: filter_decimal:2*5.000* OR filter_decimal:2*5.0

e.g.
input: 235.
query: filter_decimal:235.*

2. Regex searching: When indexing decimal fields with a dot zero any regular
expressions that don't take that into account return no results (see example
below). The only way around this is by dropping the dot zero when indexing.
Of course, this now requires me to define another field type with an
appropriate pattern replace filter. I tried creating a query token filter
but by the time I get the term attribute I don't if the search was a regular
expression or not. Any ideas on this? Is it best to just create another
field type that removes the dot zero?

e.g. /23[58]/ (will not match on 235.0)

Please let me know if I can provide any additional details. Thanks for the
help!



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Erick Erickson <er...@gmail.com>.
This feels like an XY problem. Either you're working with numbers or
you're not. It's hard for me to imagine what purpose is served by a
query on numerical data that would match 2.5, 20.5, 299.5,
299999999999999.5 etc, much less regexes. That may just be my limited
imagination however.

You could simply inject "synonyms" without the .0 in the same field though.

Best,
Erick

On Mon, May 18, 2015 at 8:20 AM, Todd Long <lo...@gmail.com> wrote:
> I'm having some normalization issues when trying to search decimal fields
> (i.e. TrieDoubleField copied to TextField).
>
> 1. Wildcard searching: I created a separate "TextField" field type (e.g.
> filter_decimal) which filters whole numbers to have at least one decimal
> place (i.e. dot zero) using the pattern replace filter. When I build the
> query I remove any extraneous zeros in the decimal (e.g. 235.000 becomes
> 235.0) to make sure my wildcard search will match on the non-wildcard
> decimal (hopefully that makes sense). I then build the wildcard query based
> on the original input along with the extraneous zeros removed (see examples
> below). Is this the best approach or does Solr allow me to go about this
> another way?
>
> e.g.
> input: 2*5.000
> query: filter_decimal:2*5.000* OR filter_decimal:2*5.0
>
> e.g.
> input: 235.
> query: filter_decimal:235.*
>
> 2. Regex searching: When indexing decimal fields with a dot zero any regular
> expressions that don't take that into account return no results (see example
> below). The only way around this is by dropping the dot zero when indexing.
> Of course, this now requires me to define another field type with an
> appropriate pattern replace filter. I tried creating a query token filter
> but by the time I get the term attribute I don't if the search was a regular
> expression or not. Any ideas on this? Is it best to just create another
> field type that removes the dot zero?
>
> e.g. /23[58]/ (will not match on 235.0)
>
> Please let me know if I can provide any additional details. Thanks for the
> help!
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Todd Long <lo...@gmail.com>.
Sounds good. Thank you for the synonym (definitely will work on this) and
padding suggestions.

- Todd



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206421.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Erick Erickson <er...@gmail.com>.
Then it seems like you can just index the raw strings as a string
field and suggest with that but fire the actual query against the
numeric type.....

Best,
Erick

On Tue, May 19, 2015 at 3:25 PM, Todd Long <lo...@gmail.com> wrote:
> Erick Erickson wrote
>> But I _really_ have to go back to one of my original questions: What's
>> the use-case?
>
> The use-case is with autocompleting fields. The user might know a frequency
> starts with 2 so we want to limit those results (e.g. 2, 23, 214, etc.). We
> would still index/store the numeric-type but maintain an additional string
> index for autocompleting (and regular expressions). We can throw away the
> "contains" but will at least need the "starts with" behavior.
>
> - Todd
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206398.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Todd Long <lo...@gmail.com>.
Erick Erickson wrote
> But I _really_ have to go back to one of my original questions: What's
> the use-case?

The use-case is with autocompleting fields. The user might know a frequency
starts with 2 so we want to limit those results (e.g. 2, 23, 214, etc.). We
would still index/store the numeric-type but maintain an additional string
index for autocompleting (and regular expressions). We can throw away the
"contains" but will at least need the "starts with" behavior.

- Todd



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206398.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Erick Erickson <er...@gmail.com>.
No cleaner ways that spring to mind. Although you might get some
mileage out of normalizing
_everything_ rather than indexing different forms. Perhaps all numbers
are stored left-padded
with zeros to 16 places to the left of the decimal point and
right-padded 16 places to the right
of the decimal point. Which incidentally allows you to do range
queries and other numeric-type
comparisons.

<rant>
But I _really_ have to go back to one of my original questions: What's
the use-case? You've
outlined _how_ users would like to use regexes  and wildcards over
numeric data, but not _why_.
You've accepted as a given that "contains" are necessary. Before
investing any more time
and effort, please, please, please figure out whether this is just
something somebody threw
in and is valueless or whether it's actually something that would
provide value _to the end user_.

This is where I really have to dig in my heels and have the product
manager explain, in very
concrete terms, the _value_ the user gets out of this. Don't get me
wrong, there may be perfectly
valid reasons. Just make sure they're well thought out before
straining to provide functionality
that implements a half-baked use-case that nobody then uses. Is this
more valuable than not being
able to do any statistics like sum, average, etc?

When having this discussion, have the range queries in your back
pocket and see if anything
that the PM brings up can't be satisfied by numeric searches rather
than string searches. Maybe
even bring in a user and ask "is this useful?".

I've just spent too much of my life implementing useless features to
not question something
like this ;)

</rant>

Best,
Erick

On Tue, May 19, 2015 at 7:19 AM, Todd Long <lo...@gmail.com> wrote:
> I see what you're saying and that should do the trick. I could index 123 with
> an index synonym 123.0. Then my regex query "/123/" should hit along with a
> boolean query "123.0 OR 123.00*". Is there a cleaner approach to breaking
> apart the boolean query in this case? Right now, outside of Solr, I'm just
> looking for any extraneous zeros and wildcards to get the exact value (e.g.
> 123.0) and OR'ing that with the original user input.
>
> Thank you for your help.
>
> - Todd
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206288.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Todd Long <lo...@gmail.com>.
I see what you're saying and that should do the trick. I could index 123 with
an index synonym 123.0. Then my regex query "/123/" should hit along with a
boolean query "123.0 OR 123.00*". Is there a cleaner approach to breaking
apart the boolean query in this case? Right now, outside of Solr, I'm just
looking for any extraneous zeros and wildcards to get the exact value (e.g.
123.0) and OR'ing that with the original user input.

Thank you for your help.

- Todd



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Erick Erickson <er...@gmail.com>.
I really have in mind an index-time filter, not necessarily a
query-time Filter. So at index time you have something like 123.4000.
You index synonyms 123.4 and 123 (or whatever) and now your _queries_
should "just work" since the forms you need are in the index already
and whether it's a regex or not is handled for you by the query
parsing process without any additional work on your part.

Of course I may be misunderstanding the problem....

Best,
Erick

On Mon, May 18, 2015 at 5:19 PM, Todd Long <lo...@gmail.com> wrote:
> Erick Erickson wrote
>> No, not using SynonymFilterFactory. Rather take that as a base for a
>> custom Filter that
>> doesn't use any input file.
>
> OK, I just wanted to make sure I wasn't missing something that could be done
> with the SynonymFilterFactory itself. At one time, I started going down this
> path but I wasn't sure if I could access the indexed values using a "query"
> filter though I assume that is part of what SynonymFilterFactory is doing...
> I was able to create a custom filter but I was only able to access the query
> input of which I still couldn't distinguish what type of search was being
> done (i.e. regex or not). The regex query input did not include the
> surrounding forward slashes.
>
> - Todd
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206155.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Todd Long <lo...@gmail.com>.
Erick Erickson wrote
> No, not using SynonymFilterFactory. Rather take that as a base for a
> custom Filter that
> doesn't use any input file.

OK, I just wanted to make sure I wasn't missing something that could be done
with the SynonymFilterFactory itself. At one time, I started going down this
path but I wasn't sure if I could access the indexed values using a "query"
filter though I assume that is part of what SynonymFilterFactory is doing...
I was able to create a custom filter but I was only able to access the query
input of which I still couldn't distinguish what type of search was being
done (i.e. regex or not). The regex query input did not include the
surrounding forward slashes.

- Todd



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206155.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Erick Erickson <er...@gmail.com>.
No, not using SynonymFilterFactory. Rather take that as a base for a
custom Filter that
doesn't use any input file. Rather it would normalize any numeric
tokens and inject
as many variants on the spot as you desire.

Best,
Erick

On Mon, May 18, 2015 at 9:56 AM, Todd Long <lo...@gmail.com> wrote:
> Essentially, we have a grid of data (i.e. frequencies, baud rates, data
> rates, etc.) and we allow wildcard filtering on the various columns. As the
> user provides input, in a specific column, we simply filter the overall data
> by an implicit "starts with" query (i.e. 23 becomes 23*). In most cases,
> yes, a range search would suffice until you get to those "contains" queries.
> We are working with strings with the need to properly handle the decimal
> place. I don't know the exact use case where the "contains" query comes into
> play with the numerics but most likely it would have to do with "pattern"
> matching (i.e. knowing a certain sequence where 2*3 might be helpful).
>
> It's easy enough to normalize the user input and perform an OR search with
> the wildcard. I'm just trying to find a way to index the data once that
> allows me to handle the dot zero in both wildcard and regex searches. I
> guess it would be nice to index the numeric as a string without dot zero and
> when performing a search have the input hit against both the whole number
> and dot zero.
>
>
> Erick Erickson wrote
>> You could simply inject "synonyms" without the .0 in the same field
>> though.
>
> Using a SynonymFilterFactory? If so, can this be done dynamically as I won't
> know the "numeric" (I guess we can call them string) values.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206050.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Todd Long <lo...@gmail.com>.
Essentially, we have a grid of data (i.e. frequencies, baud rates, data
rates, etc.) and we allow wildcard filtering on the various columns. As the
user provides input, in a specific column, we simply filter the overall data
by an implicit "starts with" query (i.e. 23 becomes 23*). In most cases,
yes, a range search would suffice until you get to those "contains" queries.
We are working with strings with the need to properly handle the decimal
place. I don't know the exact use case where the "contains" query comes into
play with the numerics but most likely it would have to do with "pattern"
matching (i.e. knowing a certain sequence where 2*3 might be helpful).

It's easy enough to normalize the user input and perform an OR search with
the wildcard. I'm just trying to find a way to index the data once that
allows me to handle the dot zero in both wildcard and regex searches. I
guess it would be nice to index the numeric as a string without dot zero and
when performing a search have the input hit against both the whole number
and dot zero.


Erick Erickson wrote
> You could simply inject "synonyms" without the .0 in the same field
> though.

Using a SynonymFilterFactory? If so, can this be done dynamically as I won't
know the "numeric" (I guess we can call them string) values.



--
View this message in context: http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015p4206050.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard/Regex Searching with Decimal Fields

Posted by Jack Krupansky <ja...@gmail.com>.
Maybe you should first disclose the nature of the business problem you are
trying to solve.

To be clear, patterns and wildcards are string processing operations, not
numeric operations. Usually one searches for ranges of numeric values. So,
again, what operation are you really trying to perform that is causing you
to resert to pattern matching and wildcards? I can't wait to hear!

I mean, if you simply want to match one of a set of numbers that are not in
a consecutive range, try the OR operator.

-- Jack Krupansky

On Mon, May 18, 2015 at 11:20 AM, Todd Long <lo...@gmail.com> wrote:

> I'm having some normalization issues when trying to search decimal fields
> (i.e. TrieDoubleField copied to TextField).
>
> 1. Wildcard searching: I created a separate "TextField" field type (e.g.
> filter_decimal) which filters whole numbers to have at least one decimal
> place (i.e. dot zero) using the pattern replace filter. When I build the
> query I remove any extraneous zeros in the decimal (e.g. 235.000 becomes
> 235.0) to make sure my wildcard search will match on the non-wildcard
> decimal (hopefully that makes sense). I then build the wildcard query based
> on the original input along with the extraneous zeros removed (see examples
> below). Is this the best approach or does Solr allow me to go about this
> another way?
>
> e.g.
> input: 2*5.000
> query: filter_decimal:2*5.000* OR filter_decimal:2*5.0
>
> e.g.
> input: 235.
> query: filter_decimal:235.*
>
> 2. Regex searching: When indexing decimal fields with a dot zero any
> regular
> expressions that don't take that into account return no results (see
> example
> below). The only way around this is by dropping the dot zero when indexing.
> Of course, this now requires me to define another field type with an
> appropriate pattern replace filter. I tried creating a query token filter
> but by the time I get the term attribute I don't if the search was a
> regular
> expression or not. Any ideas on this? Is it best to just create another
> field type that removes the dot zero?
>
> e.g. /23[58]/ (will not match on 235.0)
>
> Please let me know if I can provide any additional details. Thanks for the
> help!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Wildcard-Regex-Searching-with-Decimal-Fields-tp4206015.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>