You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Lochschmied, Alexander" <Al...@vishay.com> on 2012/08/10 10:07:33 UTC

Indexing wildcard patterns

Coming from a SQL database based search system, we already have a set of defined patterns associated with our searchable documents.

% matches no or any number of characters
_ matches one character

Example:
Doc 1: 'AB%CD', 'AB%CD%'
Doc 2: 'AB_CD'
...

Thus Doc 1 matches
ABXYZCD
ABCD
ABCDXYZ
...

Whereas Doc 2 matches only
ABXCD
ABYCD
ABZCD
...

This can be achieved in SQL WHERE statements using the LIKE operator.

Is there a (similar) way to this in Solr?

Thanks,
Alexander

AW: Indexing wildcard patterns

Posted by "Lochschmied, Alexander" <Al...@vishay.com>.
Thank you Toke, your comments made a lot of sense to me. Luckily we do not have many patterns and we just decided to consider only the prefixes up to the first wildcard. So we will no longer have to deal with patterns.
Alexander

-----Ursprüngliche Nachricht-----
Von: Toke Eskildsen [mailto:te@statsbiblioteket.dk] 
Gesendet: Freitag, 10. August 2012 13:29
An: solr-user@lucene.apache.org
Betreff: Re: Indexing wildcard patterns

On Fri, 2012-08-10 at 10:07 +0200, Lochschmied, Alexander wrote:
> Coming from a SQL database based search system, we already have a set of defined patterns associated with our searchable documents.
> 
> % matches no or any number of characters _ matches one character
> 
> Example:
> Doc 1: 'AB%CD', 'AB%CD%'
> Doc 2: 'AB_CD'

As I understand it: You have a list of (simple) patterns and want to find those that matches a given input. When you do it in SQL, it iterates all patterns and applies them one at a time.

I am not aware of any mechanism in Lucene/Solr that provides this functionality. Implementing a new Query type for this would be a possibility, and speed could be somewhat optimized by compiling the patterns only once; but as long as the underlying algorithm is "iterate all patterns and see if they match", this will not scale very well.

Before speculating any further, it would be nice to know the scale of your problem: How many unique patterns are we talking about? Is there any "pattern to the patterns", such as specific lengths, maximum number of substitutions or literal prefixes?


Re: Indexing wildcard patterns

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.
On Fri, 2012-08-10 at 10:07 +0200, Lochschmied, Alexander wrote:
> Coming from a SQL database based search system, we already have a set of defined patterns associated with our searchable documents.
> 
> % matches no or any number of characters
> _ matches one character
> 
> Example:
> Doc 1: 'AB%CD', 'AB%CD%'
> Doc 2: 'AB_CD'

As I understand it: You have a list of (simple) patterns and want to
find those that matches a given input. When you do it in SQL, it
iterates all patterns and applies them one at a time.

I am not aware of any mechanism in Lucene/Solr that provides this
functionality. Implementing a new Query type for this would be a
possibility, and speed could be somewhat optimized by compiling the
patterns only once; but as long as the underlying algorithm is "iterate
all patterns and see if they match", this will not scale very well.

Before speculating any further, it would be nice to know the scale of
your problem: How many unique patterns are we talking about? Is there
any "pattern to the patterns", such as specific lengths, maximum number
of substitutions or literal prefixes?


Re: AW: Indexing wildcard patterns

Posted by Tomas Zerolo <to...@axelspringer.de>.
On Fri, Aug 10, 2012 at 12:38:46PM -0400, Jack Krupansky wrote:
> "Doc1 has the pattern "AB%CD%" associated with it (somehow?!)."
> 
> You need to clarify what you mean by that.

I'm not the OP, but I think (s)he means the patterns are in the
database and the string to match is given in the query. Perhaps
this inversion is a bit unusual, and most optimizers aren't
prepared for that, but still reasonable, IMHO.

> To be clear, Solr support for wildcards is a superset of the SQL
> LIKE operator, and the patterns used in the LIKE operator are NOT
> stored in the table data, but used at query time

I don't know about others, but PostgreSQL copes just fine:

 | tomas@rasputin:~$ psql template1
 | psql (9.1.2)
 | Type "help" for help.
 | 
 | template1=# create database test;
 | CREATE DATABASE
 | template1=# create table foo (
 | template1(#   pattern VARCHAR
 | template1(# );
 | CREATE TABLE
 | template1=# insert into foo values('%blah');
 | INSERT 0 1
 | template1=# insert into foo values('blah%');
 | INSERT 0 1
 | template1=# insert into foo values('%bloh%');
 | INSERT 0 1
 | template1=# select * from foo where 'blahblah' like pattern;
 |  pattern 
 | ---------
 |  %blah
 |  blah%
 | (2 rows)

Now don't ask whether the optimizer has a fair chance at this. Dunno
what happens when we have, say, 10^7 patterns... but the OP's pattern
set seems to be "reasonably small".

>                                                  - same with Solr.
> In SQL you do not "associate" patterns with table data, but rather
> you query data using a pattern.

I'd guess that the above trick might be doable in SOLR as well, as
other posts in this thread seem to suggest. But I'm not that proficient
in SOLR, that's why I'm lurking here ;-)

tomás
-- 
Tomás Zerolo
Axel Springer AG
Axel Springer media Systems
BILD Produktionssysteme
Axel-Springer-Straße 65
10888 Berlin
Tel.: +49 (30) 2591-72875
tomas.zerolo@axelspringer.de
www.axelspringer.de

Axel Springer AG, Sitz Berlin, Amtsgericht Charlottenburg, HRB 4998
Vorsitzender des Aufsichtsrats: Dr. Giuseppe Vita
Vorstand: Dr. Mathias Döpfner (Vorsitzender)
Jan Bayer, Ralph Büchi, Lothar Lanz, Dr. Andreas Wiele

Re: AW: AW: Indexing wildcard patterns

Posted by Jack Krupansky <ja...@basetechnology.com>.
Ah, okay, I see the usage now. In SQL the right operand of LIKE can be 
either a literal wildcard pattern or an expression which is evaluated 
per-row during the query. Solr/Lucene has the former, but not the latter. 
The wildcard pattern will be fixed at the start of the search.

-- Jack Krupansky

-----Original Message----- 
From: Lochschmied, Alexander
Sent: Monday, August 13, 2012 3:05 AM
To: solr-user@lucene.apache.org
Subject: AW: AW: Indexing wildcard patterns

Here is what we do in SQL:

mysql> select * from _tbl;
+----+------------+
| id | field      |
+----+------------+
|  1 | plain text |
|  2 | wil_c%     |
+----+------------+
2 rows in set (0.14 sec)

mysql> SELECT * FROM _TBL WHERE 'wildcard' LIKE FIELD;
+----+--------+
| id | field  |
+----+--------+
|  2 | wil_c% |
+----+--------+
1 row in set (0.12 sec)

So the patterns are associated with the actual documents in the database. We 
use those fields as a means to manually customize some searches.

Thanks,
Alexander

-----Ursprüngliche Nachricht-----
Von: Jack Krupansky [mailto:jack@basetechnology.com]
Gesendet: Freitag, 10. August 2012 18:39
An: solr-user@lucene.apache.org
Betreff: Re: AW: Indexing wildcard patterns

"Doc1 has the pattern "AB%CD%" associated with it (somehow?!)."

You need to clarify what you mean by that.

To be clear, Solr support for wildcards is a superset of the SQL LIKE 
operator, and the patterns used in the LIKE operator are NOT stored in the 
table data, but used at query time - same with Solr. In SQL you do not 
"associate" patterns with table data, but rather you query data using a 
pattern.

Step back and describe the problem you are trying to solve rather than 
prematurely jumping into a proposed solution.

So, if there is something you already do in SQL and now wish to do it in 
Solr, please tell us about it.

-- Jack Krupansky

-----Original Message-----
From: Lochschmied, Alexander
Sent: Friday, August 10, 2012 5:25 AM
To: solr-user@lucene.apache.org
Subject: AW: Indexing wildcard patterns

I thought my question might be confusing...

I know about Solr providing wildcards in queries, but my problem is 
different.

I have those patterns associated with my searchable documents before any 
actual search is done.
I need Solr to return the document which is associated with matching 
patterns. User does not enter the wildcard pattern; wildcard pattern must be 
tested by Solr automatically.

So in the example I provided below, a user might enter " ABCDXYZ " and I 
need Solr to return Doc1, as Doc1 has the pattern "AB%CD%" associated with 
it (somehow?!).

Thanks,
Alexander


-----Ursprüngliche Nachricht-----
Von: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Gesendet: Freitag, 10. August 2012 10:34
An: solr-user@lucene.apache.org
Betreff: Re: Indexing wildcard patterns



--- On Fri, 8/10/12, Lochschmied, Alexander 
<Al...@vishay.com> wrote:

> From: Lochschmied, Alexander <Al...@vishay.com>
> Subject: Indexing wildcard patterns
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Date: Friday, August 10, 2012, 11:07 AM Coming from a SQL database
> based search system, we already have a set of defined patterns
> associated with our searchable documents.
>
> % matches no or any number of characters _ matches one character
>
> Example:
> Doc 1: 'AB%CD', 'AB%CD%'
> Doc 2: 'AB_CD'
> ...
>
> Thus Doc 1 matches
> ABXYZCD
> ABCD
> ABCDXYZ
> ...
>
> Whereas Doc 2 matches only
> ABXCD
> ABYCD
> ABZCD
> ...
>
> This can be achieved in SQL WHERE statements using the LIKE operator.
>
> Is there a (similar) way to this in Solr?

Yes, wildcard search in solr

* matches no or any number of characters ? matches one character

http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Wildcard%20Searches 


AW: AW: Indexing wildcard patterns

Posted by "Lochschmied, Alexander" <Al...@vishay.com>.
Here is what we do in SQL:

mysql> select * from _tbl;
+----+------------+
| id | field      |
+----+------------+
|  1 | plain text |
|  2 | wil_c%     |
+----+------------+
2 rows in set (0.14 sec)

mysql> SELECT * FROM _TBL WHERE 'wildcard' LIKE FIELD;
+----+--------+
| id | field  |
+----+--------+
|  2 | wil_c% |
+----+--------+
1 row in set (0.12 sec)

So the patterns are associated with the actual documents in the database. We use those fields as a means to manually customize some searches.

Thanks,
Alexander

-----Ursprüngliche Nachricht-----
Von: Jack Krupansky [mailto:jack@basetechnology.com] 
Gesendet: Freitag, 10. August 2012 18:39
An: solr-user@lucene.apache.org
Betreff: Re: AW: Indexing wildcard patterns

"Doc1 has the pattern "AB%CD%" associated with it (somehow?!)."

You need to clarify what you mean by that.

To be clear, Solr support for wildcards is a superset of the SQL LIKE operator, and the patterns used in the LIKE operator are NOT stored in the table data, but used at query time - same with Solr. In SQL you do not "associate" patterns with table data, but rather you query data using a pattern.

Step back and describe the problem you are trying to solve rather than prematurely jumping into a proposed solution.

So, if there is something you already do in SQL and now wish to do it in Solr, please tell us about it.

-- Jack Krupansky

-----Original Message-----
From: Lochschmied, Alexander
Sent: Friday, August 10, 2012 5:25 AM
To: solr-user@lucene.apache.org
Subject: AW: Indexing wildcard patterns

I thought my question might be confusing...

I know about Solr providing wildcards in queries, but my problem is different.

I have those patterns associated with my searchable documents before any actual search is done.
I need Solr to return the document which is associated with matching patterns. User does not enter the wildcard pattern; wildcard pattern must be tested by Solr automatically.

So in the example I provided below, a user might enter " ABCDXYZ " and I need Solr to return Doc1, as Doc1 has the pattern "AB%CD%" associated with it (somehow?!).

Thanks,
Alexander


-----Ursprüngliche Nachricht-----
Von: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Gesendet: Freitag, 10. August 2012 10:34
An: solr-user@lucene.apache.org
Betreff: Re: Indexing wildcard patterns



--- On Fri, 8/10/12, Lochschmied, Alexander <Al...@vishay.com> wrote:

> From: Lochschmied, Alexander <Al...@vishay.com>
> Subject: Indexing wildcard patterns
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Date: Friday, August 10, 2012, 11:07 AM Coming from a SQL database 
> based search system, we already have a set of defined patterns 
> associated with our searchable documents.
>
> % matches no or any number of characters _ matches one character
>
> Example:
> Doc 1: 'AB%CD', 'AB%CD%'
> Doc 2: 'AB_CD'
> ...
>
> Thus Doc 1 matches
> ABXYZCD
> ABCD
> ABCDXYZ
> ...
>
> Whereas Doc 2 matches only
> ABXCD
> ABYCD
> ABZCD
> ...
>
> This can be achieved in SQL WHERE statements using the LIKE operator.
>
> Is there a (similar) way to this in Solr?

Yes, wildcard search in solr

* matches no or any number of characters ? matches one character

http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Wildcard%20Searches 


Re: AW: Indexing wildcard patterns

Posted by Jack Krupansky <ja...@basetechnology.com>.
"Doc1 has the pattern "AB%CD%" associated with it (somehow?!)."

You need to clarify what you mean by that.

To be clear, Solr support for wildcards is a superset of the SQL LIKE 
operator, and the patterns used in the LIKE operator are NOT stored in the 
table data, but used at query time - same with Solr. In SQL you do not 
"associate" patterns with table data, but rather you query data using a 
pattern.

Step back and describe the problem you are trying to solve rather than 
prematurely jumping into a proposed solution.

So, if there is something you already do in SQL and now wish to do it in 
Solr, please tell us about it.

-- Jack Krupansky

-----Original Message----- 
From: Lochschmied, Alexander
Sent: Friday, August 10, 2012 5:25 AM
To: solr-user@lucene.apache.org
Subject: AW: Indexing wildcard patterns

I thought my question might be confusing...

I know about Solr providing wildcards in queries, but my problem is 
different.

I have those patterns associated with my searchable documents before any 
actual search is done.
I need Solr to return the document which is associated with matching 
patterns. User does not enter the wildcard pattern; wildcard pattern must be 
tested by Solr automatically.

So in the example I provided below, a user might enter " ABCDXYZ " and I 
need Solr to return Doc1, as Doc1 has the pattern "AB%CD%" associated with 
it (somehow?!).

Thanks,
Alexander


-----Ursprüngliche Nachricht-----
Von: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Gesendet: Freitag, 10. August 2012 10:34
An: solr-user@lucene.apache.org
Betreff: Re: Indexing wildcard patterns



--- On Fri, 8/10/12, Lochschmied, Alexander 
<Al...@vishay.com> wrote:

> From: Lochschmied, Alexander <Al...@vishay.com>
> Subject: Indexing wildcard patterns
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Date: Friday, August 10, 2012, 11:07 AM Coming from a SQL database
> based search system, we already have a set of defined patterns
> associated with our searchable documents.
>
> % matches no or any number of characters _ matches one character
>
> Example:
> Doc 1: 'AB%CD', 'AB%CD%'
> Doc 2: 'AB_CD'
> ...
>
> Thus Doc 1 matches
> ABXYZCD
> ABCD
> ABCDXYZ
> ...
>
> Whereas Doc 2 matches only
> ABXCD
> ABYCD
> ABZCD
> ...
>
> This can be achieved in SQL WHERE statements using the LIKE operator.
>
> Is there a (similar) way to this in Solr?

Yes, wildcard search in solr

* matches no or any number of characters ? matches one character

http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Wildcard%20Searches 


Re: AW: Indexing wildcard patterns

Posted by Ahmet Arslan <io...@yahoo.com>.
> So in the example I provided below, a user might enter "
> ABCDXYZ " and I need Solr to return Doc1, as Doc1 has the
> pattern "AB%CD%" associated with it (somehow?!).

OK understood now. You can use Lucene's MemoryIndex for this.

http://lucene.apache.org/core/3_6_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html

AW: Indexing wildcard patterns

Posted by "Lochschmied, Alexander" <Al...@vishay.com>.
I thought my question might be confusing...

I know about Solr providing wildcards in queries, but my problem is different.

I have those patterns associated with my searchable documents before any actual search is done.
I need Solr to return the document which is associated with matching patterns. User does not enter the wildcard pattern; wildcard pattern must be tested by Solr automatically.

So in the example I provided below, a user might enter " ABCDXYZ " and I need Solr to return Doc1, as Doc1 has the pattern "AB%CD%" associated with it (somehow?!).

Thanks,
Alexander


-----Ursprüngliche Nachricht-----
Von: Ahmet Arslan [mailto:iorixxx@yahoo.com] 
Gesendet: Freitag, 10. August 2012 10:34
An: solr-user@lucene.apache.org
Betreff: Re: Indexing wildcard patterns



--- On Fri, 8/10/12, Lochschmied, Alexander <Al...@vishay.com> wrote:

> From: Lochschmied, Alexander <Al...@vishay.com>
> Subject: Indexing wildcard patterns
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Date: Friday, August 10, 2012, 11:07 AM Coming from a SQL database 
> based search system, we already have a set of defined patterns 
> associated with our searchable documents.
> 
> % matches no or any number of characters _ matches one character
> 
> Example:
> Doc 1: 'AB%CD', 'AB%CD%'
> Doc 2: 'AB_CD'
> ...
> 
> Thus Doc 1 matches
> ABXYZCD
> ABCD
> ABCDXYZ
> ...
> 
> Whereas Doc 2 matches only
> ABXCD
> ABYCD
> ABZCD
> ...
> 
> This can be achieved in SQL WHERE statements using the LIKE operator.
> 
> Is there a (similar) way to this in Solr?

Yes, wildcard search in solr

* matches no or any number of characters ? matches one character

 http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Wildcard%20Searches

Re: Indexing wildcard patterns

Posted by Ahmet Arslan <io...@yahoo.com>.

--- On Fri, 8/10/12, Lochschmied, Alexander <Al...@vishay.com> wrote:

> From: Lochschmied, Alexander <Al...@vishay.com>
> Subject: Indexing wildcard patterns
> To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
> Date: Friday, August 10, 2012, 11:07 AM
> Coming from a SQL database based
> search system, we already have a set of defined patterns
> associated with our searchable documents.
> 
> % matches no or any number of characters
> _ matches one character
> 
> Example:
> Doc 1: 'AB%CD', 'AB%CD%'
> Doc 2: 'AB_CD'
> ...
> 
> Thus Doc 1 matches
> ABXYZCD
> ABCD
> ABCDXYZ
> ...
> 
> Whereas Doc 2 matches only
> ABXCD
> ABYCD
> ABZCD
> ...
> 
> This can be achieved in SQL WHERE statements using the LIKE
> operator.
> 
> Is there a (similar) way to this in Solr?

Yes, wildcard search in solr

* matches no or any number of characters
? matches one character

 http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Wildcard%20Searches