You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Jana, Kumar Raja" <kj...@ptc.com> on 2009/01/15 14:23:08 UTC

Customizing Solr to handle Leading Wildcard queries

Hi,

 

Not being able to perform Leading Wildcard queries is a major handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.

 

I have found quite a few threads on this topic and one of the solutions
was that this feature can be enabled by adding:

parser.setAllowLeadingWildcards(true); at Line 92 in QueryParsing.java

Unfortunately, this did not work or may be I was using a different
parser and I don't know how to configure the parsers to make this work.

 

Can someone please tell me the steps to customize Solr to enable this
feature?

 

Thanks,

Kumar

Re: Customizing Solr to handle Leading Wildcard queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Yeah, I think the begin/end chars are very helpful here.  But I like the suggestion of figuring out which words really need to support leading wildcards...although that's typically impossible to predict, since people are typically free to enter whatever queries they feel like.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Neal Richter <nr...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, January 28, 2009 3:10:29 AM
> Subject: Re: Customizing Solr to handle Leading Wildcard queries
> 
> Oh wait.. looks like Otis' suggestion of "index n-grams with begin/end
> delim characters"  and relying on phrase-searching to link the chains
> of characters.. logically doing a better version of my previous email.
> 
> - Neal
> 
> On Wed, Jan 28, 2009 at 1:04 AM, Neal Richter wrote:
> > leading wildcard search is called grep ;-)
> >
> > Ditto on the indexing reversed words suggestion.
> >
> > Can you create a second field in solr that contains /only/ the words
> > from the fields you care to reverse?  Once you do that you could
> > pre-process the query and look for leading wildcards and address those
> > (after reversing the query) only against your special
> > reverse-meta-data field.
> >
> > The *foo* case really is grep! You nearly by definition have to
> > linearly scan the index unless some magic is added.
> >
> > Your options are to extend Otis' ngram suggestion and turn a word like
> > "baffoonery"
> > into:
> >
> > (stored in "meta field")
> > baffoonery
> > affoonery
> > ffoonery
> > foonery
> > oonery
> > onery
> > nery
> > ery
> > ry
> >
> > Now you can take a query like "*foo*" and drop the leading wildcard
> > and it will hit on 'foonery'.
> >
> > Make sense?  You are trading index size for not doing a linear scan
> > like grep.  It's not advisable to do this for every word in your
> > document set ;-)
> >
> > - Neal Richter
> >
> > On Wed, Jan 28, 2009 at 12:19 AM, Jana, Kumar Raja wrote:
> >> Hi,
> >>
> >> Thanks Otis, Newton and everyone else for the help on this issue.
> >>
> >> Most of the data I index are documents like pdfs, word Docs, open office
> >> documents, etc. I store the content of the document in a field called
> >> content and the remaining metadata of the document like name, id,
> >> created by, modified by, created on, etc in a copy field called
> >> metadata. I am not particularly interested in enabling leading wildcard
> >> characters in the content (although such a possibility would be a
> >> bonus). For this, I've tried implementing the suggestion to store
> >> reverse strings as well as the correct strings for the metadata field.
> >> All leading wildcard queries like "*abc" and searched as "cba*" against
> >> the reversed metadata field. So far so good. Thank you :)
> >>
> >> But now, I ran into the scenario where the query string is *abc* :( and
> >> the whole thing came down crashing again. I cannot ignore such queries.
> >> I would rather take the risk of Solr OOMing by enabling the leading
> >> wildcard query searches.
> >>
> >> Can someone please tell me the steps to turn on this feature in Lucene
> >> QueryParser? I am sure it will be helpful to many to document such a
> >> procedure on the Wiki or somewhere else. (I am definitely going to do
> >> that once I fix this. Too much trouble this seems to be)
> >> Also, which queryParser does Solr use by default?
> >>
> >> Thanks,
> >> Kumar
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> >> Sent: Thursday, January 15, 2009 10:18 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Customizing Solr to handle Leading Wildcard queries
> >>
> >> Hi ramuK,
> >>
> >> I believe you can turn that "on" via the Lucene QueryParser, but of
> >> course such searches will be slo(oo)w.  You can also index reversed
> >> tokens (e.g. *kumar --> rakum*) or you could index n-grams with
> >> begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
> >> m a r $")
> >>
> >>
> >> Otis
> >> --
> >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >>
> >>
> >>
> >> ----- Original Message ----
> >>> From: "Jana, Kumar Raja" 
> >>> To: solr-user@lucene.apache.org
> >>> Sent: Thursday, January 15, 2009 9:49:24 AM
> >>> Subject: RE: Customizing Solr to handle Leading Wildcard queries
> >>>
> >>> Hi Erik,
> >>>
> >>> Thanks for the quick reply.
> >>> I want to enable leading wildcard query searches in general. The case
> >>> mentioned in the earlier mail is just one of the many instances I use
> >>> this feature.
> >>>
> >>> -Kumar
> >>>
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> >>> Sent: Thursday, January 15, 2009 7:59 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Customizing Solr to handle Leading Wildcard queries
> >>>
> >>>
> >>> On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
> >>> > Not being able to perform Leading Wildcard queries is a major
> >>> > handicap.
> >>> > I want to be able to perform searches like *.pdf to fetch all pdf
> >>> > documents from Solr.
> >>>
> >>> For this particular case, I recommend indexing the document type as a
> >>
> >>> separate field.  Something like type:pdf (or use a MIME type string).
> >>
> >>> Then you can do a very direct and fast query to search or facet by
> >>> document types.
> >>>
> >>>     Erik
> >>
> >>
> >

Re: Customizing Solr to handle Leading Wildcard queries

Posted by Neal Richter <nr...@gmail.com>.

Oh wait.. looks like Otis' suggestion of "index n-grams with begin/end
delim characters"  and relying on phrase-searching to link the chains
of characters.. logically doing a better version of my previous email.

- Neal

On Wed, Jan 28, 2009 at 1:04 AM, Neal Richter <nr...@gmail.com> wrote:
> leading wildcard search is called grep ;-)
>
> Ditto on the indexing reversed words suggestion.
>
> Can you create a second field in solr that contains /only/ the words
> from the fields you care to reverse?  Once you do that you could
> pre-process the query and look for leading wildcards and address those
> (after reversing the query) only against your special
> reverse-meta-data field.
>
> The *foo* case really is grep! You nearly by definition have to
> linearly scan the index unless some magic is added.
>
> Your options are to extend Otis' ngram suggestion and turn a word like
> "baffoonery"
> into:
>
> (stored in "meta field")
> baffoonery
> affoonery
> ffoonery
> foonery
> oonery
> onery
> nery
> ery
> ry
>
> Now you can take a query like "*foo*" and drop the leading wildcard
> and it will hit on 'foonery'.
>
> Make sense?  You are trading index size for not doing a linear scan
> like grep.  It's not advisable to do this for every word in your
> document set ;-)
>
> - Neal Richter
>
> On Wed, Jan 28, 2009 at 12:19 AM, Jana, Kumar Raja <kj...@ptc.com> wrote:
>> Hi,
>>
>> Thanks Otis, Newton and everyone else for the help on this issue.
>>
>> Most of the data I index are documents like pdfs, word Docs, open office
>> documents, etc. I store the content of the document in a field called
>> content and the remaining metadata of the document like name, id,
>> created by, modified by, created on, etc in a copy field called
>> metadata. I am not particularly interested in enabling leading wildcard
>> characters in the content (although such a possibility would be a
>> bonus). For this, I've tried implementing the suggestion to store
>> reverse strings as well as the correct strings for the metadata field.
>> All leading wildcard queries like "*abc" and searched as "cba*" against
>> the reversed metadata field. So far so good. Thank you :)
>>
>> But now, I ran into the scenario where the query string is *abc* :( and
>> the whole thing came down crashing again. I cannot ignore such queries.
>> I would rather take the risk of Solr OOMing by enabling the leading
>> wildcard query searches.
>>
>> Can someone please tell me the steps to turn on this feature in Lucene
>> QueryParser? I am sure it will be helpful to many to document such a
>> procedure on the Wiki or somewhere else. (I am definitely going to do
>> that once I fix this. Too much trouble this seems to be)
>> Also, which queryParser does Solr use by default?
>>
>> Thanks,
>> Kumar
>>
>>
>>
>>
>> -----Original Message-----
>> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
>> Sent: Thursday, January 15, 2009 10:18 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Customizing Solr to handle Leading Wildcard queries
>>
>> Hi ramuK,
>>
>> I believe you can turn that "on" via the Lucene QueryParser, but of
>> course such searches will be slo(oo)w.  You can also index reversed
>> tokens (e.g. *kumar --> rakum*) or you could index n-grams with
>> begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
>> m a r $")
>>
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> ----- Original Message ----
>>> From: "Jana, Kumar Raja" <kj...@ptc.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, January 15, 2009 9:49:24 AM
>>> Subject: RE: Customizing Solr to handle Leading Wildcard queries
>>>
>>> Hi Erik,
>>>
>>> Thanks for the quick reply.
>>> I want to enable leading wildcard query searches in general. The case
>>> mentioned in the earlier mail is just one of the many instances I use
>>> this feature.
>>>
>>> -Kumar
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
>>> Sent: Thursday, January 15, 2009 7:59 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Customizing Solr to handle Leading Wildcard queries
>>>
>>>
>>> On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
>>> > Not being able to perform Leading Wildcard queries is a major
>>> > handicap.
>>> > I want to be able to perform searches like *.pdf to fetch all pdf
>>> > documents from Solr.
>>>
>>> For this particular case, I recommend indexing the document type as a
>>
>>> separate field.  Something like type:pdf (or use a MIME type string).
>>
>>> Then you can do a very direct and fast query to search or facet by
>>> document types.
>>>
>>>     Erik
>>
>>
>

Re: Customizing Solr to handle Leading Wildcard queries

Posted by Neal Richter <nr...@gmail.com>.

leading wildcard search is called grep ;-)

Ditto on the indexing reversed words suggestion.

Can you create a second field in solr that contains /only/ the words
from the fields you care to reverse?  Once you do that you could
pre-process the query and look for leading wildcards and address those
(after reversing the query) only against your special
reverse-meta-data field.

The *foo* case really is grep! You nearly by definition have to
linearly scan the index unless some magic is added.

Your options are to extend Otis' ngram suggestion and turn a word like
"baffoonery"
into:

(stored in "meta field")
baffoonery
affoonery
ffoonery
foonery
oonery
onery
nery
ery
ry

Now you can take a query like "*foo*" and drop the leading wildcard
and it will hit on 'foonery'.

Make sense?  You are trading index size for not doing a linear scan
like grep.  It's not advisable to do this for every word in your
document set ;-)

- Neal Richter

On Wed, Jan 28, 2009 at 12:19 AM, Jana, Kumar Raja <kj...@ptc.com> wrote:
> Hi,
>
> Thanks Otis, Newton and everyone else for the help on this issue.
>
> Most of the data I index are documents like pdfs, word Docs, open office
> documents, etc. I store the content of the document in a field called
> content and the remaining metadata of the document like name, id,
> created by, modified by, created on, etc in a copy field called
> metadata. I am not particularly interested in enabling leading wildcard
> characters in the content (although such a possibility would be a
> bonus). For this, I've tried implementing the suggestion to store
> reverse strings as well as the correct strings for the metadata field.
> All leading wildcard queries like "*abc" and searched as "cba*" against
> the reversed metadata field. So far so good. Thank you :)
>
> But now, I ran into the scenario where the query string is *abc* :( and
> the whole thing came down crashing again. I cannot ignore such queries.
> I would rather take the risk of Solr OOMing by enabling the leading
> wildcard query searches.
>
> Can someone please tell me the steps to turn on this feature in Lucene
> QueryParser? I am sure it will be helpful to many to document such a
> procedure on the Wiki or somewhere else. (I am definitely going to do
> that once I fix this. Too much trouble this seems to be)
> Also, which queryParser does Solr use by default?
>
> Thanks,
> Kumar
>
>
>
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Thursday, January 15, 2009 10:18 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Customizing Solr to handle Leading Wildcard queries
>
> Hi ramuK,
>
> I believe you can turn that "on" via the Lucene QueryParser, but of
> course such searches will be slo(oo)w.  You can also index reversed
> tokens (e.g. *kumar --> rakum*) or you could index n-grams with
> begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
> m a r $")
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: "Jana, Kumar Raja" <kj...@ptc.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, January 15, 2009 9:49:24 AM
>> Subject: RE: Customizing Solr to handle Leading Wildcard queries
>>
>> Hi Erik,
>>
>> Thanks for the quick reply.
>> I want to enable leading wildcard query searches in general. The case
>> mentioned in the earlier mail is just one of the many instances I use
>> this feature.
>>
>> -Kumar
>>
>>
>>
>>
>> -----Original Message-----
>> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
>> Sent: Thursday, January 15, 2009 7:59 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Customizing Solr to handle Leading Wildcard queries
>>
>>
>> On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
>> > Not being able to perform Leading Wildcard queries is a major
>> > handicap.
>> > I want to be able to perform searches like *.pdf to fetch all pdf
>> > documents from Solr.
>>
>> For this particular case, I recommend indexing the document type as a
>
>> separate field.  Something like type:pdf (or use a MIME type string).
>
>> Then you can do a very direct and fast query to search or facet by
>> document types.
>>
>>     Erik
>
>

RE: Customizing Solr to handle Leading Wildcard queries

Posted by "Jana, Kumar Raja" <kj...@ptc.com>.

Hi,

Thanks Otis, Newton and everyone else for the help on this issue.

Most of the data I index are documents like pdfs, word Docs, open office
documents, etc. I store the content of the document in a field called
content and the remaining metadata of the document like name, id,
created by, modified by, created on, etc in a copy field called
metadata. I am not particularly interested in enabling leading wildcard
characters in the content (although such a possibility would be a
bonus). For this, I've tried implementing the suggestion to store
reverse strings as well as the correct strings for the metadata field.
All leading wildcard queries like "*abc" and searched as "cba*" against
the reversed metadata field. So far so good. Thank you :)

But now, I ran into the scenario where the query string is *abc* :( and
the whole thing came down crashing again. I cannot ignore such queries.
I would rather take the risk of Solr OOMing by enabling the leading
wildcard query searches. 

Can someone please tell me the steps to turn on this feature in Lucene
QueryParser? I am sure it will be helpful to many to document such a
procedure on the Wiki or somewhere else. (I am definitely going to do
that once I fix this. Too much trouble this seems to be)
Also, which queryParser does Solr use by default? 

Thanks,
Kumar

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Thursday, January 15, 2009 10:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Customizing Solr to handle Leading Wildcard queries

Hi ramuK,

I believe you can turn that "on" via the Lucene QueryParser, but of
course such searches will be slo(oo)w.  You can also index reversed
tokens (e.g. *kumar --> rakum*) or you could index n-grams with
begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
m a r $")

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: "Jana, Kumar Raja" <kj...@ptc.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, January 15, 2009 9:49:24 AM
> Subject: RE: Customizing Solr to handle Leading Wildcard queries
> 
> Hi Erik,
> 
> Thanks for the quick reply.
> I want to enable leading wildcard query searches in general. The case
> mentioned in the earlier mail is just one of the many instances I use
> this feature.
> 
> -Kumar
> 
> 
> 
> 
> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
> Sent: Thursday, January 15, 2009 7:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Customizing Solr to handle Leading Wildcard queries
> 
> 
> On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
> > Not being able to perform Leading Wildcard queries is a major  
> > handicap.
> > I want to be able to perform searches like *.pdf to fetch all pdf
> > documents from Solr.
> 
> For this particular case, I recommend indexing the document type as a

> separate field.  Something like type:pdf (or use a MIME type string).

> Then you can do a very direct and fast query to search or facet by  
> document types.
> 
>     Erik

Re: Customizing Solr to handle Leading Wildcard queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi ramuK,

I believe you can turn that "on" via the Lucene QueryParser, but of course such searches will be slo(oo)w.  You can also index reversed tokens (e.g. *kumar --> rakum*) or you could index n-grams with begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u m a r $")


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Jana, Kumar Raja" <kj...@ptc.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, January 15, 2009 9:49:24 AM
> Subject: RE: Customizing Solr to handle Leading Wildcard queries
> 
> Hi Erik,
> 
> Thanks for the quick reply.
> I want to enable leading wildcard query searches in general. The case
> mentioned in the earlier mail is just one of the many instances I use
> this feature.
> 
> -Kumar
> 
> 
> 
> 
> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
> Sent: Thursday, January 15, 2009 7:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Customizing Solr to handle Leading Wildcard queries
> 
> 
> On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
> > Not being able to perform Leading Wildcard queries is a major  
> > handicap.
> > I want to be able to perform searches like *.pdf to fetch all pdf
> > documents from Solr.
> 
> For this particular case, I recommend indexing the document type as a  
> separate field.  Something like type:pdf (or use a MIME type string).  
> Then you can do a very direct and fast query to search or facet by  
> document types.
> 
>     Erik

Re: Customizing Solr to handle Leading Wildcard queries

Posted by Glen Newton <gl...@gmail.com>.

If we are talking short single term fields (like a file field that has
a single term like "foo.pdf") then do what the DBMS b-tree indexes did
a long time ago: for every field you want a leading wildcard, insert
it in reverse order. So field file:"foo.pdf"  is also stored, indexed
as reverseField:"fdp.oof". Now when someone does a search on
reverseField, like reverseField:*oo.pdf, you reverse the query to be:
fdp.oo*

I believe some of the DBMSs kept a separate reverse b-tree to handle
leading wildcard queries.

And obviously this technique is harder to put in place for arbitrary
sections of text that have to parsed. But a special parser could be
written to handle this as well.

-glen
http://zzzoot.blogspot.com/

2009/1/15 Jana, Kumar Raja <kj...@ptc.com>:
> Hi Erik,
>
> Thanks for the quick reply.
> I want to enable leading wildcard query searches in general. The case
> mentioned in the earlier mail is just one of the many instances I use
> this feature.
>
> -Kumar
>
>
>
>
> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> Sent: Thursday, January 15, 2009 7:59 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Customizing Solr to handle Leading Wildcard queries
>
>
> On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
>> Not being able to perform Leading Wildcard queries is a major
>> handicap.
>> I want to be able to perform searches like *.pdf to fetch all pdf
>> documents from Solr.
>
> For this particular case, I recommend indexing the document type as a
> separate field.  Something like type:pdf (or use a MIME type string).
> Then you can do a very direct and fast query to search or facet by
> document types.
>
>        Erik
>
>

-- 

-

RE: Customizing Solr to handle Leading Wildcard queries

Posted by "Jana, Kumar Raja" <kj...@ptc.com>.

Hi Erik,

Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.

-Kumar

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
Sent: Thursday, January 15, 2009 7:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Customizing Solr to handle Leading Wildcard queries

On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
> Not being able to perform Leading Wildcard queries is a major  
> handicap.
> I want to be able to perform searches like *.pdf to fetch all pdf
> documents from Solr.

For this particular case, I recommend indexing the document type as a  
separate field.  Something like type:pdf (or use a MIME type string).   
Then you can do a very direct and fast query to search or facet by  
document types.

	Erik

Re: Customizing Solr to handle Leading Wildcard queries

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 15, 2009, at 8:23 AM, Jana, Kumar Raja wrote:
> Not being able to perform Leading Wildcard queries is a major  
> handicap.
> I want to be able to perform searches like *.pdf to fetch all pdf
> documents from Solr.

For this particular case, I recommend indexing the document type as a  
separate field.  Something like type:pdf (or use a MIME type string).   
Then you can do a very direct and fast query to search or facet by  
document types.

	Erik