You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ben <be...@autonomic.net> on 2009/06/30 22:03:32 UTC

Excluding characters from a wildcard query

Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching 
anything with an underscore in the string) using some code like :

...
parameters.add("fq", "vector:[^_]*_[^_]*");
...

seems to cause problems for SOLR, I assume because of the [ or ^ character.

Can somebody please advise how to handle character exclusion in such 
searches?

Any help or pointers are much appreciated!

Thanks

Ben

Re: Excluding characters from a wildcard query

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm not sure if you can do prefix queries with the fq parameter. You will
: need to use the 'q' parameter for that.

fq supports anything q supports ... with the QParser and local params 
options it can be any syntax you want (as long as there is a QParser for 
it)


-Hoss

Re: Excluding characters from a wildcard query

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Wed, Jul 1, 2009 at 5:07 PM, Ben <be...@autonomic.net> wrote:

> my brain was switched off.  I'm using SOLRJ, which means I'll need to
> specify multiple :
>
> addMultipleFields(solrDoc, "vector", "vectorvalue", 1.0f);
>
> for each value to be added to the multiValuedField.
>
> Then, with luck, the simple wildcard query will be executed over each
> individual value when looking for matches, meaning the simple query syntax
> can made adequate to do what's needed.
>
>
I'm not sure if you can do prefix queries with the fq parameter. You will
need to use the 'q' parameter for that.

You may also want to look at the regex query support in lucene (contrib
package). I don't think that is supported out of the box in Solr yet.
-- 
Regards,
Shalin Shekhar Mangar.

Re: Excluding characters from a wildcard query

Posted by Ben <be...@autonomic.net>.

my brain was switched off.  I'm using SOLRJ, which means I'll need to 
specify multiple :

addMultipleFields(solrDoc, "vector", "vectorvalue", 1.0f);

for each value to be added to the multiValuedField.

Then, with luck, the simple wildcard query will be executed over each 
individual value when looking for matches, meaning the simple query 
syntax can made adequate to do what's needed.

Many thanks Uwe.

B

Uwe Klosa wrote:
> 2009/7/1 Ben <be...@autonomic.net>
>
>   
>> I'm not quite sure I understand exactly what you mean.
>> The string I'm processing could have many tens of thousands of values... I
>> hope you aren't implying I'd need to split it into many tens of thousands of
>> "columns".
>>     
>
>
> No, that is not what I meant. It will be one field (column) with tens of
> thousands of values.
>
>
>   
>> If you're saying what I think you're saying, you're saying that I should
>> leave whitespaces between the individual parts of the string, pass in the
>> string into a "multiValued" field and have SOLR internally treat each "word"
>> as an individual entity?
>> Thanks for your help with this...
>>     
>
>
> I said nothing about whitespaces. I don't know how you update your solr
> documents. Are you using XML or Solrj?
>
> Uwe
>
>

Re: Excluding characters from a wildcard query

Posted by Uwe Klosa <uw...@gmail.com>.

2009/7/1 Ben <be...@autonomic.net>

> I'm not quite sure I understand exactly what you mean.
> The string I'm processing could have many tens of thousands of values... I
> hope you aren't implying I'd need to split it into many tens of thousands of
> "columns".


No, that is not what I meant. It will be one field (column) with tens of
thousands of values.


>
>
> If you're saying what I think you're saying, you're saying that I should
> leave whitespaces between the individual parts of the string, pass in the
> string into a "multiValued" field and have SOLR internally treat each "word"
> as an individual entity?
> Thanks for your help with this...


I said nothing about whitespaces. I don't know how you update your solr
documents. Are you using XML or Solrj?

Uwe

Re: Excluding characters from a wildcard query

Posted by Ben <be...@autonomic.net>.

I'm not quite sure I understand exactly what you mean.
The string I'm processing could have many tens of thousands of values... 
I hope you aren't implying I'd need to split it into many tens of 
thousands of "columns".

If you're saying what I think you're saying, you're saying that I should 
leave whitespaces between the individual parts of the string, pass in 
the string into a "multiValued" field and have SOLR internally treat 
each "word" as an individual entity? 

Thanks for your help with this...

Ben

Uwe Klosa wrote:
> To get the desired efffect I described you have to do the split before you
> send the document to solr. I'm not aware of an analyzer that can split one
> field value into several field values. The analyzers and tokenizers do
> create tokens from field values in many different ways.
>
> As I see it you have to do some preprocessing yourself.
>
> Uwe
>
> 2009/7/1 Ben <be...@autonomic.net>
>
>   
>> Is there a way in the Schema to specify that the comma should be used to
>> split the values up? e.g. Can I specify my "vector" field as multivalue and
>> also specify some sort of tokeniser to automatically split on commas?
>>
>> Ben
>>
>>
>>
>> Uwe Klosa wrote:
>>
>>     
>>> You should split the strings at the comma yourself and store the values in
>>> a
>>> multivalued field? Then wildcard search like A1_* are not a problem. I
>>> don't
>>> know so much about facets. But if they work on multivalued fields that
>>> should be then no problem at all.
>>>
>>> Uwe
>>>
>>> 2009/7/1 Ben <be...@autonomic.net>
>>>
>>>
>>>
>>>       
>>>> Yes, I had done that... however, I'm beginning to see now that what I am
>>>> doing is called a "wildcard query" which is going via Lucene's
>>>> queryparser.
>>>> Lucene's query parser doesn't not support the regexp idea of character
>>>> exclusion ... i.e. I'm not trying to match "[" I'm trying to express
>>>> "Match
>>>> as many characters as possible, which are not underscores" with [^_]*
>>>>
>>>> Perhaps I'm going about my whole problem in an ineffective way, but I'm
>>>> not
>>>> sure how I can sensibly describe what I'm doing without it becoming a
>>>> long
>>>> document.
>>>>
>>>> The only other approach I can think of is to change what I'm indexing but
>>>> I'm not sure how to achieve that.
>>>> I've tried explaining it once, and obviously failed, so I'll try again.
>>>>
>>>> I'm given a string containing many vectors (where each dimension is
>>>> separated by an underscore, and each vector is seperated by a comma) e.g.
>>>>
>>>> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>>>>
>>>> I want my facet query to tell me if, within one of the vectors within
>>>> that
>>>> string, there is a match for dimensions I'm interested in. Of the four
>>>> dimensions in this example, I may choose to fix an arbitrary number of
>>>> them
>>>> with values, and the rest with wildcards e.g. I might look for a facet
>>>> containing Ox_*_*_* so one of the vectors in the string must have its
>>>> first
>>>> dimension matching "Ox" and I don't care about the rest.
>>>>
>>>> ***Is there a way to break down this string on the comma's so that I can
>>>> apply a normal wildcard query and SOLR applies it to each
>>>> individually?***
>>>> That would solve all my problems :
>>>> e.g.
>>>> The string is internally represented in lucene/solr as
>>>> A1_B1_C1_D1
>>>> A2_B2_C2_D2
>>>> A3_B3_C3_D3
>>>>
>>>> where it tries to match the wildcard query on each in turn?
>>>>
>>>> Thanks for you help, I'm deeply confused about this at the moment...
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>>         
>>>
>>>       
>>     
>
>

Re: Excluding characters from a wildcard query

Posted by Uwe Klosa <uw...@gmail.com>.

To get the desired efffect I described you have to do the split before you
send the document to solr. I'm not aware of an analyzer that can split one
field value into several field values. The analyzers and tokenizers do
create tokens from field values in many different ways.

As I see it you have to do some preprocessing yourself.

Uwe

2009/7/1 Ben <be...@autonomic.net>

> Is there a way in the Schema to specify that the comma should be used to
> split the values up? e.g. Can I specify my "vector" field as multivalue and
> also specify some sort of tokeniser to automatically split on commas?
>
> Ben
>
>
>
> Uwe Klosa wrote:
>
>> You should split the strings at the comma yourself and store the values in
>> a
>> multivalued field? Then wildcard search like A1_* are not a problem. I
>> don't
>> know so much about facets. But if they work on multivalued fields that
>> should be then no problem at all.
>>
>> Uwe
>>
>> 2009/7/1 Ben <be...@autonomic.net>
>>
>>
>>
>>> Yes, I had done that... however, I'm beginning to see now that what I am
>>> doing is called a "wildcard query" which is going via Lucene's
>>> queryparser.
>>> Lucene's query parser doesn't not support the regexp idea of character
>>> exclusion ... i.e. I'm not trying to match "[" I'm trying to express
>>> "Match
>>> as many characters as possible, which are not underscores" with [^_]*
>>>
>>> Perhaps I'm going about my whole problem in an ineffective way, but I'm
>>> not
>>> sure how I can sensibly describe what I'm doing without it becoming a
>>> long
>>> document.
>>>
>>> The only other approach I can think of is to change what I'm indexing but
>>> I'm not sure how to achieve that.
>>> I've tried explaining it once, and obviously failed, so I'll try again.
>>>
>>> I'm given a string containing many vectors (where each dimension is
>>> separated by an underscore, and each vector is seperated by a comma) e.g.
>>>
>>> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>>>
>>> I want my facet query to tell me if, within one of the vectors within
>>> that
>>> string, there is a match for dimensions I'm interested in. Of the four
>>> dimensions in this example, I may choose to fix an arbitrary number of
>>> them
>>> with values, and the rest with wildcards e.g. I might look for a facet
>>> containing Ox_*_*_* so one of the vectors in the string must have its
>>> first
>>> dimension matching "Ox" and I don't care about the rest.
>>>
>>> ***Is there a way to break down this string on the comma's so that I can
>>> apply a normal wildcard query and SOLR applies it to each
>>> individually?***
>>> That would solve all my problems :
>>> e.g.
>>> The string is internally represented in lucene/solr as
>>> A1_B1_C1_D1
>>> A2_B2_C2_D2
>>> A3_B3_C3_D3
>>>
>>> where it tries to match the wildcard query on each in turn?
>>>
>>> Thanks for you help, I'm deeply confused about this at the moment...
>>>
>>> Ben
>>>
>>>
>>>
>>
>>
>>
>
>

Re: Excluding characters from a wildcard query

Posted by Ben <be...@autonomic.net>.

Is there a way in the Schema to specify that the comma should be used to 
split the values up? 
e.g. Can I specify my "vector" field as multivalue and also specify some 
sort of tokeniser to automatically split on commas?

Ben


Uwe Klosa wrote:
> You should split the strings at the comma yourself and store the values in a
> multivalued field? Then wildcard search like A1_* are not a problem. I don't
> know so much about facets. But if they work on multivalued fields that
> should be then no problem at all.
>
> Uwe
>
> 2009/7/1 Ben <be...@autonomic.net>
>
>   
>> Yes, I had done that... however, I'm beginning to see now that what I am
>> doing is called a "wildcard query" which is going via Lucene's queryparser.
>> Lucene's query parser doesn't not support the regexp idea of character
>> exclusion ... i.e. I'm not trying to match "[" I'm trying to express "Match
>> as many characters as possible, which are not underscores" with [^_]*
>>
>> Perhaps I'm going about my whole problem in an ineffective way, but I'm not
>> sure how I can sensibly describe what I'm doing without it becoming a long
>> document.
>>
>> The only other approach I can think of is to change what I'm indexing but
>> I'm not sure how to achieve that.
>> I've tried explaining it once, and obviously failed, so I'll try again.
>>
>> I'm given a string containing many vectors (where each dimension is
>> separated by an underscore, and each vector is seperated by a comma) e.g.
>>
>> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>>
>> I want my facet query to tell me if, within one of the vectors within that
>> string, there is a match for dimensions I'm interested in. Of the four
>> dimensions in this example, I may choose to fix an arbitrary number of them
>> with values, and the rest with wildcards e.g. I might look for a facet
>> containing Ox_*_*_* so one of the vectors in the string must have its first
>> dimension matching "Ox" and I don't care about the rest.
>>
>> ***Is there a way to break down this string on the comma's so that I can
>> apply a normal wildcard query and SOLR applies it to each individually?***
>> That would solve all my problems :
>> e.g.
>> The string is internally represented in lucene/solr as
>> A1_B1_C1_D1
>> A2_B2_C2_D2
>> A3_B3_C3_D3
>>
>> where it tries to match the wildcard query on each in turn?
>>
>> Thanks for you help, I'm deeply confused about this at the moment...
>>
>> Ben
>>
>>     
>
>

Re: Excluding characters from a wildcard query

Posted by Uwe Klosa <uw...@gmail.com>.

You should split the strings at the comma yourself and store the values in a
multivalued field? Then wildcard search like A1_* are not a problem. I don't
know so much about facets. But if they work on multivalued fields that
should be then no problem at all.

Uwe

2009/7/1 Ben <be...@autonomic.net>

> Yes, I had done that... however, I'm beginning to see now that what I am
> doing is called a "wildcard query" which is going via Lucene's queryparser.
> Lucene's query parser doesn't not support the regexp idea of character
> exclusion ... i.e. I'm not trying to match "[" I'm trying to express "Match
> as many characters as possible, which are not underscores" with [^_]*
>
> Perhaps I'm going about my whole problem in an ineffective way, but I'm not
> sure how I can sensibly describe what I'm doing without it becoming a long
> document.
>
> The only other approach I can think of is to change what I'm indexing but
> I'm not sure how to achieve that.
> I've tried explaining it once, and obviously failed, so I'll try again.
>
> I'm given a string containing many vectors (where each dimension is
> separated by an underscore, and each vector is seperated by a comma) e.g.
>
> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>
> I want my facet query to tell me if, within one of the vectors within that
> string, there is a match for dimensions I'm interested in. Of the four
> dimensions in this example, I may choose to fix an arbitrary number of them
> with values, and the rest with wildcards e.g. I might look for a facet
> containing Ox_*_*_* so one of the vectors in the string must have its first
> dimension matching "Ox" and I don't care about the rest.
>
> ***Is there a way to break down this string on the comma's so that I can
> apply a normal wildcard query and SOLR applies it to each individually?***
> That would solve all my problems :
> e.g.
> The string is internally represented in lucene/solr as
> A1_B1_C1_D1
> A2_B2_C2_D2
> A3_B3_C3_D3
>
> where it tries to match the wildcard query on each in turn?
>
> Thanks for you help, I'm deeply confused about this at the moment...
>
> Ben
>

Re: Excluding characters from a wildcard query

Posted by Ben <be...@autonomic.net>.

Yes, I had done that... however, I'm beginning to see now that what I am 
doing is called a "wildcard query" which is going via Lucene's queryparser.
Lucene's query parser doesn't not support the regexp idea of character 
exclusion ... i.e. I'm not trying to match "[" I'm trying to express 
"Match as many characters as possible, which are not underscores" with [^_]*

Perhaps I'm going about my whole problem in an ineffective way, but I'm 
not sure how I can sensibly describe what I'm doing without it becoming 
a long document.

The only other approach I can think of is to change what I'm indexing 
but I'm not sure how to achieve that.
I've tried explaining it once, and obviously failed, so I'll try again.

I'm given a string containing many vectors (where each dimension is 
separated by an underscore, and each vector is seperated by a comma) e.g.

A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3

I want my facet query to tell me if, within one of the vectors within 
that string, there is a match for dimensions I'm interested in. Of the 
four dimensions in this example, I may choose to fix an arbitrary number 
of them with values, and the rest with wildcards e.g. I might look for a 
facet containing Ox_*_*_* so one of the vectors in the string must have 
its first dimension matching "Ox" and I don't care about the rest.

***Is there a way to break down this string on the comma's so that I can 
apply a normal wildcard query and SOLR applies it to each 
individually?*** That would solve all my problems :
e.g.
The string is internally represented in lucene/solr as
A1_B1_C1_D1
A2_B2_C2_D2
A3_B3_C3_D3

where it tries to match the wildcard query on each in turn?

Thanks for you help, I'm deeply confused about this at the moment...

Ben

Re: Excluding characters from a wildcard query

Posted by Uwe Klosa <uw...@gmail.com>.

You have to escape all special characters. Even [ to \[

Have a look here http://lucene.apache.org/java/2_4_0/queryparsersyntax.html

Uwe

2009/7/1 Ben <be...@autonomic.net>

> I only just noticed that this is an exception being thrown by the
> lucene.queryParser. Should I be mailing on the lucene list, or is it ok
> here?
>
> I'm beginning to wonder if the "fq" can handle the type of character
> exclusion I'm trying in the RegExp.
> Escaping the string also doesn't work  :
>
> Cannot parse 'vector:_\*[\^_\]\*_[\^_\]\*_[\^_\]\*': Encountered "]" at
> line 1, column 15.
> Was expecting one of:
>   "TO" ...
>   <RANGEIN_QUOTED> ...
>   <RANGEIN_GOOP> ...
>
> Ben wrote:
>
>>
>> Ben wrote:
>>
>>> The exception SOLR raises is :
>>>
>>> org.apache.lucene.queryParser.ParseException: Cannot parse
>>> 'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
>>> Was expecting one of:
>>>   "TO" ...
>>>   <RANGEIN_QUOTED> ...
>>>   <RANGEIN_GOOP> ...
>>>  Ben wrote:
>>>
>>>> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching
>>>> anything with an underscore in the string) using some code like :
>>>>
>>>> ...
>>>> parameters.add("fq", "vector:[^_]*_[^_]*");
>>>> ...
>>>>
>>>> seems to cause problems for SOLR, I assume because of the [ or ^
>>>> character.
>>>>
>>>> Can somebody please advise how to handle character exclusion in such
>>>> searches?
>>>>
>>>> Any help or pointers are much appreciated!
>>>>
>>>> Thanks
>>>>
>>>> Ben
>>>>
>>>
>>>
>>
>

Re: Excluding characters from a wildcard query

Posted by Ben <be...@autonomic.net>.

I only just noticed that this is an exception being thrown by the 
lucene.queryParser. Should I be mailing on the lucene list, or is it ok 
here?

I'm beginning to wonder if the "fq" can handle the type of character 
exclusion I'm trying in the RegExp.
Escaping the string also doesn't work  :

Cannot parse 'vector:_\*[\^_\]\*_[\^_\]\*_[\^_\]\*': Encountered "]" at 
line 1, column 15.
Was expecting one of:
    "TO" ...
    <RANGEIN_QUOTED> ...
    <RANGEIN_GOOP> ...
   

Ben wrote:
>
> Ben wrote:
>> The exception SOLR raises is :
>>
>> org.apache.lucene.queryParser.ParseException: Cannot parse 
>> 'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
>> Was expecting one of:
>>    "TO" ...
>>    <RANGEIN_QUOTED> ...
>>    <RANGEIN_GOOP> ...
>>  
>> Ben wrote:
>>> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching 
>>> anything with an underscore in the string) using some code like :
>>>
>>> ...
>>> parameters.add("fq", "vector:[^_]*_[^_]*");
>>> ...
>>>
>>> seems to cause problems for SOLR, I assume because of the [ or ^ 
>>> character.
>>>
>>> Can somebody please advise how to handle character exclusion in such 
>>> searches?
>>>
>>> Any help or pointers are much appreciated!
>>>
>>> Thanks
>>>
>>> Ben
>>
>

Re: Excluding characters from a wildcard query - More Info - Is this difficult, or am I being ignored because it's too obvious to merit an answer?

Posted by Ben <be...@autonomic.net>.

Ben wrote:
> The exception SOLR raises is :
>
> org.apache.lucene.queryParser.ParseException: Cannot parse 
> 'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
> Was expecting one of:
>    "TO" ...
>    <RANGEIN_QUOTED> ...
>    <RANGEIN_GOOP> ...
>  
> Ben wrote:
>> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching 
>> anything with an underscore in the string) using some code like :
>>
>> ...
>> parameters.add("fq", "vector:[^_]*_[^_]*");
>> ...
>>
>> seems to cause problems for SOLR, I assume because of the [ or ^ 
>> character.
>>
>> Can somebody please advise how to handle character exclusion in such 
>> searches?
>>
>> Any help or pointers are much appreciated!
>>
>> Thanks
>>
>> Ben
>

Re: Excluding characters from a wildcard query - More Info

Posted by Ben <be...@autonomic.net>.

The exception SOLR raises is :

org.apache.lucene.queryParser.ParseException: Cannot parse 
'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
Was expecting one of:
    "TO" ...
    <RANGEIN_QUOTED> ...
    <RANGEIN_GOOP> ...
   

Ben wrote:
> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching 
> anything with an underscore in the string) using some code like :
>
> ...
> parameters.add("fq", "vector:[^_]*_[^_]*");
> ...
>
> seems to cause problems for SOLR, I assume because of the [ or ^ 
> character.
>
> Can somebody please advise how to handle character exclusion in such 
> searches?
>
> Any help or pointers are much appreciated!
>
> Thanks
>
> Ben