You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ben <be...@autonomic.net> on 2009/06/30 22:03:32 UTC
Excluding characters from a wildcard query
Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching
anything with an underscore in the string) using some code like :
...
parameters.add("fq", "vector:[^_]*_[^_]*");
...
seems to cause problems for SOLR, I assume because of the [ or ^ character.
Can somebody please advise how to handle character exclusion in such
searches?
Any help or pointers are much appreciated!
Thanks
Ben
Re: Excluding characters from a wildcard query
Posted by Chris Hostetter <ho...@fucit.org>.
: I'm not sure if you can do prefix queries with the fq parameter. You will
: need to use the 'q' parameter for that.
fq supports anything q supports ... with the QParser and local params
options it can be any syntax you want (as long as there is a QParser for
it)
-Hoss
Re: Excluding characters from a wildcard query
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Jul 1, 2009 at 5:07 PM, Ben <be...@autonomic.net> wrote:
> my brain was switched off. I'm using SOLRJ, which means I'll need to
> specify multiple :
>
> addMultipleFields(solrDoc, "vector", "vectorvalue", 1.0f);
>
> for each value to be added to the multiValuedField.
>
> Then, with luck, the simple wildcard query will be executed over each
> individual value when looking for matches, meaning the simple query syntax
> can made adequate to do what's needed.
>
>
I'm not sure if you can do prefix queries with the fq parameter. You will
need to use the 'q' parameter for that.
You may also want to look at the regex query support in lucene (contrib
package). I don't think that is supported out of the box in Solr yet.
--
Regards,
Shalin Shekhar Mangar.
Re: Excluding characters from a wildcard query
Posted by Ben <be...@autonomic.net>.
my brain was switched off. I'm using SOLRJ, which means I'll need to
specify multiple :
addMultipleFields(solrDoc, "vector", "vectorvalue", 1.0f);
for each value to be added to the multiValuedField.
Then, with luck, the simple wildcard query will be executed over each
individual value when looking for matches, meaning the simple query
syntax can made adequate to do what's needed.
Many thanks Uwe.
B
Uwe Klosa wrote:
> 2009/7/1 Ben <be...@autonomic.net>
>
>
>> I'm not quite sure I understand exactly what you mean.
>> The string I'm processing could have many tens of thousands of values... I
>> hope you aren't implying I'd need to split it into many tens of thousands of
>> "columns".
>>
>
>
> No, that is not what I meant. It will be one field (column) with tens of
> thousands of values.
>
>
>
>> If you're saying what I think you're saying, you're saying that I should
>> leave whitespaces between the individual parts of the string, pass in the
>> string into a "multiValued" field and have SOLR internally treat each "word"
>> as an individual entity?
>> Thanks for your help with this...
>>
>
>
> I said nothing about whitespaces. I don't know how you update your solr
> documents. Are you using XML or Solrj?
>
> Uwe
>
>
Re: Excluding characters from a wildcard query
Posted by Uwe Klosa <uw...@gmail.com>.
2009/7/1 Ben <be...@autonomic.net>
> I'm not quite sure I understand exactly what you mean.
> The string I'm processing could have many tens of thousands of values... I
> hope you aren't implying I'd need to split it into many tens of thousands of
> "columns".
No, that is not what I meant. It will be one field (column) with tens of
thousands of values.
>
>
> If you're saying what I think you're saying, you're saying that I should
> leave whitespaces between the individual parts of the string, pass in the
> string into a "multiValued" field and have SOLR internally treat each "word"
> as an individual entity?
> Thanks for your help with this...
I said nothing about whitespaces. I don't know how you update your solr
documents. Are you using XML or Solrj?
Uwe
Re: Excluding characters from a wildcard query
Posted by Ben <be...@autonomic.net>.
I'm not quite sure I understand exactly what you mean.
The string I'm processing could have many tens of thousands of values...
I hope you aren't implying I'd need to split it into many tens of
thousands of "columns".
If you're saying what I think you're saying, you're saying that I should
leave whitespaces between the individual parts of the string, pass in
the string into a "multiValued" field and have SOLR internally treat
each "word" as an individual entity?
Thanks for your help with this...
Ben
Uwe Klosa wrote:
> To get the desired efffect I described you have to do the split before you
> send the document to solr. I'm not aware of an analyzer that can split one
> field value into several field values. The analyzers and tokenizers do
> create tokens from field values in many different ways.
>
> As I see it you have to do some preprocessing yourself.
>
> Uwe
>
> 2009/7/1 Ben <be...@autonomic.net>
>
>
>> Is there a way in the Schema to specify that the comma should be used to
>> split the values up? e.g. Can I specify my "vector" field as multivalue and
>> also specify some sort of tokeniser to automatically split on commas?
>>
>> Ben
>>
>>
>>
>> Uwe Klosa wrote:
>>
>>
>>> You should split the strings at the comma yourself and store the values in
>>> a
>>> multivalued field? Then wildcard search like A1_* are not a problem. I
>>> don't
>>> know so much about facets. But if they work on multivalued fields that
>>> should be then no problem at all.
>>>
>>> Uwe
>>>
>>> 2009/7/1 Ben <be...@autonomic.net>
>>>
>>>
>>>
>>>
>>>> Yes, I had done that... however, I'm beginning to see now that what I am
>>>> doing is called a "wildcard query" which is going via Lucene's
>>>> queryparser.
>>>> Lucene's query parser doesn't not support the regexp idea of character
>>>> exclusion ... i.e. I'm not trying to match "[" I'm trying to express
>>>> "Match
>>>> as many characters as possible, which are not underscores" with [^_]*
>>>>
>>>> Perhaps I'm going about my whole problem in an ineffective way, but I'm
>>>> not
>>>> sure how I can sensibly describe what I'm doing without it becoming a
>>>> long
>>>> document.
>>>>
>>>> The only other approach I can think of is to change what I'm indexing but
>>>> I'm not sure how to achieve that.
>>>> I've tried explaining it once, and obviously failed, so I'll try again.
>>>>
>>>> I'm given a string containing many vectors (where each dimension is
>>>> separated by an underscore, and each vector is seperated by a comma) e.g.
>>>>
>>>> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>>>>
>>>> I want my facet query to tell me if, within one of the vectors within
>>>> that
>>>> string, there is a match for dimensions I'm interested in. Of the four
>>>> dimensions in this example, I may choose to fix an arbitrary number of
>>>> them
>>>> with values, and the rest with wildcards e.g. I might look for a facet
>>>> containing Ox_*_*_* so one of the vectors in the string must have its
>>>> first
>>>> dimension matching "Ox" and I don't care about the rest.
>>>>
>>>> ***Is there a way to break down this string on the comma's so that I can
>>>> apply a normal wildcard query and SOLR applies it to each
>>>> individually?***
>>>> That would solve all my problems :
>>>> e.g.
>>>> The string is internally represented in lucene/solr as
>>>> A1_B1_C1_D1
>>>> A2_B2_C2_D2
>>>> A3_B3_C3_D3
>>>>
>>>> where it tries to match the wildcard query on each in turn?
>>>>
>>>> Thanks for you help, I'm deeply confused about this at the moment...
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
Re: Excluding characters from a wildcard query
Posted by Uwe Klosa <uw...@gmail.com>.
To get the desired efffect I described you have to do the split before you
send the document to solr. I'm not aware of an analyzer that can split one
field value into several field values. The analyzers and tokenizers do
create tokens from field values in many different ways.
As I see it you have to do some preprocessing yourself.
Uwe
2009/7/1 Ben <be...@autonomic.net>
> Is there a way in the Schema to specify that the comma should be used to
> split the values up? e.g. Can I specify my "vector" field as multivalue and
> also specify some sort of tokeniser to automatically split on commas?
>
> Ben
>
>
>
> Uwe Klosa wrote:
>
>> You should split the strings at the comma yourself and store the values in
>> a
>> multivalued field? Then wildcard search like A1_* are not a problem. I
>> don't
>> know so much about facets. But if they work on multivalued fields that
>> should be then no problem at all.
>>
>> Uwe
>>
>> 2009/7/1 Ben <be...@autonomic.net>
>>
>>
>>
>>> Yes, I had done that... however, I'm beginning to see now that what I am
>>> doing is called a "wildcard query" which is going via Lucene's
>>> queryparser.
>>> Lucene's query parser doesn't not support the regexp idea of character
>>> exclusion ... i.e. I'm not trying to match "[" I'm trying to express
>>> "Match
>>> as many characters as possible, which are not underscores" with [^_]*
>>>
>>> Perhaps I'm going about my whole problem in an ineffective way, but I'm
>>> not
>>> sure how I can sensibly describe what I'm doing without it becoming a
>>> long
>>> document.
>>>
>>> The only other approach I can think of is to change what I'm indexing but
>>> I'm not sure how to achieve that.
>>> I've tried explaining it once, and obviously failed, so I'll try again.
>>>
>>> I'm given a string containing many vectors (where each dimension is
>>> separated by an underscore, and each vector is seperated by a comma) e.g.
>>>
>>> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>>>
>>> I want my facet query to tell me if, within one of the vectors within
>>> that
>>> string, there is a match for dimensions I'm interested in. Of the four
>>> dimensions in this example, I may choose to fix an arbitrary number of
>>> them
>>> with values, and the rest with wildcards e.g. I might look for a facet
>>> containing Ox_*_*_* so one of the vectors in the string must have its
>>> first
>>> dimension matching "Ox" and I don't care about the rest.
>>>
>>> ***Is there a way to break down this string on the comma's so that I can
>>> apply a normal wildcard query and SOLR applies it to each
>>> individually?***
>>> That would solve all my problems :
>>> e.g.
>>> The string is internally represented in lucene/solr as
>>> A1_B1_C1_D1
>>> A2_B2_C2_D2
>>> A3_B3_C3_D3
>>>
>>> where it tries to match the wildcard query on each in turn?
>>>
>>> Thanks for you help, I'm deeply confused about this at the moment...
>>>
>>> Ben
>>>
>>>
>>>
>>
>>
>>
>
>
Re: Excluding characters from a wildcard query
Posted by Ben <be...@autonomic.net>.
Is there a way in the Schema to specify that the comma should be used to
split the values up?
e.g. Can I specify my "vector" field as multivalue and also specify some
sort of tokeniser to automatically split on commas?
Ben
Uwe Klosa wrote:
> You should split the strings at the comma yourself and store the values in a
> multivalued field? Then wildcard search like A1_* are not a problem. I don't
> know so much about facets. But if they work on multivalued fields that
> should be then no problem at all.
>
> Uwe
>
> 2009/7/1 Ben <be...@autonomic.net>
>
>
>> Yes, I had done that... however, I'm beginning to see now that what I am
>> doing is called a "wildcard query" which is going via Lucene's queryparser.
>> Lucene's query parser doesn't not support the regexp idea of character
>> exclusion ... i.e. I'm not trying to match "[" I'm trying to express "Match
>> as many characters as possible, which are not underscores" with [^_]*
>>
>> Perhaps I'm going about my whole problem in an ineffective way, but I'm not
>> sure how I can sensibly describe what I'm doing without it becoming a long
>> document.
>>
>> The only other approach I can think of is to change what I'm indexing but
>> I'm not sure how to achieve that.
>> I've tried explaining it once, and obviously failed, so I'll try again.
>>
>> I'm given a string containing many vectors (where each dimension is
>> separated by an underscore, and each vector is seperated by a comma) e.g.
>>
>> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>>
>> I want my facet query to tell me if, within one of the vectors within that
>> string, there is a match for dimensions I'm interested in. Of the four
>> dimensions in this example, I may choose to fix an arbitrary number of them
>> with values, and the rest with wildcards e.g. I might look for a facet
>> containing Ox_*_*_* so one of the vectors in the string must have its first
>> dimension matching "Ox" and I don't care about the rest.
>>
>> ***Is there a way to break down this string on the comma's so that I can
>> apply a normal wildcard query and SOLR applies it to each individually?***
>> That would solve all my problems :
>> e.g.
>> The string is internally represented in lucene/solr as
>> A1_B1_C1_D1
>> A2_B2_C2_D2
>> A3_B3_C3_D3
>>
>> where it tries to match the wildcard query on each in turn?
>>
>> Thanks for you help, I'm deeply confused about this at the moment...
>>
>> Ben
>>
>>
>
>
Re: Excluding characters from a wildcard query
Posted by Uwe Klosa <uw...@gmail.com>.
You should split the strings at the comma yourself and store the values in a
multivalued field? Then wildcard search like A1_* are not a problem. I don't
know so much about facets. But if they work on multivalued fields that
should be then no problem at all.
Uwe
2009/7/1 Ben <be...@autonomic.net>
> Yes, I had done that... however, I'm beginning to see now that what I am
> doing is called a "wildcard query" which is going via Lucene's queryparser.
> Lucene's query parser doesn't not support the regexp idea of character
> exclusion ... i.e. I'm not trying to match "[" I'm trying to express "Match
> as many characters as possible, which are not underscores" with [^_]*
>
> Perhaps I'm going about my whole problem in an ineffective way, but I'm not
> sure how I can sensibly describe what I'm doing without it becoming a long
> document.
>
> The only other approach I can think of is to change what I'm indexing but
> I'm not sure how to achieve that.
> I've tried explaining it once, and obviously failed, so I'll try again.
>
> I'm given a string containing many vectors (where each dimension is
> separated by an underscore, and each vector is seperated by a comma) e.g.
>
> A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
>
> I want my facet query to tell me if, within one of the vectors within that
> string, there is a match for dimensions I'm interested in. Of the four
> dimensions in this example, I may choose to fix an arbitrary number of them
> with values, and the rest with wildcards e.g. I might look for a facet
> containing Ox_*_*_* so one of the vectors in the string must have its first
> dimension matching "Ox" and I don't care about the rest.
>
> ***Is there a way to break down this string on the comma's so that I can
> apply a normal wildcard query and SOLR applies it to each individually?***
> That would solve all my problems :
> e.g.
> The string is internally represented in lucene/solr as
> A1_B1_C1_D1
> A2_B2_C2_D2
> A3_B3_C3_D3
>
> where it tries to match the wildcard query on each in turn?
>
> Thanks for you help, I'm deeply confused about this at the moment...
>
> Ben
>
Re: Excluding characters from a wildcard query
Posted by Ben <be...@autonomic.net>.
Yes, I had done that... however, I'm beginning to see now that what I am
doing is called a "wildcard query" which is going via Lucene's queryparser.
Lucene's query parser doesn't not support the regexp idea of character
exclusion ... i.e. I'm not trying to match "[" I'm trying to express
"Match as many characters as possible, which are not underscores" with [^_]*
Perhaps I'm going about my whole problem in an ineffective way, but I'm
not sure how I can sensibly describe what I'm doing without it becoming
a long document.
The only other approach I can think of is to change what I'm indexing
but I'm not sure how to achieve that.
I've tried explaining it once, and obviously failed, so I'll try again.
I'm given a string containing many vectors (where each dimension is
separated by an underscore, and each vector is seperated by a comma) e.g.
A1_B1_C1_D1,A2_B2_C2_D2,A3_B3_C3_D3
I want my facet query to tell me if, within one of the vectors within
that string, there is a match for dimensions I'm interested in. Of the
four dimensions in this example, I may choose to fix an arbitrary number
of them with values, and the rest with wildcards e.g. I might look for a
facet containing Ox_*_*_* so one of the vectors in the string must have
its first dimension matching "Ox" and I don't care about the rest.
***Is there a way to break down this string on the comma's so that I can
apply a normal wildcard query and SOLR applies it to each
individually?*** That would solve all my problems :
e.g.
The string is internally represented in lucene/solr as
A1_B1_C1_D1
A2_B2_C2_D2
A3_B3_C3_D3
where it tries to match the wildcard query on each in turn?
Thanks for you help, I'm deeply confused about this at the moment...
Ben
Re: Excluding characters from a wildcard query
Posted by Uwe Klosa <uw...@gmail.com>.
You have to escape all special characters. Even [ to \[
Have a look here http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
Uwe
2009/7/1 Ben <be...@autonomic.net>
> I only just noticed that this is an exception being thrown by the
> lucene.queryParser. Should I be mailing on the lucene list, or is it ok
> here?
>
> I'm beginning to wonder if the "fq" can handle the type of character
> exclusion I'm trying in the RegExp.
> Escaping the string also doesn't work :
>
> Cannot parse 'vector:_\*[\^_\]\*_[\^_\]\*_[\^_\]\*': Encountered "]" at
> line 1, column 15.
> Was expecting one of:
> "TO" ...
> <RANGEIN_QUOTED> ...
> <RANGEIN_GOOP> ...
>
> Ben wrote:
>
>>
>> Ben wrote:
>>
>>> The exception SOLR raises is :
>>>
>>> org.apache.lucene.queryParser.ParseException: Cannot parse
>>> 'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
>>> Was expecting one of:
>>> "TO" ...
>>> <RANGEIN_QUOTED> ...
>>> <RANGEIN_GOOP> ...
>>> Ben wrote:
>>>
>>>> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching
>>>> anything with an underscore in the string) using some code like :
>>>>
>>>> ...
>>>> parameters.add("fq", "vector:[^_]*_[^_]*");
>>>> ...
>>>>
>>>> seems to cause problems for SOLR, I assume because of the [ or ^
>>>> character.
>>>>
>>>> Can somebody please advise how to handle character exclusion in such
>>>> searches?
>>>>
>>>> Any help or pointers are much appreciated!
>>>>
>>>> Thanks
>>>>
>>>> Ben
>>>>
>>>
>>>
>>
>
Re: Excluding characters from a wildcard query
Posted by Ben <be...@autonomic.net>.
I only just noticed that this is an exception being thrown by the
lucene.queryParser. Should I be mailing on the lucene list, or is it ok
here?
I'm beginning to wonder if the "fq" can handle the type of character
exclusion I'm trying in the RegExp.
Escaping the string also doesn't work :
Cannot parse 'vector:_\*[\^_\]\*_[\^_\]\*_[\^_\]\*': Encountered "]" at
line 1, column 15.
Was expecting one of:
"TO" ...
<RANGEIN_QUOTED> ...
<RANGEIN_GOOP> ...
Ben wrote:
>
> Ben wrote:
>> The exception SOLR raises is :
>>
>> org.apache.lucene.queryParser.ParseException: Cannot parse
>> 'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
>> Was expecting one of:
>> "TO" ...
>> <RANGEIN_QUOTED> ...
>> <RANGEIN_GOOP> ...
>>
>> Ben wrote:
>>> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching
>>> anything with an underscore in the string) using some code like :
>>>
>>> ...
>>> parameters.add("fq", "vector:[^_]*_[^_]*");
>>> ...
>>>
>>> seems to cause problems for SOLR, I assume because of the [ or ^
>>> character.
>>>
>>> Can somebody please advise how to handle character exclusion in such
>>> searches?
>>>
>>> Any help or pointers are much appreciated!
>>>
>>> Thanks
>>>
>>> Ben
>>
>
Re: Excluding characters from a wildcard query - More Info - Is
this difficult, or am I being ignored because it's too obvious to merit an
answer?
Posted by Ben <be...@autonomic.net>.
Ben wrote:
> The exception SOLR raises is :
>
> org.apache.lucene.queryParser.ParseException: Cannot parse
> 'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
> Was expecting one of:
> "TO" ...
> <RANGEIN_QUOTED> ...
> <RANGEIN_GOOP> ...
>
> Ben wrote:
>> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching
>> anything with an underscore in the string) using some code like :
>>
>> ...
>> parameters.add("fq", "vector:[^_]*_[^_]*");
>> ...
>>
>> seems to cause problems for SOLR, I assume because of the [ or ^
>> character.
>>
>> Can somebody please advise how to handle character exclusion in such
>> searches?
>>
>> Any help or pointers are much appreciated!
>>
>> Thanks
>>
>> Ben
>
Re: Excluding characters from a wildcard query - More Info
Posted by Ben <be...@autonomic.net>.
The exception SOLR raises is :
org.apache.lucene.queryParser.ParseException: Cannot parse
'vector:_*[^_]*_[^_]*_[^_]*': Encountered "]" at line 1, column 12.
Was expecting one of:
"TO" ...
<RANGEIN_QUOTED> ...
<RANGEIN_GOOP> ...
Ben wrote:
> Passing in a RegularExpression like "[^_]*_[^_]*" (e.g. matching
> anything with an underscore in the string) using some code like :
>
> ...
> parameters.add("fq", "vector:[^_]*_[^_]*");
> ...
>
> seems to cause problems for SOLR, I assume because of the [ or ^
> character.
>
> Can somebody please advise how to handle character exclusion in such
> searches?
>
> Any help or pointers are much appreciated!
>
> Thanks
>
> Ben