You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "kobe.free.world@gmail.com" <ko...@gmail.com> on 2013/05/17 13:42:43 UTC

Searching for terms having embedded white spaces like "word1 word2"

Hi Guys,

I have a field defined with the following custom data type,

<fieldType name="cust_str" class="solr.TextField" positionIncrementGap="100"
sortMissingLast="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
	 <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
	 <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

This field has values like "SAN MIGUEL","SAN JUAN","SAN DIEGO" etc. I wish
to perform a "Starts With" and "Contains" search on these values and I
perform the query in SOLR as follows,

-Starts With: field:SAN M*
-Contains: field:*SAN M*

But, the SOLR is not returning correct results because of the white space.
What modifications do I need to make in order to make the sreahces work for
the values with embedded white spaces?



--
View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-terms-having-embedded-white-spaces-like-word1-word2-tp4064170.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Searching for terms having embedded white spaces like "word1 word2"

Posted by Jack Krupansky <ja...@basetechnology.com>.
Ideally, such a text search should be done using tokenized text and span 
query. Maybe you could do it using the "surround" query parser, but you 
should be able to do it using the LucidWorks Search query parser:

"this is" BEFORE:1 ("good" OR "excellent")

But, given that you have a keyword tokenizer with embedded white space, you 
should be able to write a Lucene regex query for the same as raw text, 
something like [untested!]:

/this\\s+is\\s+(\\w\\s+)?(good|excellent)/

That would be "contains".

Starts with:

/^this\\s+is\\s+(\\w\\s+)?(good|excellent)/

Ends with:

/this\\s+is\\s+(\\w\\s+)?(good|excellent)$/

Exact match:

/^this\\s+is\\s+(\\w\\s+)?(good|excellent)$/

Caveat:
BUT... such character-level regex matching is NOT guaranteed to be speedy 
and really should only be used for relatively small datasets.

-- Jack Krupansky

-----Original Message----- 
From: kobe.free.world@gmail.com
Sent: Saturday, May 18, 2013 6:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Searching for terms having embedded white spaces like "word1 
word2"

Thank you so very much Jack for your prompt reply. Your solution worked for
us.

I have another issue in querying fields having values of the sort
<string>This is good</string><string>This is also good</string><string>This
is excellent</string>. I want to perform "StartsWith" as well as 'Contains"
searches on this field. The field definition is as follow,

  <fieldType name="cust_str" class="solr.TextField"
positionIncrementGap="100" sortMissingLast="true">
      <analyzer type="index">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Please suggest how to perform the above mentioned search.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-terms-having-embedded-white-spaces-like-word1-word2-tp4064170p4064355.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: Searching for terms having embedded white spaces like "word1 word2"

Posted by "kobe.free.world@gmail.com" <ko...@gmail.com>.
Thank you so very much Jack for your prompt reply. Your solution worked for
us.

I have another issue in querying fields having values of the sort
<string>This is good</string><string>This is also good</string><string>This
is excellent</string>. I want to perform "StartsWith" as well as 'Contains"
searches on this field. The field definition is as follow,

  <fieldType name="cust_str" class="solr.TextField"
positionIncrementGap="100" sortMissingLast="true">
      <analyzer type="index">
     	<tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
	 <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
	 <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

Please suggest how to perform the above mentioned search.



--
View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-terms-having-embedded-white-spaces-like-word1-word2-tp4064170p4064355.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Searching for terms having embedded white spaces like "word1 word2"

Posted by Jack Krupansky <ja...@basetechnology.com>.
Is this really a text field where you want to search for tokenized keywords? 
Or is it a string field where you wish strictly to deal with equality of the 
entire string or explicit wildcards for substring matches, as you've show. 
You haven't told us your full requirements for this field.

The standard tokenizer breaks the input into individual tokens or keywords. 
Yes, you can use wildcards on those tokens, but only on one token at a time, 
not two as you have shown.

You may want to consider two fields, such as cust and cust_str. The former 
would be tokenized, like standard tokenizer and allow keyword search, but 
the latter would be a single string or a single token. Either make the 
latter a true string type, or use a TextField that uses the keyword 
tokenizer, which preserves whitespace and special characters. You probably 
shouldn't use the stop filter for the second field.

You'll have the explicitly escape the spaces in your queries using a 
backslash. You can't enclose the query in quotes since that would disable 
the wildcard.

You could also use regex queries on that field:

/.*san.m.*/

-- Jack Krupansky
-----Original Message----- 
From: kobe.free.world@gmail.com
Sent: Friday, May 17, 2013 7:42 AM
To: solr-user@lucene.apache.org
Subject: Searching for terms having embedded white spaces like "word1 word2"

Hi Guys,

I have a field defined with the following custom data type,

<fieldType name="cust_str" class="solr.TextField" positionIncrementGap="100"
sortMissingLast="true">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

This field has values like "SAN MIGUEL","SAN JUAN","SAN DIEGO" etc. I wish
to perform a "Starts With" and "Contains" search on these values and I
perform the query in SOLR as follows,

-Starts With: field:SAN M*
-Contains: field:*SAN M*

But, the SOLR is not returning correct results because of the white space.
What modifications do I need to make in order to make the sreahces work for
the values with embedded white spaces?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-terms-having-embedded-white-spaces-like-word1-word2-tp4064170.html
Sent from the Solr - User mailing list archive at Nabble.com.