You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "haya.axelrod" <ha...@gmail.com> on 2013/12/22 20:11:18 UTC

Solr - Match whole word only in text fields

I have a text field that can contain very long values (like text files). I
want to create field type for it (text, not string), in order to have
something like "Match whole word only" in notepad++, but the delimiter
should not be only white spaces. If i have:

myName=aaa bbb

I would like to get it for the following search strings "aaa", "bbb", "aaa
bbb", "myName=aaa bbb", "myName", but not for "aa" or "ame=a" or "a bb".
Another example is:

<myName>aaa bbb</myName> 
Can i do this somehow?

What should be my field type definition?

The text can contain any character. Before search i'm escaping the search
string using
http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Match-whole-word-only-in-text-fields-tp4107795.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Match whole word only in text fields

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Haya,

Yes you are correct, "myName=aaa bbb" will produce index terms: "myName", "aaa", "bbb". You can verify this at admin analysis page. You can test your analyzer by entering sample text in  an user interface. 
Your query "myName aaa" will be a Phrase Query and will match with above settings.
Your query "myName bbb" won't match.

http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches

It is better to give it a try. 

Ahmet


On Friday, December 27, 2013 6:18 AM, Kydryavtsev Andrey <we...@yandex.ru> wrote:
Hi everybody!

Ahmet, do I get it correct - if I use this text_char_norm field type, for input "myName=aaa bbb" I'll index terms "myName", "aaa", "bbb"? So I'll match with query like "myName" or query like  "bbb", but not match with "myName aaa". I can use this type for query value, so split "myName aaa" into ( "myName" && "aaa") - and it will work. But this approach will give false positive match with "myName bbb". What do you think, how I can handle this? One of the  approaches is to use in this field type KeywordTokenizer+ShingleFilter instead of WhitespaceTokenizerFactory, so tokens like "myName", "myName aaa", "myName aaa bbb", "aaa", "aaa bbb", "bbb" will be indexed, but it significantly increased index size in case of long values. 


26.12.2013, 03:20, "Ahmet Arslan" <io...@yahoo.com>:
> Hi Haya,
>
> With MappingCharFilter you can have full control over character set that you want to split.
>
> in mappings.txt you will have
>
> ":" => " "
> "=" => " "
>
> Use the following type and see if it suits for your needs. Update mappings.txt according to your needs.
>
>     <fieldType name="text_char_norm" class="solr.TextField" positionIncrementGap="100" >
>       <analyzer>
>         <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory" />
>       </analyzer>
>     </fieldType>
>
> On Sunday, December 22, 2013 9:19 PM, haya.axelrod <ha...@gmail.com> wrote:
> I have a text field that can contain very long values (like text files). I
> want to create field type for it (text, not string), in order to have
> something like "Match whole word only" in notepad++, but the delimiter
> should not be only white spaces. If i have:
>
> myName=aaa bbb
>
> I would like to get it for the following search strings "aaa", "bbb", "aaa
> bbb", "myName=aaa bbb", "myName", but not for "aa" or "ame=a" or "a bb".
> Another example is:
>
> <myName>aaa bbb</myName>
> Can i do this somehow?
>
> What should be my field type definition?
>
> The text can contain any character. Before search i'm escaping the search
> string using
> http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html
>
> Thanks
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-Match-whole-word-only-in-text-fields-tp4107795.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Match whole word only in text fields

Posted by Kydryavtsev Andrey <we...@yandex.ru>.
Hi everybody!

Ahmet, do I get it correct - if I use this text_char_norm field type, for input "myName=aaa bbb" I'll index terms "myName", "aaa", "bbb"? So I'll match with query like "myName" or query like  "bbb", but not match with "myName aaa". I can use this type for query value, so split "myName aaa" into ( "myName" && "aaa") - and it will work. But this approach will give false positive match with "myName bbb". What do you think, how I can handle this? One of the  approaches is to use in this field type KeywordTokenizer+ShingleFilter instead of WhitespaceTokenizerFactory, so tokens like "myName", "myName aaa", "myName aaa bbb", "aaa", "aaa bbb", "bbb" will be indexed, but it significantly increased index size in case of long values. 

26.12.2013, 03:20, "Ahmet Arslan" <io...@yahoo.com>:
> Hi Haya,
>
> With MappingCharFilter you can have full control over character set that you want to split.
>
> in mappings.txt you will have
>
> ":" => " "
> "=" => " "
>
> Use the following type and see if it suits for your needs. Update mappings.txt according to your needs.
>
>     <fieldType name="text_char_norm" class="solr.TextField" positionIncrementGap="100" >
>       <analyzer>
>         <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory" />
>       </analyzer>
>     </fieldType>
>
> On Sunday, December 22, 2013 9:19 PM, haya.axelrod <ha...@gmail.com> wrote:
> I have a text field that can contain very long values (like text files). I
> want to create field type for it (text, not string), in order to have
> something like "Match whole word only" in notepad++, but the delimiter
> should not be only white spaces. If i have:
>
> myName=aaa bbb
>
> I would like to get it for the following search strings "aaa", "bbb", "aaa
> bbb", "myName=aaa bbb", "myName", but not for "aa" or "ame=a" or "a bb".
> Another example is:
>
> <myName>aaa bbb</myName>
> Can i do this somehow?
>
> What should be my field type definition?
>
> The text can contain any character. Before search i'm escaping the search
> string using
> http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html
>
> Thanks
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-Match-whole-word-only-in-text-fields-tp4107795.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Match whole word only in text fields

Posted by Ahmet Arslan <io...@yahoo.com>.
Hi Haya,

With MappingCharFilter you can have full control over character set that you want to split.

in mappings.txt you will have

":" => " "
"=" => " "

Use the following type and see if it suits for your needs. Update mappings.txt according to your needs.

    <fieldType name="text_char_norm" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>





On Sunday, December 22, 2013 9:19 PM, haya.axelrod <ha...@gmail.com> wrote:
I have a text field that can contain very long values (like text files). I
want to create field type for it (text, not string), in order to have
something like "Match whole word only" in notepad++, but the delimiter
should not be only white spaces. If i have:

myName=aaa bbb

I would like to get it for the following search strings "aaa", "bbb", "aaa
bbb", "myName=aaa bbb", "myName", but not for "aa" or "ame=a" or "a bb".
Another example is:

<myName>aaa bbb</myName> 
Can i do this somehow?

What should be my field type definition?

The text can contain any character. Before search i'm escaping the search
string using
http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/util/ClientUtils.html

Thanks



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Match-whole-word-only-in-text-fields-tp4107795.html
Sent from the Solr - User mailing list archive at Nabble.com.