You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Van Tassell, Kristian" <kr...@siemens.com> on 2013/09/30 19:50:57 UTC

Searching on (hyphenated/capitalized) word issue

I have a search term "multi-CAD" being issues on tokenized text.  The problem is that you cannot get any search results when you type "multicad" unless you add a hyphen (multi-cad) or type "multiCAD" (omitting the hyphen, but correctly adding the CAPS into the spelling).



However, for the similar but unhyphenated word AutoCAD, you can type "autocad" and get hits for AutoCAD, as you would expect. You can type "auto-cad" and get the same results.

The query seems to get parsed as separate words (resulting in hits) for multi-CAD, multiCAD, autocad, auto-cad and AUTOCAD, but not for multicad. In other words, the search terms  become "multi cad" and "auto cad" for all cases except for when the term is "multicad".

I'm guessing this may be in part to "auto" being a more common word prefix, but I may be wrong. Can anyone provide some clarity (and maybe point me towards a potential solution)?

Thanks in advance!


Kristian Van Tassell
Siemens Industry Sector
Siemens Product Lifecycle Management Software Inc.
5939 Rice Creek Parkway
Shoreview, MN  55126 United States
Tel.      :+1 (651) 855-6194
Fax      :+1 (651) 855-6280
kristian.vantassell@siemens.com <kristian.vantassell@siemens.com%20>
www.siemens.com/plm

Re: Searching on (hyphenated/capitalized) word issue

Posted by Upayavira <uv...@odoko.co.uk>.

It depends whether multicad is a special case, or whether you want micr
to match the term "microsoft".

If it is a special case, you can use synonyms, so that multi and
multicad are considered the same term.

If it isn't a special case, then ngrams could work - your document would
be indexed with:

mul
mult
multi
multic
multica
multicad

all indexed at the same term position, allowing for any of those to
match. Of course, that will make your index much larger.

As Erick says, use the admin/analysis page to play with your analysis
chains and see what they do to different inputs.

Upayavira

On Wed, Oct 9, 2013, at 09:30 PM, Erick Erickson wrote:
> The admin/analysis page is definitely your friend. On the
> surface, [catenateWords="1"] in WDFF should mash the
> split up bits of multiCAD into multicad and you should be.
> 
> I suspect that StandardTokenizerFactory is somehow getting
> into the mix here. Under any circumstance, the admin/analysis
> page should help.
> 
> StandardTokenizerFactory, on a quick test, does split up
> multi-cad into separate tokens that then do NOT get
> concatenated...
> 
> That doesn't explain not getting hits on multiCAD though when
> you search for multicad.
> 
> Best,
> Erick
> 
> 
> On Wed, Oct 9, 2013 at 10:45 AM, Furkan KAMACI <fu...@gmail.com>
> wrote:
> > If you have that word to index: "multicad" and if you want to get result
> > when you search that: "multi" you can use ngram filter. However you should
> > consider pros and cons of using Ngram Filter. If you use ngrams you may
> > find "multicad" from "multi" but your index size will be much more bigger.
> >
> > I suggest you to look at here:
> > http://docs.lucidworks.com/display/solr/Tokenizers
> >
> >
> >
> > 2013/10/9 Van Tassell, Kristian <kr...@siemens.com>
> >
> >> Thank you Upayavira.
> >>
> >> I'm trying to figure out what will make Solr stem on "multi" in the word
> >> "multicad" so that any attempt to search on "multicad", "Multi-CAD" or
> >> "multiCAD" will return results. The WordDelimiterFilterFactory helps with
> >> the case of multi followed by a dash or a capital letter, but I'm not sure
> >> how to get Solr to tokenize the word "multi". Should I look at ngram
> >> configurations? Or is there a filter which promotes (rather than protects)
> >> words from being stemmed? (in other words, I could configure in a txt file
> >> that "multi" should be stemmed.
> >>
> >> Just to reiterate, I am not getting any results when I search for the word
> >> "multicad", even though it appears many times in the text as "multiCAD" and
> >> "Multi-CAD".
> >>
> >> Here is my configuration:
> >>
> >> <analyzer>
> >>             <tokenizer class="solr.StandardTokenizerFactory"/>
> >>             <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords_en.txt" enablePositionIncrements="true"/>
> >>             <filter class="solr.SynonymFilterFactory"
> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >>             <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>             <filter class="solr.LowerCaseFilterFactory"/>
> >>             <filter class="solr.SnowballPorterFilterFactory"
> >> language="English" protected="protwords.txt"/>
> >>   </analyzer>
> >>
> >> -----Original Message-----
> >> From: Upayavira [mailto:uv@odoko.co.uk]
> >> Sent: Monday, September 30, 2013 1:45 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Searching on (hyphenated/capitalized) word issue
> >>
> >> You need to look at your analysis chain. The stuff you're talking about
> >> there is all configurable.
> >>
> >> There's different tokenisers available to split your fields differently,
> >> then you might use the WordDelimiterFilterFactory to split existing tokens
> >> further (e.g. WiFi might become "wi", "fi" and "WiFi"). So really, you need
> >> to craft your own analysis chain to fit the kind of data you are working
> >> with.
> >>
> >> Upayavira
> >>
> >> On Mon, Sep 30, 2013, at 06:50 PM, Van Tassell, Kristian wrote:
> >> > I have a search term "multi-CAD" being issues on tokenized text.  The
> >> > problem is that you cannot get any search results when you type
> >> > "multicad" unless you add a hyphen (multi-cad) or type "multiCAD"
> >> > (omitting the hyphen, but correctly adding the CAPS into the spelling).
> >> >
> >> >
> >> >
> >> > However, for the similar but unhyphenated word AutoCAD, you can type
> >> > "autocad" and get hits for AutoCAD, as you would expect. You can type
> >> > "auto-cad" and get the same results.
> >> >
> >> > The query seems to get parsed as separate words (resulting in hits)
> >> > for multi-CAD, multiCAD, autocad, auto-cad and AUTOCAD, but not for
> >> multicad.
> >> > In other words, the search terms  become "multi cad" and "auto cad"
> >> > for all cases except for when the term is "multicad".
> >> >
> >> > I'm guessing this may be in part to "auto" being a more common word
> >> > prefix, but I may be wrong. Can anyone provide some clarity (and maybe
> >> > point me towards a potential solution)?
> >> >
> >> > Thanks in advance!
> >> >
> >> >
> >> > Kristian Van Tassell
> >> > Siemens Industry Sector
> >> > Siemens Product Lifecycle Management Software Inc.
> >> > 5939 Rice Creek Parkway
> >> > Shoreview, MN  55126 United States
> >> > Tel.      :+1 (651) 855-6194
> >> > Fax      :+1 (651) 855-6280
> >> > kristian.vantassell@siemens.com <kristian.vantassell@siemens.com%20>
> >> > www.siemens.com/plm
> >> >
> >>

Re: Searching on (hyphenated/capitalized) word issue

Posted by Erick Erickson <er...@gmail.com>.

The admin/analysis page is definitely your friend. On the
surface, [catenateWords="1"] in WDFF should mash the
split up bits of multiCAD into multicad and you should be.

I suspect that StandardTokenizerFactory is somehow getting
into the mix here. Under any circumstance, the admin/analysis
page should help.

StandardTokenizerFactory, on a quick test, does split up
multi-cad into separate tokens that then do NOT get
concatenated...

That doesn't explain not getting hits on multiCAD though when
you search for multicad.

Best,
Erick


On Wed, Oct 9, 2013 at 10:45 AM, Furkan KAMACI <fu...@gmail.com> wrote:
> If you have that word to index: "multicad" and if you want to get result
> when you search that: "multi" you can use ngram filter. However you should
> consider pros and cons of using Ngram Filter. If you use ngrams you may
> find "multicad" from "multi" but your index size will be much more bigger.
>
> I suggest you to look at here:
> http://docs.lucidworks.com/display/solr/Tokenizers
>
>
>
> 2013/10/9 Van Tassell, Kristian <kr...@siemens.com>
>
>> Thank you Upayavira.
>>
>> I'm trying to figure out what will make Solr stem on "multi" in the word
>> "multicad" so that any attempt to search on "multicad", "Multi-CAD" or
>> "multiCAD" will return results. The WordDelimiterFilterFactory helps with
>> the case of multi followed by a dash or a capital letter, but I'm not sure
>> how to get Solr to tokenize the word "multi". Should I look at ngram
>> configurations? Or is there a filter which promotes (rather than protects)
>> words from being stemmed? (in other words, I could configure in a txt file
>> that "multi" should be stemmed.
>>
>> Just to reiterate, I am not getting any results when I search for the word
>> "multicad", even though it appears many times in the text as "multiCAD" and
>> "Multi-CAD".
>>
>> Here is my configuration:
>>
>> <analyzer>
>>             <tokenizer class="solr.StandardTokenizerFactory"/>
>>             <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_en.txt" enablePositionIncrements="true"/>
>>             <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>             <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>             <filter class="solr.LowerCaseFilterFactory"/>
>>             <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>   </analyzer>
>>
>> -----Original Message-----
>> From: Upayavira [mailto:uv@odoko.co.uk]
>> Sent: Monday, September 30, 2013 1:45 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Searching on (hyphenated/capitalized) word issue
>>
>> You need to look at your analysis chain. The stuff you're talking about
>> there is all configurable.
>>
>> There's different tokenisers available to split your fields differently,
>> then you might use the WordDelimiterFilterFactory to split existing tokens
>> further (e.g. WiFi might become "wi", "fi" and "WiFi"). So really, you need
>> to craft your own analysis chain to fit the kind of data you are working
>> with.
>>
>> Upayavira
>>
>> On Mon, Sep 30, 2013, at 06:50 PM, Van Tassell, Kristian wrote:
>> > I have a search term "multi-CAD" being issues on tokenized text.  The
>> > problem is that you cannot get any search results when you type
>> > "multicad" unless you add a hyphen (multi-cad) or type "multiCAD"
>> > (omitting the hyphen, but correctly adding the CAPS into the spelling).
>> >
>> >
>> >
>> > However, for the similar but unhyphenated word AutoCAD, you can type
>> > "autocad" and get hits for AutoCAD, as you would expect. You can type
>> > "auto-cad" and get the same results.
>> >
>> > The query seems to get parsed as separate words (resulting in hits)
>> > for multi-CAD, multiCAD, autocad, auto-cad and AUTOCAD, but not for
>> multicad.
>> > In other words, the search terms  become "multi cad" and "auto cad"
>> > for all cases except for when the term is "multicad".
>> >
>> > I'm guessing this may be in part to "auto" being a more common word
>> > prefix, but I may be wrong. Can anyone provide some clarity (and maybe
>> > point me towards a potential solution)?
>> >
>> > Thanks in advance!
>> >
>> >
>> > Kristian Van Tassell
>> > Siemens Industry Sector
>> > Siemens Product Lifecycle Management Software Inc.
>> > 5939 Rice Creek Parkway
>> > Shoreview, MN  55126 United States
>> > Tel.      :+1 (651) 855-6194
>> > Fax      :+1 (651) 855-6280
>> > kristian.vantassell@siemens.com <kristian.vantassell@siemens.com%20>
>> > www.siemens.com/plm
>> >
>>

Re: Searching on (hyphenated/capitalized) word issue

Posted by Furkan KAMACI <fu...@gmail.com>.

If you have that word to index: "multicad" and if you want to get result
when you search that: "multi" you can use ngram filter. However you should
consider pros and cons of using Ngram Filter. If you use ngrams you may
find "multicad" from "multi" but your index size will be much more bigger.

I suggest you to look at here:
http://docs.lucidworks.com/display/solr/Tokenizers



2013/10/9 Van Tassell, Kristian <kr...@siemens.com>

> Thank you Upayavira.
>
> I'm trying to figure out what will make Solr stem on "multi" in the word
> "multicad" so that any attempt to search on "multicad", "Multi-CAD" or
> "multiCAD" will return results. The WordDelimiterFilterFactory helps with
> the case of multi followed by a dash or a capital letter, but I'm not sure
> how to get Solr to tokenize the word "multi". Should I look at ngram
> configurations? Or is there a filter which promotes (rather than protects)
> words from being stemmed? (in other words, I could configure in a txt file
> that "multi" should be stemmed.
>
> Just to reiterate, I am not getting any results when I search for the word
> "multicad", even though it appears many times in the text as "multiCAD" and
> "Multi-CAD".
>
> Here is my configuration:
>
> <analyzer>
>             <tokenizer class="solr.StandardTokenizerFactory"/>
>             <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt" enablePositionIncrements="true"/>
>             <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>             <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>             <filter class="solr.LowerCaseFilterFactory"/>
>             <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>   </analyzer>
>
> -----Original Message-----
> From: Upayavira [mailto:uv@odoko.co.uk]
> Sent: Monday, September 30, 2013 1:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Searching on (hyphenated/capitalized) word issue
>
> You need to look at your analysis chain. The stuff you're talking about
> there is all configurable.
>
> There's different tokenisers available to split your fields differently,
> then you might use the WordDelimiterFilterFactory to split existing tokens
> further (e.g. WiFi might become "wi", "fi" and "WiFi"). So really, you need
> to craft your own analysis chain to fit the kind of data you are working
> with.
>
> Upayavira
>
> On Mon, Sep 30, 2013, at 06:50 PM, Van Tassell, Kristian wrote:
> > I have a search term "multi-CAD" being issues on tokenized text.  The
> > problem is that you cannot get any search results when you type
> > "multicad" unless you add a hyphen (multi-cad) or type "multiCAD"
> > (omitting the hyphen, but correctly adding the CAPS into the spelling).
> >
> >
> >
> > However, for the similar but unhyphenated word AutoCAD, you can type
> > "autocad" and get hits for AutoCAD, as you would expect. You can type
> > "auto-cad" and get the same results.
> >
> > The query seems to get parsed as separate words (resulting in hits)
> > for multi-CAD, multiCAD, autocad, auto-cad and AUTOCAD, but not for
> multicad.
> > In other words, the search terms  become "multi cad" and "auto cad"
> > for all cases except for when the term is "multicad".
> >
> > I'm guessing this may be in part to "auto" being a more common word
> > prefix, but I may be wrong. Can anyone provide some clarity (and maybe
> > point me towards a potential solution)?
> >
> > Thanks in advance!
> >
> >
> > Kristian Van Tassell
> > Siemens Industry Sector
> > Siemens Product Lifecycle Management Software Inc.
> > 5939 Rice Creek Parkway
> > Shoreview, MN  55126 United States
> > Tel.      :+1 (651) 855-6194
> > Fax      :+1 (651) 855-6280
> > kristian.vantassell@siemens.com <kristian.vantassell@siemens.com%20>
> > www.siemens.com/plm
> >
>

RE: Searching on (hyphenated/capitalized) word issue

Posted by "Van Tassell, Kristian" <kr...@siemens.com>.

Thank you Upayavira.

I'm trying to figure out what will make Solr stem on "multi" in the word "multicad" so that any attempt to search on "multicad", "Multi-CAD" or "multiCAD" will return results. The WordDelimiterFilterFactory helps with the case of multi followed by a dash or a capital letter, but I'm not sure how to get Solr to tokenize the word "multi". Should I look at ngram configurations? Or is there a filter which promotes (rather than protects) words from being stemmed? (in other words, I could configure in a txt file that "multi" should be stemmed.

Just to reiterate, I am not getting any results when I search for the word "multicad", even though it appears many times in the text as "multiCAD" and "Multi-CAD".

Here is my configuration:

<analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_en.txt" enablePositionIncrements="true"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>				
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
  </analyzer>

-----Original Message-----
From: Upayavira [mailto:uv@odoko.co.uk] 
Sent: Monday, September 30, 2013 1:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Searching on (hyphenated/capitalized) word issue

You need to look at your analysis chain. The stuff you're talking about there is all configurable.

There's different tokenisers available to split your fields differently, then you might use the WordDelimiterFilterFactory to split existing tokens further (e.g. WiFi might become "wi", "fi" and "WiFi"). So really, you need to craft your own analysis chain to fit the kind of data you are working with.

Upayavira

On Mon, Sep 30, 2013, at 06:50 PM, Van Tassell, Kristian wrote:
> I have a search term "multi-CAD" being issues on tokenized text.  The 
> problem is that you cannot get any search results when you type 
> "multicad" unless you add a hyphen (multi-cad) or type "multiCAD"
> (omitting the hyphen, but correctly adding the CAPS into the spelling).
> 
> 
> 
> However, for the similar but unhyphenated word AutoCAD, you can type 
> "autocad" and get hits for AutoCAD, as you would expect. You can type 
> "auto-cad" and get the same results.
> 
> The query seems to get parsed as separate words (resulting in hits) 
> for multi-CAD, multiCAD, autocad, auto-cad and AUTOCAD, but not for multicad.
> In other words, the search terms  become "multi cad" and "auto cad" 
> for all cases except for when the term is "multicad".
> 
> I'm guessing this may be in part to "auto" being a more common word 
> prefix, but I may be wrong. Can anyone provide some clarity (and maybe 
> point me towards a potential solution)?
> 
> Thanks in advance!
> 
> 
> Kristian Van Tassell
> Siemens Industry Sector
> Siemens Product Lifecycle Management Software Inc.
> 5939 Rice Creek Parkway
> Shoreview, MN  55126 United States
> Tel.      :+1 (651) 855-6194
> Fax      :+1 (651) 855-6280
> kristian.vantassell@siemens.com <kristian.vantassell@siemens.com%20>
> www.siemens.com/plm
>

Re: Searching on (hyphenated/capitalized) word issue

Posted by Upayavira <uv...@odoko.co.uk>.

You need to look at your analysis chain. The stuff you're talking about
there is all configurable.

There's different tokenisers available to split your fields differently,
then you might use the WordDelimiterFilterFactory to split existing
tokens further (e.g. WiFi might become "wi", "fi" and "WiFi"). So
really, you need to craft your own analysis chain to fit the kind of
data you are working with.

Upayavira

On Mon, Sep 30, 2013, at 06:50 PM, Van Tassell, Kristian wrote:
> I have a search term "multi-CAD" being issues on tokenized text.  The
> problem is that you cannot get any search results when you type
> "multicad" unless you add a hyphen (multi-cad) or type "multiCAD"
> (omitting the hyphen, but correctly adding the CAPS into the spelling).
> 
> 
> 
> However, for the similar but unhyphenated word AutoCAD, you can type
> "autocad" and get hits for AutoCAD, as you would expect. You can type
> "auto-cad" and get the same results.
> 
> The query seems to get parsed as separate words (resulting in hits) for
> multi-CAD, multiCAD, autocad, auto-cad and AUTOCAD, but not for multicad.
> In other words, the search terms  become "multi cad" and "auto cad" for
> all cases except for when the term is "multicad".
> 
> I'm guessing this may be in part to "auto" being a more common word
> prefix, but I may be wrong. Can anyone provide some clarity (and maybe
> point me towards a potential solution)?
> 
> Thanks in advance!
> 
> 
> Kristian Van Tassell
> Siemens Industry Sector
> Siemens Product Lifecycle Management Software Inc.
> 5939 Rice Creek Parkway
> Shoreview, MN  55126 United States
> Tel.      :+1 (651) 855-6194
> Fax      :+1 (651) 855-6280
> kristian.vantassell@siemens.com <kristian.vantassell@siemens.com%20>
> www.siemens.com/plm
>