You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by realw5 <dr...@improvementdirect.com> on 2007/05/31 00:01:59 UTC

SOLR Indexing/Querying

Hey Guys,
I need some guidance in regards to a problem we are having with our solr
index. Below is a list of terms our customers search for, which are failing
or not returning the complete set. The second side of the list is the
product id/keyword we want it to match.

Can you give me some direction on how this can (or let me know if i can't be
done) with index/query analyzers. Any help is much appeciated!

Dan

---------------------------

Keyword Typed In / We want it to find

D3555 / 3555LHP
D460160-BN / D460160
D460160BN / D460160
Dd454557 / D454557
84200ORB / 84200
84200-ORB / 84200
T13420-SCH / T13420
t14240-ss / t14240
-- 
View this message in context: http://www.nabble.com/SOLR-Indexing-Querying-tf3843221.html#a10883456
Sent from the Solr - User mailing list archive at Nabble.com.

Re: AW: SOLR Indexing/Querying

Posted by Walter Underwood <wu...@netflix.com>.

I solved something similar to this by creating a "stemmer" for part
numbers. Variations like "-BN" on the end can be treated as inflections
in the part number language, similar to plurals in English.

I used a set of regexes to match and transform, in some cases generating
multiple "root" part numbers. With the per-field analyzers in Solr, this
would work much better.

I'll make another search for the presentation that covers this. It was
at our Ultraseek Users Group Meeting in 1999.

wunder

On 5/31/07 11:46 AM, "Chris Hostetter" <ho...@fucit.org> wrote:

> 
> : It looks alot like using Solr's standard "WordDelimiterFilter" (see the
> : sample schema.xml) does what you need.
> 
> WordDelimiterFilter will only get you so far.  it can split the indexed
> text of "3555LHP" into tokens "3555" and "LHP"; and the user entered
> "D3555" into the tokens "D" and "3555" -- but because those tokens
> orriginated as part of a single chunk of input text, the QueryParser will
> turn them into a phrase query, which will not match on the single token
> "3555" ... the "D" just isn't there.
> 
> I can't think of anyway to achieve what you want "out of the box" i think
> you'd need a custom ReuestHandler that uses your own query parser which
> uses boolean queries instead of PhraseQueries.
> 
> 
> : > Keyword Typed In / We want it to find
> : >
> : > D3555 / 3555LHP
> : > D460160-BN / D460160
> : > D460160BN / D460160
> : > Dd454557 / D454557
> : > 84200ORB / 84200
> : > 84200-ORB / 84200
> : > T13420-SCH / T13420
> : > t14240-ss / t14240

Re: AW: SOLR Indexing/Querying

Posted by Chris Hostetter <ho...@fucit.org>.

: It looks alot like using Solr's standard "WordDelimiterFilter" (see the
: sample schema.xml) does what you need.

WordDelimiterFilter will only get you so far.  it can split the indexed
text of "3555LHP" into tokens "3555" and "LHP"; and the user entered
"D3555" into the tokens "D" and "3555" -- but because those tokens
orriginated as part of a single chunk of input text, the QueryParser will
turn them into a phrase query, which will not match on the single token
"3555" ... the "D" just isn't there.

I can't think of anyway to achieve what you want "out of the box" i think
you'd need a custom ReuestHandler that uses your own query parser which
uses boolean queries instead of PhraseQueries.


: > Keyword Typed In / We want it to find
: >
: > D3555 / 3555LHP
: > D460160-BN / D460160
: > D460160BN / D460160
: > Dd454557 / D454557
: > 84200ORB / 84200
: > 84200-ORB / 84200
: > T13420-SCH / T13420
: > t14240-ss / t14240




-Hoss

AW: SOLR Indexing/Querying

Posted by "Burkamp, Christian" <C....@Ceyoniq.com>.

Hi there,

It looks alot like using Solr's standard "WordDelimiterFilter" (see the sample schema.xml) does what you need.
It splits on alphabetical to numeric boundaries and on the various kinds of intra word delimiters like "-", "_" or ".". You can decide whether the parts are put together again in addition to the split up tokens. Control this by the parameters "catenateWords", "catenateNumbers" and "catenateAll".
Good documentation on this topic is found on the wiki

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089

-- Christian

-----Ursprüngliche Nachricht-----
Von: Frans Flippo [mailto:frans.flippo@gmail.com] 
Gesendet: Donnerstag, 31. Mai 2007 11:27
An: solr-user@lucene.apache.org
Betreff: Re: SOLR Indexing/Querying

I think if you add a field that has an analyzer that creates tokens on alpha/digit/punctuation boundaries, that should go a long way. Use that both at index and search time.

For example:
* 3555LHP  becomes "3555" "LHP"
  Searching for D3555 becomes "D" OR "3555", so it matches on token "3555" from 3555LHP.

* t14240 becomes "t" "14240"
  Searching for t14240-ss  becomes "t" OR "14240" OR "ss", matching "14240" from "t14240".

Similarly for your other examples.

If this proves to be too broad, you may need to define some stricter rules, but you could use this for starters.

I think you will have to write your own analyzer, as it doesn't look like any of the analyzers available in Solr/Lucene do exactly what you need. But that's relatively straightforward. Just start with the code from one of the existing Analyzers (e.g. KeywordAnalyzer).

Good luck,
Frans

On 5/31/07, realw5 <dr...@improvementdirect.com> wrote:
>
>
> Hey Guys,
> I need some guidance in regards to a problem we are having with our 
> solr index. Below is a list of terms our customers search for, which 
> are failing or not returning the complete set. The second side of the 
> list is the product id/keyword we want it to match.
>
> Can you give me some direction on how this can (or let me know if i 
> can't be
> done) with index/query analyzers. Any help is much appeciated!
>
> Dan
>
> ---------------------------
>
> Keyword Typed In / We want it to find
>
> D3555 / 3555LHP
> D460160-BN / D460160
> D460160BN / D460160
> Dd454557 / D454557
> 84200ORB / 84200
> 84200-ORB / 84200
> T13420-SCH / T13420
> t14240-ss / t14240
> --
> View this message in context: 
> http://www.nabble.com/SOLR-Indexing-Querying-tf3843221.html#a10883456
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: SOLR Indexing/Querying

Posted by Frans Flippo <fr...@gmail.com>.

I think if you add a field that has an analyzer that creates tokens on
alpha/digit/punctuation boundaries, that should go a long way. Use that both
at index and search time.

For example:
* 3555LHP  becomes "3555" "LHP"
  Searching for D3555 becomes "D" OR "3555", so it matches on token "3555"
from 3555LHP.

* t14240 becomes "t" "14240"
  Searching for t14240-ss  becomes "t" OR "14240" OR "ss", matching "14240"
from "t14240".

Similarly for your other examples.

If this proves to be too broad, you may need to define some stricter rules,
but you could use this for starters.

I think you will have to write your own analyzer, as it doesn't look like
any of the analyzers available in Solr/Lucene do exactly what you need. But
that's relatively straightforward. Just start with the code from one of the
existing Analyzers (e.g. KeywordAnalyzer).

Good luck,
Frans

On 5/31/07, realw5 <dr...@improvementdirect.com> wrote:
>
>
> Hey Guys,
> I need some guidance in regards to a problem we are having with our solr
> index. Below is a list of terms our customers search for, which are
> failing
> or not returning the complete set. The second side of the list is the
> product id/keyword we want it to match.
>
> Can you give me some direction on how this can (or let me know if i can't
> be
> done) with index/query analyzers. Any help is much appeciated!
>
> Dan
>
> ---------------------------
>
> Keyword Typed In / We want it to find
>
> D3555 / 3555LHP
> D460160-BN / D460160
> D460160BN / D460160
> Dd454557 / D454557
> 84200ORB / 84200
> 84200-ORB / 84200
> T13420-SCH / T13420
> t14240-ss / t14240
> --
> View this message in context:
> http://www.nabble.com/SOLR-Indexing-Querying-tf3843221.html#a10883456
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>