You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Yonik Seeley <ys...@gmail.com> on 2005/08/16 00:16:11 UTC

intra-word delimiters

Does anyone have solutions for handling intraword delimiters (case
changes, non-alphanumeric chars, and alpha-numeric transitions)?

If the source text is Wi-Fi, we want to be able to match the following
user queries:

wi fi
wifi
wi-fi
wi+fi
WiFi

One way is to index "wi", "fi", and "wifi".
However, indexing all combinations of subwords gets a bit messy when
the number of subwords gets larger.  I need to handle product names,
serial numbers, SKUs, etc.

Another example:
Source Text contains "Canon Powershot SD500 7MP Digital Elph"

And I want to be able to match the following user queries:
Power Shot SD 500
CanonPowerShotSD500
SD 500 7 MP digitalelph
Canon-Powershot-SD 500

Any ideas?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: intra-word delimiters

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Aug 15, 2005, at 8:53 PM, Marvin Humphrey wrote:

> Create a phrase query that when it encounters ab => { tokenlength  
> => 2 } knows to look for something at position 3.

Fencepost error!  That should have been "position 2".

Not that correcting the error makes the algo any more practical.  ;)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: intra-word delimiters

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Aug 15, 2005, at 7:47 PM, Yonik Seeley wrote:

> That was the plan, but step (4) really seems problematic.
>
> - term expansion this way can lead to a lot of false matches
> - phrase queries with many bordering words break
> - settingt term positions such that phrase queries work on all combos
> of subwords is non-trivial.

Tag every term with its length in tokens.  :)

Index at these positions.

Pos0: a ab abc abcd
Pos1: b bc bcd
Pos2: c cd
Pos3: d

Create a phrase query that when it encounters ab => { tokenlength =>  
2 } knows to look for something at position 3.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: intra-word delimiters

Posted by Yonik Seeley <ys...@gmail.com>.

That was the plan, but step (4) really seems problematic.

- term expansion this way can lead to a lot of false matches
- phrase queries with many bordering words break
- settingt term positions such that phrase queries work on all combos
of subwords is non-trivial.

It seems like a better approach might be a new query type that can
handle things like this.

As an example, consider a-b-c-d (4 subwords)... one way of indexing
the tokens would be:

Pos0: a
Pos1: b,  ab,  a
Pos2: c,  bc,  abc,  cd
Pos3: d,  abcd

There are only 10 uniqe tokens n(n/2+1/2), but I needed to index 11 in
order to satisfy all possible phrase queries of catenated subwords. 
Notice how many other things will now match though (ac, aab,
aababcabcd, etc).

In addition, any algorithm I come up with to generate those term
position uses even more terms than the hand-generated one above.

By using index expansion in this manner, we have lost info about the
original ordering.  A new type of fuzzy phrase query seems like it
might be able to do a better job in many circumstances.

-Yonik

On 8/15/05, Marvin Humphrey <ma...@rectangular.com> wrote:
> 1) Lowercase.
> 2) Convert non-alphanumeric characters to spaces.
> 3) Introduce a space at every boundary between a letter and a number.
> 4) concatenate all 1, 2, 3 .. n term combinations and index them.
> 5) Don't stem.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: intra-word delimiters

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Aug 15, 2005, at 3:16 PM, Yonik Seeley wrote:

> Another example:
> Source Text contains "Canon Powershot SD500 7MP Digital Elph"
>
> And I want to be able to match the following user queries:
> Power Shot SD 500
> CanonPowerShotSD500
> SD 500 7 MP digitalelph
> Canon-Powershot-SD 500
>
> Any ideas?

How about this?

1) Lowercase.
2) Convert non-alphanumeric characters to spaces.
3) Introduce a space at every boundary between a letter and a number.
4) concatenate all 1, 2, 3 .. n term combinations and index them.
5) Don't stem.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org