You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by JesL <je...@comcast.net> on 2009/07/16 15:04:42 UTC

Search in non-linguistic text

Hello,
Are there any suggestions / best practices for using Lucene for searching
non-linguistic text?  What I mean by non-linguistic is that it's not English
or any other language, but rather product codes.  This is presenting some
interesting challenges.  Among them are the need for pretty lax wildcard
searches.  For example, ABC should match on ABCD, but so should BCD.  Also,
it needs to be agnostic to special characters.  So, ABC/D should match ABCD
as well as ABC-D or "ABC D".

As I write an analyzer to handle these cases, I seem to be pretty quickly
degrading into a "like '%blah%' search, with rules to treat all special
characters as single-character, optional wildcards.  I'm concerned that the
performance of this will be disappointing, though.

Any help would be much appreciated.  Thanks!

- Jes
-- 
View this message in context: http://www.nabble.com/Search-in-non-linguistic-text-tp24515936p24515936.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search in non-linguistic text

Posted by Robert Muir <rc...@gmail.com>.

take a look at WordDelimiterFilter from Solr [you can use it in your
lucene app too]

On Thu, Jul 16, 2009 at 9:04 AM, JesL<je...@comcast.net> wrote:
>
> Hello,
> Are there any suggestions / best practices for using Lucene for searching
> non-linguistic text?  What I mean by non-linguistic is that it's not English
> or any other language, but rather product codes.  This is presenting some
> interesting challenges.  Among them are the need for pretty lax wildcard
> searches.  For example, ABC should match on ABCD, but so should BCD.  Also,
> it needs to be agnostic to special characters.  So, ABC/D should match ABCD
> as well as ABC-D or "ABC D".
>
> As I write an analyzer to handle these cases, I seem to be pretty quickly
> degrading into a "like '%blah%' search, with rules to treat all special
> characters as single-character, optional wildcards.  I'm concerned that the
> performance of this will be disappointing, though.
>
> Any help would be much appreciated.  Thanks!
>
> - Jes
> --
> View this message in context: http://www.nabble.com/Search-in-non-linguistic-text-tp24515936p24515936.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search in non-linguistic text

Posted by Anshum <an...@gmail.com>.

Hi Jes,Good to see you here. You could try something like an n'gram
analyzer. You'd have to explore, though 'm assuming it'd be helpful for
you.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Thu, Jul 16, 2009 at 6:34 PM, JesL <je...@comcast.net> wrote:

>
> Hello,
> Are there any suggestions / best practices for using Lucene for searching
> non-linguistic text?  What I mean by non-linguistic is that it's not
> English
> or any other language, but rather product codes.  This is presenting some
> interesting challenges.  Among them are the need for pretty lax wildcard
> searches.  For example, ABC should match on ABCD, but so should BCD.  Also,
> it needs to be agnostic to special characters.  So, ABC/D should match ABCD
> as well as ABC-D or "ABC D".
>
> As I write an analyzer to handle these cases, I seem to be pretty quickly
> degrading into a "like '%blah%' search, with rules to treat all special
> characters as single-character, optional wildcards.  I'm concerned that the
> performance of this will be disappointing, though.
>
> Any help would be much appreciated.  Thanks!
>
> - Jes
> --
> View this message in context:
> http://www.nabble.com/Search-in-non-linguistic-text-tp24515936p24515936.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Search in non-linguistic text

Posted by Digy <di...@gmail.com>.

Another approach could be  splitting the text into chars and returning each
char as a token(in a custom analyzer).

For ex: for the document [some text]
Tokens would be [s] [o] [m] [e]       [t] [e] [x] [t] and searches such as
[ome] or [ex] would get hits.

Sample code written in C# is below:
http://people.apache.org/~digy/SingleCharAnalyzer.cs

DIGY

-----Original Message-----
From: Matthew Hall [mailto:mhall@informatics.jax.org] 
Sent: Thursday, July 16, 2009 4:36 PM
To: java-user@lucene.apache.org
Subject: Re: Search in non-linguistic text

Assuming your dataset isn't incredibly large, I think you could.. cheat 
here, and optimize your data for searching.

Am I correct in assuming that BC, should also match on ABCD?

If so, then yes your current thoughts on the problems that you face are 
correct, and everything you do will be turning into a contains search, 
which is yes.. not the best performance you have ever seen.

However, knowing this, you can manipulate your data in such a way, that 
you can get around that limitation, and turn everything into a prefix 
(or postfix) search if you so prefer.

So here's what you do:

When you are indexing the term ABCD, you are actually going to add 
several documents into the index (or into various special purpose 
indexes, if you so prefer.. but more on that later on)

Lets say you want to turn everything into a prefix search under the covers.

In the index you would store the following values, all of which point at 
the document "ABCD"

'ABCD'
'BCD'
'CD'
'D'

Then, when you do your search for the terms "BC" you will really be 
searching on "BC*", which will produce a match to the second document. 

Now Lucene documents can be considered as giant data holding object, you 
can and SHOULD have fields in the document that are not used at search 
time, but ARE used at display generation time (or whatever layer feeds 
your display, if you are going in a more OO fashion).

Now this technique isn't without its drawbacks of course, you will see 
an increase in your index size, but unless you are playing around with 
some VERY large datasets that really shouldn't matter.

Now, if I was the one implementing this, I would probably make at least 
two indexes, one for exact punctuation relevant data.  The other index 
would contain the data that I've described above, with one important 
difference, any and all punctuation (including whitespace) has been 
removed, and all of the letters in your codes were collapsed down into a 
single word.  That way you can perform two searches, and ensure that 
exact punctuation relevant matches will appear higher in your results 
list than non punctuation relevant ones.

Anyhow, that's pretty much it in a nutshell.  I think this technique 
should work for you, after you have decided

JesL wrote:
> Hello,
> Are there any suggestions / best practices for using Lucene for searching
> non-linguistic text?  What I mean by non-linguistic is that it's not
English
> or any other language, but rather product codes.  This is presenting some
> interesting challenges.  Among them are the need for pretty lax wildcard
> searches.  For example, ABC should match on ABCD, but so should BCD.
Also,
> it needs to be agnostic to special characters.  So, ABC/D should match
ABCD
> as well as ABC-D or "ABC D".
>
> As I write an analyzer to handle these cases, I seem to be pretty quickly
> degrading into a "like '%blah%' search, with rules to treat all special
> characters as single-character, optional wildcards.  I'm concerned that
the
> performance of this will be disappointing, though.
>
> Any help would be much appreciated.  Thanks!
>
> - Jes
>   

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Search in non-linguistic text

Posted by Matthew Hall <mh...@informatics.jax.org>.

Assuming your dataset isn't incredibly large, I think you could.. cheat 
here, and optimize your data for searching.

Am I correct in assuming that BC, should also match on ABCD?

If so, then yes your current thoughts on the problems that you face are 
correct, and everything you do will be turning into a contains search, 
which is yes.. not the best performance you have ever seen.

However, knowing this, you can manipulate your data in such a way, that 
you can get around that limitation, and turn everything into a prefix 
(or postfix) search if you so prefer.

So here's what you do:

When you are indexing the term ABCD, you are actually going to add 
several documents into the index (or into various special purpose 
indexes, if you so prefer.. but more on that later on)

Lets say you want to turn everything into a prefix search under the covers.

In the index you would store the following values, all of which point at 
the document "ABCD"

'ABCD'
'BCD'
'CD'
'D'

Then, when you do your search for the terms "BC" you will really be 
searching on "BC*", which will produce a match to the second document. 

Now Lucene documents can be considered as giant data holding object, you 
can and SHOULD have fields in the document that are not used at search 
time, but ARE used at display generation time (or whatever layer feeds 
your display, if you are going in a more OO fashion).

Now this technique isn't without its drawbacks of course, you will see 
an increase in your index size, but unless you are playing around with 
some VERY large datasets that really shouldn't matter.

Now, if I was the one implementing this, I would probably make at least 
two indexes, one for exact punctuation relevant data.  The other index 
would contain the data that I've described above, with one important 
difference, any and all punctuation (including whitespace) has been 
removed, and all of the letters in your codes were collapsed down into a 
single word.  That way you can perform two searches, and ensure that 
exact punctuation relevant matches will appear higher in your results 
list than non punctuation relevant ones.

Anyhow, that's pretty much it in a nutshell.  I think this technique 
should work for you, after you have decided

JesL wrote:
> Hello,
> Are there any suggestions / best practices for using Lucene for searching
> non-linguistic text?  What I mean by non-linguistic is that it's not English
> or any other language, but rather product codes.  This is presenting some
> interesting challenges.  Among them are the need for pretty lax wildcard
> searches.  For example, ABC should match on ABCD, but so should BCD.  Also,
> it needs to be agnostic to special characters.  So, ABC/D should match ABCD
> as well as ABC-D or "ABC D".
>
> As I write an analyzer to handle these cases, I seem to be pretty quickly
> degrading into a "like '%blah%' search, with rules to treat all special
> characters as single-character, optional wildcards.  I'm concerned that the
> performance of this will be disappointing, though.
>
> Any help would be much appreciated.  Thanks!
>
> - Jes
>   

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Ugh

Posted by Matthew Hall <mh...@informatics.jax.org>.

  
They are upgrading our mail servers here, so if you are seeing.. many 
MANY duplicates of things I posted.. I'm really sorry about that. T_T

Matt

-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org