You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Aigner, Thomas" <TA...@WescoDist.com> on 2005/06/28 19:18:37 UTC

Indexing puncutation

Hello all,

	I am VERY new to Lucene and we are trying out Lucene to see if
it will accomplish the vast majority of our search functions.

	I have a question about a good way to index some of our product
description codes.  We have description codes like 21-MA-GAB and other
punctuation.  Our users need to be able to search for "21 MA GAB" or 
"21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by
creating synonyms for the 3 different ways when punctuation is in parts
to search for? I know I can stop punctuation in the index but what about
grouping the information together or with spaces?

Thanks all in advance,
Tom


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncutation

Posted by Chris D <br...@gmail.com>.
On 6/28/05, Aigner, Thomas <TA...@wescodist.com> wrote:
> Hello all,
> 
>         I am VERY new to Lucene and we are trying out Lucene to see if
> it will accomplish the vast majority of our search functions.
> 
>         I have a question about a good way to index some of our product
> description codes.  We have description codes like 21-MA-GAB and other
> punctuation.  Our users need to be able to search for "21 MA GAB" or
> "21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by
> creating synonyms for the 3 different ways when punctuation is in parts
> to search for? I know I can stop punctuation in the index but what about
> grouping the information together or with spaces?
> 
> Thanks all in advance,
> Tom

There is a couple ways to do this, and I'm not sure which would be
best. (I'm also fairly new to lucene)

You can create a grammar that recognizes your product codes (see
StandardAnalyzer code for examples on how to do that) then use a
custom filter to normalize everything.

Forgive my poor lex but general idea

| <CODE: <NUM><NUM>  ("-"|"_"|""|" ") <ALPHA>+ ("-"|"_"|""|" ") <ALPHA>+ >

Then in the filter, normalize to strip out all of the punctuation.
This can be done with a regex or something faster but just for
reference.

   if (type == CODE_TYPE) {
      return new org.apache.lucene.analysis.Token(text.replaceAll("-",
""), t.startOffset(), t.endOffset(), type);
   } ... 

See StandardAnalyzer, it has a lot of code that would do what you need
and you can copy, paste and edit.

You could also do synonyms but that seems like it would be more overhead.

If you think of a better way, let me know, I have to do something similar.

Cheers,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncutation

Posted by Ken Krugler <kk...@transpac.com>.
>I do a vaguely similar thing;  I have to strip accents from 
>characters such as e-acute out of both my input data and my incoming 
>search queries to put them into a standard form.  I do this with a 
>custom TokenFilter subclass.  I have an analyzer that includes this 
>filter along with some of the standard ones (LowercaseFilter, etc). 
>I run the same analyzer on indexing and searching, which has been 
>discussed in other posts.

For a hard-core approach to this problem, you could try converting 
all text to Unicode first, then use the ICU package to create a level 
0 "sort key" (the C API is col_getSortKey). This will be a string 
suitable for comparison to determine weak equality, but you can also 
just index it as a regular token.

There are some subtle issues w/locale-specific behavior of the sort 
key generation step, where you could guess at the right locale to use 
for the conversion, but in general that shouldn't matter.

Two other issues are code/data size (ICU can be big) and the 
performance hit while indexing documents.

-- Ken



>Aigner, Thomas wrote:
>
>>Hello all,
>>
>>	I am VERY new to Lucene and we are trying out Lucene to see if
>>it will accomplish the vast majority of our search functions.
>>
>>	I have a question about a good way to index some of our product
>>description codes.  We have description codes like 21-MA-GAB and other
>>punctuation.  Our users need to be able to search for "21 MA GAB" 
>>or "21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by
>>creating synonyms for the 3 different ways when punctuation is in parts
>>to search for? I know I can stop punctuation in the index but what about
>>grouping the information together or with spaces?
>>
>>Thanks all in advance,
>>Tom


-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing puncutation

Posted by Peter Pimley <pe...@semantico.com>.
I'm not sure how useful this reply is, but hey ;)

<aol>me too!</aol>

I do a vaguely similar thing;  I have to strip accents from characters 
such as e-acute out of both my input data and my incoming search queries 
to put them into a standard form.  I do this with a custom TokenFilter 
subclass.  I have an analyzer that includes this filter along with some 
of the standard ones (LowercaseFilter, etc).  I run the same analyzer on 
indexing and searching, which has been discussed in other posts.

My point is that I'm happy with this approach and I'd recommend you do a 
similar thing, at least as a first attempt.

Cheers,
Peter Pimley



Aigner, Thomas wrote:

>Hello all,
>
>	I am VERY new to Lucene and we are trying out Lucene to see if
>it will accomplish the vast majority of our search functions.
>
>	I have a question about a good way to index some of our product
>description codes.  We have description codes like 21-MA-GAB and other
>punctuation.  Our users need to be able to search for "21 MA GAB" or 
>"21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by
>creating synonyms for the 3 different ways when punctuation is in parts
>to search for? I know I can stop punctuation in the index but what about
>grouping the information together or with spaces?
>
>Thanks all in advance,
>Tom
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org