You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alex Soto <le...@gmail.com> on 2008/06/24 17:48:13 UTC

searching for C++

Hello:

I have a problem where I need to search for the term "C++".
If I use StandardAnalyzer, the "+" characters are removed and the
search is done on just the "c" character which is not what is
intended.
Yet, I need to use standard analyzer for the other benefits it provides.

I think I need to write a specialized tokenizer (and accompanying
analyzer) that let the "+" characters pass.
I would use the JFlex provided one, modify it and add it to my project.

My question is:

Is there any simpler way to accomplish the same?


Best regards,
Alex Soto
lexsoto@gmail.com

-
Amicus Plato, sed magis amica veritas.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: searching for C++

Posted by Alex Soto <le...@gmail.com>.
Thanks everyone. I appreciate the help.

I think I will write my own tokenizer, because I do not have a
predefined list of words with symbols.
I will modify the grammar by defining a SYMBOL token as John suggested
and redefine ALPHANUM to include it.

Regards,
Alex Soto



On Tue, Jun 24, 2008 at 12:12 PM, N. Hira <nh...@cognocys.com> wrote:
> This isn't ideal, but if you have a defined list of such terms, you may find
> it easier to filter these terms out into a separate field for indexing.
>
> -h
> ----------------------------------------------------------------------
> Hira, N.R.
> Solutions Architect
> Cognocys, Inc.
> (773) 251-7453
>
> On 24-Jun-2008, at 11:03 AM, John Byrne wrote:
>
>> I don't think there is a simpler way. I think you will have to modify the
>> tokenizer. Once you go beyond basic human-readable text, you always end up
>> having to do that. I have modified the JavaCC version of StandardTokenizer
>>  for allowing symbols to pass through, but I've never used the JFlex version
>> - don't know anything about JFlex I'm afraid!
>>
>> A good strategy might be to make a new type of lexical token called
>> "SYMBOL" and try to catch as many symbols as you can think of; then maybe
>> create new token types which are ALPHANUM types that can have pre-fixed or
>> post-fixed symbols.
>>
>> That way, you'll be able to catch things like "c++" in a TokenFilter, and
>> you can choose to pass it through as a single token, or split it up into two
>> tokens, or whatever you want.
>>
>> Hope that helps.
>>
>> Regards,
>> JB
>>
>> Alex Soto wrote:
>>>
>>> Hello:
>>>
>>> I have a problem where I need to search for the term "C++".
>>> If I use StandardAnalyzer, the "+" characters are removed and the
>>> search is done on just the "c" character which is not what is
>>> intended.
>>> Yet, I need to use standard analyzer for the other benefits it provides.
>>>
>>> I think I need to write a specialized tokenizer (and accompanying
>>> analyzer) that let the "+" characters pass.
>>> I would use the JFlex provided one, modify it and add it to my project.
>>>
>>> My question is:
>>>
>>> Is there any simpler way to accomplish the same?
>>>
>>>
>>> Best regards,
>>> Alex Soto
>>> lexsoto@gmail.com
>>>
>>> -
>>> Amicus Plato, sed magis amica veritas.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>>
>>>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Alex Soto
lexsoto@gmail.com

-
Amicus Plato, sed magis amica veritas.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: searching for C++

Posted by "N. Hira" <nh...@cognocys.com>.
This isn't ideal, but if you have a defined list of such terms, you  
may find it easier to filter these terms out into a separate field  
for indexing.

-h
----------------------------------------------------------------------
Hira, N.R.
Solutions Architect
Cognocys, Inc.
(773) 251-7453

On 24-Jun-2008, at 11:03 AM, John Byrne wrote:

> I don't think there is a simpler way. I think you will have to  
> modify the tokenizer. Once you go beyond basic human-readable text,  
> you always end up having to do that. I have modified the JavaCC  
> version of StandardTokenizer  for allowing symbols to pass through,  
> but I've never used the JFlex version - don't know anything about  
> JFlex I'm afraid!
>
> A good strategy might be to make a new type of lexical token called  
> "SYMBOL" and try to catch as many symbols as you can think of; then  
> maybe create new token types which are ALPHANUM types that can have  
> pre-fixed or post-fixed symbols.
>
> That way, you'll be able to catch things like "c++" in a  
> TokenFilter, and you can choose to pass it through as a single  
> token, or split it up into two tokens, or whatever you want.
>
> Hope that helps.
>
> Regards,
> JB
>
> Alex Soto wrote:
>> Hello:
>>
>> I have a problem where I need to search for the term "C++".
>> If I use StandardAnalyzer, the "+" characters are removed and the
>> search is done on just the "c" character which is not what is
>> intended.
>> Yet, I need to use standard analyzer for the other benefits it  
>> provides.
>>
>> I think I need to write a specialized tokenizer (and accompanying
>> analyzer) that let the "+" characters pass.
>> I would use the JFlex provided one, modify it and add it to my  
>> project.
>>
>> My question is:
>>
>> Is there any simpler way to accomplish the same?
>>
>>
>> Best regards,
>> Alex Soto
>> lexsoto@gmail.com
>>
>> -
>> Amicus Plato, sed magis amica veritas.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: searching for C++

Posted by John Byrne <jo...@propylon.com>.
I don't think there is a simpler way. I think you will have to modify 
the tokenizer. Once you go beyond basic human-readable text, you always 
end up having to do that. I have modified the JavaCC version of 
StandardTokenizer  for allowing symbols to pass through, but I've never 
used the JFlex version - don't know anything about JFlex I'm afraid!

A good strategy might be to make a new type of lexical token called 
"SYMBOL" and try to catch as many symbols as you can think of; then 
maybe create new token types which are ALPHANUM types that can have 
pre-fixed or post-fixed symbols.

That way, you'll be able to catch things like "c++" in a TokenFilter, 
and you can choose to pass it through as a single token, or split it up 
into two tokens, or whatever you want.

Hope that helps.

Regards,
JB

Alex Soto wrote:
> Hello:
>
> I have a problem where I need to search for the term "C++".
> If I use StandardAnalyzer, the "+" characters are removed and the
> search is done on just the "c" character which is not what is
> intended.
> Yet, I need to use standard analyzer for the other benefits it provides.
>
> I think I need to write a specialized tokenizer (and accompanying
> analyzer) that let the "+" characters pass.
> I would use the JFlex provided one, modify it and add it to my project.
>
> My question is:
>
> Is there any simpler way to accomplish the same?
>
>
> Best regards,
> Alex Soto
> lexsoto@gmail.com
>
> -
> Amicus Plato, sed magis amica veritas.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org