You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Donna L Gresh <gr...@us.ibm.com> on 2008/03/04 20:22:24 UTC

C++ as token in StandardAnalyzer?

I saw some discussion in the archives some time ago about the fact that 
C++ is tokenized as C in the StandardAnalyzer; this seems to still be the 
case; I was wondering if there is a simple way for me to get the behavior 
I want for C++ (that it is tokenized as C++) in particular, and perhaps 
for other more ideosyncratic terms I may have in my own application--
Thanks
Donna



Re: C++ as token in StandardAnalyzer?

Posted by Erick Erickson <er...@gmail.com>.
 Almost by definition, you have to write your own analyzer. This may be as
simple as chaining another filter into one of the regular analyzers or as
complex as defining your own grammar.

As far as I know, there's no "keep word" list. But that would be an
interesting addition. That is, a variety of analyzer that you not only
passed a list of stop words to, but also passed a list of "keep words",
or words that should NOT be massaged at all. I can imagine that this
would get pretty tricky for, say, StandardAnalyzer, but something like
this in the chain of WhitespaceTokenizer >> LowercaseFilter >>
KeepwordFilter might be useful...

All this right off the top of my head without much thought, but....

Best
Erick

On Tue, Mar 4, 2008 at 2:22 PM, Donna L Gresh <gr...@us.ibm.com> wrote:

> I saw some discussion in the archives some time ago about the fact that
> C++ is tokenized as C in the StandardAnalyzer; this seems to still be the
> case; I was wondering if there is a simple way for me to get the behavior
> I want for C++ (that it is tokenized as C++) in particular, and perhaps
> for other more ideosyncratic terms I may have in my own application--
> Thanks
> Donna
>
>
>

RE: C++ as token in StandardAnalyzer?

Posted by Tom Conlon <to...@2ls.com>.
Hi Donna - See previous post below that may help. Tom
////////////////////////////////////////////////////////
Hi,

In case this is of help to others:

Crux of problem: 
I wanted numbers and characters such as # and + to be considered.

Solution:
implement a LowercaseWhitespaceAnalyzer and a
LowercaseWhitespaceTokenizer.

i.e.
IndexWriter writer = new IndexWriter(INDEX_DIR, new
LowercaseWhitespaceAnalyzer(), true);

Tom
=======================================================================
Diagnostics:

StandardAnalyzer
----------------
Enter Querystring: (C++ AND C#)      Searching for: +c +c
Enter Querystring: (C\+\+ AND C\#)   Searching for: +c +c
Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: ("moss 2007" "sharepoint 2007") asp.net

SimpleAnalyser
--------------
Enter Querystring: C++ Searching for: c
Enter Querystring: C#  Searching for: c
Enter Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: (moss or sharepoint) and "asp net"

WhitespaceAnalyzer
------------------
Enter Querystring: (C++ AND C#)  Searching for: +C++ +C# Enter
Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: ("moss 2007" or "sharepoint 2007") and asp.net

KeywordAnalyzer
---------------
Enter Querystring: (C++ AND C#) Searching for: +C++ +C# Enter
Querystring: ("moss 2007" or "sharepoint 2007") and "asp.net"
Searching for: (moss 2007 or sharepoint 2007) and asp.net

StopAnalyzer
------------
Enter Querystring: (C\++ AND C\#)  Searching for: +c +c Enter
Querystring: ("MOSS 2007" or "SHAREPOINT 2007") and "ASP.NET"
Searching for: (moss sharepoint) "asp net"
 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
 

-----Original Message-----
From: Donna L Gresh [mailto:gresh@us.ibm.com] 
Sent: 04 March 2008 19:22
To: java-user@lucene.apache.org
Subject: C++ as token in StandardAnalyzer?

I saw some discussion in the archives some time ago about the fact that 
C++ is tokenized as C in the StandardAnalyzer; this seems to still be 
C++ the
case; I was wondering if there is a simple way for me to get the
behavior I want for C++ (that it is tokenized as C++) in particular, and
perhaps for other more ideosyncratic terms I may have in my own
application-- Thanks Donna



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org