You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael _ <so...@gmail.com> on 2009/08/06 17:38:42 UTC

Preserving "C++" and other weird tokens

Hi everyone,
I'm indexing several documents that contain words that the StandardTokenizer
cannot detect as tokens.  These are words like
  C#
  .NET
  C++
which are important for users to be able to search for, but get treated as
"C", "NET", and "C".

How can I create a list of words that should be understood to be indivisible
tokens?  Is my only option somehow stringing together a lot of
PatternTokenizers?  I'd love to do something like <tokenizer
class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.

Thanks in advance!

Re: Preserving "C++" and other weird tokens

Posted by solrcoder <so...@gmail.com>.
Ach, sorry I didn't find this before posting! - Michael


Yonik Seeley-2 wrote:
> 
> http://search.lucidimagination.com/search/document/2d325f6178afc00a/how_to_search_for_c
> 
> -Yonik
> http://www.lucidimagination.com
> 

-- 
View this message in context: http://www.nabble.com/Preserving-%22C%2B%2B%22-and-other-weird-tokens-tp24848968p24868579.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Preserving "C++" and other weird tokens

Posted by Yonik Seeley <yo...@lucidimagination.com>.
http://search.lucidimagination.com/search/document/2d325f6178afc00a/how_to_search_for_c

-Yonik
http://www.lucidimagination.com



On Thu, Aug 6, 2009 at 11:38 AM, Michael _<so...@gmail.com> wrote:
> Hi everyone,
> I'm indexing several documents that contain words that the StandardTokenizer
> cannot detect as tokens.  These are words like
>  C#
>  .NET
>  C++
> which are important for users to be able to search for, but get treated as
> "C", "NET", and "C".
>
> How can I create a list of words that should be understood to be indivisible
> tokens?  Is my only option somehow stringing together a lot of
> PatternTokenizers?  I'd love to do something like <tokenizer
> class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.
>
> Thanks in advance!
>

Re: Preserving "C++" and other weird tokens

Posted by Michael _ <so...@gmail.com>.
On Thu, Aug 6, 2009 at 11:38 AM, Michael _ <so...@gmail.com> wrote:

> Hi everyone,
> I'm indexing several documents that contain words that the
> StandardTokenizer cannot detect as tokens.  These are words like
>   C#
>   .NET
>   C++
> which are important for users to be able to search for, but get treated as
> "C", "NET", and "C".
>
> How can I create a list of words that should be understood to be
> indivisible tokens?  Is my only option somehow stringing together a lot of
> PatternTokenizers?  I'd love to do something like <tokenizer
> class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.
>
> Thanks in advance!
>

By the way, in case it wasn't clear: I'm not particularly tied to using the
StandardTokenizer.  Any tokenizer would be fine, if it did a reasonable job
of splitting up the input text while preserving special cases.

I'm also not averse to passing in a list of regexes, if I had to, but I'm
suspicious that that would be redoing a lot of the work done by the parser
inside the Tokenizer.

Thanks,
Michael