You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael _ <so...@gmail.com> on 2009/08/06 17:38:42 UTC
Preserving "C++" and other weird tokens
Hi everyone,
I'm indexing several documents that contain words that the StandardTokenizer
cannot detect as tokens. These are words like
C#
.NET
C++
which are important for users to be able to search for, but get treated as
"C", "NET", and "C".
How can I create a list of words that should be understood to be indivisible
tokens? Is my only option somehow stringing together a lot of
PatternTokenizers? I'd love to do something like <tokenizer
class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.
Thanks in advance!
Re: Preserving "C++" and other weird tokens
Posted by solrcoder <so...@gmail.com>.
Ach, sorry I didn't find this before posting! - Michael
Yonik Seeley-2 wrote:
>
> http://search.lucidimagination.com/search/document/2d325f6178afc00a/how_to_search_for_c
>
> -Yonik
> http://www.lucidimagination.com
>
--
View this message in context: http://www.nabble.com/Preserving-%22C%2B%2B%22-and-other-weird-tokens-tp24848968p24868579.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Preserving "C++" and other weird tokens
Posted by Yonik Seeley <yo...@lucidimagination.com>.
http://search.lucidimagination.com/search/document/2d325f6178afc00a/how_to_search_for_c
-Yonik
http://www.lucidimagination.com
On Thu, Aug 6, 2009 at 11:38 AM, Michael _<so...@gmail.com> wrote:
> Hi everyone,
> I'm indexing several documents that contain words that the StandardTokenizer
> cannot detect as tokens. These are words like
> C#
> .NET
> C++
> which are important for users to be able to search for, but get treated as
> "C", "NET", and "C".
>
> How can I create a list of words that should be understood to be indivisible
> tokens? Is my only option somehow stringing together a lot of
> PatternTokenizers? I'd love to do something like <tokenizer
> class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.
>
> Thanks in advance!
>
Re: Preserving "C++" and other weird tokens
Posted by Michael _ <so...@gmail.com>.
On Thu, Aug 6, 2009 at 11:38 AM, Michael _ <so...@gmail.com> wrote:
> Hi everyone,
> I'm indexing several documents that contain words that the
> StandardTokenizer cannot detect as tokens. These are words like
> C#
> .NET
> C++
> which are important for users to be able to search for, but get treated as
> "C", "NET", and "C".
>
> How can I create a list of words that should be understood to be
> indivisible tokens? Is my only option somehow stringing together a lot of
> PatternTokenizers? I'd love to do something like <tokenizer
> class="StandardTokenizer" tokenwhitelist=".NET C++ C#" />.
>
> Thanks in advance!
>
By the way, in case it wasn't clear: I'm not particularly tied to using the
StandardTokenizer. Any tokenizer would be fine, if it did a reasonable job
of splitting up the input text while preserving special cases.
I'm also not averse to passing in a list of regexes, if I had to, but I'm
suspicious that that would be redoing a lot of the work done by the parser
inside the Tokenizer.
Thanks,
Michael