You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Abin Mathew <ab...@toostep.com> on 2010/02/03 06:25:26 UTC

C++ being filtered (please help)

Hi
I have a field which may take the form "C++,PHP & MySql,C#"
now i want to tokenize it based on comma or white space and other word
delimiting characters only.Not on the plus sign. so that result after
tokenization should be
C++
PHP
MySql
C#

But the result I am getting is
c
php
mysql
c
Please give me some pointers as to which analyzer and tokenizer to use

Thank you
Abin Mathew

Re: C++ being filtered (please help)

Posted by Chris Hostetter <ho...@fucit.org>.
: > now i want to tokenize it based on comma or white space and
: > other word
: > delimiting characters only. Not on the plus sign. so that
: > result after
: > tokenization should be
	...
: > But the result I am getting is

...you haven't told us what type of analyzer settings you are currently 
using, so it's completley impossible to give you specific advice on what 
to do -- the problem may not be your current tokenizer at all, it might be 
some TokenFilter that is being applied after tokenization.

: <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt" /> 
: <tokenizer class="solr.WhitespaceTokenizerFactory" /> 
: <filter class="solr.LowerCaseFilterFactory" /> 	

that's seems somewhat overkill for the problem of "i want to tokenize on 
an explicit list of characters" ... using the PAtternTokenizerFactory (in 
place of the MappingCharFilterFactory and the WhitespaceTokenizerFactory) 
would probably be a little more straight forward.

-Hoss


Re: C++ being filtered (please help)

Posted by Ahmet Arslan <io...@yahoo.com>.
> I have a field which may take the form "C++,PHP &
> MySql,C#"
> now i want to tokenize it based on comma or white space and
> other word
> delimiting characters only. Not on the plus sign. so that
> result after
> tokenization should be
> C++
> PHP
> MySql
> C#
> 
> But the result I am getting is
> c
> php
> mysql
> c
> Please give me some pointers as to which analyzer and
> tokenizer to use
> 

You can use this analyzer:

<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt" /> 
<tokenizer class="solr.WhitespaceTokenizerFactory" /> 
<filter class="solr.LowerCaseFilterFactory" /> 	
</analyzer>

With mappings.txt file:
"," => " "

you can add more characters (to mappings.txt file) that you want to break words at.