You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Devon Baumgarten <db...@nationalcorp.com> on 2011/12/12 22:51:55 UTC

Removing whitespace

Hello,

I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result.

Ultimately, the effect I am after is that searching "bobdole" would match "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way... can anyone lend some assistance?

Thanks!

Dev B


RE: Removing whitespace

Posted by Devon Baumgarten <db...@nationalcorp.com>.
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten

-----Original Message-----
From: Alireza Salimi [mailto:alireza.salimi@gmail.com] 
Sent: Monday, December 12, 2011 4:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Removing whitespace

That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten <
dbaumgarten@nationalcorp.com> wrote:

> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing.
> The only answer I have found suggested that it is necessary to write my own
> tokenizer. Is this true? I want to remove whitespace and special characters
> from the phrase and create N-grams from the result.
>
> Ultimately, the effect I am after is that searching "bobdole" would match
> "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way...
> can anyone lend some assistance?
>
> Thanks!
>
> Dev B
>
>


-- 
Alireza Salimi
Java EE Developer

Re: Removing whitespace

Posted by Alireza Salimi <al...@gmail.com>.
That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten <
dbaumgarten@nationalcorp.com> wrote:

> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing.
> The only answer I have found suggested that it is necessary to write my own
> tokenizer. Is this true? I want to remove whitespace and special characters
> from the phrase and create N-grams from the result.
>
> Ultimately, the effect I am after is that searching "bobdole" would match
> "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way...
> can anyone lend some assistance?
>
> Thanks!
>
> Dev B
>
>


-- 
Alireza Salimi
Java EE Developer

RE: Removing whitespace

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Devon,

Something like this should work for you (untested!):

<analyzer>
  <!-- Remove non-"word" characters; only underscores, letters & numbers allowed -->
  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\W+" replacement=""/>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="2"/>
</analyzer>

Steve

> -----Original Message-----
> From: Devon Baumgarten [mailto:dbaumgarten@nationalcorp.com]
> Sent: Monday, December 12, 2011 4:52 PM
> To: 'solr-user@lucene.apache.org'
> Subject: Removing whitespace
> 
> Hello,
> 
> I am having trouble finding how to remove/ignore whitespace when indexing.
> The only answer I have found suggested that it is necessary to write my
> own tokenizer. Is this true? I want to remove whitespace and special
> characters from the phrase and create N-grams from the result.
> 
> Ultimately, the effect I am after is that searching "bobdole" would match
> "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better
> way... can anyone lend some assistance?
> 
> Thanks!
> 
> Dev B


RE: Removing whitespace

Posted by Devon Baumgarten <db...@nationalcorp.com>.
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten

Re: Removing whitespace

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(11/12/13 6:51), Devon Baumgarten wrote:
> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result.

How about using one of existing charfilters?

https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html

https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/MappingCharFilterFactory.html

koji
-- 
Check out "Query Log Visualizer" for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/