You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Devon Baumgarten <db...@nationalcorp.com> on 2011/12/12 22:51:55 UTC
Removing whitespace
Hello,
I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result.
Ultimately, the effect I am after is that searching "bobdole" would match "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way... can anyone lend some assistance?
Thanks!
Dev B
RE: Removing whitespace
Posted by Devon Baumgarten <db...@nationalcorp.com>.
Thanks Alireza, Steven and Koji for the quick responses!
I'll read up on those and give it a shot.
Devon Baumgarten
-----Original Message-----
From: Alireza Salimi [mailto:alireza.salimi@gmail.com]
Sent: Monday, December 12, 2011 4:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Removing whitespace
That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
The
On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten <
dbaumgarten@nationalcorp.com> wrote:
> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing.
> The only answer I have found suggested that it is necessary to write my own
> tokenizer. Is this true? I want to remove whitespace and special characters
> from the phrase and create N-grams from the result.
>
> Ultimately, the effect I am after is that searching "bobdole" would match
> "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way...
> can anyone lend some assistance?
>
> Thanks!
>
> Dev B
>
>
--
Alireza Salimi
Java EE Developer
Re: Removing whitespace
Posted by Alireza Salimi <al...@gmail.com>.
That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories
The
On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten <
dbaumgarten@nationalcorp.com> wrote:
> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing.
> The only answer I have found suggested that it is necessary to write my own
> tokenizer. Is this true? I want to remove whitespace and special characters
> from the phrase and create N-grams from the result.
>
> Ultimately, the effect I am after is that searching "bobdole" would match
> "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way...
> can anyone lend some assistance?
>
> Thanks!
>
> Dev B
>
>
--
Alireza Salimi
Java EE Developer
RE: Removing whitespace
Posted by Steven A Rowe <sa...@syr.edu>.
Hi Devon,
Something like this should work for you (untested!):
<analyzer>
<!-- Remove non-"word" characters; only underscores, letters & numbers allowed -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\W+" replacement=""/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="2"/>
</analyzer>
Steve
> -----Original Message-----
> From: Devon Baumgarten [mailto:dbaumgarten@nationalcorp.com]
> Sent: Monday, December 12, 2011 4:52 PM
> To: 'solr-user@lucene.apache.org'
> Subject: Removing whitespace
>
> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing.
> The only answer I have found suggested that it is necessary to write my
> own tokenizer. Is this true? I want to remove whitespace and special
> characters from the phrase and create N-grams from the result.
>
> Ultimately, the effect I am after is that searching "bobdole" would match
> "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better
> way... can anyone lend some assistance?
>
> Thanks!
>
> Dev B
RE: Removing whitespace
Posted by Devon Baumgarten <db...@nationalcorp.com>.
Thanks Alireza, Steven and Koji for the quick responses!
I'll read up on those and give it a shot.
Devon Baumgarten
Re: Removing whitespace
Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
(11/12/13 6:51), Devon Baumgarten wrote:
> Hello,
>
> I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result.
How about using one of existing charfilters?
https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html
https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/MappingCharFilterFactory.html
koji
--
Check out "Query Log Visualizer" for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/