You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark Mandel <ma...@gmail.com> on 2011/06/09 01:07:15 UTC

Tokenising based on known words?

Not sure if this possible, but figured I would ask the question.

Basically, we have some users who do some pretty rediculous things ;o)

Rather than writing "red jacket", they write "redjacket", which obviously
returns no results.

Is there any way, with Solr, to go hunting for known words (maybe if there
is no results) within the word set? Or even tokenise based on known words in
the index?

Last time I played with spell check suggestions, it didn't seem to handle
this very well,  but I've yet to try it again on 3.2.0 (just upgraded from
1.4.1).

Any help/thoughts appreciated, as they do this alllll the time.

Mark

-- 
E: mark.mandel@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com

Re: Tokenising based on known words?

Posted by lee carroll <le...@googlemail.com>.
we've played with HyphenationCompoundWordTokenFilterFactory it works
better than maintaining a word dictionary to split (although we ended
up not using it for reasons i can't recall)

see

http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenationCompoundWordTokenFilterFactory.html



On 9 June 2011 06:42, Gora Mohanty <go...@mimirtech.com> wrote:
> On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel <ma...@gmail.com> wrote:
>> Not sure if this possible, but figured I would ask the question.
>>
>> Basically, we have some users who do some pretty rediculous things ;o)
>>
>> Rather than writing "red jacket", they write "redjacket", which obviously
>> returns no results.
> [...]
>
> Have you tried using synonyms,
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> It seems like they should fit your use case.
>
> Regards,
> Gora
>

Re: Tokenising based on known words?

Posted by Mark Mandel <ma...@gmail.com>.
Thanks for the feedback! This definitely gives me some options to work on!

Mark

On Thu, Jun 9, 2011 at 11:21 PM, Steven A Rowe <sa...@syr.edu> wrote:

> Hi Mark,
>
> Are you familiar with shingles aka token n-grams?
>
>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html
>
> Use the empty string for the tokenSeparator to get wordstogether style
> tokens in your index.
>
> I think you'll want to apply this filter only at index-time, since the
> users will supply the shingles all by themselves :).
>
> Steve
>
> > -----Original Message-----
> > From: Mark Mandel [mailto:mark.mandel@gmail.com]
> > Sent: Thursday, June 09, 2011 8:37 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tokenising based on known words?
> >
> > Synonyms really wouldn't work for every possible combination of words in
> > our
> > index.
> >
> > Thanks for the idea though.
> >
> > Mark
> >
> > On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty <go...@mimirtech.com> wrote:
> >
> > > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel <ma...@gmail.com>
> > wrote:
> > > > Not sure if this possible, but figured I would ask the question.
> > > >
> > > > Basically, we have some users who do some pretty rediculous things
> > ;o)
> > > >
> > > > Rather than writing "red jacket", they write "redjacket", which
> > obviously
> > > > returns no results.
> > > [...]
> > >
> > > Have you tried using synonyms,
> > >
> > >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF
> > ilterFactory
> > > It seems like they should fit your use case.
> > >
> > > Regards,
> > > Gora
> > >
> >
> >
> >
> > --
> > E: mark.mandel@gmail.com
> > T: http://www.twitter.com/neurotic
> > W: www.compoundtheory.com
> >
> > cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
> > http://www.cfobjective.com.au
> >
> > Hands-on ColdFusion ORM Training
> > www.ColdFusionOrmTraining.com
>



-- 
E: mark.mandel@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com

RE: Tokenising based on known words?

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Mark,

Are you familiar with shingles aka token n-grams?

http://lucene.apache.org/solr/api/org/apache/solr/analysis/ShingleFilterFactory.html

Use the empty string for the tokenSeparator to get wordstogether style tokens in your index. 

I think you'll want to apply this filter only at index-time, since the users will supply the shingles all by themselves :).

Steve

> -----Original Message-----
> From: Mark Mandel [mailto:mark.mandel@gmail.com]
> Sent: Thursday, June 09, 2011 8:37 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Tokenising based on known words?
> 
> Synonyms really wouldn't work for every possible combination of words in
> our
> index.
> 
> Thanks for the idea though.
> 
> Mark
> 
> On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty <go...@mimirtech.com> wrote:
> 
> > On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel <ma...@gmail.com>
> wrote:
> > > Not sure if this possible, but figured I would ask the question.
> > >
> > > Basically, we have some users who do some pretty rediculous things
> ;o)
> > >
> > > Rather than writing "red jacket", they write "redjacket", which
> obviously
> > > returns no results.
> > [...]
> >
> > Have you tried using synonyms,
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymF
> ilterFactory
> > It seems like they should fit your use case.
> >
> > Regards,
> > Gora
> >
> 
> 
> 
> --
> E: mark.mandel@gmail.com
> T: http://www.twitter.com/neurotic
> W: www.compoundtheory.com
> 
> cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
> http://www.cfobjective.com.au
> 
> Hands-on ColdFusion ORM Training
> www.ColdFusionOrmTraining.com

Re: Tokenising based on known words?

Posted by Mark Mandel <ma...@gmail.com>.
Synonyms really wouldn't work for every possible combination of words in our
index.

Thanks for the idea though.

Mark

On Thu, Jun 9, 2011 at 3:42 PM, Gora Mohanty <go...@mimirtech.com> wrote:

> On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel <ma...@gmail.com> wrote:
> > Not sure if this possible, but figured I would ask the question.
> >
> > Basically, we have some users who do some pretty rediculous things ;o)
> >
> > Rather than writing "red jacket", they write "redjacket", which obviously
> > returns no results.
> [...]
>
> Have you tried using synonyms,
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> It seems like they should fit your use case.
>
> Regards,
> Gora
>



-- 
E: mark.mandel@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com

Re: Tokenising based on known words?

Posted by Gora Mohanty <go...@mimirtech.com>.
On Thu, Jun 9, 2011 at 4:37 AM, Mark Mandel <ma...@gmail.com> wrote:
> Not sure if this possible, but figured I would ask the question.
>
> Basically, we have some users who do some pretty rediculous things ;o)
>
> Rather than writing "red jacket", they write "redjacket", which obviously
> returns no results.
[...]

Have you tried using synonyms,
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
It seems like they should fit your use case.

Regards,
Gora