You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Sebastian M <mi...@yahoo.com> on 2011/01/11 17:22:01 UTC

default RegexFragmenter

Hello,

I'm investigating an issue where spellcheck queries are tokenized without
being explicitly told to do so, resulting in suggestions such as
"www.www.product4sale.com.com" for the queries such as
"www.product4sale.com".

The default RegexFragmenter fragmenter (name="regex") uses the regular
expression:

[-\w ,/\n\"']{20,200}

I understand parts of it, but I'm not sure about the - sign, or the slash
midway through it.
I would like to perhaps tailor this regular expression to not cause query
terms such as "www.product4sale.com" to be broken down on the period marks,
but just be kept as they are.

Any suggestions or answers are highly appreciated!

Sebastian
-- 
View this message in context: http://lucene.472066.n3.nabble.com/default-RegexFragmenter-tp2235106p2235106.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: default RegexFragmenter

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Sebastian,

If I remember my regular expressions, that - and / are really just that.  The 
stuff inside angle brackets means "any of the characters between [ and ]".  - 
and / are just two of those characters, along with newline, space, comma, etc.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Sebastian M <mi...@yahoo.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, January 11, 2011 11:22:01 AM
> Subject: default RegexFragmenter
> 
> 
> Hello,
> 
> I'm investigating an issue where spellcheck queries are  tokenized without
> being explicitly told to do so, resulting in suggestions  such as
> "www.www.product4sale.com.com" for the queries such  as
> "www.product4sale.com".
> 
> The default RegexFragmenter fragmenter  (name="regex") uses the regular
> expression:
> 
> [-\w  ,/\n\"']{20,200}
> 
> I understand parts of it, but I'm not sure about the -  sign, or the slash
> midway through it.
> I would like to perhaps tailor this  regular expression to not cause query
> terms such as "www.product4sale.com" to  be broken down on the period marks,
> but just be kept as they are.
> 
> Any  suggestions or answers are highly appreciated!
> 
> Sebastian
> -- 
> View  this message in context: 
>http://lucene.472066.n3.nabble.com/default-RegexFragmenter-tp2235106p2235106.html
>
> Sent  from the Solr - User mailing list archive at Nabble.com.
>