You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by SBS <jt...@uow.edu.au> on 2011/08/17 00:15:49 UTC

Overriding default handling of '/' and '-'

Our document base includes terms which are in fact codes that may contain
dashes and slashes such as "M1234/5" and "12345-00".  Presently Lucene
appears to breaking up these codes according to the slashes and dashes and
searches are therefore not working properly.  Instead of matching an exact
code of "12345-00", Lucene matches any text containing either "12345" or
"00" which is not desirable.

Is there a way to change this default behaviour (a filter perhaps)?  The
situation is complicated by the fact that the content also includes normal
text where processing of the slashes and dashes in this manner is probably
expected and desirable.  I guess if I turn off this default behaviour then I
will lose it for normal words but that is probably acceptable and
unavoidable.

Thanks,

-sbs

--
View this message in context: http://lucene.472066.n3.nabble.com/Overriding-default-handling-of-and-tp3259987p3259987.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Overriding default handling of '/' and '-'

Posted by Mihai Caraman <ca...@gmail.com>.

QueryParser is to blaim, so avoid using it. Like you said, by just filtering
you're good. That's how I did it, when the query came, it came broken in
two, the part that needed to be (full-text)analyzed and the second part by
which I filtered as exact match(suppose it applies to you too)

2011/8/17 SBS <jt...@uow.edu.au>

> Our document base includes terms which are in fact codes that may contain
> dashes and slashes such as "M1234/5" and "12345-00".  Presently Lucene
> appears to breaking up these codes according to the slashes and dashes and
> searches are therefore not working properly.  Instead of matching an exact
> code of "12345-00", Lucene matches any text containing either "12345" or
> "00" which is not desirable.
>
> Is there a way to change this default behaviour (a filter perhaps)?  The
> situation is complicated by the fact that the content also includes normal
> text where processing of the slashes and dashes in this manner is probably
> expected and desirable.  I guess if I turn off this default behaviour then
> I
> will lose it for normal words but that is probably acceptable and
> unavoidable.
>
> Thanks,
>
> -sbs
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Overriding-default-handling-of-and-tp3259987p3259987.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Overriding default handling of '/' and '-'

Posted by Ian Lea <ia...@gmail.com>.

What analyzer are you using?  You could build your own including
MappingCharFilter to replace / and - with something that didn't cause
splits.  You could also get clever and insert the translated value in
the token stream as well as the original which might give you the best
of both worlds.

If the codes were in their own field in your index you could use
KeywordAnalyzer for that field.

Whatever you do, don't forget to use the same analyzer at index and
search time, unless you are getting very clever.

Lucene in Action 2nd edition has useful info and code samples on
analysis chains, and much else besides.


--
Ian.


On Tue, Aug 16, 2011 at 11:15 PM, SBS <jt...@uow.edu.au> wrote:
> Our document base includes terms which are in fact codes that may contain
> dashes and slashes such as "M1234/5" and "12345-00".  Presently Lucene
> appears to breaking up these codes according to the slashes and dashes and
> searches are therefore not working properly.  Instead of matching an exact
> code of "12345-00", Lucene matches any text containing either "12345" or
> "00" which is not desirable.
>
> Is there a way to change this default behaviour (a filter perhaps)?  The
> situation is complicated by the fact that the content also includes normal
> text where processing of the slashes and dashes in this manner is probably
> expected and desirable.  I guess if I turn off this default behaviour then I
> will lose it for normal words but that is probably acceptable and
> unavoidable.
>
> Thanks,
>
> -sbs
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Overriding-default-handling-of-and-tp3259987p3259987.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org