You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Lukas Zapletal <lz...@root.cz> on 2003/01/31 21:27:25 UTC

Escaping bug \( and ? or *

Hello all,

Let`s have an indexed text "Test (1) and test (2)".

Now search for: \(1\)

Everything OK, so lets search for: \(?\)

Nothing found! It`s same with \" and maybe other escaped characters.

Is this a bug? Is it already solved in the CVS? If not, how can we fix it?

Thanks for help! Lucene rocks.

ps - have anybody compiled lucene with GCJ? if so with any results in performance?

-- 
Lukas Zapletal      [lzap@root.cz]
http://www.tanecni-olomouc.cz/lzap




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Escaping bug \( and ? or *

Posted by Tatu Saloranta <ta...@hypermall.net>.
On Saturday 08 February 2003 02:14, Lukas Zapletal wrote:
> > Tatu Saloranta wrote:
> >> I think the problem is that the analyzer you used for indexer strips
> >> out parenthesis. So, text actually indexed would look something like:
> >> "test 1 test 2" (assuming 'and' is a stop word removed). Thus there's
> >> no token matching term "(1)" or "(2)".
> >> Same goes for most other punctuation characters, they are routinely
> >> stripped by analyser, as they usually are not very useful for searching.
> >>
> >> To make it work the way you want, you need to modify analyzer to
> >> included parentesis, perhaps so that they are included only if
> >> they contain just single alpha-numeric token (otherwise
> >> "(1 and 2)" would be tokenized to "(1" and "2)" which is probably
> >> not what you want?
>
> Well this doesn`t work. Check the bugzilla for the example: ESCAPING BUG
> \(abc\) and \(a*c\) in v1.2
>
> Can anyone help me with it?

Hope I'm not wrong this time, but wasn't it so that prefix/wildcard query 
terms do not currently go through an analyzer? So searching for
\(abc\) would still search for "abc" (analyzer is run after query tokenizer
parses main query structure, getting term "(abc)", then tokenizer removes 
parentheses), but searching for \(a*c\) would actually search for
(a*c). And indexer likely hasn't included parentheses in indexed content?

If there was a way to define an analyzer for QueryParser to use for prefix 
queries this could be solved. This analyzer would need to be specialized 
however, to account for * and ? characters, since they are not to be removed 
(which is normally what should be done)

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Escaping bug \( and ? or *

Posted by Lukas Zapletal <lz...@root.cz>.
> Tatu Saloranta wrote:
>
>> I think the problem is that the analyzer you used for indexer strips 
>> out parenthesis. So, text actually indexed would look something like:
>> "test 1 test 2" (assuming 'and' is a stop word removed). Thus there's
>> no token matching term "(1)" or "(2)".
>> Same goes for most other punctuation characters, they are routinely
>> stripped by analyser, as they usually are not very useful for searching.
>>
>> To make it work the way you want, you need to modify analyzer to 
>> included parentesis, perhaps so that they are included only if
>> they contain just single alpha-numeric token (otherwise
>> "(1 and 2)" would be tokenized to "(1" and "2)" which is probably
>> not what you want?
>
Well this doesn`t work. Check the bugzilla for the example: ESCAPING BUG 
\(abc\) and \(a*c\) in v1.2

Can anyone help me with it?

-- 
Lukas Zapletal      [lzap@root.cz]
http://www.tanecni-olomouc.cz/lzap




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Escaping bug \( and ? or *

Posted by Lukas Zapletal <lz...@root.cz>.
Tatu Saloranta wrote:

>I think the problem is that the analyzer you used for indexer strips out 
>parenthesis. So, text actually indexed would look something like:
>"test 1 test 2" (assuming 'and' is a stop word removed). Thus there's
>no token matching term "(1)" or "(2)".
>Same goes for most other punctuation characters, they are routinely
>stripped by analyser, as they usually are not very useful for searching.
>
>To make it work the way you want, you need to modify analyzer to 
>included parentesis, perhaps so that they are included only if
>they contain just single alpha-numeric token (otherwise
>"(1 and 2)" would be tokenized to "(1" and "2)" which is probably
>not what you want?
>
Well I think this is not true.

I use this analzyer either for queries. So the parenthesis and other 
puncatuation are also stripped when I make query.

This is MAYBE a bug. PLEASE TEST THE CODE.

-- 
Lukas Zapletal      [lzap@root.cz]
http://www.tanecni-olomouc.cz/lzap




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Escaping bug \( and ? or *

Posted by Lukas Zapletal <lz...@root.cz>.
Tatu Saloranta wrote:

>To make it work the way you want, you need to modify analyzer to 
>included parentesis, perhaps so that they are included only if
>they contain just single alpha-numeric token (otherwise
>"(1 and 2)" would be tokenized to "(1" and "2)" which is probably
>not what you want?
>
AAAH, and I was creating the JUnit test...

Please anybody reject the record in bugzilla, I`m sorry.

Thanks Tatu! Have a nice day.

--
Respect to all space pilots whos lifes were ended on the board of Columbia.

-- 
Lukas Zapletal      [lzap@root.cz]
http://www.tanecni-olomouc.cz/lzap




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Escaping bug \( and ? or *

Posted by Tatu Saloranta <ta...@hypermall.net>.
On Friday 31 January 2003 13:27, Lukas Zapletal wrote:
> Hello all,
>
> Let`s have an indexed text "Test (1) and test (2)".
>
> Now search for: \(1\)
>
> Everything OK, so lets search for: \(?\)
>
> Nothing found! It`s same with \" and maybe other escaped characters.
>
> Is this a bug? Is it already solved in the CVS? If not, how can we fix it?

I think the problem is that the analyzer you used for indexer strips out 
parenthesis. So, text actually indexed would look something like:
"test 1 test 2" (assuming 'and' is a stop word removed). Thus there's
no token matching term "(1)" or "(2)".
Same goes for most other punctuation characters, they are routinely
stripped by analyser, as they usually are not very useful for searching.

To make it work the way you want, you need to modify analyzer to 
included parentesis, perhaps so that they are included only if
they contain just single alpha-numeric token (otherwise
"(1 and 2)" would be tokenized to "(1" and "2)" which is probably
not what you want?

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Escaping bug \( and ? or *

Posted by Lukas Zapletal <lz...@root.cz>.
Lukas Zapletal wrote:

> Hello all,
>
> Let`s have an indexed text "Test (1) and test (2)".
>
> Now search for: \(1\)
>
> Everything OK, so lets search for: \(?\)
>
> Nothing found! It`s same with \" and maybe other escaped characters.
>
> Is this a bug? Is it already solved in the CVS? If not, how can we fix 
> it?
>
> Thanks for help! Lucene rocks.
>
> ps - have anybody compiled lucene with GCJ? if so with any results in 
> performance?
>
attaching JUnit test....

-- 
Lukas Zapletal      [lzap@root.cz]
http://www.tanecni-olomouc.cz/lzap