You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by saisantoshi <sa...@gmail.com> on 2013/10/20 18:47:21 UTC

Handling special characters in Lucene 4.0

I have created strings like the below

&&searchtext
+sampletext

and when I try to search the following using *&&** or *+** it does not give
any result.

I am using QueryParser.escape(String s) method to handle the special
characters but does not look like it did anything.

Also, when I search something like this:

title:search*

it works and returns the search result

but when I search like the following, it wont work
title:*&&**

( No Result)

Is the above valid search criteria? If not, can someone suggest here what
would be appropriate search criteria?

Seems like StandardAnalyzer is stripping out all the special characters and
searching and that's why when we search without special characters, it does
seem to work.

Thanks,
Sai.




--
View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yes, other special (punctuation) characters will be preserved by the white 
space analyzer, but must be escaped in query strings. You will have to 
manually escape them with a backslash, since the QueryParser.escape method 
will escape asterisk as well, which would disable wildcard query.

-- Jack Krupansky

-----Original Message----- 
From: saisantoshi
Sent: Sunday, October 20, 2013 7:43 PM
To: java-user@lucene.apache.org
Subject: Re: Handling special characters in Lucene 4.0

what about other characters like '&,'( quote) characters. We have a
requirement that a text can start with 'sampletext' and when I search with a
'* it does not return any results but instead when I search with sample*, it
does return the result.

Thanks,
Ranjith,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096732.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by saisantoshi <sa...@gmail.com>.

what about other characters like '&,'( quote) characters. We have a
requirement that a text can start with 'sampletext' and when I search with a
'* it does not return any results but instead when I search with sample*, it
does return the result.

Thanks,
Ranjith,



--
View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096732.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by Jack Krupansky <ja...@basetechnology.com>.

Right, the "Escaping Special Characters" is simply to escape query operators 
like "&&" (means "AND") and "+" (which means "AND" or "MUST"). Yes, the 
white space analyzer could be used,  or a custom analyzer that uses the 
white space tokenizer and then also uses a filter to strip out any 
punctuation characters that you don't want to keep (e.g., period, comma, 
semicolon, parentheses, etc.)

The query parser itself knows nothing about what your chosen analyzer does. 
But the query parser does specially interpret the special characters that 
the escape method mentions.

-- Jack Krupansky

-----Original Message----- 
From: saisantoshi
Sent: Sunday, October 20, 2013 7:12 PM
To: java-user@lucene.apache.org
Subject: Re: Handling special characters in Lucene 4.0

Thanks.

So, if I understand correctly, StandardAnalyzer wont work for the following
below as it strips out the special characters and does search only on
searchText ( in this case).

queryText = *&&searchText*

If we want to do a search like "*&&**" then we need to use
WhiteSpaceAnalyzer. Please let me know if my understanding is correct.

Also, I am not sure as the following is mentioned in the lucene docs? Is the
below not for StandardAnalyzer then? It is not mentioned that it wont work
for StandardAnalyzer.

/*
Escaping Special Characters

Lucene supports escaping special characters that are part of the query
syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

To escape these character use the \ before the character. For example to
search for (1+1):2 use the query:

\(1\+1\)\:2

*/

Thanks,
Sai.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096727.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by Benson Margulies <be...@basistech.com>.

It might be helpful if you would explain, at a higher level, what you
are trying to accomplish. Where do these things come from? What
higher-level problem are you trying to solve?

On Sun, Oct 20, 2013 at 7:12 PM, saisantoshi <sa...@gmail.com> wrote:
> Thanks.
>
> So, if I understand correctly, StandardAnalyzer wont work for the following
> below as it strips out the special characters and does search only on
> searchText ( in this case).
>
> queryText = *&&searchText*
>
> If we want to do a search like "*&&**" then we need to use
> WhiteSpaceAnalyzer. Please let me know if my understanding is correct.
>
> Also, I am not sure as the following is mentioned in the lucene docs? Is the
> below not for StandardAnalyzer then? It is not mentioned that it wont work
> for StandardAnalyzer.
>
> /*
> Escaping Special Characters
>
> Lucene supports escaping special characters that are part of the query
> syntax. The current list special characters are
>
> + - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
>
> To escape these character use the \ before the character. For example to
> search for (1+1):2 use the query:
>
> \(1\+1\)\:2
>
> */
>
> Thanks,
> Sai.
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096727.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by saisantoshi <sa...@gmail.com>.

Thanks.

So, if I understand correctly, StandardAnalyzer wont work for the following
below as it strips out the special characters and does search only on
searchText ( in this case).

queryText = *&&searchText*

If we want to do a search like "*&&**" then we need to use
WhiteSpaceAnalyzer. Please let me know if my understanding is correct.

Also, I am not sure as the following is mentioned in the lucene docs? Is the
below not for StandardAnalyzer then? It is not mentioned that it wont work
for StandardAnalyzer.

/*
Escaping Special Characters

Lucene supports escaping special characters that are part of the query
syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

To escape these character use the \ before the character. For example to
search for (1+1):2 use the query:

\(1\+1\)\:2

*/

Thanks,
Sai.




--
View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096727.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by Jack Krupansky <ja...@basetechnology.com>.

The standard analyzer should remove those ampersands and pluses, so the core 
alpha terms should be matched. You would need to use the white space 
analyzer or a custom analyzer to preserve such special characters.

Please give a specific indexed text string and a specific query that fails 
against it.

Also, QueryParser.escape will also escape asterisks, so they won't perform 
wildcard query. And then the standard analyzer will remove the asterisks as 
it does with most punctuation. If you switch to an analyzer that preserves 
special characters, you can then manually escape special characters with a 
backslash, and then leave the asterisk unescaped to perform a wildcard 
query.

-- Jack Krupansky

-----Original Message----- 
From: saisantoshi
Sent: Sunday, October 20, 2013 6:02 PM
To: java-user@lucene.apache.org
Subject: Re: Handling special characters in Lucene 4.0

StandardAnalyzer both at index and search time. We use the default one and
don't have any custom analyzers.

Thanks,
Sai



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096710.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by saisantoshi <sa...@gmail.com>.

StandardAnalyzer both at index and search time. We use the default one and
don't have any custom analyzers.

Thanks,
Sai



--
View this message in context: http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674p4096710.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling special characters in Lucene 4.0

Posted by Jack Krupansky <ja...@basetechnology.com>.

Maybe you are not using the same analyzer at index and query time. Even 
though you are correctly escaping the special query syntax characters, 
either the query analyzer is removing them or your index analyzer removed 
them. What analyzer are you using at index time? And, what analyzer are you 
using at query time?

-- Jack Krupansky

-----Original Message----- 
From: saisantoshi
Sent: Sunday, October 20, 2013 12:47 PM
To: java-user@lucene.apache.org
Subject: Handling special characters in Lucene 4.0

I have created strings like the below

&&searchtext
+sampletext

and when I try to search the following using *&&** or *+** it does not give
any result.

I am using QueryParser.escape(String s) method to handle the special
characters but does not look like it did anything.

Also, when I search something like this:

title:search*

it works and returns the search result

but when I search like the following, it wont work
title:*&&**

( No Result)

Is the above valid search criteria? If not, can someone suggest here what
would be appropriate search criteria?

Seems like StandardAnalyzer is stripping out all the special characters and
searching and that's why when we search without special characters, it does
seem to work.

Thanks,
Sai.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Handling-special-characters-in-Lucene-4-0-tp4096674.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org