You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ra...@barclays.com on 2013/11/15 21:21:45 UTC

WhitespaceAnalyzer vs StandardAnalyzer

Hi,

I implemented my Lucene solution using StandardAnalyzer for both indexing and searching. While testing, I noticed that special characters such as hyphens, forward slash etc. are omitted by this Analyzer.

In plain English, the requirement is to search for individual words, in Lucene terms SPACE should be the only tokenizer. Also, no part of the text should not be modified / omitted.

For eg. ModelNumber: ABC/x:123
Here there should be only 2 tokens, "ModelNumber:" and "ABC/x:123".

Based on what I read about WhitespaceAnalyzer, it sounds as though it can do exactly what I am looking for. Before I make this big decision, I also wanted to run this by you folks to check if there are any side-effects of switching the Analyzer - keeping in mind my requirements.

Any suggestions as always would be greatly appreciated.

Regards,
Raghu

_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

Re: WhitespaceAnalyzer vs StandardAnalyzer

Posted by VIGNESH S <vi...@gmail.com>.

Hi,

whitespace analyzer would be ideal for your requirement.


On Sat, Nov 16, 2013 at 1:51 AM, <ra...@barclays.com> wrote:

> Hi,
>
> I implemented my Lucene solution using StandardAnalyzer for both indexing
> and searching. While testing, I noticed that special characters such as
> hyphens, forward slash etc. are omitted by this Analyzer.
>
> In plain English, the requirement is to search for individual words, in
> Lucene terms SPACE should be the only tokenizer. Also, no part of the text
> should not be modified / omitted.
>
> For eg. ModelNumber: ABC/x:123
> Here there should be only 2 tokens, "ModelNumber:" and "ABC/x:123".
>
> Based on what I read about WhitespaceAnalyzer, it sounds as though it can
> do exactly what I am looking for. Before I make this big decision, I also
> wanted to run this by you folks to check if there are any side-effects of
> switching the Analyzer - keeping in mind my requirements.
>
> Any suggestions as always would be greatly appreciated.
>
> Regards,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>



-- 
Thanks and Regards
Vignesh Srinivasan
9739135640

Re: WhitespaceAnalyzer vs StandardAnalyzer

Posted by Erick Erickson <er...@gmail.com>.

What are you getting from looking at your admin/analysis page? That should
help a lot. Otherwise you haven't provided much info about what's failing,
for instance debug=query output to see what gets through your parser.

You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick


On Sun, Nov 17, 2013 at 9:42 PM, <ra...@barclays.com> wrote:

> Hi All,
>
> Could any one please suggest if it is possible to perform Leading and / or
> trailing wildcard searches using WhitespaceAnalyzer?
>
> As noted below, WhitespaceAnalyzer works well for my cause, but I need to
> support wildcard searches. Initial results prove that it isn't possible.
>
> BTW, I escape all the special characters and then at the end I tried by
> suffixing "*". It didn't help.
>
> Please suggest. It is real urgent.. Appreciate any possible help!
>
> Regards,
> Raghu
>
>
> -----Original Message-----
> From: Rao, Raghavendra: IT (NYK)
> Sent: Sunday, November 17, 2013 12:54 PM
> To: java-user@lucene.apache.org
> Subject: RE: WhitespaceAnalyzer vs StandardAnalyzer
>
> The solution clicked to me as soon as I sent the email :)
>
> The problem was that I was enclosing the search text with double quotes
> (for PhraseQuery) before providing it to QueryParser and it was getting
> messed up as double quotes is one of the special characters for Lucene and
> I guess even the double quotes were getting escaped. Now I changed the code
> as follows.
>
> Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_43);
> QueryParser parser = new QueryParser(Version.LUCENE_43, "CONTENTS",
> analyzer); query = parser.parse("\"" +
> QueryParser.escape(strTxtSearchString.toUpperCase()) + "\"");
>
> Regards,
> Raghu
>
>
> -----Original Message-----
> From: Rao, Raghavendra: IT (NYK)
> Sent: Sunday, November 17, 2013 12:36 PM
> To: java-user@lucene.apache.org
> Subject: RE: WhitespaceAnalyzer vs StandardAnalyzer
>
> Thank you very much, Eric.
>
> WhitespaceAnalyzer is going pretty well. I am now trying to search for
> values with special characters that need escaping for Lucene, but facing
> some issues.
>
> I have used the QueryParser.escape() method in the past with
> StandardAnalyzer and it worked fine. But now with WhitespaceAnalyzer, the
> final query is getting tampered once I use the escape() method. Below is an
> example.
>
> ***Code***
> Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_43);
> QueryParser parser = new QueryParser(Version.LUCENE_43, "CONTENTS",
> analyzer); query =
> parser.parse(QueryParser.escape(strTxtSearchString.toUpperCase()));
>
> ***Result***
> Raw Search string passed: modern corporation It is provided to Lucene as:
> "modern corporation" for PhraseQuery
>
> Type of query: BooleanQuery
> query.toString: CONTENTS:"MODERN CONTENTS:CORPORATION"
>
> where as I am expecting:
>
> Type of query: PhraseQuery
> query.toString: CONTENTS:"MODERN CORPORATION"
>
> Please suggest if I am doing anything wrong. As a last option, I am
> planning to manually escape the special characters by preceding them with a
> "\".
>
> Regards,
> Raghu
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Friday, November 15, 2013 4:45 PM
> To: java-user
> Subject: Re: WhitespaceAnalyzer vs StandardAnalyzer
>
> Well, your example will work exactly as you want. And if your input is
> strictly controlled, that's fine. But if you're putting in text, for
> instance, punctuation  will be part of the token. I.e. in the sentence just
> before this one, "token" would not be found, but "token." would.
>
> The admin/analysis page is your friend :).
>
> You might want to consider following with a LowerCaseFilterFactory here
> unless you want your searches to be case sensitive.
>
> And do watch querying in this case. You need to escape things like the
> colon and other special characters, see:
> http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#EscapingSpecial Characters
>
> Best,
> Erick
>
>
> On Fri, Nov 15, 2013 at 3:21 PM, <ra...@barclays.com> wrote:
>
> > Hi,
> >
> > I implemented my Lucene solution using StandardAnalyzer for both
> > indexing and searching. While testing, I noticed that special
> > characters such as hyphens, forward slash etc. are omitted by this
> Analyzer.
> >
> > In plain English, the requirement is to search for individual words,
> > in Lucene terms SPACE should be the only tokenizer. Also, no part of
> > the text should not be modified / omitted.
> >
> > For eg. ModelNumber: ABC/x:123
> > Here there should be only 2 tokens, "ModelNumber:" and "ABC/x:123".
> >
> > Based on what I read about WhitespaceAnalyzer, it sounds as though it
> > can do exactly what I am looking for. Before I make this big decision,
> > I also wanted to run this by you folks to check if there are any
> > side-effects of switching the Analyzer - keeping in mind my requirements.
> >
> > Any suggestions as always would be greatly appreciated.
> >
> > Regards,
> > Raghu
> >
> >
> > _______________________________________________
> >
> > This message is for information purposes only, it is not a
> > recommendation, advice, offer or solicitation to buy or sell a product
> > or service nor an official confirmation of any transaction. It is
> > directed at persons who are professionals and is not intended for
> > retail customer use. Intended for recipient only. This message is
> subject to the terms at:
> > www.barclays.com/emaildisclaimer.
> >
> > For important disclosures, please see:
> > www.barclays.com/salesandtradingdisclaimer regarding market commentary
> > from Barclays Sales and/or Trading, who are active market
> > participants; and in respect of Barclays Research, including
> > disclosures relating to specific issuers, please see
> http://publicresearch.barclays.com.
> >
> > _______________________________________________
> >
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: WhitespaceAnalyzer vs StandardAnalyzer

Posted by ra...@barclays.com.

Hi All,

Could any one please suggest if it is possible to perform Leading and / or trailing wildcard searches using WhitespaceAnalyzer?

As noted below, WhitespaceAnalyzer works well for my cause, but I need to support wildcard searches. Initial results prove that it isn't possible.

BTW, I escape all the special characters and then at the end I tried by suffixing "*". It didn't help.

Please suggest. It is real urgent.. Appreciate any possible help!

Regards,
Raghu

-----Original Message-----
From: Rao, Raghavendra: IT (NYK) 
Sent: Sunday, November 17, 2013 12:54 PM
To: java-user@lucene.apache.org
Subject: RE: WhitespaceAnalyzer vs StandardAnalyzer

The solution clicked to me as soon as I sent the email :)

The problem was that I was enclosing the search text with double quotes (for PhraseQuery) before providing it to QueryParser and it was getting messed up as double quotes is one of the special characters for Lucene and I guess even the double quotes were getting escaped. Now I changed the code as follows.

Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_43);
QueryParser parser = new QueryParser(Version.LUCENE_43, "CONTENTS", analyzer); query = parser.parse("\"" + QueryParser.escape(strTxtSearchString.toUpperCase()) + "\"");

Regards,
Raghu

-----Original Message-----
From: Rao, Raghavendra: IT (NYK)
Sent: Sunday, November 17, 2013 12:36 PM
To: java-user@lucene.apache.org
Subject: RE: WhitespaceAnalyzer vs StandardAnalyzer

Thank you very much, Eric.

WhitespaceAnalyzer is going pretty well. I am now trying to search for values with special characters that need escaping for Lucene, but facing some issues.

I have used the QueryParser.escape() method in the past with StandardAnalyzer and it worked fine. But now with WhitespaceAnalyzer, the final query is getting tampered once I use the escape() method. Below is an example.

***Code***
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_43);
QueryParser parser = new QueryParser(Version.LUCENE_43, "CONTENTS", analyzer); query = parser.parse(QueryParser.escape(strTxtSearchString.toUpperCase()));

***Result***
Raw Search string passed: modern corporation It is provided to Lucene as: "modern corporation" for PhraseQuery

Type of query: BooleanQuery
query.toString: CONTENTS:"MODERN CONTENTS:CORPORATION"

where as I am expecting:

Type of query: PhraseQuery
query.toString: CONTENTS:"MODERN CORPORATION"

Please suggest if I am doing anything wrong. As a last option, I am planning to manually escape the special characters by preceding them with a "\".

Regards,
Raghu

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, November 15, 2013 4:45 PM
To: java-user
Subject: Re: WhitespaceAnalyzer vs StandardAnalyzer

Well, your example will work exactly as you want. And if your input is strictly controlled, that's fine. But if you're putting in text, for instance, punctuation  will be part of the token. I.e. in the sentence just before this one, "token" would not be found, but "token." would.

The admin/analysis page is your friend :).

You might want to consider following with a LowerCaseFilterFactory here unless you want your searches to be case sensitive.

And do watch querying in this case. You need to escape things like the colon and other special characters, see:
http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping Special Characters

Best,
Erick

On Fri, Nov 15, 2013 at 3:21 PM, <ra...@barclays.com> wrote:

> Hi,
>
> I implemented my Lucene solution using StandardAnalyzer for both 
> indexing and searching. While testing, I noticed that special 
> characters such as hyphens, forward slash etc. are omitted by this Analyzer.
>
> In plain English, the requirement is to search for individual words, 
> in Lucene terms SPACE should be the only tokenizer. Also, no part of 
> the text should not be modified / omitted.
>
> For eg. ModelNumber: ABC/x:123
> Here there should be only 2 tokens, "ModelNumber:" and "ABC/x:123".
>
> Based on what I read about WhitespaceAnalyzer, it sounds as though it 
> can do exactly what I am looking for. Before I make this big decision, 
> I also wanted to run this by you folks to check if there are any 
> side-effects of switching the Analyzer - keeping in mind my requirements.
>
> Any suggestions as always would be greatly appreciated.
>
> Regards,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a 
> recommendation, advice, offer or solicitation to buy or sell a product 
> or service nor an official confirmation of any transaction. It is 
> directed at persons who are professionals and is not intended for 
> retail customer use. Intended for recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary 
> from Barclays Sales and/or Trading, who are active market 
> participants; and in respect of Barclays Research, including 
> disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>
_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: WhitespaceAnalyzer vs StandardAnalyzer

Posted by ra...@barclays.com.

The solution clicked to me as soon as I sent the email :)

The problem was that I was enclosing the search text with double quotes (for PhraseQuery) before providing it to QueryParser and it was getting messed up as double quotes is one of the special characters for Lucene and I guess even the double quotes were getting escaped. Now I changed the code as follows.

Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_43);
QueryParser parser = new QueryParser(Version.LUCENE_43, "CONTENTS", analyzer); 
query = parser.parse("\"" + QueryParser.escape(strTxtSearchString.toUpperCase()) + "\"");

Regards,
Raghu

-----Original Message-----
From: Rao, Raghavendra: IT (NYK) 
Sent: Sunday, November 17, 2013 12:36 PM
To: java-user@lucene.apache.org
Subject: RE: WhitespaceAnalyzer vs StandardAnalyzer

Thank you very much, Eric.

WhitespaceAnalyzer is going pretty well. I am now trying to search for values with special characters that need escaping for Lucene, but facing some issues.

I have used the QueryParser.escape() method in the past with StandardAnalyzer and it worked fine. But now with WhitespaceAnalyzer, the final query is getting tampered once I use the escape() method. Below is an example.

***Code***
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_43);
QueryParser parser = new QueryParser(Version.LUCENE_43, "CONTENTS", analyzer); query = parser.parse(QueryParser.escape(strTxtSearchString.toUpperCase()));

***Result***
Raw Search string passed: modern corporation It is provided to Lucene as: "modern corporation" for PhraseQuery

Type of query: BooleanQuery
query.toString: CONTENTS:"MODERN CONTENTS:CORPORATION"

where as I am expecting:

Type of query: PhraseQuery
query.toString: CONTENTS:"MODERN CORPORATION"

Please suggest if I am doing anything wrong. As a last option, I am planning to manually escape the special characters by preceding them with a "\".

Regards,
Raghu

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, November 15, 2013 4:45 PM
To: java-user
Subject: Re: WhitespaceAnalyzer vs StandardAnalyzer

Well, your example will work exactly as you want. And if your input is strictly controlled, that's fine. But if you're putting in text, for instance, punctuation  will be part of the token. I.e. in the sentence just before this one, "token" would not be found, but "token." would.

The admin/analysis page is your friend :).

You might want to consider following with a LowerCaseFilterFactory here unless you want your searches to be case sensitive.

And do watch querying in this case. You need to escape things like the colon and other special characters, see:
http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping Special Characters

Best,
Erick

On Fri, Nov 15, 2013 at 3:21 PM, <ra...@barclays.com> wrote:

> Hi,
>
> I implemented my Lucene solution using StandardAnalyzer for both 
> indexing and searching. While testing, I noticed that special 
> characters such as hyphens, forward slash etc. are omitted by this Analyzer.
>
> In plain English, the requirement is to search for individual words, 
> in Lucene terms SPACE should be the only tokenizer. Also, no part of 
> the text should not be modified / omitted.
>
> For eg. ModelNumber: ABC/x:123
> Here there should be only 2 tokens, "ModelNumber:" and "ABC/x:123".
>
> Based on what I read about WhitespaceAnalyzer, it sounds as though it 
> can do exactly what I am looking for. Before I make this big decision, 
> I also wanted to run this by you folks to check if there are any 
> side-effects of switching the Analyzer - keeping in mind my requirements.
>
> Any suggestions as always would be greatly appreciated.
>
> Regards,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a 
> recommendation, advice, offer or solicitation to buy or sell a product 
> or service nor an official confirmation of any transaction. It is 
> directed at persons who are professionals and is not intended for 
> retail customer use. Intended for recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary 
> from Barclays Sales and/or Trading, who are active market 
> participants; and in respect of Barclays Research, including 
> disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>
_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: WhitespaceAnalyzer vs StandardAnalyzer

Posted by ra...@barclays.com.

Thank you very much, Eric.

WhitespaceAnalyzer is going pretty well. I am now trying to search for values with special characters that need escaping for Lucene, but facing some issues.

I have used the QueryParser.escape() method in the past with StandardAnalyzer and it worked fine. But now with WhitespaceAnalyzer, the final query is getting tampered once I use the escape() method. Below is an example.

***Code***
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_43);
QueryParser parser = new QueryParser(Version.LUCENE_43, "CONTENTS", analyzer);
query = parser.parse(QueryParser.escape(strTxtSearchString.toUpperCase()));

***Result***
Raw Search string passed: modern corporation
It is provided to Lucene as: "modern corporation" for PhraseQuery

Type of query: BooleanQuery
query.toString: CONTENTS:"MODERN CONTENTS:CORPORATION"

where as I am expecting:

Type of query: PhraseQuery
query.toString: CONTENTS:"MODERN CORPORATION"

Please suggest if I am doing anything wrong. As a last option, I am planning to manually escape the special characters by preceding them with a "\".

Regards,
Raghu

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Friday, November 15, 2013 4:45 PM
To: java-user
Subject: Re: WhitespaceAnalyzer vs StandardAnalyzer

Well, your example will work exactly as you want. And if your input is strictly controlled, that's fine. But if you're putting in text, for instance, punctuation  will be part of the token. I.e. in the sentence just before this one, "token" would not be found, but "token." would.

The admin/analysis page is your friend :).

You might want to consider following with a LowerCaseFilterFactory here unless you want your searches to be case sensitive.

And do watch querying in this case. You need to escape things like the colon and other special characters, see:
http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping Special Characters

Best,
Erick

On Fri, Nov 15, 2013 at 3:21 PM, <ra...@barclays.com> wrote:

> Hi,
>
> I implemented my Lucene solution using StandardAnalyzer for both 
> indexing and searching. While testing, I noticed that special 
> characters such as hyphens, forward slash etc. are omitted by this Analyzer.
>
> In plain English, the requirement is to search for individual words, 
> in Lucene terms SPACE should be the only tokenizer. Also, no part of 
> the text should not be modified / omitted.
>
> For eg. ModelNumber: ABC/x:123
> Here there should be only 2 tokens, "ModelNumber:" and "ABC/x:123".
>
> Based on what I read about WhitespaceAnalyzer, it sounds as though it 
> can do exactly what I am looking for. Before I make this big decision, 
> I also wanted to run this by you folks to check if there are any 
> side-effects of switching the Analyzer - keeping in mind my requirements.
>
> Any suggestions as always would be greatly appreciated.
>
> Regards,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a 
> recommendation, advice, offer or solicitation to buy or sell a product 
> or service nor an official confirmation of any transaction. It is 
> directed at persons who are professionals and is not intended for 
> retail customer use. Intended for recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary 
> from Barclays Sales and/or Trading, who are active market 
> participants; and in respect of Barclays Research, including 
> disclosures relating to specific issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>
_______________________________________________

This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer.

For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com.

_______________________________________________

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: WhitespaceAnalyzer vs StandardAnalyzer

Posted by Erick Erickson <er...@gmail.com>.

Well, your example will work exactly as you want. And if your input is
strictly controlled, that's fine. But if you're putting in text, for
instance, punctuation  will be part of the token. I.e. in the sentence just
before this one, "token" would not be found, but "token." would.

The admin/analysis page is your friend :).

You might want to consider following with a LowerCaseFilterFactory here
unless you want your searches to be case sensitive.

And do watch querying in this case. You need to escape things like the
colon and other special characters, see:
http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Escaping Special
Characters

Best,
Erick


On Fri, Nov 15, 2013 at 3:21 PM, <ra...@barclays.com> wrote:

> Hi,
>
> I implemented my Lucene solution using StandardAnalyzer for both indexing
> and searching. While testing, I noticed that special characters such as
> hyphens, forward slash etc. are omitted by this Analyzer.
>
> In plain English, the requirement is to search for individual words, in
> Lucene terms SPACE should be the only tokenizer. Also, no part of the text
> should not be modified / omitted.
>
> For eg. ModelNumber: ABC/x:123
> Here there should be only 2 tokens, "ModelNumber:" and "ABC/x:123".
>
> Based on what I read about WhitespaceAnalyzer, it sounds as though it can
> do exactly what I am looking for. Before I make this big decision, I also
> wanted to run this by you folks to check if there are any side-effects of
> switching the Analyzer - keeping in mind my requirements.
>
> Any suggestions as always would be greatly appreciated.
>
> Regards,
> Raghu
>
>
> _______________________________________________
>
> This message is for information purposes only, it is not a recommendation,
> advice, offer or solicitation to buy or sell a product or service nor an
> official confirmation of any transaction. It is directed at persons who are
> professionals and is not intended for retail customer use. Intended for
> recipient only. This message is subject to the terms at:
> www.barclays.com/emaildisclaimer.
>
> For important disclosures, please see:
> www.barclays.com/salesandtradingdisclaimer regarding market commentary
> from Barclays Sales and/or Trading, who are active market participants; and
> in respect of Barclays Research, including disclosures relating to specific
> issuers, please see http://publicresearch.barclays.com.
>
> _______________________________________________
>