You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Barbarelli <mb...@gmail.com> on 2007/07/12 19:23:45 UTC

Customizing Stop Word List?

Hello to All,

I'm having a problem with Lucene where certain words that I would like to be
included in the query are actually being ommitted from it.  And I think that
is because Lucene recognizes them as stop words.  This is the case with
roughly four terms in particular.  They look like English grammar particles,
but they are actually ISO country codes.

"IT"     (Italy)
"IN"     (India)
"BE"   (Belgium)
"NO"   (Norway)
"AT"    (Austria)

Scenario:
-------------
The user submits ISO country codes as part of the Lucene query to be matched
against a field in the Lucene index that also contains ISO country codes.
In most cases, this works fine due to the fact that the majority of ISO
country codes do not resemble grammar particles.  The following are okay,
for example.

GB  (Great Britain)
FR  (France)
NL  (Netherlands)

The following are stripped from queries, as listed above.

 "IT"     (Italy)
"IN"     (India)
"BE"   (Belgium)
"NO"   (Norway)
"AT"    (Austria)


So far, I have attempted to fix this problem by defining my own list of stop
words and passing that array onto a standard analyzer used for both indexing
and searching.  That didn't work.  Would a per-field analyzer work in this
case?

Any ideas?  Many thanks in advance for your help.

Re: Customizing Stop Word List?

Posted by Michael Barbarelli <mb...@gmail.com>.

Please disregard previous request for assistance.  I've fixed the bug I was
struggling with and it actually had nothing to do with the analyzer in
question.

Thanks very much.


On 7/13/07, Michael Barbarelli <mb...@gmail.com> wrote:
>
> Here's the sample code. Incidentally, this is in C#. I am using Lucene.NET,
> but I am assuming this problem could be universal to all versions and that
> this is a question that is best exposed to the collective wisdom of the Java
> user group.
>
> default list of ISO country codes.
> *
>
> public string[] DEFAULT_STOP_WORDS = { "a", "and", "are", "as", "at",
> "be", "but", "by", "for", "if", "in", "into", "is", "no", "not", "of", "on",
> "or", "s", "such", "t", "that", "the", "their", "then", "there", "these",
> "they", "this", "to", "was", "will", "with",
> "inc","incorporated","co.","ltd","ltd." };
> *
>
> create array containing stop words, but where ISO country code equivalents
> are omitted.
> *
>
> public string[] MY_STOP_WORDS = { "a", "and" , "are", "as" , "but", "by" ,
> "for", "if" , "in", "into" , "is", "no" , "not", "of" , "on", "or" , "s",
> "such" , "t", "that" , "the", "their" , "then", "there" , "these", "they",
> "this", "to" , "was", "will" , "with", "inc" , "incorporated", "co." ,
> "ltd", "ltd." };*
>
> Next, create query and submit to search. Provide MY_STOP_WORDS array as
> parameter to the standard analyzer.
> *
>
> Query
> query = QueryParser.Parse(strQuery, "company_name", new StandardAnalyzer
> (MY_STOP_WORDS));
>
> Hits
> hits = searcher.Search(query);*
>
> Note that the default field for the query object is company name. However,
> multi-field queries will be submitted to the query object in the variable
> "strQuery".
>
> For example,
> *
>
> +(company_name:widgets ^10~ international ^5~ incorporated~ )
> +(country_iso:US)
> *
>
> There is a bit of logic elsewhere in my application that constructs this
> syntax based on field names and values submitted via the UI. However, if one
> of those country code values is "AT", "BE", "IT", "IN", etc; then the query
> logic is erroneously constructed as the following. Note that the country
> code is missing.
>
>
> *
>
> +(company_name:belgium ^10~ telecom ^5~ ) +(country_iso:)
> *
>
>
>
> Note that the country ISO field is null. If a query is sumbitted to the
> search object in this way, then I receive the following exception at
> runtime.
> *
>
> Lucene.Net.QueryParsers.ParseException was unhandled by user code
>
> Message="Encountered \")\" at line 1, column 60.\r\nWas expecting one
> of:\r\n \"(\" ...\r\n <QUOTED> ...\r\n <TERM> ...\r\n <PREFIXTERM> ...\r\n
> <WILDTERM> ...\r\n \"[\" ...\r\n \"{\" ...\r\n <NUMBER> ...\r\n "
>
> Source="Lucene.Net"
>
> StackTrace:
>
> at Lucene.Net.QueryParsers.QueryParser.jj_consume_token(Int32 kind)
>
> at Lucene.Net.QueryParsers.QueryParser.Clause(String field)
>
> at Lucene.Net.QueryParsers.QueryParser.Query(String field)
> *
>
>
>
> And finally, here is how I am creating my index:
>
>
> *
>
> doc.Add(
> Field.Keyword("rec_id" , entity_id.Trim()));
>
> doc.Add(
> Field.Text("aaa" , ob10_account_id.Trim()));
>
> doc.Add(
> Field.Text("company_name" , entity_name.Trim()));
>
> doc.Add(
> Field.Text("VAT_reg" , VAT_reg.Trim()));
>
> doc.Add(
> Field.Text("account_type_description" , account_type_description.Trim()));
>
>
> doc.Add(
> Field.Text("account_type" , account_type.Trim()));
>
> doc.Add(
> Field.Text("add_line1" , add_line1.Trim()));
>
> doc.Add(
> Field.Text("add_line2" , add_line2.Trim()));
>
> doc.Add(
> Field.Text("add_line3" , add_line3.Trim()));
>
> doc.Add(
> Field.Text("add_line4" , add_line4.Trim()));
>
> doc.Add(
> Field.Text("add_line5" , add_line5.Trim()));
>
> doc.Add(
> Field.Text("add_line6" , add_line6.Trim()));
>
> doc.Add(
> Field.Keyword("country_iso" , country_iso.Trim()));
>
> doc.Add(
> Field.Text("country_name" , country_name.Trim()));
>
> doc.Add(
> Field.Text("entity_status_desc" , entity_status_desc.Trim()));
>
> doc.Add(
> Field.Text("acct_status_desc" , acct_status_desc.Trim()));
>
> doc.Add(
> Field.Text("firstname" , firstname.Trim()));
>
> doc.Add(
> Field.Text("lastname" , lastname.Trim()));
>
>
>
> writer.AddDocument(doc);
> *
>
>
>
> Have I submitted my custom stop words incorrectly? Should I somehow use a
> per-field analyzer for the country_ISO field? If so, which?
>
> Thanks so much in advance for your help.
>
>
>

Re: Customizing Stop Word List?

Posted by Michael Barbarelli <mb...@gmail.com>.

Here's the sample code. Incidentally, this is in C#. I am using Lucene.NET,
but I am assuming this problem could be universal to all versions and that
this is a question that is best exposed to the collective wisdom of the Java
user group.

default list of ISO country codes.
*

public string[] DEFAULT_STOP_WORDS = { "a", "and", "are", "as", "at", "be",
"but", "by", "for", "if", "in", "into", "is", "no", "not", "of", "on", "or",
"s", "such", "t", "that", "the", "their", "then", "there", "these", "they",
"this", "to", "was", "will", "with", "inc","incorporated","co.","ltd","ltd."
};
*

create array containing stop words, but where ISO country code equivalents
are omitted.
*

public string[] MY_STOP_WORDS = { "a", "and", "are", "as", "but", "by",
"for", "if", "in", "into", "is", "no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these", "they", "this", "to",
"was", "will", "with", "inc", "incorporated", "co.", "ltd", "ltd." };
*

Next, create query and submit to search. Provide MY_STOP_WORDS array as
parameter to the standard analyzer.
*

Query query = QueryParser.Parse(strQuery, "company_name", new
StandardAnalyzer(MY_STOP_WORDS));

Hits hits = searcher.Search(query);
*

Note that the default field for the query object is company name. However,
multi-field queries will be submitted to the query object in the variable
"strQuery".

For example,
*

+(company_name:widgets ^10~ international ^5~ incorporated~ )
+(country_iso:US)
*

There is a bit of logic elsewhere in my application that constructs this
syntax based on field names and values submitted via the UI. However, if one
of those country code values is "AT", "BE", "IT", "IN", etc; then the query
logic is erroneously constructed as the following. Note that the country
code is missing.


*

+(company_name:belgium ^10~ telecom ^5~ ) +(country_iso:)
*



Note that the country ISO field is null. If a query is sumbitted to the
search object in this way, then I receive the following exception at
runtime.
*

Lucene.Net.QueryParsers.ParseException was unhandled by user code

Message="Encountered \")\" at line 1, column 60.\r\nWas expecting one
of:\r\n \"(\" ...\r\n <QUOTED> ...\r\n <TERM> ...\r\n <PREFIXTERM> ...\r\n
<WILDTERM> ...\r\n \"[\" ...\r\n \"{\" ...\r\n <NUMBER> ...\r\n "

Source="Lucene.Net"

StackTrace:

at Lucene.Net.QueryParsers.QueryParser.jj_consume_token(Int32 kind)

at Lucene.Net.QueryParsers.QueryParser.Clause(String field)

at Lucene.Net.QueryParsers.QueryParser.Query(String field)
*



And finally, here is how I am creating my index:


*

doc.Add(Field.Keyword("rec_id", entity_id.Trim()));

doc.Add(Field.Text("aaa", ob10_account_id.Trim()));

doc.Add(Field.Text("company_name", entity_name.Trim()));

doc.Add(Field.Text("VAT_reg", VAT_reg.Trim()));

doc.Add(Field.Text("account_type_description",
account_type_description.Trim()));

doc.Add(Field.Text("account_type", account_type.Trim()));

doc.Add(Field.Text("add_line1", add_line1.Trim()));

doc.Add(Field.Text("add_line2", add_line2.Trim()));

doc.Add(Field.Text("add_line3", add_line3.Trim()));

doc.Add(Field.Text("add_line4", add_line4.Trim()));

doc.Add(Field.Text("add_line5", add_line5.Trim()));

doc.Add(Field.Text("add_line6", add_line6.Trim()));

doc.Add(Field.Keyword("country_iso", country_iso.Trim()));

doc.Add(Field.Text("country_name", country_name.Trim()));

doc.Add(Field.Text("entity_status_desc", entity_status_desc.Trim()));

doc.Add(Field.Text("acct_status_desc", acct_status_desc.Trim()));

doc.Add(Field.Text("firstname", firstname.Trim()));

doc.Add(Field.Text("lastname", lastname.Trim()));



writer.AddDocument(doc);
*



Have I submitted my custom stop words incorrectly? Should I somehow use a
per-field analyzer for the country_ISO field? If so, which?

Thanks so much in advance for your help.

Re: Customizing Stop Word List?

Posted by Michael Barbarelli <mb...@gmail.com>.

Hello Hoss.

Cheers for your response.  Much appreciated.

"typically the act of writing this sample code helps you spot where you
amy be doing something wrong in your application"

Fair enough point.  Unfortunately, I won't be able to post any sample code
until I return to my home office. Will post it as soon as possible.

"you'll have to be a little more specific about "That didn't work." "

Well, after supplying my own list of stop words as an argument to both
analyzers,  (for indexing and searching)  I noticed that the ISO country
codes in question (AT, BE, IN, etc..) were still being stripped from the
query syntax, as though my own list of stop words had never taken effect.  I
would very much appreciate it if you could stand by for my sample code.

Kindest Regards,
Mike

On 7/12/07, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : So far, I have attempted to fix this problem by defining my own list of
> stop
> : words and passing that array onto a standard analyzer used for both
> indexing
> : and searching.  That didn't work.  Would a per-field analyzer work in
> this
>
> that is the correct way to change your stop word set ... you'll have to be
> a little more specific about "That didn't work." to get any more specific
> suggestions on your problem ... ie: can you send a small bit of self
> contained sample code that shows it not working for you?
>
> (typically the act of writing this sample code helps you spot where you
> amy be doing something wrong in your application)
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Customizing Stop Word List?

Posted by Chris Hostetter <ho...@fucit.org>.

: So far, I have attempted to fix this problem by defining my own list of stop
: words and passing that array onto a standard analyzer used for both indexing
: and searching.  That didn't work.  Would a per-field analyzer work in this

that is the correct way to change your stop word set ... you'll have to be
a little more specific about "That didn't work." to get any more specific
suggestions on your problem ... ie: can you send a small bit of self
contained sample code that shows it not working for you?

(typically the act of writing this sample code helps you spot where you
amy be doing something wrong in your application)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org