You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shahak Nagiel <sn...@yahoo.com> on 2013/05/21 16:06:14 UTC
Case insensitive StringField?
It appears that StringField instances are treated as literals, even though my analyzer lower-cases (on both write and read sides). So, for example, I can match with a term query (e.g. "NEW YORK"), but only if the case matches. If I use a QueryParser (or MultiFieldQueryParser), it never works because these query values are lowercased and don't match.
I've found that using a TextField instead works, presumably because it's tokenized and processed correctly by the write analyzer. However, I would prefer that queries match against the entire/exact phrase ("NEW YORK"), rather than among the tokens ("NEW" or "YORK").
What's the solution here?
Thanks in advance.
RE: Case insensitive StringField?
Posted by Michael Ryan <mr...@moreover.com>.
Here's what we use for this:
<fieldType name="caseInsensitiveString" class="solr.TextField" indexed="true" stored="true" omitNorms="true" sortMissingLast="true" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<field name="someField" type="caseInsensitiveString" omitTermFreqAndPositions="true"/>
As far as I know, StringField does not use analyzers at all - they'll just be ignored.
KeywordTokenizerFactory does the "exact phrase" bit, and LowerCaseFilterFactory does the lowercasing.
-Michael
-----Original Message-----
From: Shahak Nagiel [mailto:snagiel@yahoo.com]
Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?
It appears that StringField instances are treated as literals, even though my analyzer lower-cases (on both write and read sides). So, for example, I can match with a term query (e.g. "NEW YORK"), but only if the case matches. If I use a QueryParser (or MultiFieldQueryParser), it never works because these query values are lowercased and don't match.
I've found that using a TextField instead works, presumably because it's tokenized and processed correctly by the write analyzer. However, I would prefer that queries match against the entire/exact phrase ("NEW YORK"), rather than among the tokens ("NEW" or "YORK").
What's the solution here?
Thanks in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Case insensitive StringField?
Posted by Jack Krupansky <ja...@basetechnology.com>.
Yes it is. It always will. But... you can escape the spaces with a
backslash:
Query q = qp.parse("new\\ york");
-- Jack Krupansky
-----Original Message-----
From: Shahak Nagiel
Sent: Tuesday, May 21, 2013 10:09 PM
To: java-user@lucene.apache.org
Subject: Re: Case insensitive StringField?
Jack / Michael: Thanks, but the query parser still seems to be tokenizing
the query?
public class StringPhraseAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents (String fieldName,
Reader reader) {
Tokenizer tok = new KeywordTokenizer(reader);
TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, tok);
filter = new TrimFilter(filter, true);
return new TokenStreamComponents(tok, filter);
}
}
...
Analyzer analyzer = new StringPhraseAnalyzer();
// using this analyzer, add document to index with city TextField (value
"NEW YORK")
QueryParser qp = new QueryParser(Version.LUCENE_41, "city", analyzer);
Query q = qp.parse("new york");
System.out.println ("Query: " + q);
results in...
Query: city:new city:york// I expected "city:new york"
...and no matches. Is a QueryParser the wrong way to generate the query for
this type of analyzer?
Thanks again!
________________________________
From: Jack Krupansky <ja...@basetechnology.com>
To: java-user@lucene.apache.org
Sent: Tuesday, May 21, 2013 10:22 AM
Subject: Re: Case insensitive StringField?
To be clear, analysis is not supported on StringField (or any non-tokenized
field). But the good news is that by using the keyword tokenizer
(KeywordTokenizer) on a TextField, you can get the same effect.
That will preserve the entire input as a single token. You may want to
include filters to trim exterior white space and normalize interior white
space.
-- Jack Krupansky
-----Original Message-----
From: Shahak Nagiel
Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?
It appears that StringField instances are treated as literals, even though
my analyzer lower-cases (on both write and read sides). So, for example, I
can match with a term query (e.g. "NEW YORK"), but only if the case matches.
If I use a QueryParser (or MultiFieldQueryParser), it never works because
these query values are lowercased and don't match.
I've found that using a TextField instead works, presumably because it's
tokenized and processed correctly by the write analyzer. However, I would
prefer that queries match against the entire/exact phrase ("NEW YORK"),
rather than among the tokens ("NEW" or "YORK").
What's the solution here?
Thanks in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Case insensitive StringField?
Posted by Shahak Nagiel <sn...@yahoo.com>.
Jack / Michael: Thanks, but the query parser still seems to be tokenizing the query?
public class StringPhraseAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents (String fieldName, Reader reader) {
Tokenizer tok = new KeywordTokenizer(reader);
TokenFilter filter = new LowerCaseFilter(Version.LUCENE_41, tok);
filter = new TrimFilter(filter, true);
return new TokenStreamComponents(tok, filter);
}
}
...
Analyzer analyzer = new StringPhraseAnalyzer();
// using this analyzer, add document to index with city TextField (value "NEW YORK")
QueryParser qp = new QueryParser(Version.LUCENE_41, "city", analyzer);
Query q = qp.parse("new york");
System.out.println ("Query: " + q);
results in...
Query: city:new city:york// I expected "city:new york"
...and no matches. Is a QueryParser the wrong way to generate the query for this type of analyzer?
Thanks again!
________________________________
From: Jack Krupansky <ja...@basetechnology.com>
To: java-user@lucene.apache.org
Sent: Tuesday, May 21, 2013 10:22 AM
Subject: Re: Case insensitive StringField?
To be clear, analysis is not supported on StringField (or any non-tokenized
field). But the good news is that by using the keyword tokenizer
(KeywordTokenizer) on a TextField, you can get the same effect.
That will preserve the entire input as a single token. You may want to
include filters to trim exterior white space and normalize interior white
space.
-- Jack Krupansky
-----Original Message-----
From: Shahak Nagiel
Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?
It appears that StringField instances are treated as literals, even though
my analyzer lower-cases (on both write and read sides). So, for example, I
can match with a term query (e.g. "NEW YORK"), but only if the case matches.
If I use a QueryParser (or MultiFieldQueryParser), it never works because
these query values are lowercased and don't match.
I've found that using a TextField instead works, presumably because it's
tokenized and processed correctly by the write analyzer. However, I would
prefer that queries match against the entire/exact phrase ("NEW YORK"),
rather than among the tokens ("NEW" or "YORK").
What's the solution here?
Thanks in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Case insensitive StringField?
Posted by Jack Krupansky <ja...@basetechnology.com>.
To be clear, analysis is not supported on StringField (or any non-tokenized
field). But the good news is that by using the keyword tokenizer
(KeywordTokenizer) on a TextField, you can get the same effect.
That will preserve the entire input as a single token. You may want to
include filters to trim exterior white space and normalize interior white
space.
-- Jack Krupansky
-----Original Message-----
From: Shahak Nagiel
Sent: Tuesday, May 21, 2013 10:06 AM
To: java-user@lucene.apache.org
Subject: Case insensitive StringField?
It appears that StringField instances are treated as literals, even though
my analyzer lower-cases (on both write and read sides). So, for example, I
can match with a term query (e.g. "NEW YORK"), but only if the case matches.
If I use a QueryParser (or MultiFieldQueryParser), it never works because
these query values are lowercased and don't match.
I've found that using a TextField instead works, presumably because it's
tokenized and processed correctly by the write analyzer. However, I would
prefer that queries match against the entire/exact phrase ("NEW YORK"),
rather than among the tokens ("NEW" or "YORK").
What's the solution here?
Thanks in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org