You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by lu...@nitwit.de on 2004/04/02 17:00:54 UTC
Re: Zero hits for queries ending with a number
On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
> Field.Keyword is suitable for storing data like Url. Give that a try.
I just tried this a minute ago and found that I cannot use wildcards with
Keywords: url:www.yahoo.*
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Extremely well said, Tatu!
On Apr 3, 2004, at 11:24 AM, Tatu Saloranta wrote:
> On Saturday 03 April 2004 08:34, lucene@nitwit.de wrote:
>> On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
>>> No objections that error messages and such could be made clearer.
>>> Patches welcome! Care to submit better error message handling in
>>> this
>>> case? Or perhaps allow lower-case "to"?
>>
>> I think the best would be if Lucene would simply have a
>> setCaseSensitive(boolean).
>>
>> IMHO it's in any case a bad idea to make searches case-sensitive (per
>> default).
>
> I'd have to disagree. I think that search engine core should not have
> to
> bother with details of character sets, such as lower-casing. Rules for
> lower/upper/initial/mixed case for all Unicode-languages are rather
> involved... and if you tried to do that, next thing would be whether
> accentuation and umlaut marks should matter or not (which is language
> dependant). That's why to me the natural way to go is to do direct
> comparison, ignoring case when executing queries. This does not prevent
> anyone from implementing such functionality (see below).
>
> I think architecture and design of Lucene core is delightfully simple.
> One can
> easily create case-independent functionality by using proper
> analyzers, and
> (for the most part), configuring QueryParser. I would agree, however,
> that
> QueryParser is "victim of its success"; it's too often used in
> situations
> where one really should create proper GUI that builds the query.
> Backend code
> can then mangle input as it sees fit, and build query objects.
> QueryParser is more natural for quick-n-dirty scenarios, where one
> just has to
> slap something together quickly, or if one only has textual interface
> to deal
> with. It's nice thing to have, but it has its limitations; there's no
> way to
> create one parser that's perfect for every use(r).
>
> What could be done would be to make sure all examples / demo web apps
> would
> implement case-insensitive indexing and searching, since that is often
> what
> is needed?
>
> -+ Tatu +-
>
>>
>>> But, also, folks need to really step back and practice basic
>>> troubleshooting skills. I asked you if that string was what you
>>> passed
>>> to the QueryParser and you said yes, when in fact it was not. And
>>> you
>>
>> I forgot that I did lower-case it. I fact I even output it in it's
>> original
>> state but lower-case it just before I pass it to lucene. That
>> lower-casing
>> is what I would call a hack and hence it's no surprise that I forgot
>> it :-)
>>
>> Timo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by Tatu Saloranta <ta...@hypermall.net>.
On Saturday 03 April 2004 08:34, lucene@nitwit.de wrote:
> On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
> > No objections that error messages and such could be made clearer.
> > Patches welcome! Care to submit better error message handling in this
> > case? Or perhaps allow lower-case "to"?
>
> I think the best would be if Lucene would simply have a
> setCaseSensitive(boolean).
>
> IMHO it's in any case a bad idea to make searches case-sensitive (per
> default).
I'd have to disagree. I think that search engine core should not have to
bother with details of character sets, such as lower-casing. Rules for
lower/upper/initial/mixed case for all Unicode-languages are rather
involved... and if you tried to do that, next thing would be whether
accentuation and umlaut marks should matter or not (which is language
dependant). That's why to me the natural way to go is to do direct
comparison, ignoring case when executing queries. This does not prevent
anyone from implementing such functionality (see below).
I think architecture and design of Lucene core is delightfully simple. One can
easily create case-independent functionality by using proper analyzers, and
(for the most part), configuring QueryParser. I would agree, however, that
QueryParser is "victim of its success"; it's too often used in situations
where one really should create proper GUI that builds the query. Backend code
can then mangle input as it sees fit, and build query objects.
QueryParser is more natural for quick-n-dirty scenarios, where one just has to
slap something together quickly, or if one only has textual interface to deal
with. It's nice thing to have, but it has its limitations; there's no way to
create one parser that's perfect for every use(r).
What could be done would be to make sure all examples / demo web apps would
implement case-insensitive indexing and searching, since that is often what
is needed?
-+ Tatu +-
>
> > But, also, folks need to really step back and practice basic
> > troubleshooting skills. I asked you if that string was what you passed
> > to the QueryParser and you said yes, when in fact it was not. And you
>
> I forgot that I did lower-case it. I fact I even output it in it's original
> state but lower-case it just before I pass it to lucene. That lower-casing
> is what I would call a hack and hence it's no surprise that I forgot it :-)
>
> Timo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 3, 2004, at 10:34 AM, lucene@nitwit.de wrote:
> I forgot that I did lower-case it. I fact I even output it in it's
> original
> state but lower-case it just before I pass it to lucene. That
> lower-casing is
> what I would call a hack and hence it's no surprise that I forgot it
> :-)
But why even lowercase? That is what an analyzer typically does anyway
(look at the output from AnalysisDemo to see).
Note that there are switches on QueryParser (and MultiFieldQueryParser
is lacking in this respect, another reason not to use it) that does
lowercase wildcard terms automatically:
setLowercaseWildcardTerms(true). Wildcard terms are not analyzed by
QueryParser, so this was added to account for it.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by lu...@nitwit.de.
On Saturday 03 April 2004 17:11, Erik Hatcher wrote:
> No objections that error messages and such could be made clearer.
> Patches welcome! Care to submit better error message handling in this
> case? Or perhaps allow lower-case "to"?
I think the best would be if Lucene would simply have a
setCaseSensitive(boolean).
IMHO it's in any case a bad idea to make searches case-sensitive (per
default).
> But, also, folks need to really step back and practice basic
> troubleshooting skills. I asked you if that string was what you passed
> to the QueryParser and you said yes, when in fact it was not. And you
I forgot that I did lower-case it. I fact I even output it in it's original
state but lower-case it just before I pass it to lucene. That lower-casing is
what I would call a hack and hence it's no surprise that I forgot it :-)
Timo
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 3, 2004, at 9:59 AM, lucene@nitwit.de wrote:
> On Saturday 03 April 2004 15:19, Erik Hatcher wrote:
>> date:[20030101 TO 20030202]
>
> I found the/my bug.
>
> Since Lucene is case-sensitive, I do lower-case all queries for user's
> convenience. The ParseException is thrown because the "TO" becomes
> "to".
>
> Well, I really think Lucene needs to daff such stumbling blocks
> aside...
No objections that error messages and such could be made clearer.
Patches welcome! Care to submit better error message handling in this
case? Or perhaps allow lower-case "to"?
But, also, folks need to really step back and practice basic
troubleshooting skills. I asked you if that string was what you passed
to the QueryParser and you said yes, when in fact it was not. And you
slowly fed more details of your scenario (MFQP, some German
SnowballAnalyzer variant). Reduce the variables in the equation and
narrow things down until it works and then incrementally add
complexity. I cannot encourage folks enough to try some JUnit
test-driven *learning* by exploring various scenarios.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by lu...@nitwit.de.
On Saturday 03 April 2004 15:19, Erik Hatcher wrote:
> date:[20030101 TO 20030202]
I found the/my bug.
Since Lucene is case-sensitive, I do lower-case all queries for user's
convenience. The ParseException is thrown because the "TO" becomes "to".
Well, I really think Lucene needs to daff such stumbling blocks aside...
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Ok, we're getting somewhere now.
So, where is the exception you encountered when using this utility
code?! (i.e. it didn't thrown an exception, so something is different
in your usage in your code).
I tried this:
Query query = MultiFieldQueryParser.parse("date:[20030101 TO
20030202]", new String[] { "id", "title", "summary", "contents", "date"
}, new GermanAnalyzer());
System.out.println("query = " + query.toString());
And it worked fine (only duplicated the query for each field). No
exception at all. Of course I'm guessing on your analyzer since you
didn't provide that detail (although it shouldn't matter in the
exception you experienced).
On Apr 3, 2004, at 6:06 AM, lucene@nitwit.de wrote:
> SnowballAnalyzer("German2"):
>
> Analzying "http://www.yahoo.com/foo/bar.html"
> org.apache.lucene.analysis.snowball.SnowballAnalyzer:
> [http] [www.yahoo.com] [foo] [bar.html]
So this is the analyzer you want to use, right?
Wildcards should work on "www.yahoo.*"
What is the "German2" stemmer for Snowball?
You've introduced a lot of variables to your equation here....
MultiFieldQueryParser and a non-standard Snowball stemmer. All of
which I had to pull out of you for details, each of which is critical
to understanding the problem.
>> analyzer you are using, and also do the same on .toString of the query
>> you parsed. Those two pieces of info will tell all.
>
> "url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo*
> url:www.yahoo*
> url:www.yahoo*"
>
> Well, I actually use a MultiFieldQueryParser, that's probably why the
> term
> does appear so often. Strange parser, it should be clear that am
> explicit
> "url:xyz" should only look in the url field, shouldn't it?
Do you really need to query on multiple fields? Why not just use the
plain QueryParser? If you need an aggregate field, create one at index
time. QueryParsing is problematic enough, but adding in MFQP makes it
even more complicated.
Which Analyzer are you using for indexing? This same SnowballAnalyzer
with "German2" stemmer?
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by lu...@nitwit.de.
On Saturday 03 April 2004 11:48, Erik Hatcher wrote:
> Provide us the results of running your url through that, using the same
SnowballAnalyzer("German2"):
Analzying "http://www.yahoo.com/foo/bar.html"
org.apache.lucene.analysis.WhitespaceAnalyzer:
[http://www.yahoo.com/foo/bar.html]
org.apache.lucene.analysis.SimpleAnalyzer:
[http] [www] [yahoo] [com] [foo] [bar] [html]
org.apache.lucene.analysis.StopAnalyzer:
[http] [www] [yahoo] [com] [foo] [bar] [html]
org.apache.lucene.analysis.standard.StandardAnalyzer:
[http] [www.yahoo.com] [foo] [bar.html]
org.apache.lucene.analysis.snowball.SnowballAnalyzer:
[http] [www.yahoo.com] [foo] [bar.html]
> analyzer you are using, and also do the same on .toString of the query
> you parsed. Those two pieces of info will tell all.
"url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo* url:www.yahoo*
url:www.yahoo*"
Well, I actually use a MultiFieldQueryParser, that's probably why the term
does appear so often. Strange parser, it should be clear that am explicit
"url:xyz" should only look in the url field, shouldn't it?
Timo
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 3, 2004, at 3:19 AM, lucene@nitwit.de wrote:
>> You *can* use wildcards with keywords (in fact, a keyword really has
>> no
>> meaning once indexed - everything is a "term" at that point).
>
> Well, I just tried. I also was surprised actually - but it just
> didn't work.
>
> I can use wildcards for
>
> doc.add(Field.Text("url", row.getString("url")));
>
> but I cannot for
>
> doc.add(Field.Keyword("url", row.getString("url")));
>
>> - create a utility (I've posted one on the list in the past) that
>> shows what your analyzer is doing graphically.
>
> Interesting. Can you give me subject/date of that posting?
AnalysisDemo in this article:
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
Provide us the results of running your url through that, using the same
analyzer you are using, and also do the same on .toString of the query
you parsed. Those two pieces of info will tell all.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by lu...@nitwit.de.
On Friday 02 April 2004 23:48, Erik Hatcher wrote:
> On Apr 2, 2004, at 10:00 AM, lucene@nitwit.de wrote:
> > On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
> >> Field.Keyword is suitable for storing data like Url. Give that a try.
> >
> > I just tried this a minute ago and found that I cannot use wildcards
> > with
> > Keywords: url:www.yahoo.*
>
> You *can* use wildcards with keywords (in fact, a keyword really has no
> meaning once indexed - everything is a "term" at that point).
Well, I just tried. I also was surprised actually - but it just didn't work.
I can use wildcards for
doc.add(Field.Text("url", row.getString("url")));
but I cannot for
doc.add(Field.Keyword("url", row.getString("url")));
> - create a utility (I've posted one on the list in the past) that
> shows what your analyzer is doing graphically.
Interesting. Can you give me subject/date of that posting?
Timo
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Zero hits for queries ending with a number
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 2, 2004, at 10:00 AM, lucene@nitwit.de wrote:
> On Saturday 13 March 2004 11:06, Otis Gospodnetic wrote:
>> Field.Keyword is suitable for storing data like Url. Give that a try.
>
> I just tried this a minute ago and found that I cannot use wildcards
> with
> Keywords: url:www.yahoo.*
You *can* use wildcards with keywords (in fact, a keyword really has no
meaning once indexed - everything is a "term" at that point).
99% of the issues people have with things like this end up being
Analyzer/QueryParser related.
A few quick pieces of advice:
- use Luke to see what is inside your index and understand what it
looks like from the inside.
- create a utility (I've posted one on the list in the past) that
shows what your analyzer is doing graphically.
- use Query.toString to output what QueryParser did to your query
expression.
Armed with the above bits of trivia, you have the information to
troubleshoot the situation first-hand.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org