You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Allen Atamer <aa...@casebank.com> on 2004/12/01 00:01:37 UTC

RE: literal search in quotes on non-tokenized field

Erik,


> -----Original Message-----
> > Here's a log of the parsed query before going to the searcher:
> >
> > Parsed query: (Build:"origi") for the first search
> > Parsed query: (Build:origi) for the second search
> 
> What do you mean by "parsed", since below you say you're not using
> QueryParser/Analyzer.


Sorry, that's residual log text. The lines of code are 

BooleanQuery totalQuery = new BooleanQuery();

.. logic to build totalQuery ...

log.debug("Parsed query: " + totalQuery.toString());
dbSearchHits = searcher.search(totalQuery);


> > Right now we're not using a query parser / analyzer system to build the
> > query. We're building the query up.
> > The query mentioned above is a TermQuery object
> 
> Let me hopefully clarify what you've said.... you've indexed (I'm not
> using quotes on purpose) origi, but you're doing a TermQuery on "origi"
> (with the quotes) and expecting it to match?
> 
> It doesn't work that way.  A TermQuery must match *exactly* what was
> indexed (either directly as a Keyword, or as tokens emitted from the
> analyzer).  Since you're building the query up yourself from, I'm
> assuming, user input, you may need to pre-process what the user entered
> to get the right term to query on.  Only the term origi would match.

Yeah but it doesn't. The exact text in the database is ORIGI. Keyword
doesn't work if you supply more than one word. In fact we're doing it wrong.
Fields with a small number of terms should not be indexed as keyword, but
tokenized. I'm going to change the indexing strategy to only use keyword
when there's one and only one keyword in the data itself. Fields with two to
three words will be tokenized with the NoTokenizingTokenizer that was posted
earlier, and fields with four or more words will be tokenized with
MyTokenizer.

All we need to do for searching keyword fields is remove the double quotes
to be consistent with searching in a tokenized field. Then use QueryParser
to parse the tokenized fields with the appropriate parser for the field.
This should solve the problem.

Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: literal search in quotes on non-tokenized field

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Nov 30, 2004, at 6:01 PM, Allen Atamer wrote:
>> It doesn't work that way.  A TermQuery must match *exactly* what was
>> indexed (either directly as a Keyword, or as tokens emitted from the
>> analyzer).  Since you're building the query up yourself from, I'm
>> assuming, user input, you may need to pre-process what the user 
>> entered
>> to get the right term to query on.  Only the term origi would match.
>
> Yeah but it doesn't. The exact text in the database is ORIGI.

But you lowercased what you indexed (in the code you sent).

>  Keyword
> doesn't work if you supply more than one word.

Depends on what you mean by "doesn't work".  It works as expected. 
Keyword fields are not tokenized and thus a TermQuery on it has to be 
exactly the value you supplied.  But it sounds like you've got a handle 
on the situation now.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org