You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mark Woon <mo...@helix.stanford.edu> on 2003/08/26 06:53:46 UTC

Newbie Questions

Hi all...

I've been playing with Lucene for a couple days now and I have a couple 
questions I'm hoping some one can help me with.  I've created a Lucene 
index with data from a database that's in several different fields, and 
I want to set up a web page where users can search the index.  Ideally, 
all searches should be as google-like as possible.  In Lucene terms, I 
guess this means the query should be fuzzy.  For example, if someone 
searches for "cancer" then I'd like to get back all resuls with any form 
of the word cancer in the term ("cancerous", "breast cancer", etc.).

So far, I seem to be having two problems:

1) How can I search all fields at the same time?  The QueryParser seems 
to only search one specific field.

2) How can I automatically default all searches into fuzzy mode?  I 
don't want my users to have to know that they must add a "~" at the end 
of all their terms.

Thanks,
-Mark




Re: Newbie Questions

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Tuesday, August 26, 2003, at 02:51  PM, Mark Woon wrote:
> Ah, I've been testing out something similar to the latter.  I've been 
> adding multiple values on the same key.  Won't this have the same 
> effect?  I've been assuming that if I do
>
> doc.add(Field.Keyword("content", "value1");
> doc.add(Field.Keyword("content", "value2");
>
> And did a search on the "content" field for either value, I'd get a 
> hit, and it seems to work.  This way, I figure I'd be able to 
> differentiate between values that I want tokenized and values that I 
> don't.
>
> Is there a difference between this and building a StringBuffer 
> containing all the values and storing that as a single field-value?

There is a big difference between using Field.Text and Field.Keyword, 
yes.  It all depends on how you want things tokenized (or not).  
Field.Keyword does not tokenize (via the Analyzer), but Field.Text does.

	Erik


RE: Newbie Questions

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hi Mark.

Sorry, it's rc1 really which is out. But if you go to the cvs server, then
you'll find the rc2-dev version.

Multiple calls to Document.add with the same field results in that "their
text is treated as though appended for the purposes of search." (API doc).

Can you try out if there's a differece between the cases you mention? I don'
t know but I'd be interested as well;-).

Gregor




-----Original Message-----
From: Mark Woon [mailto:morpheus@helix.stanford.edu]
Sent: Tuesday, August 26, 2003 8:52 PM
To: Lucene Users List
Subject: Re: Newbie Questions


Gregor Heinrich wrote:

> ad 1: MultiFieldQueryParser is what you might want: you can specify the
> fields to run the query on. Alternatively, the practice of duplicating
> the
> contents of all separate fields in question into one additional merged
> field
> has been suggested, which enables you to use QueryParser itself.
>

Ah, I've been testing out something similar to the latter.  I've been
adding multiple values on the same key.  Won't this have the same
effect?  I've been assuming that if I do

doc.add(Field.Keyword("content", "value1");
doc.add(Field.Keyword("content", "value2");

And did a search on the "content" field for either value, I'd get a hit,
and it seems to work.  This way, I figure I'd be able to differentiate
between values that I want tokenized and values that I don't.

Is there a difference between this and building a StringBuffer
containing all the values and storing that as a single field-value?


> ad 2: Depending on the Analyzer you use, the query is normalised, i.e.,
> stemmed (remove suffices from words) and stopword-filtered (remove highly
> frequent words). Have a look at StandardAnalyzer.tokenStream(...) to
> see how
> the different filters work. In the analysis package the 1.3rc2 Lucene
> distribution has a Porter stemming algorithm: PorterStemmer.
>

There's an rc2 out?  Where??  I just checked the Lucene website and only
see rc1.


Thanks everyone for all the quick responses!

-Mark



Re: Newbie Questions

Posted by Mark Woon <mo...@helix.stanford.edu>.
Gregor Heinrich wrote:

> ad 1: MultiFieldQueryParser is what you might want: you can specify the
> fields to run the query on. Alternatively, the practice of duplicating 
> the
> contents of all separate fields in question into one additional merged 
> field
> has been suggested, which enables you to use QueryParser itself.
>

Ah, I've been testing out something similar to the latter.  I've been 
adding multiple values on the same key.  Won't this have the same 
effect?  I've been assuming that if I do

doc.add(Field.Keyword("content", "value1");
doc.add(Field.Keyword("content", "value2");

And did a search on the "content" field for either value, I'd get a hit, 
and it seems to work.  This way, I figure I'd be able to differentiate 
between values that I want tokenized and values that I don't.

Is there a difference between this and building a StringBuffer 
containing all the values and storing that as a single field-value?


> ad 2: Depending on the Analyzer you use, the query is normalised, i.e.,
> stemmed (remove suffices from words) and stopword-filtered (remove highly
> frequent words). Have a look at StandardAnalyzer.tokenStream(...) to 
> see how
> the different filters work. In the analysis package the 1.3rc2 Lucene
> distribution has a Porter stemming algorithm: PorterStemmer.
>

There's an rc2 out?  Where??  I just checked the Lucene website and only 
see rc1.


Thanks everyone for all the quick responses!

-Mark


RE: Newbie Questions

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hi Mark,

short answers to your questions:

ad 1: MultiFieldQueryParser is what you might want: you can specify the
fields to run the query on. Alternatively, the practice of duplicating the
contents of all separate fields in question into one additional merged field
has been suggested, which enables you to use QueryParser itself.

ad 2: Depending on the Analyzer you use, the query is normalised, i.e.,
stemmed (remove suffices from words) and stopword-filtered (remove highly
frequent words). Have a look at StandardAnalyzer.tokenStream(...) to see how
the different filters work. In the analysis package the 1.3rc2 Lucene
distribution has a Porter stemming algorithm: PorterStemmer.

Have fun,

Gregor

-----Original Message-----
From: Mark Woon [mailto:morpheus@helix.stanford.edu]
Sent: Tuesday, August 26, 2003 6:54 AM
To: lucene-user@jakarta.apache.org
Subject: Newbie Questions


Hi all...

I've been playing with Lucene for a couple days now and I have a couple
questions I'm hoping some one can help me with.  I've created a Lucene
index with data from a database that's in several different fields, and
I want to set up a web page where users can search the index.  Ideally,
all searches should be as google-like as possible.  In Lucene terms, I
guess this means the query should be fuzzy.  For example, if someone
searches for "cancer" then I'd like to get back all resuls with any form
of the word cancer in the term ("cancerous", "breast cancer", etc.).

So far, I seem to be having two problems:

1) How can I search all fields at the same time?  The QueryParser seems
to only search one specific field.

2) How can I automatically default all searches into fuzzy mode?  I
don't want my users to have to know that they must add a "~" at the end
of all their terms.

Thanks,
-Mark




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



RE: Newbie Questions

Posted by Aviran Mordo <am...@infosciences.com>.
1. You need to use MultiFieldQueryParser
2. I think you should use PorterStemFilter instead of fuzzy query
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Por
terStemFilter.html

-----Original Message-----
From: Mark Woon [mailto:morpheus@helix.stanford.edu] 
Sent: Tuesday, August 26, 2003 12:54 AM
To: lucene-user@jakarta.apache.org
Subject: Newbie Questions


Hi all...

I've been playing with Lucene for a couple days now and I have a couple 
questions I'm hoping some one can help me with.  I've created a Lucene 
index with data from a database that's in several different fields, and 
I want to set up a web page where users can search the index.  Ideally, 
all searches should be as google-like as possible.  In Lucene terms, I 
guess this means the query should be fuzzy.  For example, if someone 
searches for "cancer" then I'd like to get back all resuls with any form

of the word cancer in the term ("cancerous", "breast cancer", etc.).

So far, I seem to be having two problems:

1) How can I search all fields at the same time?  The QueryParser seems 
to only search one specific field.

2) How can I automatically default all searches into fuzzy mode?  I 
don't want my users to have to know that they must add a "~" at the end 
of all their terms.

Thanks,
-Mark




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org



Re: Newbie Questions

Posted by Erik Hatcher <li...@ehatchersolutions.com>.
On Tuesday, August 26, 2003, at 12:53  AM, Mark Woon wrote:
> 1) How can I search all fields at the same time?  The QueryParser 
> seems to only search one specific field.

The common thing I've done and seen others do is glue all the fields 
together into a master searchable field named something like "contents" 
or "keywords" (be sure to put a space in between text so it can be 
tokenized properly).

> 2) How can I automatically default all searches into fuzzy mode?  I 
> don't want my users to have to know that they must add a "~" at the 
> end of all their terms.

Your description of searches for "cancer" finding "cancerous" isn't 
really what the fuzzy query is about.  What you're after, I think, is 
more the stemming algorithms used during the analysis phase.  Have a 
look at the SnowballAnalyzer in the Lucene sandbox.  There is a little 
bit about it in the article I wrote for java.net: 
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html - it 
definitely sounds like more work in the analysis phase is what you're 
after.

	Erik