You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Rahul D Thakare <ra...@rediffmail.com> on 2005/07/13 14:18:44 UTC

Wild card and multiple keyword search

  
Hi, 

 We are using doc.add(Field.Text("keywords",keywords)); to add the keywords to the document, where keywords is comma separated keywords string.
Lucene seems to tokenize the keywords with multiple words like(MAIN BOARD) as different keywords(ie as MAIN and BOARD). Tokenization is based on comma and space...So if we search for "MAIN BOARD", documents having keywords like "MAIN LOGIC", "MAIN PARTS", etc also show up

If one searches for "MAIN BOARD", we want get only the documents have "MAIN BOARD".  How to do this ?

To achieve this we used doc.add(Field.Keyword("keywords", keywords)); and while searching
we cannot use standard analyzer, while searching, as divides the keywords if we search keywords having space... so we wrote an KeywordAnalyser(KeywordAnalyzer is basically returns only one single token) as given below.

/**
 * Tokenizes the entire stream as single token
 */

 public class KeywordAnalyzer extends Analyzer
 {
	 public TokenStream tokenStream(String fieldName, final Reader reader)
	 {
		 return new TokenStream(){
			 private boolean done;
			 private final char[] buffer = new char[1024];
			 public Token next() throws IOException
			 {
				 if(!done)
				 {
					 done = true;
					 StringBuffer buffer = new StringBuffer();
					 int length = 0;
					 while(true)
					 {
						 length = reader.read(this.buffer);
						 if(length == -1) break;

						 buffer.append(this.buffer,0,length);
					 }
					 String text = buffer.toString();
					 return new Token(text.toUpperCase(),0,text.length());
				 }
				 return null;
			 }
		 };
	 }
 }

Which solve the above said problem, but we are not able to the wild card searchs like MAIN*, etc.

We need both the functionality ie. 
1.  if user searches for MAIN BOARD, should get only documents that contain MAIN BOARD and not MAIN LOGIC, MAIN, MAIN PART etc. 
2. User should be able to do the wild card search like MAIN*, etc and get the desired documents.

Please let us know, how we should do the indexing ? and which analyzer to use to do the search ?

thanks
Rahul...

Re: Wild card and multiple keyword search

Posted by Ian Lea <ia...@gmail.com>.
Sounds to me that all you need is to AND rather than OR your search terms.

	QueryParser qp = new QueryParser("keywords", analyzer);
	qp.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
        Query q = qp.parse(words);

where analyzer is just the standard one.

Or search for +MAIN +BOARD.  Or MAIN AND BOARD.


--
Ian.


On 13 Jul 2005 12:18:44 -0000, Rahul D Thakare
<ra...@rediffmail.com> wrote:
>  
> Hi,
> 
>  We are using doc.add(Field.Text("keywords",keywords)); to add the keywords to the document, where keywords is comma separated keywords string.
> Lucene seems to tokenize the keywords with multiple words like(MAIN BOARD) as different keywords(ie as MAIN and BOARD). Tokenization is based on comma and space...So if we search for "MAIN BOARD", documents having keywords like "MAIN LOGIC", "MAIN PARTS", etc also show up
> 
> If one searches for "MAIN BOARD", we want get only the documents have "MAIN BOARD".  How to do this ?
> 
> To achieve this we used doc.add(Field.Keyword("keywords", keywords)); and while searching
> we cannot use standard analyzer, while searching, as divides the keywords if we search keywords having space... so we wrote an KeywordAnalyser(KeywordAnalyzer is basically returns only one single token) as given below.
> 
> /**
>  * Tokenizes the entire stream as single token
>  */
> 
>  public class KeywordAnalyzer extends Analyzer
>  {
>          public TokenStream tokenStream(String fieldName, final Reader reader)
>          {
>                  return new TokenStream(){
>                          private boolean done;
>                          private final char[] buffer = new char[1024];
>                          public Token next() throws IOException
>                          {
>                                  if(!done)
>                                  {
>                                          done = true;
>                                          StringBuffer buffer = new StringBuffer();
>                                          int length = 0;
>                                          while(true)
>                                          {
>                                                  length = reader.read(this.buffer);
>                                                  if(length == -1) break;
> 
>                                                  buffer.append(this.buffer,0,length);
>                                          }
>                                          String text = buffer.toString();
>                                          return new Token(text.toUpperCase(),0,text.length());
>                                  }
>                                  return null;
>                          }
>                  };
>          }
>  }
> 
> Which solve the above said problem, but we are not able to the wild card searchs like MAIN*, etc.
> 
> We need both the functionality ie.
> 1.  if user searches for MAIN BOARD, should get only documents that contain MAIN BOARD and not MAIN LOGIC, MAIN, MAIN PART etc.
> 2. User should be able to do the wild card search like MAIN*, etc and get the desired documents.
> 
> Please let us know, how we should do the indexing ? and which analyzer to use to do the search ?
> 
> thanks
> Rahul...
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Wild card and multiple keyword search

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jul 13, 2005, at 8:18 AM, Rahul D Thakare wrote:
>  We are using doc.add(Field.Text("keywords",keywords)); to add the  
> keywords to the document, where keywords is comma separated  
> keywords string.

If the text is already comma separated and that is the level at which  
you things tokenized, then simply do something like this (untested  
pseudo-code):

     String[] values = keywords.split(",");
     for (int i=0; i < values.length; i++)
         doc.add(Field.Keyword("keywords", values[i]));

> Lucene seems to tokenize the keywords with multiple words like(MAIN  
> BOARD) as different keywords(ie as MAIN and BOARD). Tokenization is  
> based on comma and space...So if we search for "MAIN BOARD",  
> documents having keywords like "MAIN LOGIC", "MAIN PARTS", etc also  
> show up
>
> If one searches for "MAIN BOARD", we want get only the documents  
> have "MAIN BOARD".  How to do this ?

The question back to you is do you want searches for simply "MAIN" to  
find both "MAIN LOGIC" and "MAIN PARTS"?  Or should it return no  
documents since its not an exact match?

Using the above code, "MAIN" would find neither of those and the  
query would have to be exact.  I see below you've clarified this  
requirement...

> To achieve this we used doc.add(Field.Keyword("keywords",  
> keywords)); and while searching
> we cannot use standard analyzer, while searching, as divides the  
> keywords if we search keywords having space... so we wrote an  
> KeywordAnalyser(KeywordAnalyzer is basically returns only one  
> single token) as given below.

There is a KeywordAnalyzer now in the contrib/analyzers codebase, and  
it will ship with the next version of Lucene (or you could build it  
yourself and use it).  There is also a couple of variants of the  
KeywordAnalyzer in the Lucene in Action code (www.lucenebook.com).

> Which solve the above said problem, but we are not able to the wild  
> card searchs like MAIN*, etc.
>
> We need both the functionality ie.
> 1.  if user searches for MAIN BOARD, should get only documents that  
> contain MAIN BOARD and not MAIN LOGIC, MAIN, MAIN PART etc.
> 2. User should be able to do the wild card search like MAIN*, etc  
> and get the desired documents.
>
> Please let us know, how we should do the indexing ? and which  
> analyzer to use to do the search ?

There are many ways to go about this sort of thing, and I apologize  
for being short on time and not able to explain them all fully.  One  
option is to keep the tokenization using a traditional analyzer so  
that it separates by whitespace, but when a user queries it turns  
into a PhraseQuery.  If you really mean for wildcards to be single  
words in the field (in other words, users don't need to query on MA*)  
then the space separated tokenization would work fine here as well.

It is important to think through the analysis process as well as the  
search interface issues (the interface must be given thorough  
consideration and treated as a first class citizen when discussing  
implementations), especially when wildcard and range queries come  
up.  It has been a hot topic recently on how to deal with wildcards  
and ranges efficiently.  In your example, if by "MAIN*" you intend  
for the word MAIN to be a unique token and the user would choose a  
full word to search upon and merely wants to find it within a larger  
field then wildcards are not necessary.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org