You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Magnus Johansson <ma...@technohuman.com> on 2003/03/11 11:05:29 UTC

QueryParser and compound words

Hello

I have written an Analyzer for swedish. Compound words are common in
swedish, therefore my Analyzer tries to split the compound words
into its parts. For example the swedish word fotbollsmatch (football 
game) is split into fotboll and match.

However when I use my Analyzer with the QueryParser the query 
footballsmatch is changed into "fotbolls match" (notice the quotes)
when what I really want is the query fotbolls match (with no qoutes).
Is this possible? The splitting of compound words is
of no real use if I can't get rid of the qoutes.

I have attached some sample code that illustrates the problem
(using a dummy Analyzer that splits words larger than five
charcters into two)

/magnus



------------------------------------------------------------------

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;

import java.io.Reader;
import java.io.IOException;

public class TestAnalyzer extends Analyzer {
     public TokenStream tokenStream(String s, Reader reader) {
         return new SplitStream(new StandardTokenizer(reader));
     }

     public static void main(String[] args) throws Exception {
         QueryParser qp = new QueryParser("fieldname",
             new TestAnalyzer());
         Query q = qp.parse("queryparser");
         System.out.println("Query: " + q.toString("fieldname"));
         System.out.println("Correct: query parser");
     }
}

class SplitStream extends TokenStream {
     private static final int SPLIT_SIZE = 5;
     private TokenStream tstream;
     private String buffer = null;
     private int start, end;

     public SplitStream(TokenStream tstream) { this.tstream = tstream; }

     public Token next() throws IOException {
         if (buffer == null) {
             Token tok = tstream.next();
             if (tok == null) {
                 return null;
             } else if (tok.termText().length() > SPLIT_SIZE) {
                 buffer = tok.termText().substring(SPLIT_SIZE);
                 start = tok.startOffset() + SPLIT_SIZE;
                 end = tok.endOffset();
                 return new Token(
                     tok.termText().substring(0, SPLIT_SIZE),
                     tok.startOffset(),
                         tok.startOffset() + SPLIT_SIZE);
             } else {
                 return tok;
             }
         } else {
             Token t = new Token(buffer, start, end);
             buffer = null;
             return t;
         }
     }
}








---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: QueryParser and compound words

Posted by Tatu Saloranta <ta...@hypermall.net>.
On Thursday 13 March 2003 00:52, Magnus Johansson wrote:
> Tatu Saloranta wrote:
...
> >But same happens during indexing; fotbollsmatch should be properly
> >split and stemmed to "fotboll" and "match" terms, right?
>
> Yes but the word fotbollsmatch was never indexed in this example. Only
> the word fotboll.
> I want a query for fotbollsmatch to match a document containing the word
> fotboll.

Ok I think I finally understand what you meant. :-)

So, basically, in your case you would prefer getting query:

fotbollsmatch

to expand to (after stemming etc):

fotboll match

and not

"fotboll match"

So that matching just one of the words would be enough for a hit (either
"either of" or "just first word" or "just last word").
It would be possible to implement this functionality by overriding default
QueryParser and modifying its functionality slightly. 

In QueryParser you should be able to override default handling for terms,
so that whenever you get just single token (in this case "fotbollsmatch")
that expands to multiple Terms, you do not construct a phrase query, but
just BooleanQuery with TermQueries (look at getFieldQuery(); it handles
basic search terms). You may need to use simple heuristics for figuring
when you have white space(s) that indicate "normal" phrases, which probably
should still be handled using PhraseQuery.

Of course this is all assuming you still do want that functionality. :-)
And if you do, it would be good idea to get patch back in case someone else
finds that useful later on (I think many non-english languages have concept
of compound words; German and Finnish at least do).

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: QueryParser and compound words

Posted by Magnus Johansson <ma...@technohuman.com>.
Tatu Saloranta wrote:

>On Wednesday 12 March 2003 01:19, Magnus Johansson wrote:
>  
>
>>Well, the problem arise when a user enters a query with a compound word
>>and the compound word itself is not indexed, only one of its parts.
>>    
>>
>
>Yes, but neither is compound word itself ever user in query either, assuming
>same analyser is used (like it always should)?
>
>  
>
>>For example the index contains a document with the following word:
>>fotboll (football).
>>
>>Let's say the users searches for fotbollsmatch (football game). The word
>>is split into fotboll and match and the phrase "fotboll match" is
>>searched for.
>>The user finds no matching document.
>>    
>>
>
>But same happens during indexing; fotbollsmatch should be properly
>split and stemmed to "fotboll" and "match" terms, right?
>  
>
Yes but the word fotbollsmatch was never indexed in this example. Only 
the word fotboll.
I want a query for fotbollsmatch to match a document containing the word 
fotboll.

>  
>
>>Comparing this to english the user would have found a document, however
>>scored
>>slightly lower than a document containing both the words football and game.
>>
>>I agree with you that this might not be a problem. The user could be
>>instructed
>>to reformulate his query. However the behaviour for an english index and
>>    
>>
>
>I actually think that if user has to be aware of internal stemming and 
>reformulate query I think this would be bit of a problem. :-)
>But I'm not 100% sure search string would differ from indexed string, assuming 
>same base token (unprocessed token, ie "fotbollsmatch") was both contained
>in the document and searched for using QueryParser.
>
>  
>
>>a swedish
>>index would be different.
>>    
>>
>
>I think that in general behaviour is heavily dependant on analyser (tokenizer 
>+ stemmer) being used, so it's probably different between most languages.
>
I think I'll accept how it works now. It is perhaps unlikely that the 
user would query the index
using a compound word and expecting documents containing only one of its 
parts in result.
The more I think about it the more difficult it becomes to come up with 
a realistic example
of why the behaviour would need to be changed.

Thank you for your feedback

/magnus



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: QueryParser and compound words

Posted by Tatu Saloranta <ta...@hypermall.net>.
On Wednesday 12 March 2003 01:19, Magnus Johansson wrote:
> Well, the problem arise when a user enters a query with a compound word
> and the compound word itself is not indexed, only one of its parts.

Yes, but neither is compound word itself ever user in query either, assuming
same analyser is used (like it always should)?

> For example the index contains a document with the following word:
> fotboll (football).
>
> Let's say the users searches for fotbollsmatch (football game). The word
> is split into fotboll and match and the phrase "fotboll match" is
> searched for.
> The user finds no matching document.

But same happens during indexing; fotbollsmatch should be properly
split and stemmed to "fotboll" and "match" terms, right?

> Comparing this to english the user would have found a document, however
> scored
> slightly lower than a document containing both the words football and game.
>
> I agree with you that this might not be a problem. The user could be
> instructed
> to reformulate his query. However the behaviour for an english index and

I actually think that if user has to be aware of internal stemming and 
reformulate query I think this would be bit of a problem. :-)
But I'm not 100% sure search string would differ from indexed string, assuming 
same base token (unprocessed token, ie "fotbollsmatch") was both contained
in the document and searched for using QueryParser.

> a swedish
> index would be different.

I think that in general behaviour is heavily dependant on analyser (tokenizer 
+ stemmer) being used, so it's probably different between most languages.

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: QueryParser and compound words

Posted by Magnus Johansson <ma...@technohuman.com>.
Well, the problem arise when a user enters a query with a compound word
and the compound word itself is not indexed, only one of its parts.

For example the index contains a document with the following word:
fotboll (football).

Let's say the users searches for fotbollsmatch (football game). The word
is split into fotboll and match and the phrase "fotboll match" is 
searched for.
The user finds no matching document.

Comparing this to english the user would have found a document, however 
scored
slightly lower than a document containing both the words football and game.

I agree with you that this might not be a problem. The user could be 
instructed
to reformulate his query. However the behaviour for an english index and 
a swedish
index would be different.

/magnus

Tatu Saloranta wrote:

>On Tuesday 11 March 2003 03:05, Magnus Johansson wrote:
>  
>
>>Hello
>>
>>I have written an Analyzer for swedish. Compound words are common in
>>swedish, therefore my Analyzer tries to split the compound words
>>into its parts. For example the swedish word fotbollsmatch (football
>>game) is split into fotboll and match.
>>    
>>
>
>(same applies to many other languages so this is a common problem I think).
>
>However... I'm not sure why you consider this a problem? The reason quotes
>are added is that since a single token (as parsed by QueryParser) expands no
>multiple terms, it becomes a PhraseQuery. Same happen (should happen)
>during indexing, so end result should match word in both "normal" case (word 
>is correctly spelled as compound word) and when word is (incorrectly) spelled 
>with spaces?
>As to quotes; they are only shown when converting query to a String; 
>internally there are no quotes to be matched.
>
>-+ Tatu +-
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: QueryParser and compound words

Posted by Tatu Saloranta <ta...@hypermall.net>.
On Tuesday 11 March 2003 03:05, Magnus Johansson wrote:
> Hello
>
> I have written an Analyzer for swedish. Compound words are common in
> swedish, therefore my Analyzer tries to split the compound words
> into its parts. For example the swedish word fotbollsmatch (football
> game) is split into fotboll and match.

(same applies to many other languages so this is a common problem I think).

However... I'm not sure why you consider this a problem? The reason quotes
are added is that since a single token (as parsed by QueryParser) expands no
multiple terms, it becomes a PhraseQuery. Same happen (should happen)
during indexing, so end result should match word in both "normal" case (word 
is correctly spelled as compound word) and when word is (incorrectly) spelled 
with spaces?
As to quotes; they are only shown when converting query to a String; 
internally there are no quotes to be matched.

-+ Tatu +-


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org