You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by karl wettin <ka...@gmail.com> on 2006/07/24 03:52:36 UTC

dash-words

I'm want to filter words with a dash in them.

["x-men"]
["xmen"]
["x", "men"]

All of above should be synonyms. The problem is ["x", "men"] requiring a
distance between the terms and thus also matching "x-men men". Or? How
about storing ["x", "men"] as the first term and the use set a negative
position increment?

I could just try that prior to posting, :) but choose not to as I have a
feeling many of you already implemented something like this, that there
are multiple solutions to the problem?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Martin Braun <mb...@uni-hd.de>.

Yonik Seeley schrieb:
> On 7/23/06, karl wettin <ka...@gmail.com> wrote:
>> I'm want to filter words with a dash in them.
>>
>> ["x-men"]
>> ["xmen"]
>> ["x", "men"]
>>
>> All of above should be synonyms. The problem is ["x", "men"] requiring a
>> distance between the terms and thus also matching "x-men men".
> 
> WordDelimiterFilter from Solr does this:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
> 

I can recommend this too. I use it and it works fine! I just do a
LowerCaseFilter afterwards to avoid the downside:
"if source text is "powershot" then a query of "PowerShot" won't match!"


> 
> It also has the false match problem you mention... "x xmen" would
> match a document with x-men, although this hasn't been a problem in
> practise.
> 
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


-- 
Universitaetsbibliothek Heidelberg   Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg   Fax: +49 6221 54-2623

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Yonik Seeley <ys...@gmail.com>.

On 7/24/06, karl wettin <ka...@gmail.com> wrote:
> > WordDelimiterFilter from Solr does this
>
> > It also has the false match problem you mention...
>
> Will it effect a phrase query?

Yes... adding some slop to phrase queries is the best way to deal with that.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Yonik Seeley <ys...@gmail.com>.

On 7/25/06, karl wettin <ka...@gmail.com> wrote:
> On Mon, 2006-07-24 at 21:16 -0400, Yonik Seeley wrote:
>
> > > > I can't figure out what the parameters does. ;)
> >
> > Hopefully the wiki link I gave before will explain the parameters.
>
> Oh, I so totally missed that. Do you want me to java-doc it up and send
> you the patch?

Sure!

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by karl wettin <ka...@gmail.com>.

On Mon, 2006-07-24 at 21:16 -0400, Yonik Seeley wrote:

> > > I can't figure out what the parameters does. ;)
> 
> Hopefully the wiki link I gave before will explain the parameters. 

Oh, I so totally missed that. Do you want me to java-doc it up and send
you the patch?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Yonik Seeley <ys...@gmail.com>.

> > I can't figure out what the parameters does. ;)

Hopefully the wiki link I gave before will explain the parameters.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by karl wettin <ka...@gmail.com>.

On Tue, 2006-07-25 at 11:42 -0400, Yonik Seeley wrote:
> > > Yes, it will fail without slop... I don't think there is a 
> > > practical way around that. 

It would of course be much easier if Lucene supported multiple token
dimensions instead of position increment only. 

> the x-men are here
>     x men
>     xmen

I've briefly talked about this on the dev-list (some month ago) when
position boosts was up for discussion. I don't know how much work this
really is, but it would be such a nice feature.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Chris Hostetter <ho...@fucit.org>.

: with a query like this +arbeiterjugend +west-berlin I get no results.
:
: org.apache.lucene.queryParser.QueryParser.parse makes this query (with
: WordDelimiterFilter) with Default QueryParser.AND_OPERATOR:
:
: +titel:arbeiterjugend +titel:"west (berlin westberlin)"
:
: with +arbeiterjugend +westberlin I get the result.
:
: It seems that the synonyms don't work with the query. How do you solve
: this in Solr? Do I have to build a TermQuery?

First off, when using WordDelimiterFilter it's generally a good idea to
use a slightly differnet configuration of the Filter in your indexng
analyzer then in your query analyzer -- this is discussed a bit inthe
wiki...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
http://wiki.apache.org/solr/SolrRelevancyCookbook#head-353fcfa33e5c4a0a5959aa3d8d33c5a3a61f2683

...this can help avoid situations like you describe.

but in general, what you are running into is a general constraint of the
way Analyzers can produce tokens with a "zero gap" indicating that they
occupy the same spot as the previous token, but there is no way for the
analyzer to indicate that a sequence of 1 or more tokens occupies the same
space as another sequence of 1 or more tokens.  so when QUeryParser asks
the analyzer to make a token stream out of "west-berlin" the analyzer has
no way to return a token stream that can easily be recognized as [ [[west]
[berlin]] or [westberlin] ].

this does in fact prove to be a large problem when dealing with "multi
word synonyms" (also discussed in the wiki mentioned above) but can
generally be dealt with in the WordDeliminterFilter.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Martin Braun <mb...@uni-hd.de>.

Hi Yonik,

>> So a Phrase search to "The xmen story" will fail. With a slop of 1 the
>> doc will be found.
>>
>> But when generating the query I won't know when to use a slop. So adding
>> slops isn't a nice solution.
> 
> If you can't tolerate slop, this is a problem.

I use the WordDelimiterFilter now without slop, because in other cases
it's an amelioration. But I (or better my app) stumbled now over a non
Phrase Query:

If I am searching for a title named (sorry for the german example).

"lage der arbeiterjugend in westberlin" (indexed with
WordDelimiterFilter + lowercase)

with a query like this +arbeiterjugend +west-berlin I get no results.

org.apache.lucene.queryParser.QueryParser.parse makes this query (with
WordDelimiterFilter) with Default QueryParser.AND_OPERATOR:

+titel:arbeiterjugend +titel:"west (berlin westberlin)"

with +arbeiterjugend +westberlin I get the result.

It seems that the synonyms don't work with the query. How do you solve
this in Solr? Do I have to build a TermQuery?

thanks in advance,

martin







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Yonik Seeley <ys...@gmail.com>.

On 7/25/06, Martin Braun <mb...@uni-hd.de> wrote:
> Hi Yonik,
>
> >> I can't figure out what the parameters does. ;)
> >
> > Yes, it will fail without slop... I don't think there is a practical
> > way around that.
>
> I am trying to analyze your WordDelimiterFilter.
>
> If I have x-men, after analyzing (with catenateAll) I get this:
>
>
>  Analzying "The x-men story"
>         de.unihd.ub.ftsearch.WordIndexAnalyzer:
>                 [the] [x] [men] [xmen] [story]
>
>
> 1: [the:0->3:word]
> 2: [x:4->5:word]
> 3: [men:6->9:word] [xmen:4->9:word]
> 4: [story:10->15:word]
>
> 1: [the]
> 2: [x]
> 3: [men] [xmen]
> 4: [story]
>
>
> So a Phrase search to "The xmen story" will fail. With a slop of 1 the
> doc will be found.
>
> But when generating the query I won't know when to use a slop. So adding
> slops isn't a nice solution.

If you can't tolerate slop, this is a problem.

The only 100% solution that I could think of to this problem is to
re-index the entire stream (with a very large position gap inbetween)
for each variant.

"the x men story"
"the xmen story"

Problems:
1) combinatorial explosion very quickly (not practical at all)
2) messes up idfs pretty badly

Phrase slop is the easiest workaround, esp when you wanted slop anyway.

> Would it be a solution, to take the concatenated synonyms to both
> Positions? Or are there any drawbacks with this?

I considered that too... but it increases false matches, and it still
doesn't fix many phrase queries.

> 1: [the]
> 2: [x] [xmen]
> 3: [men] [xmen]
> 4: [story]

While "the xmen" and "xmen story" will now both match, "the xmen
story" will still fail to match.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Martin Braun <mb...@uni-hd.de>.

Hi Yonik,

>> I can't figure out what the parameters does. ;)
> 
> Yes, it will fail without slop... I don't think there is a practical
> way around that.

I am trying to analyze your WordDelimiterFilter.

If I have x-men, after analyzing (with catenateAll) I get this:


 Analzying "The x-men story"
	de.unihd.ub.ftsearch.WordIndexAnalyzer:
		[the] [x] [men] [xmen] [story]


1: [the:0->3:word]
2: [x:4->5:word]
3: [men:6->9:word] [xmen:4->9:word]
4: [story:10->15:word]

1: [the]
2: [x]
3: [men] [xmen]
4: [story]


So a Phrase search to "The xmen story" will fail. With a slop of 1 the
doc will be found.

But when generating the query I won't know when to use a slop. So adding
slops isn't a nice solution.

Would it be a solution, to take the concatenated synonyms to both
Positions? Or are there any drawbacks with this?

1: [the]
2: [x] [xmen]
3: [men] [xmen]
4: [story]

thanks,
martin




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Yonik Seeley <ys...@gmail.com>.

On 7/24/06, karl wettin <ka...@gmail.com> wrote:
> On Mon, 2006-07-24 at 15:17 +0200, karl wettin wrote:
> > On Mon, 2006-07-24 at 15:15 +0200, karl wettin wrote:
> > > Yes, it effects PhraseQuery. Only "the x men are" will match.
> >
> > I'm stupid. Forget about it. I should of course analyze the query too.
>
> But still it fails on xmen. Could it have something to do with this:
>    WordDelimiterFilter(ts, 1,1,0,0,0);
>
> ?
>
> I can't figure out what the parameters does. ;)

Yes, it will fail without slop... I don't think there is a practical
way around that.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by karl wettin <ka...@gmail.com>.

On Mon, 2006-07-24 at 15:17 +0200, karl wettin wrote:
> On Mon, 2006-07-24 at 15:15 +0200, karl wettin wrote:
> > Yes, it effects PhraseQuery. Only "the x men are" will match.
> 
> I'm stupid. Forget about it. I should of course analyze the query too.

But still it fails on xmen. Could it have something to do with this:
   WordDelimiterFilter(ts, 1,1,0,0,0);

?

I can't figure out what the parameters does. ;)


The new code:

package org.apache.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.queryParser.QueryParser;

import java.io.Reader;
import java.util.HashSet;


public class TestWordDelimiterFilter {

    public static void main(String[] args) throws Exception {
        final String field = "field";

        Directory dir = new RAMDirectory();
        Analyzer a = new Analyzer();

        IndexWriter w = new IndexWriter(dir, a, true);
        Document d = new Document();
        d.add(new Field(field, "the x-men are here", Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.NO));
        w.addDocument(d);
        w.close();

        IndexSearcher is = new IndexSearcher(dir);
        QueryParser qp = new QueryParser(field, a);
        System.out.println(is.search(qp.parse("\"the x men are\"")).length());
        System.out.println(is.search(qp.parse("\"the xmen are\"")).length()); // expected 1, get 0.
        System.out.println(is.search(qp.parse("\"the x-men are\"")).length());


        is.close();
        dir.close();

    }

    public static class Analyzer extends org.apache.lucene.analysis.Analyzer {
        public TokenStream tokenStream(String fieldName, Reader reader) {
            TokenStream ts = new StandardAnalyzer(new HashSet()).tokenStream(fieldName, reader);
            ts = new WordDelimiterFilter(ts, 1,1,0,0,0);
            return ts;
        }
    }

}



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by karl wettin <ka...@gmail.com>.

On Mon, 2006-07-24 at 15:15 +0200, karl wettin wrote:
> Yes, it effects PhraseQuery. Only "the x men are" will match.

I'm stupid. Forget about it. I should of course analyze the query too.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by karl wettin <ka...@gmail.com>.

On Mon, 2006-07-24 at 13:51 +0200, karl wettin wrote:
> On Mon, 2006-07-24 at 00:34 -0400, Yonik Seeley wrote:
> > > filter words with a dash
> > >
> > > ["x-men"]
> > > ["xmen"]
> > > ["x", "men"]
> > >
> > > The problem is ["x", "men"] requiring a distance between the terms
> > > and thus also matching "x-men men".
> > 
> > WordDelimiterFilter from Solr does this
> 
> > It also has the false match problem you mention...
> 
> Will it effect a phrase query?
> 
> I.e. would "the xmen are" be a no-match as the filtered index data
> would be "the x (men|xmen|x-men) are here"?
> 
> I'll write a test now.

Yes, it effects PhraseQuery. Only "the x men are" will match.



package org.apache.solr.analysis;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import java.io.Reader;
import java.util.HashSet;


public class TestWordDelimiterFilter {

    public static void main(String[] args) throws Exception {
        final String field = "field";

        Directory dir = new RAMDirectory();
        Analyzer a = new Analyzer();

        IndexWriter w = new IndexWriter(dir, a, true);
        Document d = new Document();
        d.add(new Field(field, "the x-men are here", Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.NO));
        w.addDocument(d);
        w.close();

        IndexSearcher is = new IndexSearcher(dir);

        PhraseQuery pq = new PhraseQuery();
        pq.add(new Term(field, "the"));
        pq.add(new Term(field, "x-men"));
        pq.add(new Term(field, "are"));
        System.out.println(is.search(pq).length());

        pq = new PhraseQuery();
        pq.add(new Term(field, "the"));
        pq.add(new Term(field, "xmen"));
        pq.add(new Term(field, "are"));
        System.out.println(is.search(pq).length());

        pq = new PhraseQuery();
        pq.add(new Term(field, "the"));
        pq.add(new Term(field, "x"));
        pq.add(new Term(field, "men"));
        pq.add(new Term(field, "are"));
        System.out.println(is.search(pq).length());

        is.close();
        dir.close();

    }

    public static class Analyzer extends org.apache.lucene.analysis.Analyzer {
        public TokenStream tokenStream(String fieldName, Reader reader) {
            TokenStream ts = new StandardAnalyzer(new HashSet()).tokenStream(fieldName, reader);
            ts = new WordDelimiterFilter(ts, 1,1,0,0,0);
            return ts;
        }
    }

}



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by karl wettin <ka...@gmail.com>.

On Mon, 2006-07-24 at 00:34 -0400, Yonik Seeley wrote:
> > filter words with a dash
> >
> > ["x-men"]
> > ["xmen"]
> > ["x", "men"]
> >
> > The problem is ["x", "men"] requiring a distance between the terms
> > and thus also matching "x-men men".
> 
> WordDelimiterFilter from Solr does this

> It also has the false match problem you mention...

Will it effect a phrase query?

I.e. would "the xmen are here" be a no-match as the filtered index data
would be "the x (men|xmen|x-men) are here"?

I'll write a test now.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: dash-words

Posted by Yonik Seeley <ys...@gmail.com>.

On 7/23/06, karl wettin <ka...@gmail.com> wrote:
> I'm want to filter words with a dash in them.
>
> ["x-men"]
> ["xmen"]
> ["x", "men"]
>
> All of above should be synonyms. The problem is ["x", "men"] requiring a
> distance between the terms and thus also matching "x-men men".

WordDelimiterFilter from Solr does this:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089

It also has the false match problem you mention... "x xmen" would
match a document with x-men, although this hasn't been a problem in
practise.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org