You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by karl wettin <ka...@gmail.com> on 2006/07/24 03:52:36 UTC
dash-words
I'm want to filter words with a dash in them.
["x-men"]
["xmen"]
["x", "men"]
All of above should be synonyms. The problem is ["x", "men"] requiring a
distance between the terms and thus also matching "x-men men". Or? How
about storing ["x", "men"] as the first term and the use set a negative
position increment?
I could just try that prior to posting, :) but choose not to as I have a
feeling many of you already implemented something like this, that there
are multiple solutions to the problem?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Martin Braun <mb...@uni-hd.de>.
Yonik Seeley schrieb:
> On 7/23/06, karl wettin <ka...@gmail.com> wrote:
>> I'm want to filter words with a dash in them.
>>
>> ["x-men"]
>> ["xmen"]
>> ["x", "men"]
>>
>> All of above should be synonyms. The problem is ["x", "men"] requiring a
>> distance between the terms and thus also matching "x-men men".
>
> WordDelimiterFilter from Solr does this:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
>
I can recommend this too. I use it and it works fine! I just do a
LowerCaseFilter afterwards to avoid the downside:
"if source text is "powershot" then a query of "PowerShot" won't match!"
>
> It also has the false match problem you mention... "x xmen" would
> match a document with x-men, although this hasn't been a problem in
> practise.
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--
Universitaetsbibliothek Heidelberg Tel: +49 6221 54-2580
Ploeck 107-109, D-69117 Heidelberg Fax: +49 6221 54-2623
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Yonik Seeley <ys...@gmail.com>.
On 7/24/06, karl wettin <ka...@gmail.com> wrote:
> > WordDelimiterFilter from Solr does this
>
> > It also has the false match problem you mention...
>
> Will it effect a phrase query?
Yes... adding some slop to phrase queries is the best way to deal with that.
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Yonik Seeley <ys...@gmail.com>.
On 7/25/06, karl wettin <ka...@gmail.com> wrote:
> On Mon, 2006-07-24 at 21:16 -0400, Yonik Seeley wrote:
>
> > > > I can't figure out what the parameters does. ;)
> >
> > Hopefully the wiki link I gave before will explain the parameters.
>
> Oh, I so totally missed that. Do you want me to java-doc it up and send
> you the patch?
Sure!
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by karl wettin <ka...@gmail.com>.
On Mon, 2006-07-24 at 21:16 -0400, Yonik Seeley wrote:
> > > I can't figure out what the parameters does. ;)
>
> Hopefully the wiki link I gave before will explain the parameters.
Oh, I so totally missed that. Do you want me to java-doc it up and send
you the patch?
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Yonik Seeley <ys...@gmail.com>.
> > I can't figure out what the parameters does. ;)
Hopefully the wiki link I gave before will explain the parameters.
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by karl wettin <ka...@gmail.com>.
On Tue, 2006-07-25 at 11:42 -0400, Yonik Seeley wrote:
> > > Yes, it will fail without slop... I don't think there is a
> > > practical way around that.
It would of course be much easier if Lucene supported multiple token
dimensions instead of position increment only.
> the x-men are here
> x men
> xmen
I've briefly talked about this on the dev-list (some month ago) when
position boosts was up for discussion. I don't know how much work this
really is, but it would be such a nice feature.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Chris Hostetter <ho...@fucit.org>.
: with a query like this +arbeiterjugend +west-berlin I get no results.
:
: org.apache.lucene.queryParser.QueryParser.parse makes this query (with
: WordDelimiterFilter) with Default QueryParser.AND_OPERATOR:
:
: +titel:arbeiterjugend +titel:"west (berlin westberlin)"
:
: with +arbeiterjugend +westberlin I get the result.
:
: It seems that the synonyms don't work with the query. How do you solve
: this in Solr? Do I have to build a TermQuery?
First off, when using WordDelimiterFilter it's generally a good idea to
use a slightly differnet configuration of the Filter in your indexng
analyzer then in your query analyzer -- this is discussed a bit inthe
wiki...
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
http://wiki.apache.org/solr/SolrRelevancyCookbook#head-353fcfa33e5c4a0a5959aa3d8d33c5a3a61f2683
...this can help avoid situations like you describe.
but in general, what you are running into is a general constraint of the
way Analyzers can produce tokens with a "zero gap" indicating that they
occupy the same spot as the previous token, but there is no way for the
analyzer to indicate that a sequence of 1 or more tokens occupies the same
space as another sequence of 1 or more tokens. so when QUeryParser asks
the analyzer to make a token stream out of "west-berlin" the analyzer has
no way to return a token stream that can easily be recognized as [ [[west]
[berlin]] or [westberlin] ].
this does in fact prove to be a large problem when dealing with "multi
word synonyms" (also discussed in the wiki mentioned above) but can
generally be dealt with in the WordDeliminterFilter.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Martin Braun <mb...@uni-hd.de>.
Hi Yonik,
>> So a Phrase search to "The xmen story" will fail. With a slop of 1 the
>> doc will be found.
>>
>> But when generating the query I won't know when to use a slop. So adding
>> slops isn't a nice solution.
>
> If you can't tolerate slop, this is a problem.
I use the WordDelimiterFilter now without slop, because in other cases
it's an amelioration. But I (or better my app) stumbled now over a non
Phrase Query:
If I am searching for a title named (sorry for the german example).
"lage der arbeiterjugend in westberlin" (indexed with
WordDelimiterFilter + lowercase)
with a query like this +arbeiterjugend +west-berlin I get no results.
org.apache.lucene.queryParser.QueryParser.parse makes this query (with
WordDelimiterFilter) with Default QueryParser.AND_OPERATOR:
+titel:arbeiterjugend +titel:"west (berlin westberlin)"
with +arbeiterjugend +westberlin I get the result.
It seems that the synonyms don't work with the query. How do you solve
this in Solr? Do I have to build a TermQuery?
thanks in advance,
martin
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Yonik Seeley <ys...@gmail.com>.
On 7/25/06, Martin Braun <mb...@uni-hd.de> wrote:
> Hi Yonik,
>
> >> I can't figure out what the parameters does. ;)
> >
> > Yes, it will fail without slop... I don't think there is a practical
> > way around that.
>
> I am trying to analyze your WordDelimiterFilter.
>
> If I have x-men, after analyzing (with catenateAll) I get this:
>
>
> Analzying "The x-men story"
> de.unihd.ub.ftsearch.WordIndexAnalyzer:
> [the] [x] [men] [xmen] [story]
>
>
> 1: [the:0->3:word]
> 2: [x:4->5:word]
> 3: [men:6->9:word] [xmen:4->9:word]
> 4: [story:10->15:word]
>
> 1: [the]
> 2: [x]
> 3: [men] [xmen]
> 4: [story]
>
>
> So a Phrase search to "The xmen story" will fail. With a slop of 1 the
> doc will be found.
>
> But when generating the query I won't know when to use a slop. So adding
> slops isn't a nice solution.
If you can't tolerate slop, this is a problem.
The only 100% solution that I could think of to this problem is to
re-index the entire stream (with a very large position gap inbetween)
for each variant.
"the x men story"
"the xmen story"
Problems:
1) combinatorial explosion very quickly (not practical at all)
2) messes up idfs pretty badly
Phrase slop is the easiest workaround, esp when you wanted slop anyway.
> Would it be a solution, to take the concatenated synonyms to both
> Positions? Or are there any drawbacks with this?
I considered that too... but it increases false matches, and it still
doesn't fix many phrase queries.
> 1: [the]
> 2: [x] [xmen]
> 3: [men] [xmen]
> 4: [story]
While "the xmen" and "xmen story" will now both match, "the xmen
story" will still fail to match.
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Martin Braun <mb...@uni-hd.de>.
Hi Yonik,
>> I can't figure out what the parameters does. ;)
>
> Yes, it will fail without slop... I don't think there is a practical
> way around that.
I am trying to analyze your WordDelimiterFilter.
If I have x-men, after analyzing (with catenateAll) I get this:
Analzying "The x-men story"
de.unihd.ub.ftsearch.WordIndexAnalyzer:
[the] [x] [men] [xmen] [story]
1: [the:0->3:word]
2: [x:4->5:word]
3: [men:6->9:word] [xmen:4->9:word]
4: [story:10->15:word]
1: [the]
2: [x]
3: [men] [xmen]
4: [story]
So a Phrase search to "The xmen story" will fail. With a slop of 1 the
doc will be found.
But when generating the query I won't know when to use a slop. So adding
slops isn't a nice solution.
Would it be a solution, to take the concatenated synonyms to both
Positions? Or are there any drawbacks with this?
1: [the]
2: [x] [xmen]
3: [men] [xmen]
4: [story]
thanks,
martin
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Yonik Seeley <ys...@gmail.com>.
On 7/24/06, karl wettin <ka...@gmail.com> wrote:
> On Mon, 2006-07-24 at 15:17 +0200, karl wettin wrote:
> > On Mon, 2006-07-24 at 15:15 +0200, karl wettin wrote:
> > > Yes, it effects PhraseQuery. Only "the x men are" will match.
> >
> > I'm stupid. Forget about it. I should of course analyze the query too.
>
> But still it fails on xmen. Could it have something to do with this:
> WordDelimiterFilter(ts, 1,1,0,0,0);
>
> ?
>
> I can't figure out what the parameters does. ;)
Yes, it will fail without slop... I don't think there is a practical
way around that.
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by karl wettin <ka...@gmail.com>.
On Mon, 2006-07-24 at 15:17 +0200, karl wettin wrote:
> On Mon, 2006-07-24 at 15:15 +0200, karl wettin wrote:
> > Yes, it effects PhraseQuery. Only "the x men are" will match.
>
> I'm stupid. Forget about it. I should of course analyze the query too.
But still it fails on xmen. Could it have something to do with this:
WordDelimiterFilter(ts, 1,1,0,0,0);
?
I can't figure out what the parameters does. ;)
The new code:
package org.apache.solr.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.queryParser.QueryParser;
import java.io.Reader;
import java.util.HashSet;
public class TestWordDelimiterFilter {
public static void main(String[] args) throws Exception {
final String field = "field";
Directory dir = new RAMDirectory();
Analyzer a = new Analyzer();
IndexWriter w = new IndexWriter(dir, a, true);
Document d = new Document();
d.add(new Field(field, "the x-men are here", Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.NO));
w.addDocument(d);
w.close();
IndexSearcher is = new IndexSearcher(dir);
QueryParser qp = new QueryParser(field, a);
System.out.println(is.search(qp.parse("\"the x men are\"")).length());
System.out.println(is.search(qp.parse("\"the xmen are\"")).length()); // expected 1, get 0.
System.out.println(is.search(qp.parse("\"the x-men are\"")).length());
is.close();
dir.close();
}
public static class Analyzer extends org.apache.lucene.analysis.Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream ts = new StandardAnalyzer(new HashSet()).tokenStream(fieldName, reader);
ts = new WordDelimiterFilter(ts, 1,1,0,0,0);
return ts;
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by karl wettin <ka...@gmail.com>.
On Mon, 2006-07-24 at 15:15 +0200, karl wettin wrote:
> Yes, it effects PhraseQuery. Only "the x men are" will match.
I'm stupid. Forget about it. I should of course analyze the query too.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by karl wettin <ka...@gmail.com>.
On Mon, 2006-07-24 at 13:51 +0200, karl wettin wrote:
> On Mon, 2006-07-24 at 00:34 -0400, Yonik Seeley wrote:
> > > filter words with a dash
> > >
> > > ["x-men"]
> > > ["xmen"]
> > > ["x", "men"]
> > >
> > > The problem is ["x", "men"] requiring a distance between the terms
> > > and thus also matching "x-men men".
> >
> > WordDelimiterFilter from Solr does this
>
> > It also has the false match problem you mention...
>
> Will it effect a phrase query?
>
> I.e. would "the xmen are" be a no-match as the filtered index data
> would be "the x (men|xmen|x-men) are here"?
>
> I'll write a test now.
Yes, it effects PhraseQuery. Only "the x men are" will match.
package org.apache.solr.analysis;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import java.io.Reader;
import java.util.HashSet;
public class TestWordDelimiterFilter {
public static void main(String[] args) throws Exception {
final String field = "field";
Directory dir = new RAMDirectory();
Analyzer a = new Analyzer();
IndexWriter w = new IndexWriter(dir, a, true);
Document d = new Document();
d.add(new Field(field, "the x-men are here", Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.NO));
w.addDocument(d);
w.close();
IndexSearcher is = new IndexSearcher(dir);
PhraseQuery pq = new PhraseQuery();
pq.add(new Term(field, "the"));
pq.add(new Term(field, "x-men"));
pq.add(new Term(field, "are"));
System.out.println(is.search(pq).length());
pq = new PhraseQuery();
pq.add(new Term(field, "the"));
pq.add(new Term(field, "xmen"));
pq.add(new Term(field, "are"));
System.out.println(is.search(pq).length());
pq = new PhraseQuery();
pq.add(new Term(field, "the"));
pq.add(new Term(field, "x"));
pq.add(new Term(field, "men"));
pq.add(new Term(field, "are"));
System.out.println(is.search(pq).length());
is.close();
dir.close();
}
public static class Analyzer extends org.apache.lucene.analysis.Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream ts = new StandardAnalyzer(new HashSet()).tokenStream(fieldName, reader);
ts = new WordDelimiterFilter(ts, 1,1,0,0,0);
return ts;
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by karl wettin <ka...@gmail.com>.
On Mon, 2006-07-24 at 00:34 -0400, Yonik Seeley wrote:
> > filter words with a dash
> >
> > ["x-men"]
> > ["xmen"]
> > ["x", "men"]
> >
> > The problem is ["x", "men"] requiring a distance between the terms
> > and thus also matching "x-men men".
>
> WordDelimiterFilter from Solr does this
> It also has the false match problem you mention...
Will it effect a phrase query?
I.e. would "the xmen are here" be a no-match as the filtered index data
would be "the x (men|xmen|x-men) are here"?
I'll write a test now.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: dash-words
Posted by Yonik Seeley <ys...@gmail.com>.
On 7/23/06, karl wettin <ka...@gmail.com> wrote:
> I'm want to filter words with a dash in them.
>
> ["x-men"]
> ["xmen"]
> ["x", "men"]
>
> All of above should be synonyms. The problem is ["x", "men"] requiring a
> distance between the terms and thus also matching "x-men men".
WordDelimiterFilter from Solr does this:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089
It also has the false match problem you mention... "x xmen" would
match a document with x-men, although this hasn't been a problem in
practise.
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org