You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by legrand thomas <th...@yahoo.fr> on 2009/06/03 08:25:51 UTC

Re: Index and search terms containing character "-"

Hi,

A KeywordAnalyzer solved my problem. 
Luke allowed me to understand the queries and the content of the index.

Thanks  (Erick & Balasubramanian Sudaakeran)
Tom

--- En date de : Dim 31.5.09, Erick Erickson <er...@gmail.com> a écrit :

De: Erick Erickson <er...@gmail.com>
Objet: Re: Index and search terms containing character "-"
À: java-user@lucene.apache.org
Date: Dimanche 31 Mai 2009, 15h33

Simple analyzer does two things: splits tokens on non-letter characters and
lowercases them.

So, in your test you've indexed the tokens "jack" and "bauer" in your second
document, the hyphen is completely lost during tokenization and you have
two tokens for that document.

Using the term query "jack-bauer" is looking for a *single* token that is
exactly "jack-bauer", which is nowhere in your index. Term queries assume
that you know exactly what result you want, they don't try to transform the
input at all. Had you run "jack-bauer" through the query parser with
SimpleAnalyzer, you'd be searching for a document with the two terms
"jack" and "bauer", and you'd have hit your second document but not your
first.

I'd strongly recommend you get a copy of Luke, it's invaluable for questions
like this
because it lets you look at what's actually in your index. It'll also show
you how
queries get broken down when pushed through various analyzers...

BTW, nice test case for demonstrating what you were seeing, it makes
answering
*vastly* easier.....

HTH
Erick

On Sun, May 31, 2009 at 5:55 AM, legrand thomas <th...@yahoo.fr>wrote:

> Hi,
>
> I have a problem using TermQuery and FuzzyQuery for terms containing the
> character "-". Considering I've indexed "jack" and "jack-bauer" as 2
> tokenized captions, I get no result when searching for "jack-bauer".
> Moreover, "jack" with a TermQuery returns the two captions.
>
> What should I do to get "jack-bauer" with new TermQuery("jack-bauer") ?
>
> A full test case is given below.
>
> Thanks,
> Tom
>
>
> import junit.framework.Assert;
>
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.SimpleAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.Hits;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.TermQuery;
> import org.apache.lucene.store.FSDirectory;
> import org.junit.Test;
>
> public class IDebugIndexTest {
>
>     @Test
>     public void TermQueryTest() {
>
>         Analyzer analyser = new SimpleAnalyzer();
>
>         try {
>             // write docs to new index
>             IndexWriter writer = new IndexWriter(FSDirectory
>                     .getDirectory("/tmp/idx_test"), analyser, true);
>
>             Document jack = new Document();
>             jack.add(new Field("caption", "jack", Field.Store.YES,
>                     Field.Index.TOKENIZED));
>             writer.addDocument(jack);
>
>             Document jackBauer = new Document();
>             jackBauer.add(new Field("caption", "jack-bauer",
> Field.Store.YES,
>                     Field.Index.TOKENIZED));
>             writer.addDocument(jackBauer);
>
>             writer.close();
>
>             // try to search
>             IndexSearcher s = new
> IndexSearcher(IndexReader.open(FSDirectory
>                     .getDirectory("/tmp/idx_test")));
>
>             // The next assertion is ok
>             Hits jackHits = s
>             .search(new TermQuery(new Term("caption", "jack")));
>             Assert.assertEquals(jackHits.length(), 2);
>
>             // The next assertion fails !!!
>             Hits jackBauerHits = s.search(new TermQuery(new Term("caption",
>             "jack-bauer")));
>             Assert.assertEquals(jackBauerHits.length(), 1);
>
>         } catch (Exception e) {
>             Assert.fail();
>         }
>
>     }
> }
>
>
>
>
>
>



      

Re: Index and search terms containing character "-"

Posted by Erick Erickson <er...@gmail.com>.
Just be aware that KeywordAnalyzer won't tokenize at all. That is,if you
expect to index "jack-bauer" and hit on "jack" or "bauer" it won't.

Best
Erick

On Wed, Jun 3, 2009 at 2:25 AM, legrand thomas <th...@yahoo.fr>wrote:

> Hi,
>
> A KeywordAnalyzer solved my problem.
> Luke allowed me to understand the queries and the content of the index.
>
> Thanks  (Erick & Balasubramanian Sudaakeran)
> Tom
>
> --- En date de : Dim 31.5.09, Erick Erickson <er...@gmail.com> a
> écrit :
>
> De: Erick Erickson <er...@gmail.com>
> Objet: Re: Index and search terms containing character "-"
> À: java-user@lucene.apache.org
> Date: Dimanche 31 Mai 2009, 15h33
>
> Simple analyzer does two things: splits tokens on non-letter characters and
> lowercases them.
>
> So, in your test you've indexed the tokens "jack" and "bauer" in your
> second
> document, the hyphen is completely lost during tokenization and you have
> two tokens for that document.
>
> Using the term query "jack-bauer" is looking for a *single* token that is
> exactly "jack-bauer", which is nowhere in your index. Term queries assume
> that you know exactly what result you want, they don't try to transform the
> input at all. Had you run "jack-bauer" through the query parser with
> SimpleAnalyzer, you'd be searching for a document with the two terms
> "jack" and "bauer", and you'd have hit your second document but not your
> first.
>
> I'd strongly recommend you get a copy of Luke, it's invaluable for
> questions
> like this
> because it lets you look at what's actually in your index. It'll also show
> you how
> queries get broken down when pushed through various analyzers...
>
> BTW, nice test case for demonstrating what you were seeing, it makes
> answering
> *vastly* easier.....
>
> HTH
> Erick
>
> On Sun, May 31, 2009 at 5:55 AM, legrand thomas <thomaslegrand14@yahoo.fr
> >wrote:
>
> > Hi,
> >
> > I have a problem using TermQuery and FuzzyQuery for terms containing the
> > character "-". Considering I've indexed "jack" and "jack-bauer" as 2
> > tokenized captions, I get no result when searching for "jack-bauer".
> > Moreover, "jack" with a TermQuery returns the two captions.
> >
> > What should I do to get "jack-bauer" with new TermQuery("jack-bauer") ?
> >
> > A full test case is given below.
> >
> > Thanks,
> > Tom
> >
> >
> > import junit.framework.Assert;
> >
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.SimpleAnalyzer;
> > import org.apache.lucene.document.Document;
> > import org.apache.lucene.document.Field;
> > import org.apache.lucene.index.IndexReader;
> > import org.apache.lucene.index.IndexWriter;
> > import org.apache.lucene.index.Term;
> > import org.apache.lucene.search.Hits;
> > import org.apache.lucene.search.IndexSearcher;
> > import org.apache.lucene.search.TermQuery;
> > import org.apache.lucene.store.FSDirectory;
> > import org.junit.Test;
> >
> > public class IDebugIndexTest {
> >
> >     @Test
> >     public void TermQueryTest() {
> >
> >         Analyzer analyser = new SimpleAnalyzer();
> >
> >         try {
> >             // write docs to new index
> >             IndexWriter writer = new IndexWriter(FSDirectory
> >                     .getDirectory("/tmp/idx_test"), analyser, true);
> >
> >             Document jack = new Document();
> >             jack.add(new Field("caption", "jack", Field.Store.YES,
> >                     Field.Index.TOKENIZED));
> >             writer.addDocument(jack);
> >
> >             Document jackBauer = new Document();
> >             jackBauer.add(new Field("caption", "jack-bauer",
> > Field.Store.YES,
> >                     Field.Index.TOKENIZED));
> >             writer.addDocument(jackBauer);
> >
> >             writer.close();
> >
> >             // try to search
> >             IndexSearcher s = new
> > IndexSearcher(IndexReader.open(FSDirectory
> >                     .getDirectory("/tmp/idx_test")));
> >
> >             // The next assertion is ok
> >             Hits jackHits = s
> >             .search(new TermQuery(new Term("caption", "jack")));
> >             Assert.assertEquals(jackHits.length(), 2);
> >
> >             // The next assertion fails !!!
> >             Hits jackBauerHits = s.search(new TermQuery(new
> Term("caption",
> >             "jack-bauer")));
> >             Assert.assertEquals(jackBauerHits.length(), 1);
> >
> >         } catch (Exception e) {
> >             Assert.fail();
> >         }
> >
> >     }
> > }
> >
> >
> >
> >
> >
> >
>
>
>
>
>