You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by lu...@nitwit.de on 2004/02/16 19:20:20 UTC

Word not in index

Hi!

I do build a list of all unique words in all my docs from 
WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a 
GermanAnalyzer in another index. There are plenty of word in the word list 
that don't return any hits when searching the doc index built using the 
GermanAnalyzer - and these are no stop words.

Why is this?

Thanks a lot!
Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by lu...@nitwit.de.

On Monday 16 February 2004 20:56, Otis Gospodnetic wrote:
> Timo, by the nature of your questions it seems like you didn't see the
> Articles section of Lucene's site.  There are links to several articles
>
> --- lucene@nitwit.de wrote:
> > Well, not sure whether I understood.

Well, was actually a case problem, too... :)

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Timo, by the nature of your questions it seems like you didn't see the
Articles section of Lucene's site.  There are links to several articles
there.  A few of them explain indexing (intro + more advanced), at
least one explains QueryParser and maybe Analyzer, and a few explain
vanilla searching.

Otis

--- lucene@nitwit.de wrote:
> On Monday 16 February 2004 19:45, Markus Spath wrote:
> > Analyzers preprocess the text to be indexed; different Analyzers
> will
> > generate different text-tokens that are indexed. only you can know
> which
> > Analyzer fits your needs, but you need to apply this one
> consistently for
> > indexing, searching and generating lists of unique words, if you
> want to
> > get expectable results.
> 
> Well, not sure whether I understood.
> 
> GermanAnalyzer - just as any other analyzer - does index all word
> except stop 
> words, right? What's actually the sense of a search engine if I
> cannot search 
> for words in the text? :-)
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by lu...@nitwit.de.

On Monday 16 February 2004 19:45, Markus Spath wrote:
> Analyzers preprocess the text to be indexed; different Analyzers will
> generate different text-tokens that are indexed. only you can know which
> Analyzer fits your needs, but you need to apply this one consistently for
> indexing, searching and generating lists of unique words, if you want to
> get expectable results.

Well, not sure whether I understood.

GermanAnalyzer - just as any other analyzer - does index all word except stop 
words, right? What's actually the sense of a search engine if I cannot search 
for words in the text? :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by Markus Spath <ms...@arcor.de>.

lucene@nitwit.de wrote:
> Hi!
> 
> I do build a list of all unique words in all my docs from 
> WhitespaceAnalyzer.tokenStream(). I also do index all my docs using a 
> GermanAnalyzer in another index. There are plenty of word in the word list 
> that don't return any hits when searching the doc index built using the 
> GermanAnalyzer - and these are no stop words.
> 
> Why is this?
> 

Analyzers preprocess the text to be indexed; different Analyzers will generate 
different text-tokens that are indexed. only you can know which Analyzer fits 
your needs, but you need to apply this one consistently for indexing, searching 
and generating lists of unique words, if you want to get expectable results.

markus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Timo,

You are asking a lot of good questions, but also questions for which 
answers already exist.  Just dig a little deeper and you will see.  
Have a look at my java.net article (titled "Lucene Intro") and you will 
find utility code that hilights how analyzers work.  Tinker with that a 
bit, then come back and ask further questions :))

	Erik


On Feb 16, 2004, at 1:35 PM, lucene@nitwit.de wrote:

> On Monday 16 February 2004 19:20, lucene@nitwit.de wrote:
>> Why is this?
>
> Another curiosity is that apparently the case does matter:
> "albert" (Einstein :) does return hits, but "Albert" does not - 
> despite the
> docs contain "Albert" and not "albert".
>
> Can somebody explain?
>
> Thanks!
> Timo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Custom? :)

Otis

--- lucene@nitwit.de wrote:
> On Monday 16 February 2004 19:57, Otis Gospodnetic wrote:
> > Searches ARE case sensitive, it is just that some Analyzers
> lowercase
> > all tokens.  If you are using WhitespaceAnalyzer, then tokens will
> not
> 
> GermanAnalyzer apparently is one of them. Too bad :-( Is there a 
> case-sensitive alternative out there?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by lu...@nitwit.de.

On Monday 16 February 2004 19:57, Otis Gospodnetic wrote:
> Searches ARE case sensitive, it is just that some Analyzers lowercase
> all tokens.  If you are using WhitespaceAnalyzer, then tokens will not

GermanAnalyzer apparently is one of them. Too bad :-( Is there a 
case-sensitive alternative out there?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Searches ARE case sensitive, it is just that some Analyzers lowercase
all tokens.  If you are using WhitespaceAnalyzer, then tokens will not
be lowercased, so a search for albert and Albert may yield different
results.

Otis

--- lucene@nitwit.de wrote:
> On Monday 16 February 2004 19:20, lucene@nitwit.de wrote:
> > Why is this?
> 
> Another curiosity is that apparently the case does matter: 
> "albert" (Einstein :) does return hits, but "Albert" does not -
> despite the 
> docs contain "Albert" and not "albert".
> 
> Can somebody explain?
> 
> Thanks!
> Timo
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Word not in index

Posted by lu...@nitwit.de.

On Monday 16 February 2004 19:20, lucene@nitwit.de wrote:
> Why is this?

Another curiosity is that apparently the case does matter: 
"albert" (Einstein :) does return hits, but "Albert" does not - despite the 
docs contain "Albert" and not "albert".

Can somebody explain?

Thanks!
Timo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org