You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by mchaput <mc...@aw.sgi.com> on 2003/04/21 22:56:09 UTC
Bigram search (help!)
Hi all,
Well, I got around my previous problem by switching to a different HTML
parser.
Now I have an even more subtle and frustrating problem! :(
I'm using Che Dong's CJKTokenizer/CJKAnalyzer to do bigram tokenizing of
Japanese text, with an unpatched Lucene 1.3RC1.
The tokenizer is working, here's the debug output of the tokens as they
go by (using a WinDVD help file as a test):
\u30ba\u30fc
\u30fc\u30e0
windvd
\u3067\u306f
\u4efb\u610f
\u610f\u306e
\u306e\u9078
The terms are showing up properly in the index (dumping the terms from
the index shows the character pairs are there).
When I create a query with search string \u30ba\u30fc\u30e0 I get
something reasonable:
contents:"\u30ba\u30fc \u30fc\u30e0 "
(class org.apache.lucene.search.PhraseQuery)
So far so good, *BUT*, searching for this query gives no results! As you
can see from the token stream above, this query SHOULD work, but it doesn't.
I'm at a loss. Can anyone think of what might be going wrong?
--
|
Matt Chaput | A l i a s | W a v e f r o n t
Information Designer | 210 King St. E. Toronto, ON, Canada M5A 1J7
mchaput@aw.sgi.com | (416) 874-8268
|
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Bigram search (help!)
Posted by Che Dong <ch...@hotmail.com>.
Do you use same analyser while Indexing and searching?
Che, Dong
http://www.chedong.com
----- Original Message -----
From: "mchaput" <mc...@aw.sgi.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, April 22, 2003 4:56 AM
Subject: Bigram search (help!)
> Hi all,
>
> Well, I got around my previous problem by switching to a different HTML
> parser.
>
> Now I have an even more subtle and frustrating problem! :(
>
> I'm using Che Dong's CJKTokenizer/CJKAnalyzer to do bigram tokenizing of
> Japanese text, with an unpatched Lucene 1.3RC1.
>
> The tokenizer is working, here's the debug output of the tokens as they
> go by (using a WinDVD help file as a test):
>
> \u30ba\u30fc
> \u30fc\u30e0
> windvd
> \u3067\u306f
> \u4efb\u610f
> \u610f\u306e
> \u306e\u9078
>
> The terms are showing up properly in the index (dumping the terms from
> the index shows the character pairs are there).
>
> When I create a query with search string \u30ba\u30fc\u30e0 I get
> something reasonable:
>
> contents:"\u30ba\u30fc \u30fc\u30e0 "
> (class org.apache.lucene.search.PhraseQuery)
>
> So far so good, *BUT*, searching for this query gives no results! As you
> can see from the token stream above, this query SHOULD work, but it doesn't.
>
> I'm at a loss. Can anyone think of what might be going wrong?
>
>
> --
> |
> Matt Chaput | A l i a s | W a v e f r o n t
> Information Designer | 210 King St. E. Toronto, ON, Canada M5A 1J7
> mchaput@aw.sgi.com | (416) 874-8268
> |
> "A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
Re: Bigram search (help!)
Posted by mchaput <mc...@aw.sgi.com>.
I forgot to mention a couple of things that might help narrow down the
problem:
Searching for two characters (ie a bigram pair term that is known to be
in the index) returns 0 results also.
Searching for a single character (ie a single Japanese character term
that is known to be in the index) *DOES* find results.
Very strange.
Thanks,
Matt
--
|
Matt Chaput | A l i a s | W a v e f r o n t
Information Designer | 210 King St. E. Toronto, ON, Canada M5A 1J7
mchaput@aw.sgi.com | (416) 874-8268
|
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org