You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by mchaput <mc...@aw.sgi.com> on 2003/04/21 22:56:09 UTC

Bigram search (help!)

Hi all,

Well, I got around my previous problem by switching to a different HTML 
parser.

Now I have an even more subtle and frustrating problem! :(

I'm using Che Dong's CJKTokenizer/CJKAnalyzer to do bigram tokenizing of 
Japanese text, with an unpatched Lucene 1.3RC1.

The tokenizer is working, here's the debug output of the tokens as they 
go by (using a WinDVD help file as a test):

\u30ba\u30fc
\u30fc\u30e0
windvd
\u3067\u306f
\u4efb\u610f
\u610f\u306e
\u306e\u9078

The terms are showing up properly in the index (dumping the terms from 
the index shows the character pairs are there).

When I create a query with search string \u30ba\u30fc\u30e0 I get 
something reasonable:

contents:"\u30ba\u30fc \u30fc\u30e0 "
(class org.apache.lucene.search.PhraseQuery)

So far so good, *BUT*, searching for this query gives no results! As you 
can see from the token stream above, this query SHOULD work, but it doesn't.

I'm at a loss. Can anyone think of what might be going wrong?


-- 
                       |
Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7
mchaput@aw.sgi.com    |   (416) 874-8268
                       |
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Bigram search (help!)

Posted by Che Dong <ch...@hotmail.com>.
Do you use same analyser while Indexing and searching?

Che, Dong
http://www.chedong.com

----- Original Message ----- 
From: "mchaput" <mc...@aw.sgi.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, April 22, 2003 4:56 AM
Subject: Bigram search (help!)


> Hi all,
> 
> Well, I got around my previous problem by switching to a different HTML 
> parser.
> 
> Now I have an even more subtle and frustrating problem! :(
> 
> I'm using Che Dong's CJKTokenizer/CJKAnalyzer to do bigram tokenizing of 
> Japanese text, with an unpatched Lucene 1.3RC1.
> 
> The tokenizer is working, here's the debug output of the tokens as they 
> go by (using a WinDVD help file as a test):
> 
> \u30ba\u30fc
> \u30fc\u30e0
> windvd
> \u3067\u306f
> \u4efb\u610f
> \u610f\u306e
> \u306e\u9078
> 
> The terms are showing up properly in the index (dumping the terms from 
> the index shows the character pairs are there).
> 
> When I create a query with search string \u30ba\u30fc\u30e0 I get 
> something reasonable:
> 
> contents:"\u30ba\u30fc \u30fc\u30e0 "
> (class org.apache.lucene.search.PhraseQuery)
> 
> So far so good, *BUT*, searching for this query gives no results! As you 
> can see from the token stream above, this query SHOULD work, but it doesn't.
> 
> I'm at a loss. Can anyone think of what might be going wrong?
> 
> 
> -- 
>                        |
> Matt Chaput           |   A l i a s | W a v e f r o n t
> Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7
> mchaput@aw.sgi.com    |   (416) 874-8268
>                        |
> "A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 

Re: Bigram search (help!)

Posted by mchaput <mc...@aw.sgi.com>.
I forgot to mention a couple of things that might help narrow down the 
problem:

Searching for two characters (ie a bigram pair term that is known to be 
in the index) returns 0 results also.

Searching for a single character (ie a single Japanese character term 
that is known to be in the index) *DOES* find results.

Very strange.

Thanks,

Matt

-- 
                       |
Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7
mchaput@aw.sgi.com    |   (416) 874-8268
                       |
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org