You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Koji Sekiguchi <ko...@m4.dion.ne.jp> on 2005/09/06 04:22:05 UTC

Highlighter apply to Japanese

Hi again,

I'm using highlighter to highlight terms in Japanese text,
but I cannot get preferable output.

If I use StandardAnalyzer or SnowballAnalyzer w/ English,
getBestFragment() returns preferable outputs:

Sample: (SnowballAnalyzer)
Text: A meeting will be held in the City Hall
TokenStream:
[a][meet][will][be][held][in][the][citi][hall]
Query Text: meet
Output: A <B>meeting</B> will be held in the City Hall

But if I use JapaneseAnalyzer, which is most popular Analyzer
in Japan to get TokenStream from Japanese text, to highlight
Japanese text with Highlighter, whole text is highlighted:

Sample: (JapaneseAnalyzer)
Text: AMeetingWillBeHeldInTheCityHall
TokenStream:
[A][Meeting][Will][Be][Held][In][The][City][Hall]
Query Text: Meeting
Output: <B>AMeetingWillBeHeldInTheCityHall</B>

Please note that I use alphabet to show the Text at second sample
because most users in this mailing list can read it, but in reality,
I used Japanese characters for the Text. And you'll see that
JapaneseAnalyzer,
which uses Japanese dictionary on background to extract tokens
from text stream, can recognize tokens and produce TokenStream.
But highlighter.getBestFragment() highlighted whole text.

Do I need to implement Fragmenter to highlight tokens correctly
for Japanese text?

Thanks in advance,

Koji




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Highlighter apply to Japanese

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.

Hi Chris,

Thank you for your info.
With CJKAnalyzer, the diagnosis are as follows:

	pos	start	end
	Inc	Ofst	Ofst
[Aa]	1	0	2
[aa]	1	1	3
[aB]	1	2	4
[BC]	1	3	5
[Cc]	1	4	6
[cD]	1	5	7
[Dd]	1	6	8
[dE]	1	7	9
[EF]	1	8	10
[FG]	1	9	11
[Gg]	1	10	12
[gH]	1	11	13
[Hh]	1	12	14
[hI]	1	13	15
[Ii]	1	14	16
[iJ]	1	15	17
[JK]	1	16	18
[Kk]	1	17	19
[kL]	1	18	20
[LM]	1	19	21
[Mm]	1	20	22
[mN]	1	21	23

<B>AaaBCcDdEFGgHhIiJKkLMmN</B>

CJKAnalyzer is producing TokenStream which is all overlap
Mark was pointed out.
But JapaneseAnalyzer is producing a stream of tokens
are not overlapped as I showed in my previous mail.

BTW, I couldn't find CJKHighlighter and CJKHighlighterAnalyzer in
sandbox...

Koji

> -----Original Message-----
> From: Chris Lu [mailto:chris.lu@gmail.com] 
> Sent: Tuesday, September 06, 2005 3:53 PM
> To: java-user@lucene.apache.org
> Subject: Re: Highlighter apply to Japanese
> 
> 
> Hi, Koji,
> 
> I had the same problem as you. This is because CJK's n-gram analysis
> is different from single character's.
> 
> My get around is to use CJKHighlighter and 
> CJKHighlightAnalyzer in sandbox.
> 
> -- 
> Chris Lu
> ------------
> Lucene Search RAD on Any Database
> http://www.dbsight.net
> 
> 
> On 9/5/05, Koji Sekiguchi <ko...@m4.dion.ne.jp> wrote:
> > Hi again,
> > 
> > I'm using highlighter to highlight terms in Japanese text,
> > but I cannot get preferable output.
> > 
> > If I use StandardAnalyzer or SnowballAnalyzer w/ English,
> > getBestFragment() returns preferable outputs:
> > 
> > Sample: (SnowballAnalyzer)
> > Text: A meeting will be held in the City Hall
> > TokenStream:
> > [a][meet][will][be][held][in][the][citi][hall]
> > Query Text: meet
> > Output: A <B>meeting</B> will be held in the City Hall
> > 
> > But if I use JapaneseAnalyzer, which is most popular Analyzer
> > in Japan to get TokenStream from Japanese text, to highlight
> > Japanese text with Highlighter, whole text is highlighted:
> > 
> > Sample: (JapaneseAnalyzer)
> > Text: AMeetingWillBeHeldInTheCityHall
> > TokenStream:
> > [A][Meeting][Will][Be][Held][In][The][City][Hall]
> > Query Text: Meeting
> > Output: <B>AMeetingWillBeHeldInTheCityHall</B>
> > 
> > Please note that I use alphabet to show the Text at second sample
> > because most users in this mailing list can read it, but in reality,
> > I used Japanese characters for the Text. And you'll see that
> > JapaneseAnalyzer,
> > which uses Japanese dictionary on background to extract tokens
> > from text stream, can recognize tokens and produce TokenStream.
> > But highlighter.getBestFragment() highlighted whole text.
> > 
> > Do I need to implement Fragmenter to highlight tokens correctly
> > for Japanese text?
> > 
> > Thanks in advance,
> > 
> > Koji
> > 
> > 
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highlighter apply to Japanese

Posted by Chris Lu <ch...@gmail.com>.

Hi, Koji,

I had the same problem as you. This is because CJK's n-gram analysis
is different from single character's.

My get around is to use CJKHighlighter and CJKHighlightAnalyzer in sandbox.

-- 
Chris Lu
------------
Lucene Search RAD on Any Database
http://www.dbsight.net


On 9/5/05, Koji Sekiguchi <ko...@m4.dion.ne.jp> wrote:
> Hi again,
> 
> I'm using highlighter to highlight terms in Japanese text,
> but I cannot get preferable output.
> 
> If I use StandardAnalyzer or SnowballAnalyzer w/ English,
> getBestFragment() returns preferable outputs:
> 
> Sample: (SnowballAnalyzer)
> Text: A meeting will be held in the City Hall
> TokenStream:
> [a][meet][will][be][held][in][the][citi][hall]
> Query Text: meet
> Output: A <B>meeting</B> will be held in the City Hall
> 
> But if I use JapaneseAnalyzer, which is most popular Analyzer
> in Japan to get TokenStream from Japanese text, to highlight
> Japanese text with Highlighter, whole text is highlighted:
> 
> Sample: (JapaneseAnalyzer)
> Text: AMeetingWillBeHeldInTheCityHall
> TokenStream:
> [A][Meeting][Will][Be][Held][In][The][City][Hall]
> Query Text: Meeting
> Output: <B>AMeetingWillBeHeldInTheCityHall</B>
> 
> Please note that I use alphabet to show the Text at second sample
> because most users in this mailing list can read it, but in reality,
> I used Japanese characters for the Text. And you'll see that
> JapaneseAnalyzer,
> which uses Japanese dictionary on background to extract tokens
> from text stream, can recognize tokens and produce TokenStream.
> But highlighter.getBestFragment() highlighted whole text.
> 
> Do I need to implement Fragmenter to highlight tokens correctly
> for Japanese text?
> 
> Thanks in advance,
> 
> Koji
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Highlighter apply to Japanese

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.

Hi Mark,

With the change, the problem was completely solved!

Sample: (JapaneseAnalyzer)
Text: AMeetingWillBeHeldInTheCityHall

TokenStream:
[A][Meeting][Will][Be][Held][In][The][City][Hall]

Query Text: Meeting
Output: A<B>Meeting</B>WillBeHeldInTheCityHall

Query Text: CityHall
Output: AMeetingWillBeHeldInThe<B>City</B><B>Hall</B>

Although if I use CJKAnalyzer, which is producing a stream
of tokens which all overlap, the problem sill occurs,
but with JapaneseAnalyzer, the highlighter works fine. 

Thank you very much,

Koji


> -----Original Message-----
> From: mark harwood [mailto:markharw00d@yahoo.co.uk] 
> Sent: Tuesday, September 06, 2005 7:22 PM
> To: java-user@lucene.apache.org
> Subject: RE: Highlighter apply to Japanese
> 
> 
> Try change TokenGroup.isDistinct();
> 
> Maybe the offset test code should be >= rather than >
> ie
> 
> 	boolean isDistinct(Token token)
> 	{
> 		return token.startOffset()>=endOffset;
> 	}
> 
> I've just tried the change with the Junit test and all
> seems well still with the non CJK tests.
> 
> 
> 
> 		
> ___________________________________________________________ 
> To help you stay safe and secure online, we've developed the 
> all new Yahoo! Security Centre. http://uk.security.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Highlighter apply to Japanese

Posted by mark harwood <ma...@yahoo.co.uk>.

Try change TokenGroup.isDistinct();

Maybe the offset test code should be >= rather than >
ie

	boolean isDistinct(Token token)
	{
		return token.startOffset()>=endOffset;
	}

I've just tried the change with the Junit test and all
seems well still with the non CJK tests.



		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Highlighter apply to Japanese

Posted by Koji Sekiguchi <ko...@m4.dion.ne.jp>.

I added some code you advised and the result is as follows:

Text: AaaBCcDdEFGgHhIiJKkLMmN

	Pos	start	end
	Inc	Ofst	Ofst
[Aaa]	1	0	3
[B]	1	3	4
[Cc]	1	4	6
[Dd]	1	6	8
[E]	1	8	9
[F]	1	9	10
[Gg]	1	10	12
[Hh]	1	12	14
[Ii]	1	14	16
[J]	1	16	17
[Kk]	1	17	19
[L]	1	19	20
[Mm]	1	20	22
[N]	1	22	23

Output:
<B>AaaBCcDdEFGgHhIiJKkLMmN</B>

It seems JapaneseAnalyzer produces correct tokens
to me.

Any thoughts?

Koji

> -----Original Message-----
> From: markharw00d [mailto:markharw00d@yahoo.co.uk] 
> Sent: Tuesday, September 06, 2005 3:37 PM
> To: java-user@lucene.apache.org
> Subject: Re: Highlighter apply to Japanese
> 
> 
> I don't know the behaviour of the Japanese Analyzer you are using.
> Can you add to your example diagnosis the Token.getPositionIncrement, 
> Token.startOffset and Token.endOffset for each of the tokens?
> 
> The highlighter groups tokens with overlapping start and end offsets 
> into a single TokenGroup for the purposes of highlighting. 
> This allows 
> TokenStreams which produce multiple synonyms for the same 
> source token 
> to work. This behaviour was also required to get the CJKAnalyzer to 
> work. It could be that the Analyzer you are using is 
> producing a stream 
> of tokens which *all* overlap?
> 
> Cheers
> Mark
> 
> 
> 		
> ___________________________________________________________ 
> To help you stay safe and secure online, we've developed the 
> all new Yahoo! Security Centre. http://uk.security.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highlighter apply to Japanese

Posted by markharw00d <ma...@yahoo.co.uk>.

I don't know the behaviour of the Japanese Analyzer you are using.
Can you add to your example diagnosis the Token.getPositionIncrement, 
Token.startOffset and Token.endOffset for each of the tokens?

The highlighter groups tokens with overlapping start and end offsets 
into a single TokenGroup for the purposes of highlighting. This allows 
TokenStreams which produce multiple synonyms for the same source token 
to work. This behaviour was also required to get the CJKAnalyzer to 
work. It could be that the Analyzer you are using is producing a stream 
of tokens which *all* overlap?

Cheers
Mark


		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org