You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Matthew Hall <mh...@informatics.jax.org> on 2009/06/30 16:41:19 UTC

Re: Highligheter fails using JapaneseAnalyzer

Does the same thing happen when you use SimpleAnalyzer, or StandardAnalyzer?

I have a sneaking suspicion that the : in your contents string is what's
causing your issue here, as : is a reserved character that denotes a
field specification. But I could be wrong.

Try swapping analyzers, if you no longer have the same issue with
Simple, try Standard. Assuming the same problem shows up there, I think
you might need to do something about the :.

Matt

k.sayama wrote:
> hello.
>
> i've tried to highlight string using Highligheter(2.4.1) and
> JapaneseAnalyzer
> but the following code extract show the problem
>
> String F = "f";
> String CONTENTS = "AAA :BBB CCC";
> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
> QueryParser qp = new QueryParser( F, analyzer );
> Query query = qp.parse( "BBB" );
> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>
> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>
> The sytsem outputs
> <B>AAA</B> :BBB CCC
>
> When you change CONTENTS to "AAA _BBB CCC"
> the system outputs
>
> AAA _<B>BBB</B> CCC
>
> Are there any problems?
> Thanks in advance
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by Matthew Hall <mh...@informatics.jax.org>.

Out of curiosity, when you try your other test string "aaa _bbb ccc" 
what do the token byte offsets show?

Matt

Mark Harwood wrote:
>
> On 1 Jul 2009, at 17:39, k.sayama wrote:
>
>> I could verify Token byte offsets
>>
>> The sytsem outputs
>> aaa:0:3
>> bbb:0:3
>> ccc:4:7
>>
>
> That explains the highlighter behaviour. Clearly BBB is not at 
> position 0-3 in the String you supplied
>
>>>> String CONTENTS = "AAA :BBB CCC";
>
> Looks like the Tokenizer needs fixing. Is this yours or a standard 
> Lucene class? If the latter, raising a JIRA bug with a Junit test 
> would be the best way to get things moving.
>
>
> Cheers
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


-- 
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by "k.sayama" <sa...@nifty.com>.

Hi

Tokenizer is not standard Lucene class.
but to acquire startOffset and endOffset correctly, I edited Tokenizer. 
It is operating correctly now. 

I want to verify more patterns. 

thanks

----- Original Message ----- 
From: "Mark Harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Thursday, July 02, 2009 6:25 AM
Subject: Re: Highligheter fails using JapaneseAnalyzer


> 
> On 1 Jul 2009, at 17:39, k.sayama wrote:
> 
>> I could verify Token byte offsets
>>
>> The sytsem outputs
>> aaa:0:3
>> bbb:0:3
>> ccc:4:7
>>
> 
> That explains the highlighter behaviour. Clearly BBB is not at  
> position 0-3 in the String you supplied
> 
>>>> String CONTENTS = "AAA :BBB CCC";
> 
> Looks like the Tokenizer needs fixing. Is this yours or a standard  
> Lucene class? If the latter, raising a JIRA bug with a Junit test  
> would be the best way to get things moving.
> 
> 
> Cheers
> Mark
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by Mark Harwood <ma...@yahoo.co.uk>.

On 1 Jul 2009, at 17:39, k.sayama wrote:

> I could verify Token byte offsets
>
> The sytsem outputs
> aaa:0:3
> bbb:0:3
> ccc:4:7
>

That explains the highlighter behaviour. Clearly BBB is not at  
position 0-3 in the String you supplied

>>> String CONTENTS = "AAA :BBB CCC";

Looks like the Tokenizer needs fixing. Is this yours or a standard  
Lucene class? If the latter, raising a JIRA bug with a Junit test  
would be the best way to get things moving.

Cheers
Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by "k.sayama" <sa...@nifty.com>.

I could verify Token byte offsets

The sytsem outputs
aaa:0:3
bbb:0:3
ccc:4:7

offset is initialized

Is this problem Analyzer?　Or, is it Tokenizer?

----- Original Message ----- 
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Thursday, July 02, 2009 12:55 AM
Subject: Re: Highligheter fails using JapaneseAnalyzer



>>How should I verify  it?

Make sure the Token.startOffset and endOffset properties of Tokens produced 
by your TokenStream correctly define the location of Token.termBuffer in the 
original text.



----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 16:13:17
Subject: Re: Highligheter fails using JapaneseAnalyzer

Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify  it?

----- Original Message ----- 
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Wednesday, July 01, 2009 11:31 PM
Subject: Re: Highligheter fails using JapaneseAnalyzer



Can you verify the Token byte offsets produced by this particular analyzer
are correct?



----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer

hi

I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.

I think the problem of JapaneseAnalyzer.
Can this problem be solved?

> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by mark harwood <ma...@yahoo.co.uk>.

>>How should I verify  it?

Make sure the Token.startOffset and endOffset properties of Tokens produced by your TokenStream correctly define the location of Token.termBuffer in the original text.



----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 16:13:17
Subject: Re: Highligheter fails using JapaneseAnalyzer

Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify  it?

----- Original Message ----- 
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Wednesday, July 01, 2009 11:31 PM
Subject: Re: Highligheter fails using JapaneseAnalyzer



Can you verify the Token byte offsets produced by this particular analyzer 
are correct?



----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer

hi

I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.

I think the problem of JapaneseAnalyzer.
Can this problem be solved?

> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by "k.sayama" <sa...@nifty.com>.

Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify  it?

----- Original Message ----- 
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Wednesday, July 01, 2009 11:31 PM
Subject: Re: Highligheter fails using JapaneseAnalyzer



Can you verify the Token byte offsets produced by this particular analyzer 
are correct?



----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer

hi

I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.

I think the problem of JapaneseAnalyzer.
Can this problem be solved?

> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by mark harwood <ma...@yahoo.co.uk>.

Can you verify the Token byte offsets produced by this particular analyzer are correct?



----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer

hi

I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.

I think the problem of JapaneseAnalyzer.
Can this problem be solved?

> Does the same thing happen when you use SimpleAnalyzer, or 
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Highligheter fails using JapaneseAnalyzer

Posted by "k.sayama" <sa...@nifty.com>.

hi

I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.

I think the problem of JapaneseAnalyzer.
Can this problem be solved?

> Does the same thing happen when you use SimpleAnalyzer, or 
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> -- 
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org