You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Matthew Hall <mh...@informatics.jax.org> on 2009/06/30 16:41:19 UTC
Re: Highligheter fails using JapaneseAnalyzer
Does the same thing happen when you use SimpleAnalyzer, or StandardAnalyzer?
I have a sneaking suspicion that the : in your contents string is what's
causing your issue here, as : is a reserved character that denotes a
field specification. But I could be wrong.
Try swapping analyzers, if you no longer have the same issue with
Simple, try Standard. Assuming the same problem shows up there, I think
you might need to do something about the :.
Matt
k.sayama wrote:
> hello.
>
> i've tried to highlight string using Highligheter(2.4.1) and
> JapaneseAnalyzer
> but the following code extract show the problem
>
> String F = "f";
> String CONTENTS = "AAA :BBB CCC";
> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
> QueryParser qp = new QueryParser( F, analyzer );
> Query query = qp.parse( "BBB" );
> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>
> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>
> The sytsem outputs
> <B>AAA</B> :BBB CCC
>
> When you change CONTENTS to "AAA _BBB CCC"
> the system outputs
>
> AAA _<B>BBB</B> CCC
>
> Are there any problems?
> Thanks in advance
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by Matthew Hall <mh...@informatics.jax.org>.
Out of curiosity, when you try your other test string "aaa _bbb ccc"
what do the token byte offsets show?
Matt
Mark Harwood wrote:
>
> On 1 Jul 2009, at 17:39, k.sayama wrote:
>
>> I could verify Token byte offsets
>>
>> The sytsem outputs
>> aaa:0:3
>> bbb:0:3
>> ccc:4:7
>>
>
> That explains the highlighter behaviour. Clearly BBB is not at
> position 0-3 in the String you supplied
>
>>>> String CONTENTS = "AAA :BBB CCC";
>
> Looks like the Tokenizer needs fixing. Is this yours or a standard
> Lucene class? If the latter, raising a JIRA bug with a Junit test
> would be the best way to get things moving.
>
>
> Cheers
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
--
Matthew Hall
Software Engineer
Mouse Genome Informatics
mhall@informatics.jax.org
(207) 288-6012
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by "k.sayama" <sa...@nifty.com>.
Hi
Tokenizer is not standard Lucene class.
but to acquire startOffset and endOffset correctly, I edited Tokenizer.
It is operating correctly now.
I want to verify more patterns.
thanks
----- Original Message -----
From: "Mark Harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Thursday, July 02, 2009 6:25 AM
Subject: Re: Highligheter fails using JapaneseAnalyzer
>
> On 1 Jul 2009, at 17:39, k.sayama wrote:
>
>> I could verify Token byte offsets
>>
>> The sytsem outputs
>> aaa:0:3
>> bbb:0:3
>> ccc:4:7
>>
>
> That explains the highlighter behaviour. Clearly BBB is not at
> position 0-3 in the String you supplied
>
>>>> String CONTENTS = "AAA :BBB CCC";
>
> Looks like the Tokenizer needs fixing. Is this yours or a standard
> Lucene class? If the latter, raising a JIRA bug with a Junit test
> would be the best way to get things moving.
>
>
> Cheers
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by Mark Harwood <ma...@yahoo.co.uk>.
On 1 Jul 2009, at 17:39, k.sayama wrote:
> I could verify Token byte offsets
>
> The sytsem outputs
> aaa:0:3
> bbb:0:3
> ccc:4:7
>
That explains the highlighter behaviour. Clearly BBB is not at
position 0-3 in the String you supplied
>>> String CONTENTS = "AAA :BBB CCC";
Looks like the Tokenizer needs fixing. Is this yours or a standard
Lucene class? If the latter, raising a JIRA bug with a Junit test
would be the best way to get things moving.
Cheers
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by "k.sayama" <sa...@nifty.com>.
I could verify Token byte offsets
The sytsem outputs
aaa:0:3
bbb:0:3
ccc:4:7
offset is initialized
Is this problem Analyzer? Or, is it Tokenizer?
----- Original Message -----
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Thursday, July 02, 2009 12:55 AM
Subject: Re: Highligheter fails using JapaneseAnalyzer
>>How should I verify it?
Make sure the Token.startOffset and endOffset properties of Tokens produced
by your TokenStream correctly define the location of Token.termBuffer in the
original text.
----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 16:13:17
Subject: Re: Highligheter fails using JapaneseAnalyzer
Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify it?
----- Original Message -----
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Wednesday, July 01, 2009 11:31 PM
Subject: Re: Highligheter fails using JapaneseAnalyzer
Can you verify the Token byte offsets produced by this particular analyzer
are correct?
----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer
hi
I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.
I think the problem of JapaneseAnalyzer.
Can this problem be solved?
> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by mark harwood <ma...@yahoo.co.uk>.
>>How should I verify it?
Make sure the Token.startOffset and endOffset properties of Tokens produced by your TokenStream correctly define the location of Token.termBuffer in the original text.
----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 16:13:17
Subject: Re: Highligheter fails using JapaneseAnalyzer
Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify it?
----- Original Message -----
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Wednesday, July 01, 2009 11:31 PM
Subject: Re: Highligheter fails using JapaneseAnalyzer
Can you verify the Token byte offsets produced by this particular analyzer
are correct?
----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer
hi
I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.
I think the problem of JapaneseAnalyzer.
Can this problem be solved?
> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by "k.sayama" <sa...@nifty.com>.
Sorry
I can not verify the Token byte offsets produced by JapaneseAnalyzer
How should I verify it?
----- Original Message -----
From: "mark harwood" <ma...@yahoo.co.uk>
To: <ja...@lucene.apache.org>
Sent: Wednesday, July 01, 2009 11:31 PM
Subject: Re: Highligheter fails using JapaneseAnalyzer
Can you verify the Token byte offsets produced by this particular analyzer
are correct?
----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer
hi
I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.
I think the problem of JapaneseAnalyzer.
Can this problem be solved?
> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by mark harwood <ma...@yahoo.co.uk>.
Can you verify the Token byte offsets produced by this particular analyzer are correct?
----- Original Message ----
From: k.sayama <sa...@nifty.com>
To: java-user@lucene.apache.org
Sent: Wednesday, 1 July, 2009 15:22:37
Subject: Re: Highligheter fails using JapaneseAnalyzer
hi
I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.
I think the problem of JapaneseAnalyzer.
Can this problem be solved?
> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Highligheter fails using JapaneseAnalyzer
Posted by "k.sayama" <sa...@nifty.com>.
hi
I verified it by using SimpleAnalyzer, StandardAnalyzer, and CJKAnalyzer.
but, The problem did not happen.
I think the problem of JapaneseAnalyzer.
Can this problem be solved?
> Does the same thing happen when you use SimpleAnalyzer, or
> StandardAnalyzer?
>
> I have a sneaking suspicion that the : in your contents string is what's
> causing your issue here, as : is a reserved character that denotes a
> field specification. But I could be wrong.
>
> Try swapping analyzers, if you no longer have the same issue with
> Simple, try Standard. Assuming the same problem shows up there, I think
> you might need to do something about the :.
>
> Matt
>
> k.sayama wrote:
>> hello.
>>
>> i've tried to highlight string using Highligheter(2.4.1) and
>> JapaneseAnalyzer
>> but the following code extract show the problem
>>
>> String F = "f";
>> String CONTENTS = "AAA :BBB CCC";
>> JapaneseAnalyzer analyzer = new JapaneseAnalyzer();
>> QueryParser qp = new QueryParser( F, analyzer );
>> Query query = qp.parse( "BBB" );
>> Highlighter h = new Highlighter( new QueryScorer( query, F ) );
>>
>> System.out.println( h.getBestFragment( analyzer, F, CONTENTS ) );
>>
>> The sytsem outputs
>> <B>AAA</B> :BBB CCC
>>
>> When you change CONTENTS to "AAA _BBB CCC"
>> the system outputs
>>
>> AAA _<B>BBB</B> CCC
>>
>> Are there any problems?
>> Thanks in advance
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
> --
> Matthew Hall
> Software Engineer
> Mouse Genome Informatics
> mhall@informatics.jax.org
> (207) 288-6012
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org