You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/10/13 11:04:29 UTC

Highlighting content field problem when using JiebaTokenizerFactory

Hi,

I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
Solr. It works fine with the segmentation when I'm using
the Analysis function on the Solr Admin UI.

However, when I tried to do the highlighting in Solr, it is not
highlighting in the correct place. For example, when I search of 自然环境与企业本身,
it highlight 认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的

Even when I search for English character like  responsibility, it highlight
 <em> *responsibilit<em>*y.

Basically, the highlighting goes off by 1 character/space consistently.

This problem only happens in content field, and not in any other fields.
Does anyone knows what could be causing the issue?

I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.


Regards,
Edwin

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Scott,

I've created a Jira issue for this, the code is SOLR-8334.

Regards,
Edwin


On 24 November 2015 at 00:36, Scott Stults <
sstults@opensourceconnections.com> wrote:

> Edwin,
>
> Congrats on getting it to work! Would you please create a Jira issue for
> this and add the patch? You won't need the inline change comments -- a good
> description in the ticket itself will work best.
>
> k/r,
> Scott
>
> On Sun, Nov 22, 2015 at 10:13 PM, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> wrote:
>
> > I've tried to do some minor modification in the code under
> > JiebaSegmenter.java, and the highlighting seems to be fine now.
> >
> > Basically, I created another int called offset2 under process() method.
> > int offset2 = 0;
> >
> > Then I modified the offset to offset2 for this part of the code under
> > process() method.
> >
> >         if (sb.length() > 0)
> >             if (mode == SegMode.SEARCH) {
> >                 for (Word token : sentenceProcess(sb.toString())) {
> >                     // tokens.add(new SegToken(token, offset, offset +=
> > token.length()));
> >                     tokens.add(new SegToken(token, offset2, offset2 +=
> > token.length()));         // Change to offset2 by Edwin
> >                 }
> >             } else {
> >                 for (Word token : sentenceProcess(sb.toString())) {
> >                     if (token.length() > 2) {
> >                         Word gram2;
> >                         int j = 0;
> >                         for (; j < token.length() - 1; ++j) {
> >                             gram2 = token.subSequence(j, j + 2);
> >                             if (wordDict.containsWord(gram2.getToken()))
> >                                 // tokens.add(new SegToken(gram2, offset
> +
> > j, offset + j + 2));
> >                                 tokens.add(new SegToken(gram2, offset2 +
> j,
> > offset2 + j + 2));      // Change to offset2 by Edwin
> >                         }
> >                     }
> >                     if (token.length() > 3) {
> >                         Word gram3;
> >                         int j = 0;
> >                         for (; j < token.length() - 2; ++j) {
> >                             gram3 = token.subSequence(j, j + 3);
> >                             if (wordDict.containsWord(gram3.getToken()))
> >                                 // tokens.add(new SegToken(gram3, offset
> +
> > j, offset + j + 3));
> >                                 tokens.add(new SegToken(gram3, offset2 +
> j,
> > offset2 + j + 3));      // Change to offset2 by Edwin
> >                         }
> >                     }
> >                     // tokens.add(new SegToken(token, offset, offset +=
> > token.length()));
> >                     tokens.add(new SegToken(token, offset2, offset2 +=
> > token.length()));        // Change to offset2 by Edwin
> >                 }
> >             }
> >
> >
> > Not sure if this is just a workaround, or can be used as a permanent
> > solution
> >
> > Regards,
> > Edwin
> >
> >
> > On 28 October 2015 at 15:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> > > Hi Scott,
> > >
> > > I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0
> > > package with a +1 at both the startOffset and endOffset value (see code
> > > below), and now the <em> tag of the content is shifted to the correct
> > place
> > > at the content. However, this means that in the title and other fields
> > > where the <em> tag is orignally at the correct place, they will get the
> > "org.apache.lucene.search.highlight.InvalidTokenOffsetsException"
> > > exception. I have temporary use another tokenizer for the other fields
> > > first.
> > >
> > >     public SegToken(Word word, int startOffset, int endOffset) {
> > >         this.word = word;
> > >         this.startOffset = startOffset+1;
> > >         this.endOffset = endOffset+1;
> > >     }
> > >
> > > However, I don't think this can be a permanent solution, so I'm trying
> to
> > > zoom in further to the code, to see what's the difference with the
> > content
> > > and other fields.
> > >
> > > I have also find that althought JiebaTokenizer works better for Chinese
> > > characters, it doesn't work well for English characters. For example,
> if
> > I
> > > search for "water", the JiebaTokenizer will cut it as follow:
> > > w|at|er
> > > It can't cut it as a full word, which HMMChineseTokenizer is able to.
> > >
> > > Here's my configuration in schema.xml:
> > >
> > > <fieldType name="text_chinese2" class="solr.TextField"
> > > positionIncrementGap="100">
> > >  <analyzer type="index">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >  segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > > maxGramSize="15"/>
> > >  </analyzer>
> > >  <analyzer type="query">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >  segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > >           </analyzer>
> > >   </fieldType>
> > >
> > > Does anyone knows if JiebaTokenizer is optimised to take in English
> > > characters as well?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> > > wrote:
> > >
> > >> Hi Scott,
> > >>
> > >> Thank you for providing the links and references. Will look through
> > them,
> > >> and let you know if I find any solutions or workaround.
> > >>
> > >> Regards,
> > >> Edwin
> > >>
> > >>
> > >> On 27 October 2015 at 11:13, Scott Chu <sc...@udngroup.com>
> wrote:
> > >>
> > >>>
> > >>> Take a look at Michael's 2 articles, they might help you calrify the
> > >>> idea of highlighting in Solr:
> > >>>
> > >>> Changing Bits: Lucene's TokenStreams are actually graphs!
> > >>>
> > >>>
> >
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> > >>>
> > >>> Also take a look at 4th paragraph In his another article:
> > >>>
> > >>> Changing Bits: A new Lucene highlighter is born
> > >>>
> > >>>
> >
> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
> > >>>
> > >>> Currently, I can't figure out the possible cause of your problem
> unless
> > >>> I got spare time to test it on my own, which is not available these
> > days
> > >>> (Got some projects to close)!
> > >>>
> > >>> If you find the solution or workaround, pls. let us know. Good luck
> > >>> again!
> > >>>
> > >>> Scott Chu，scott.chu@udngroup.com
> > >>> 2015/10/27
> > >>>
> > >>> ----- Original Message -----
> > >>> *From: *Scott Chu <sc...@udngroup.com>
> > >>> *To: *solr-user <so...@lucene.apache.org>
> > >>> *Date: *2015-10-27, 10:27:45
> > >>> *Subject: *Re: Highlighting content field problem when using
> > >>> JiebaTokenizerFactory
> > >>>
> > >>> Hi Edward,
> > >>>
> > >>>     Took a lot of time to see if there's anything can help you to
> > >>> define the cause of your problem. Maybe this might help you a bit:
> > >>>
> > >>> [SOLR-4722] Highlighter which generates a list of query term
> > position(s)
> > >>> for each item in a list of documents, or returns null if highlighting
> > is
> > >>> disabled. - AS...
> > >>> https://issues.apache.org/jira/browse/SOLR-4722
> > >>>
> > >>> This one is modified from FastVectorHighLighter, so ensure those 3
> > term*
> > >>> attributes are on.
> > >>>
> > >>> Scott Chu，scott.chu@udngroup.com
> > >>> 2015/10/27
> > >>>
> > >>> ----- Original Message -----
> > >>> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> > >>> *To: *solr-user <so...@lucene.apache.org>
> > >>> *Date: *2015-10-23, 10:42:32
> > >>> *Subject: *Re: Highlighting content field problem when using
> > >>> JiebaTokenizerFactory
> > >>>
> > >>> Hi Scott,
> > >>>
> > >>> Thank you for your respond.
> > >>>
> > >>> 1. You said the problem only happens on "contents" field, so maybe
> > >>> there're
> > >>> something wrong with the contents of that field. Doe it contain any
> > >>> special
> > >>> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
> > >>> something about HTML stripping will cause highlight problem. Maybe
> you
> > >>> can
> > >>>
> > >>> try purify that fields to be closed to pure text and see if highlight
> > >>> comes
> > >>> ok.
> > >>> *A) I check that the SOLR-42 is mentioning about the
> > >>> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe
> > that
> > >>> tokenizer is already deprecated too. I've tried with all kinds of
> > content
> > >>> for rich-text documents, and all of them have the same problem.*
> > >>>
> > >>> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> > >>> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> > >>> SmartChinese (I don't use this since I am dealing with Traditional
> > >>> Chinese
> > >>>
> > >>> but I see you are dealing with Simplified Chinese), or 3rd-party
> MMSeg
> > >>> and
> > >>>
> > >>> see if the problem goes away. However when I'm googling similar
> > problem,
> > >>> I
> > >>>
> > >>> saw you asked same question on August at Huaban/Jieba-analysis and
> > >>> somebody
> > >>> said he also uses JiebaTokenizer but he doesn't have your problem.
> So I
> > >>> see
> > >>> this could be less suspect.
> > >>> *A) I was thinking about the incompatible issue too, as I previously
> > >>> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have
> > >>> issue
> > >>> in 5.x. But the person from Hunban/Jieba-analysis said that he
> doesn't
> > >>> have
> > >>> this problem in Solr 5.1. I also face the same problem in Solr 5.1,
> and
> > >>> although I'm using Solr 5.3.0 now, the same problem persist. *
> > >>>
> > >>> I'm looking at the indexing process too, to see if there's any
> problem
> > >>> there. But just can't figure out why it only happen to
> JiebaTokenizer,
> > >>> and
> > >>>
> > >>> it only happen for content field.
> > >>>
> > >>>
> > >>> Regards,
> > >>> Edwin
> > >>>
> > >>>
> > >>> On 23 October 2015 at 09:41, Scott Chu <scott.chu@udngroup.com
> > >>> <+s...@udngroup.com>> wrote:
> > >>>
> > >>> > Hi Edwin,
> > >>> >
> > >>> > Since you've tested all my suggestions and the problem is still
> > there,
> > >>> I
> > >>>
> > >>> > can't think of anything wrong with your configuration. Now I can
> only
> > >>> > suspect two things:
> > >>> >
> > >>> > 1. You said the problem only happens on "contents" field, so maybe
> > >>> > there're something wrong with the contents of that field. Doe it
> > >>> contain
> > >>>
> > >>> > any special thing in them, e.g. HTML tags or symbols. I recall
> > SOLR-42
> > >>> > mentions something about HTML stripping will cause highlight
> problem.
> > >>> Maybe
> > >>> > you can try purify that fields to be closed to pure text and see if
> > >>> > highlight comes ok.
> > >>> >
> > >>> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
> > >>> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> > >>> > SmartChinese (I don't use this since I am dealing with Traditional
> > >>> Chinese
> > >>> > but I see you are dealing with Simplified Chinese), or 3rd-party
> > MMSeg
> > >>> and
> > >>> > see if the problem goes away. However when I'm googling similar
> > >>> problem, I
> > >>> > saw you asked same question on August at Huaban/Jieba-analysis and
> > >>> somebody
> > >>> > said he also uses JiebaTokenizer but he doesn't have your problem.
> So
> > >>> I see
> > >>> > this could be less suspect.
> > >>> >
> > >>> > The theory of your problem could be something in indexing process
> > >>> causes
> > >>>
> > >>> > wrong position info. for that field and when Solr do highlighting,
> it
> > >>> > retrieves wrong position info. and mark wrong position of highlight
> > >>> target
> > >>> > terms.
> > >>> >
> > >>> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> > >>> > 2015/10/23
> > >>> >
> > >>> > ----- Original Message -----
> > >>> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> > >>> <+e...@gmail.com>>
> > >>> > *To: *solr-user <solr-user@lucene.apache.org
> > >>> <+s...@lucene.apache.org>>
> > >>> > *Date: *2015-10-22, 22:22:14
> > >>> > *Subject: *Re: Highlighting content field problem when using
> > >>> > JiebaTokenizerFactory
> > >>> >
> > >>> > Hi Scott,
> > >>> >
> > >>> > Thank you for your response and suggestions.
> > >>> >
> > >>> > With respond to your questions, here are the answers:
> > >>> >
> > >>> > 1. I take a look at Jieba. It uses a dictionary and it seems to do
> a
> > >>> good
> > >>> > job on CJK. I doubt this problem may be from those filters (note: I
> > can
> > >>> > understand you may use CJKWidthFilter to convert Japanese but
> doesn't
> > >>> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have
> you
> > >>> tried
> > >>> > commenting out those filters, say leave only Jieba and StopFilter,
> > and
> > >>> see
> > >>> >
> > >>> > if this problem disppears?
> > >>> > *A) Yes, I have tried commenting out the other filters and only
> left
> > >>> with
> > >>> > Jieba and StopFilter. The problem is still there.*
> > >>> >
> > >>> > 2.Does this problem occur only on Chinese search words? Does it
> > happen
> > >>> on
> > >>> > English search words?
> > >>> > *A) Yes, the same problem occurs on English words. For example,
> when
> > I
> > >>> > search for "word", it will highlight in this way: <em> wor<em>d*
> > >>> >
> > >>> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > >>> > parameters in field declaration? I see only one is enabled. Please
> > >>> refer to
> > >>> > the answer in this stackoverflow question:
> > >>> >
> > >>> >
> > >>>
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > >>> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter
> > >>> too,
> > >>>
> > >>> > but the same problem persists as well.*
> > >>> >
> > >>> >
> > >>> > Regards,
> > >>> > Edwin
> > >>> >
> > >>> >
> > >>> > On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
> > >>> <+s...@udngroup.com>
> > >>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
> > >>> >
> > >>> > > Hi solr-user,
> > >>> > >
> > >>> > > Can't judge the cause on fast glimpse of your definition but some
> > >>> > > suggestions I can give:
> > >>> > >
> > >>> > > 1. I take a look at Jieba. It uses a dictionary and it seems to
> do
> > a
> > >>> good
> > >>> > > job on CJK. I doubt this problem may be from those filters
> (note: I
> > >>> can
> > >>> > > understand you may use CJKWidthFilter to convert Japanese but
> > doesn't
> > >>> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have
> > you
> > >>> > tried
> > >>> > > commenting out those filters, say leave only Jieba and
> StopFilter,
> > >>> and
> > >>>
> > >>> > see
> > >>> > > if this problem disppears?
> > >>> > >
> > >>> > > 2.Does this problem occur only on Chinese search words? Does it
> > >>> happen on
> > >>> > > English search words?
> > >>> > >
> > >>> > > 3.To use FastVectorHighlighter, you seem to have to enable 3
> term*
> > >>> > > parameters in field declaration? I see only one is enabled.
> Please
> > >>> refer
> > >>> > to
> > >>> > > the answer in this stackoverflow question:
> > >>> > >
> > >>> >
> > >>>
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > >>> > >
> > >>> > >
> > >>> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> > >>> scott.chu@udngroup.com <+s...@udngroup.com>>
> > >>> > > 2015/10/22
> > >>> > >
> > >>> > > ----- Original Message -----
> > >>> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> > >>> <+e...@gmail.com>
> > >>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
> > >>> > > *To: *solr-user <solr-user@lucene.apache.org
> > >>> <+s...@lucene.apache.org>
> > >>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
> > >>> > > *Date: *2015-10-20, 12:04:11
> > >>> > > *Subject: *Re: Highlighting content field problem when using
> > >>> >
> > >>> > > JiebaTokenizerFactory
> > >>> > >
> > >>> > > Hi Scott,
> > >>> > >
> > >>> > > Here's my schema.xml for content and title, which uses
> > text_chinese.
> > >>> The
> > >>> >
> > >>> > > problem only occurs in content, and not in title.
> > >>> > >
> > >>> > > <field name="content" type="text_chinese" indexed="true"
> > >>> stored="true"
> > >>> > > omitNorms="true" termVectors="true"/>
> > >>> > > <field name="title" type="text_chinese" indexed="true"
> > stored="true"
> > >>> > > omitNorms="true" termVectors="true"/>
> > >>> > >
> > >>> > >
> > >>> > > <fieldType name="text_chinese" class="solr.TextField"
> > >>> > > positionIncrementGap="100">
> > >>> > > <analyzer type="index">
> > >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >>> > > segMode="SEARCH"/>
> > >>> > > <filter class="solr.CJKWidthFilterFactory"/>
> > >>> > > <filter class="solr.CJKBigramFilterFactory"/>
> > >>> > > <filter class="solr.StopFilterFactory"
> > >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > >>> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > >>> > > maxGramSize="15"/>
> > >>> > > <filter class="solr.PorterStemFilterFactory"/>
> > >>> > > </analyzer>
> > >>> > > <analyzer type="query">
> > >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > >>> > > segMode="SEARCH"/>
> > >>> > > <filter class="solr.CJKWidthFilterFactory"/>
> > >>> > > <filter class="solr.CJKBigramFilterFactory"/>
> > >>> > > <filter class="solr.StopFilterFactory"
> > >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > >>> > > <filter class="solr.PorterStemFilterFactory"/>
> > >>> > > </analyzer>
> > >>> > > </fieldType>
> > >>> > >
> > >>> > >
> > >>> > > Here's my solrconfig.xml on the highlighting portion:
> > >>> > >
> > >>> > > <requestHandler name="/highlight" class="solr.SearchHandler">
> > >>> > > <lst name="defaults">
> > >>> > > <str name="echoParams">explicit</str>
> > >>> > > <int name="rows">10</int>
> > >>> > > <str name="wt">json</str>
> > >>> > > <str name="indent">true</str>
> > >>> > > <str name="df">text</str>
> > >>> > > <str name="fl">id, title, content_type, last_modified, url, score
> > >>> </str>
> > >>> > >
> > >>> > > <str name="hl">on</str>
> > >>> > > <str name="hl.fl">id, title, content, author, tag</str>
> > >>> > > <str name="hl.highlightMultiTerm">true</str>
> > >>> > > <str name="hl.preserveMulti">true</str>
> > >>> > > <str name="hl.encoder">html</str>
> > >>> > > <str name="hl.fragsize">200</str>
> > >>> > > <str name="group">true</str>
> > >>> > > <str name="group.field">signature</str>
> > >>> > > <str name="group.main">true</str>
> > >>> > > <str name="group.cache.percent">100</str>
> > >>> > > </lst>
> > >>> > > </requestHandler>
> > >>> > >
> > >>> > > <boundaryScanner name="breakIterator"
> > >>> > > class="solr.highlight.BreakIteratorBoundaryScanner">
> > >>> > > <lst name="defaults">
> > >>> > > <str name="hl.bs.type">WORD</str>
> > >>> > > <str name="hl.bs.language">en</str>
> > >>> > > <str name="hl.bs.country">SG</str>
> > >>> > > </lst>
> > >>> > > </boundaryScanner>
> > >>> > >
> > >>> > >
> > >>> > > Meanwhile, I'll take a look at the articles too.
> > >>> > >
> > >>> > > Thank you.
> > >>> > >
> > >>> > > Regards,
> > >>> > > Edwin
> > >>> > >
> > >>> > >
> > >>> > > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> > >>> <+s...@udngroup.com>
> > >>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>
> > >>> > > <+scott.chu@udngroup.com <+s...@udngroup.com> <+
> > >>> scott.chu@udngroup.com <+s...@udngroup.com>>>> wrote:
> > >>> > >
> > >>> > > > Hi Edwin,
> > >>> > > >
> > >>> > > > I didn't use Jieba on Chinese (I use only CJK, very
> > foundamental, I
> > >>> > > > know) so I didn't experience this problem.
> > >>> > > >
> > >>> > > > I'd suggest you post your schema.xml so we can see how you
> define
> > >>> your
> > >>> >
> > >>> > > > content field and the field type it uses?
> > >>> > > >
> > >>> > > > In the mean time, refer to these articles, maybe the answer or
> > >>> > workaround
> > >>> > > > can be deducted from them.
> > >>> > > >
> > >>> > > > https://issues.apache.org/jira/browse/SOLR-3390
> > >>> > > >
> > >>> > > >
> > >>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> > >>>
> > >>> > > >
> > >>> > > >
> > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > >>> > > >
> > >>> > > > Good luck!
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> > >>> scott.chu@udngroup.com <+s...@udngroup.com>> <+
> > >>> > scott.chu@udngroup.com <+s...@udngroup.com> <+
> > >>> scott.chu@udngroup.com <+s...@udngroup.com>>>
> > >>> > > > 2015/10/20
> > >>> > > >
> > >>> > > > ----- Original Message -----
> > >>> > > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> > >>> <+e...@gmail.com>
> > >>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>
> > >>> > > <+edwinyeozl@gmail.com <+e...@gmail.com> <+
> > >>> edwinyeozl@gmail.com <+e...@gmail.com>>>>
> > >>> > > > *To: *solr-user <solr-user@lucene.apache.org
> > >>> <+s...@lucene.apache.org>
> > >>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>
> > >>> > > <+solr-user@lucene.apache.org <+s...@lucene.apache.org> <+
> > >>> solr-user@lucene.apache.org <+s...@lucene.apache.org>>>>
> > >>> >
> > >>> > > > *Date: *2015-10-13, 17:04:29
> > >>> > > > *Subject: *Highlighting content field problem when using
> > >>> > > > JiebaTokenizerFactory
> > >>> > > >
> > >>> > > > Hi,
> > >>> > > >
> > >>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
> > >>> characters
> > >>> > > in
> > >>> > > >
> > >>> > > > Solr. It works fine with the segmentation when I'm using
> > >>> > > > the Analysis function on the Solr Admin UI.
> > >>> > > >
> > >>> > > > However, when I tried to do the highlighting in Solr, it is not
> > >>> > > > highlighting in the correct place. For example, when I search
> of
> > >>> > > 自然環境与企業本身,
> > >>> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > >>> > > >
> > >>> > > > Even when I search for English character like responsibility,
> it
> > >>> > > highlight
> > >>> > > > <em> *responsibilit<em>*y.
> > >>> > > >
> > >>> > > > Basically, the highlighting goes off by 1 character/space
> > >>> consistently.
> > >>> > > >
> > >>> > > > This problem only happens in content field, and not in any
> other
> > >>> > fields.
> > >>> > >
> > >>> > > > Does anyone knows what could be causing the issue?
> > >>> > > >
> > >>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > >>> > > >
> > >>> > > >
> > >>> > > > Regards,
> > >>> > > > Edwin
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > -----
> > >>> > > > 未在此訊息中找到病毒。
> > >>> > > > 已透過 AVG 檢查 - www.avg.com
> > >>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > -----
> > >>> > > 未在此訊息中找到病毒。
> > >>> > > 已透過 AVG 檢查 - www.avg.com
> > >>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> > >>> > >
> > >>> > >
> > >>> >
> > >>> >
> > >>> >
> > >>> > -----
> > >>> > 未在此訊息中找到病毒。
> > >>> > 已透過 AVG 檢查 - www.avg.com
> > >>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
> > >>> >
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> -----
> > >>> 未在此訊息中找到病毒。
> > >>> 已透過 AVG 檢查 - www.avg.com
> > >>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
> > >>>
> > >>>
> > >>
> > >
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Scott Stults <ss...@opensourceconnections.com>.

Edwin,

Congrats on getting it to work! Would you please create a Jira issue for
this and add the patch? You won't need the inline change comments -- a good
description in the ticket itself will work best.

k/r,
Scott

On Sun, Nov 22, 2015 at 10:13 PM, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> I've tried to do some minor modification in the code under
> JiebaSegmenter.java, and the highlighting seems to be fine now.
>
> Basically, I created another int called offset2 under process() method.
> int offset2 = 0;
>
> Then I modified the offset to offset2 for this part of the code under
> process() method.
>
>         if (sb.length() > 0)
>             if (mode == SegMode.SEARCH) {
>                 for (Word token : sentenceProcess(sb.toString())) {
>                     // tokens.add(new SegToken(token, offset, offset +=
> token.length()));
>                     tokens.add(new SegToken(token, offset2, offset2 +=
> token.length()));         // Change to offset2 by Edwin
>                 }
>             } else {
>                 for (Word token : sentenceProcess(sb.toString())) {
>                     if (token.length() > 2) {
>                         Word gram2;
>                         int j = 0;
>                         for (; j < token.length() - 1; ++j) {
>                             gram2 = token.subSequence(j, j + 2);
>                             if (wordDict.containsWord(gram2.getToken()))
>                                 // tokens.add(new SegToken(gram2, offset +
> j, offset + j + 2));
>                                 tokens.add(new SegToken(gram2, offset2 + j,
> offset2 + j + 2));      // Change to offset2 by Edwin
>                         }
>                     }
>                     if (token.length() > 3) {
>                         Word gram3;
>                         int j = 0;
>                         for (; j < token.length() - 2; ++j) {
>                             gram3 = token.subSequence(j, j + 3);
>                             if (wordDict.containsWord(gram3.getToken()))
>                                 // tokens.add(new SegToken(gram3, offset +
> j, offset + j + 3));
>                                 tokens.add(new SegToken(gram3, offset2 + j,
> offset2 + j + 3));      // Change to offset2 by Edwin
>                         }
>                     }
>                     // tokens.add(new SegToken(token, offset, offset +=
> token.length()));
>                     tokens.add(new SegToken(token, offset2, offset2 +=
> token.length()));        // Change to offset2 by Edwin
>                 }
>             }
>
>
> Not sure if this is just a workaround, or can be used as a permanent
> solution
>
> Regards,
> Edwin
>
>
> On 28 October 2015 at 15:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
> > Hi Scott,
> >
> > I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0
> > package with a +1 at both the startOffset and endOffset value (see code
> > below), and now the <em> tag of the content is shifted to the correct
> place
> > at the content. However, this means that in the title and other fields
> > where the <em> tag is orignally at the correct place, they will get the
> "org.apache.lucene.search.highlight.InvalidTokenOffsetsException"
> > exception. I have temporary use another tokenizer for the other fields
> > first.
> >
> >     public SegToken(Word word, int startOffset, int endOffset) {
> >         this.word = word;
> >         this.startOffset = startOffset+1;
> >         this.endOffset = endOffset+1;
> >     }
> >
> > However, I don't think this can be a permanent solution, so I'm trying to
> > zoom in further to the code, to see what's the difference with the
> content
> > and other fields.
> >
> > I have also find that althought JiebaTokenizer works better for Chinese
> > characters, it doesn't work well for English characters. For example, if
> I
> > search for "water", the JiebaTokenizer will cut it as follow:
> > w|at|er
> > It can't cut it as a full word, which HMMChineseTokenizer is able to.
> >
> > Here's my configuration in schema.xml:
> >
> > <fieldType name="text_chinese2" class="solr.TextField"
> > positionIncrementGap="100">
> >  <analyzer type="index">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> >  segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="15"/>
> >  </analyzer>
> >  <analyzer type="query">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> >  segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> >           </analyzer>
> >   </fieldType>
> >
> > Does anyone knows if JiebaTokenizer is optimised to take in English
> > characters as well?
> >
> > Regards,
> > Edwin
> >
> >
> > On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <ed...@gmail.com>
> > wrote:
> >
> >> Hi Scott,
> >>
> >> Thank you for providing the links and references. Will look through
> them,
> >> and let you know if I find any solutions or workaround.
> >>
> >> Regards,
> >> Edwin
> >>
> >>
> >> On 27 October 2015 at 11:13, Scott Chu <sc...@udngroup.com> wrote:
> >>
> >>>
> >>> Take a look at Michael's 2 articles, they might help you calrify the
> >>> idea of highlighting in Solr:
> >>>
> >>> Changing Bits: Lucene's TokenStreams are actually graphs!
> >>>
> >>>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> >>>
> >>> Also take a look at 4th paragraph In his another article:
> >>>
> >>> Changing Bits: A new Lucene highlighter is born
> >>>
> >>>
> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
> >>>
> >>> Currently, I can't figure out the possible cause of your problem unless
> >>> I got spare time to test it on my own, which is not available these
> days
> >>> (Got some projects to close)!
> >>>
> >>> If you find the solution or workaround, pls. let us know. Good luck
> >>> again!
> >>>
> >>> Scott Chu，scott.chu@udngroup.com
> >>> 2015/10/27
> >>>
> >>> ----- Original Message -----
> >>> *From: *Scott Chu <sc...@udngroup.com>
> >>> *To: *solr-user <so...@lucene.apache.org>
> >>> *Date: *2015-10-27, 10:27:45
> >>> *Subject: *Re: Highlighting content field problem when using
> >>> JiebaTokenizerFactory
> >>>
> >>> Hi Edward,
> >>>
> >>>     Took a lot of time to see if there's anything can help you to
> >>> define the cause of your problem. Maybe this might help you a bit:
> >>>
> >>> [SOLR-4722] Highlighter which generates a list of query term
> position(s)
> >>> for each item in a list of documents, or returns null if highlighting
> is
> >>> disabled. - AS...
> >>> https://issues.apache.org/jira/browse/SOLR-4722
> >>>
> >>> This one is modified from FastVectorHighLighter, so ensure those 3
> term*
> >>> attributes are on.
> >>>
> >>> Scott Chu，scott.chu@udngroup.com
> >>> 2015/10/27
> >>>
> >>> ----- Original Message -----
> >>> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> >>> *To: *solr-user <so...@lucene.apache.org>
> >>> *Date: *2015-10-23, 10:42:32
> >>> *Subject: *Re: Highlighting content field problem when using
> >>> JiebaTokenizerFactory
> >>>
> >>> Hi Scott,
> >>>
> >>> Thank you for your respond.
> >>>
> >>> 1. You said the problem only happens on "contents" field, so maybe
> >>> there're
> >>> something wrong with the contents of that field. Doe it contain any
> >>> special
> >>> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
> >>> something about HTML stripping will cause highlight problem. Maybe you
> >>> can
> >>>
> >>> try purify that fields to be closed to pure text and see if highlight
> >>> comes
> >>> ok.
> >>> *A) I check that the SOLR-42 is mentioning about the
> >>> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe
> that
> >>> tokenizer is already deprecated too. I've tried with all kinds of
> content
> >>> for rich-text documents, and all of them have the same problem.*
> >>>
> >>> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> >>> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> >>> SmartChinese (I don't use this since I am dealing with Traditional
> >>> Chinese
> >>>
> >>> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
> >>> and
> >>>
> >>> see if the problem goes away. However when I'm googling similar
> problem,
> >>> I
> >>>
> >>> saw you asked same question on August at Huaban/Jieba-analysis and
> >>> somebody
> >>> said he also uses JiebaTokenizer but he doesn't have your problem. So I
> >>> see
> >>> this could be less suspect.
> >>> *A) I was thinking about the incompatible issue too, as I previously
> >>> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have
> >>> issue
> >>> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't
> >>> have
> >>> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
> >>> although I'm using Solr 5.3.0 now, the same problem persist. *
> >>>
> >>> I'm looking at the indexing process too, to see if there's any problem
> >>> there. But just can't figure out why it only happen to JiebaTokenizer,
> >>> and
> >>>
> >>> it only happen for content field.
> >>>
> >>>
> >>> Regards,
> >>> Edwin
> >>>
> >>>
> >>> On 23 October 2015 at 09:41, Scott Chu <scott.chu@udngroup.com
> >>> <+s...@udngroup.com>> wrote:
> >>>
> >>> > Hi Edwin,
> >>> >
> >>> > Since you've tested all my suggestions and the problem is still
> there,
> >>> I
> >>>
> >>> > can't think of anything wrong with your configuration. Now I can only
> >>> > suspect two things:
> >>> >
> >>> > 1. You said the problem only happens on "contents" field, so maybe
> >>> > there're something wrong with the contents of that field. Doe it
> >>> contain
> >>>
> >>> > any special thing in them, e.g. HTML tags or symbols. I recall
> SOLR-42
> >>> > mentions something about HTML stripping will cause highlight problem.
> >>> Maybe
> >>> > you can try purify that fields to be closed to pure text and see if
> >>> > highlight comes ok.
> >>> >
> >>> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
> >>> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> >>> > SmartChinese (I don't use this since I am dealing with Traditional
> >>> Chinese
> >>> > but I see you are dealing with Simplified Chinese), or 3rd-party
> MMSeg
> >>> and
> >>> > see if the problem goes away. However when I'm googling similar
> >>> problem, I
> >>> > saw you asked same question on August at Huaban/Jieba-analysis and
> >>> somebody
> >>> > said he also uses JiebaTokenizer but he doesn't have your problem. So
> >>> I see
> >>> > this could be less suspect.
> >>> >
> >>> > The theory of your problem could be something in indexing process
> >>> causes
> >>>
> >>> > wrong position info. for that field and when Solr do highlighting, it
> >>> > retrieves wrong position info. and mark wrong position of highlight
> >>> target
> >>> > terms.
> >>> >
> >>> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> >>> > 2015/10/23
> >>> >
> >>> > ----- Original Message -----
> >>> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >>> <+e...@gmail.com>>
> >>> > *To: *solr-user <solr-user@lucene.apache.org
> >>> <+s...@lucene.apache.org>>
> >>> > *Date: *2015-10-22, 22:22:14
> >>> > *Subject: *Re: Highlighting content field problem when using
> >>> > JiebaTokenizerFactory
> >>> >
> >>> > Hi Scott,
> >>> >
> >>> > Thank you for your response and suggestions.
> >>> >
> >>> > With respond to your questions, here are the answers:
> >>> >
> >>> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
> >>> good
> >>> > job on CJK. I doubt this problem may be from those filters (note: I
> can
> >>> > understand you may use CJKWidthFilter to convert Japanese but doesn't
> >>> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> >>> tried
> >>> > commenting out those filters, say leave only Jieba and StopFilter,
> and
> >>> see
> >>> >
> >>> > if this problem disppears?
> >>> > *A) Yes, I have tried commenting out the other filters and only left
> >>> with
> >>> > Jieba and StopFilter. The problem is still there.*
> >>> >
> >>> > 2.Does this problem occur only on Chinese search words? Does it
> happen
> >>> on
> >>> > English search words?
> >>> > *A) Yes, the same problem occurs on English words. For example, when
> I
> >>> > search for "word", it will highlight in this way: <em> wor<em>d*
> >>> >
> >>> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> >>> > parameters in field declaration? I see only one is enabled. Please
> >>> refer to
> >>> > the answer in this stackoverflow question:
> >>> >
> >>> >
> >>>
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> >>> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter
> >>> too,
> >>>
> >>> > but the same problem persists as well.*
> >>> >
> >>> >
> >>> > Regards,
> >>> > Edwin
> >>> >
> >>> >
> >>> > On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
> >>> <+s...@udngroup.com>
> >>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
> >>> >
> >>> > > Hi solr-user,
> >>> > >
> >>> > > Can't judge the cause on fast glimpse of your definition but some
> >>> > > suggestions I can give:
> >>> > >
> >>> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do
> a
> >>> good
> >>> > > job on CJK. I doubt this problem may be from those filters (note: I
> >>> can
> >>> > > understand you may use CJKWidthFilter to convert Japanese but
> doesn't
> >>> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have
> you
> >>> > tried
> >>> > > commenting out those filters, say leave only Jieba and StopFilter,
> >>> and
> >>>
> >>> > see
> >>> > > if this problem disppears?
> >>> > >
> >>> > > 2.Does this problem occur only on Chinese search words? Does it
> >>> happen on
> >>> > > English search words?
> >>> > >
> >>> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> >>> > > parameters in field declaration? I see only one is enabled. Please
> >>> refer
> >>> > to
> >>> > > the answer in this stackoverflow question:
> >>> > >
> >>> >
> >>>
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> >>> > >
> >>> > >
> >>> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> >>> scott.chu@udngroup.com <+s...@udngroup.com>>
> >>> > > 2015/10/22
> >>> > >
> >>> > > ----- Original Message -----
> >>> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >>> <+e...@gmail.com>
> >>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
> >>> > > *To: *solr-user <solr-user@lucene.apache.org
> >>> <+s...@lucene.apache.org>
> >>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
> >>> > > *Date: *2015-10-20, 12:04:11
> >>> > > *Subject: *Re: Highlighting content field problem when using
> >>> >
> >>> > > JiebaTokenizerFactory
> >>> > >
> >>> > > Hi Scott,
> >>> > >
> >>> > > Here's my schema.xml for content and title, which uses
> text_chinese.
> >>> The
> >>> >
> >>> > > problem only occurs in content, and not in title.
> >>> > >
> >>> > > <field name="content" type="text_chinese" indexed="true"
> >>> stored="true"
> >>> > > omitNorms="true" termVectors="true"/>
> >>> > > <field name="title" type="text_chinese" indexed="true"
> stored="true"
> >>> > > omitNorms="true" termVectors="true"/>
> >>> > >
> >>> > >
> >>> > > <fieldType name="text_chinese" class="solr.TextField"
> >>> > > positionIncrementGap="100">
> >>> > > <analyzer type="index">
> >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> >>> > > segMode="SEARCH"/>
> >>> > > <filter class="solr.CJKWidthFilterFactory"/>
> >>> > > <filter class="solr.CJKBigramFilterFactory"/>
> >>> > > <filter class="solr.StopFilterFactory"
> >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> >>> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> >>> > > maxGramSize="15"/>
> >>> > > <filter class="solr.PorterStemFilterFactory"/>
> >>> > > </analyzer>
> >>> > > <analyzer type="query">
> >>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> >>> > > segMode="SEARCH"/>
> >>> > > <filter class="solr.CJKWidthFilterFactory"/>
> >>> > > <filter class="solr.CJKBigramFilterFactory"/>
> >>> > > <filter class="solr.StopFilterFactory"
> >>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> >>> > > <filter class="solr.PorterStemFilterFactory"/>
> >>> > > </analyzer>
> >>> > > </fieldType>
> >>> > >
> >>> > >
> >>> > > Here's my solrconfig.xml on the highlighting portion:
> >>> > >
> >>> > > <requestHandler name="/highlight" class="solr.SearchHandler">
> >>> > > <lst name="defaults">
> >>> > > <str name="echoParams">explicit</str>
> >>> > > <int name="rows">10</int>
> >>> > > <str name="wt">json</str>
> >>> > > <str name="indent">true</str>
> >>> > > <str name="df">text</str>
> >>> > > <str name="fl">id, title, content_type, last_modified, url, score
> >>> </str>
> >>> > >
> >>> > > <str name="hl">on</str>
> >>> > > <str name="hl.fl">id, title, content, author, tag</str>
> >>> > > <str name="hl.highlightMultiTerm">true</str>
> >>> > > <str name="hl.preserveMulti">true</str>
> >>> > > <str name="hl.encoder">html</str>
> >>> > > <str name="hl.fragsize">200</str>
> >>> > > <str name="group">true</str>
> >>> > > <str name="group.field">signature</str>
> >>> > > <str name="group.main">true</str>
> >>> > > <str name="group.cache.percent">100</str>
> >>> > > </lst>
> >>> > > </requestHandler>
> >>> > >
> >>> > > <boundaryScanner name="breakIterator"
> >>> > > class="solr.highlight.BreakIteratorBoundaryScanner">
> >>> > > <lst name="defaults">
> >>> > > <str name="hl.bs.type">WORD</str>
> >>> > > <str name="hl.bs.language">en</str>
> >>> > > <str name="hl.bs.country">SG</str>
> >>> > > </lst>
> >>> > > </boundaryScanner>
> >>> > >
> >>> > >
> >>> > > Meanwhile, I'll take a look at the articles too.
> >>> > >
> >>> > > Thank you.
> >>> > >
> >>> > > Regards,
> >>> > > Edwin
> >>> > >
> >>> > >
> >>> > > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> >>> <+s...@udngroup.com>
> >>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>
> >>> > > <+scott.chu@udngroup.com <+s...@udngroup.com> <+
> >>> scott.chu@udngroup.com <+s...@udngroup.com>>>> wrote:
> >>> > >
> >>> > > > Hi Edwin,
> >>> > > >
> >>> > > > I didn't use Jieba on Chinese (I use only CJK, very
> foundamental, I
> >>> > > > know) so I didn't experience this problem.
> >>> > > >
> >>> > > > I'd suggest you post your schema.xml so we can see how you define
> >>> your
> >>> >
> >>> > > > content field and the field type it uses?
> >>> > > >
> >>> > > > In the mean time, refer to these articles, maybe the answer or
> >>> > workaround
> >>> > > > can be deducted from them.
> >>> > > >
> >>> > > > https://issues.apache.org/jira/browse/SOLR-3390
> >>> > > >
> >>> > > >
> >>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> >>>
> >>> > > >
> >>> > > >
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> >>> > > >
> >>> > > > Good luck!
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> >>> scott.chu@udngroup.com <+s...@udngroup.com>> <+
> >>> > scott.chu@udngroup.com <+s...@udngroup.com> <+
> >>> scott.chu@udngroup.com <+s...@udngroup.com>>>
> >>> > > > 2015/10/20
> >>> > > >
> >>> > > > ----- Original Message -----
> >>> > > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >>> <+e...@gmail.com>
> >>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>
> >>> > > <+edwinyeozl@gmail.com <+e...@gmail.com> <+
> >>> edwinyeozl@gmail.com <+e...@gmail.com>>>>
> >>> > > > *To: *solr-user <solr-user@lucene.apache.org
> >>> <+s...@lucene.apache.org>
> >>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>
> >>> > > <+solr-user@lucene.apache.org <+s...@lucene.apache.org> <+
> >>> solr-user@lucene.apache.org <+s...@lucene.apache.org>>>>
> >>> >
> >>> > > > *Date: *2015-10-13, 17:04:29
> >>> > > > *Subject: *Highlighting content field problem when using
> >>> > > > JiebaTokenizerFactory
> >>> > > >
> >>> > > > Hi,
> >>> > > >
> >>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
> >>> characters
> >>> > > in
> >>> > > >
> >>> > > > Solr. It works fine with the segmentation when I'm using
> >>> > > > the Analysis function on the Solr Admin UI.
> >>> > > >
> >>> > > > However, when I tried to do the highlighting in Solr, it is not
> >>> > > > highlighting in the correct place. For example, when I search of
> >>> > > 自然環境与企業本身,
> >>> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> >>> > > >
> >>> > > > Even when I search for English character like responsibility, it
> >>> > > highlight
> >>> > > > <em> *responsibilit<em>*y.
> >>> > > >
> >>> > > > Basically, the highlighting goes off by 1 character/space
> >>> consistently.
> >>> > > >
> >>> > > > This problem only happens in content field, and not in any other
> >>> > fields.
> >>> > >
> >>> > > > Does anyone knows what could be causing the issue?
> >>> > > >
> >>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> >>> > > >
> >>> > > >
> >>> > > > Regards,
> >>> > > > Edwin
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > -----
> >>> > > > 未在此訊息中找到病毒。
> >>> > > > 已透過 AVG 檢查 - www.avg.com
> >>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> >>> > > >
> >>> > > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > -----
> >>> > > 未在此訊息中找到病毒。
> >>> > > 已透過 AVG 檢查 - www.avg.com
> >>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> >>> > >
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > -----
> >>> > 未在此訊息中找到病毒。
> >>> > 已透過 AVG 檢查 - www.avg.com
> >>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> -----
> >>> 未在此訊息中找到病毒。
> >>> 已透過 AVG 檢查 - www.avg.com
> >>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
> >>>
> >>>
> >>
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

I've tried to do some minor modification in the code under
JiebaSegmenter.java, and the highlighting seems to be fine now.

Basically, I created another int called offset2 under process() method.
int offset2 = 0;

Then I modified the offset to offset2 for this part of the code under
process() method.

        if (sb.length() > 0)
            if (mode == SegMode.SEARCH) {
                for (Word token : sentenceProcess(sb.toString())) {
                    // tokens.add(new SegToken(token, offset, offset +=
token.length()));
                    tokens.add(new SegToken(token, offset2, offset2 +=
token.length()));         // Change to offset2 by Edwin
                }
            } else {
                for (Word token : sentenceProcess(sb.toString())) {
                    if (token.length() > 2) {
                        Word gram2;
                        int j = 0;
                        for (; j < token.length() - 1; ++j) {
                            gram2 = token.subSequence(j, j + 2);
                            if (wordDict.containsWord(gram2.getToken()))
                                // tokens.add(new SegToken(gram2, offset +
j, offset + j + 2));
                                tokens.add(new SegToken(gram2, offset2 + j,
offset2 + j + 2));      // Change to offset2 by Edwin
                        }
                    }
                    if (token.length() > 3) {
                        Word gram3;
                        int j = 0;
                        for (; j < token.length() - 2; ++j) {
                            gram3 = token.subSequence(j, j + 3);
                            if (wordDict.containsWord(gram3.getToken()))
                                // tokens.add(new SegToken(gram3, offset +
j, offset + j + 3));
                                tokens.add(new SegToken(gram3, offset2 + j,
offset2 + j + 3));      // Change to offset2 by Edwin
                        }
                    }
                    // tokens.add(new SegToken(token, offset, offset +=
token.length()));
                    tokens.add(new SegToken(token, offset2, offset2 +=
token.length()));        // Change to offset2 by Edwin
                }
            }


Not sure if this is just a workaround, or can be used as a permanent
solution

Regards,
Edwin


On 28 October 2015 at 15:29, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Scott,
>
> I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0
> package with a +1 at both the startOffset and endOffset value (see code
> below), and now the <em> tag of the content is shifted to the correct place
> at the content. However, this means that in the title and other fields
> where the <em> tag is orignally at the correct place, they will get the "org.apache.lucene.search.highlight.InvalidTokenOffsetsException"
> exception. I have temporary use another tokenizer for the other fields
> first.
>
>     public SegToken(Word word, int startOffset, int endOffset) {
>         this.word = word;
>         this.startOffset = startOffset+1;
>         this.endOffset = endOffset+1;
>     }
>
> However, I don't think this can be a permanent solution, so I'm trying to
> zoom in further to the code, to see what's the difference with the content
> and other fields.
>
> I have also find that althought JiebaTokenizer works better for Chinese
> characters, it doesn't work well for English characters. For example, if I
> search for "water", the JiebaTokenizer will cut it as follow:
> w|at|er
> It can't cut it as a full word, which HMMChineseTokenizer is able to.
>
> Here's my configuration in schema.xml:
>
> <fieldType name="text_chinese2" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="15"/>
>  </analyzer>
>  <analyzer type="query">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
>           </analyzer>
>   </fieldType>
>
> Does anyone knows if JiebaTokenizer is optimised to take in English
> characters as well?
>
> Regards,
> Edwin
>
>
> On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <ed...@gmail.com>
> wrote:
>
>> Hi Scott,
>>
>> Thank you for providing the links and references. Will look through them,
>> and let you know if I find any solutions or workaround.
>>
>> Regards,
>> Edwin
>>
>>
>> On 27 October 2015 at 11:13, Scott Chu <sc...@udngroup.com> wrote:
>>
>>>
>>> Take a look at Michael's 2 articles, they might help you calrify the
>>> idea of highlighting in Solr:
>>>
>>> Changing Bits: Lucene's TokenStreams are actually graphs!
>>>
>>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>>
>>> Also take a look at 4th paragraph In his another article:
>>>
>>> Changing Bits: A new Lucene highlighter is born
>>>
>>> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
>>>
>>> Currently, I can't figure out the possible cause of your problem unless
>>> I got spare time to test it on my own, which is not available these days
>>> (Got some projects to close)!
>>>
>>> If you find the solution or workaround, pls. let us know. Good luck
>>> again!
>>>
>>> Scott Chu，scott.chu@udngroup.com
>>> 2015/10/27
>>>
>>> ----- Original Message -----
>>> *From: *Scott Chu <sc...@udngroup.com>
>>> *To: *solr-user <so...@lucene.apache.org>
>>> *Date: *2015-10-27, 10:27:45
>>> *Subject: *Re: Highlighting content field problem when using
>>> JiebaTokenizerFactory
>>>
>>> Hi Edward,
>>>
>>>     Took a lot of time to see if there's anything can help you to
>>> define the cause of your problem. Maybe this might help you a bit:
>>>
>>> [SOLR-4722] Highlighter which generates a list of query term position(s)
>>> for each item in a list of documents, or returns null if highlighting is
>>> disabled. - AS...
>>> https://issues.apache.org/jira/browse/SOLR-4722
>>>
>>> This one is modified from FastVectorHighLighter, so ensure those 3 term*
>>> attributes are on.
>>>
>>> Scott Chu，scott.chu@udngroup.com
>>> 2015/10/27
>>>
>>> ----- Original Message -----
>>> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
>>> *To: *solr-user <so...@lucene.apache.org>
>>> *Date: *2015-10-23, 10:42:32
>>> *Subject: *Re: Highlighting content field problem when using
>>> JiebaTokenizerFactory
>>>
>>> Hi Scott,
>>>
>>> Thank you for your respond.
>>>
>>> 1. You said the problem only happens on "contents" field, so maybe
>>> there're
>>> something wrong with the contents of that field. Doe it contain any
>>> special
>>> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
>>> something about HTML stripping will cause highlight problem. Maybe you
>>> can
>>>
>>> try purify that fields to be closed to pure text and see if highlight
>>> comes
>>> ok.
>>> *A) I check that the SOLR-42 is mentioning about the
>>> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
>>> tokenizer is already deprecated too. I've tried with all kinds of content
>>> for rich-text documents, and all of them have the same problem.*
>>>
>>> 2. Maybe something imcompatible between JiebaTokenizer and Solr
>>> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
>>> SmartChinese (I don't use this since I am dealing with Traditional
>>> Chinese
>>>
>>> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
>>> and
>>>
>>> see if the problem goes away. However when I'm googling similar problem,
>>> I
>>>
>>> saw you asked same question on August at Huaban/Jieba-analysis and
>>> somebody
>>> said he also uses JiebaTokenizer but he doesn't have your problem. So I
>>> see
>>> this could be less suspect.
>>> *A) I was thinking about the incompatible issue too, as I previously
>>> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have
>>> issue
>>> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't
>>> have
>>> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
>>> although I'm using Solr 5.3.0 now, the same problem persist. *
>>>
>>> I'm looking at the indexing process too, to see if there's any problem
>>> there. But just can't figure out why it only happen to JiebaTokenizer,
>>> and
>>>
>>> it only happen for content field.
>>>
>>>
>>> Regards,
>>> Edwin
>>>
>>>
>>> On 23 October 2015 at 09:41, Scott Chu <scott.chu@udngroup.com
>>> <+s...@udngroup.com>> wrote:
>>>
>>> > Hi Edwin,
>>> >
>>> > Since you've tested all my suggestions and the problem is still there,
>>> I
>>>
>>> > can't think of anything wrong with your configuration. Now I can only
>>> > suspect two things:
>>> >
>>> > 1. You said the problem only happens on "contents" field, so maybe
>>> > there're something wrong with the contents of that field. Doe it
>>> contain
>>>
>>> > any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
>>> > mentions something about HTML stripping will cause highlight problem.
>>> Maybe
>>> > you can try purify that fields to be closed to pure text and see if
>>> > highlight comes ok.
>>> >
>>> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
>>> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
>>> > SmartChinese (I don't use this since I am dealing with Traditional
>>> Chinese
>>> > but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
>>> and
>>> > see if the problem goes away. However when I'm googling similar
>>> problem, I
>>> > saw you asked same question on August at Huaban/Jieba-analysis and
>>> somebody
>>> > said he also uses JiebaTokenizer but he doesn't have your problem. So
>>> I see
>>> > this could be less suspect.
>>> >
>>> > The theory of your problem could be something in indexing process
>>> causes
>>>
>>> > wrong position info. for that field and when Solr do highlighting, it
>>> > retrieves wrong position info. and mark wrong position of highlight
>>> target
>>> > terms.
>>> >
>>> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
>>> > 2015/10/23
>>> >
>>> > ----- Original Message -----
>>> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> <+e...@gmail.com>>
>>> > *To: *solr-user <solr-user@lucene.apache.org
>>> <+s...@lucene.apache.org>>
>>> > *Date: *2015-10-22, 22:22:14
>>> > *Subject: *Re: Highlighting content field problem when using
>>> > JiebaTokenizerFactory
>>> >
>>> > Hi Scott,
>>> >
>>> > Thank you for your response and suggestions.
>>> >
>>> > With respond to your questions, here are the answers:
>>> >
>>> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
>>> good
>>> > job on CJK. I doubt this problem may be from those filters (note: I can
>>> > understand you may use CJKWidthFilter to convert Japanese but doesn't
>>> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
>>> tried
>>> > commenting out those filters, say leave only Jieba and StopFilter, and
>>> see
>>> >
>>> > if this problem disppears?
>>> > *A) Yes, I have tried commenting out the other filters and only left
>>> with
>>> > Jieba and StopFilter. The problem is still there.*
>>> >
>>> > 2.Does this problem occur only on Chinese search words? Does it happen
>>> on
>>> > English search words?
>>> > *A) Yes, the same problem occurs on English words. For example, when I
>>> > search for "word", it will highlight in this way: <em> wor<em>d*
>>> >
>>> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
>>> > parameters in field declaration? I see only one is enabled. Please
>>> refer to
>>> > the answer in this stackoverflow question:
>>> >
>>> >
>>> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>>> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter
>>> too,
>>>
>>> > but the same problem persists as well.*
>>> >
>>> >
>>> > Regards,
>>> > Edwin
>>> >
>>> >
>>> > On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
>>> <+s...@udngroup.com>
>>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
>>> >
>>> > > Hi solr-user,
>>> > >
>>> > > Can't judge the cause on fast glimpse of your definition but some
>>> > > suggestions I can give:
>>> > >
>>> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
>>> good
>>> > > job on CJK. I doubt this problem may be from those filters (note: I
>>> can
>>> > > understand you may use CJKWidthFilter to convert Japanese but doesn't
>>> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
>>> > tried
>>> > > commenting out those filters, say leave only Jieba and StopFilter,
>>> and
>>>
>>> > see
>>> > > if this problem disppears?
>>> > >
>>> > > 2.Does this problem occur only on Chinese search words? Does it
>>> happen on
>>> > > English search words?
>>> > >
>>> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
>>> > > parameters in field declaration? I see only one is enabled. Please
>>> refer
>>> > to
>>> > > the answer in this stackoverflow question:
>>> > >
>>> >
>>> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>>> > >
>>> > >
>>> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
>>> scott.chu@udngroup.com <+s...@udngroup.com>>
>>> > > 2015/10/22
>>> > >
>>> > > ----- Original Message -----
>>> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> <+e...@gmail.com>
>>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
>>> > > *To: *solr-user <solr-user@lucene.apache.org
>>> <+s...@lucene.apache.org>
>>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
>>> > > *Date: *2015-10-20, 12:04:11
>>> > > *Subject: *Re: Highlighting content field problem when using
>>> >
>>> > > JiebaTokenizerFactory
>>> > >
>>> > > Hi Scott,
>>> > >
>>> > > Here's my schema.xml for content and title, which uses text_chinese.
>>> The
>>> >
>>> > > problem only occurs in content, and not in title.
>>> > >
>>> > > <field name="content" type="text_chinese" indexed="true"
>>> stored="true"
>>> > > omitNorms="true" termVectors="true"/>
>>> > > <field name="title" type="text_chinese" indexed="true" stored="true"
>>> > > omitNorms="true" termVectors="true"/>
>>> > >
>>> > >
>>> > > <fieldType name="text_chinese" class="solr.TextField"
>>> > > positionIncrementGap="100">
>>> > > <analyzer type="index">
>>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>>> > > segMode="SEARCH"/>
>>> > > <filter class="solr.CJKWidthFilterFactory"/>
>>> > > <filter class="solr.CJKBigramFilterFactory"/>
>>> > > <filter class="solr.StopFilterFactory"
>>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>>> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> > > maxGramSize="15"/>
>>> > > <filter class="solr.PorterStemFilterFactory"/>
>>> > > </analyzer>
>>> > > <analyzer type="query">
>>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>>> > > segMode="SEARCH"/>
>>> > > <filter class="solr.CJKWidthFilterFactory"/>
>>> > > <filter class="solr.CJKBigramFilterFactory"/>
>>> > > <filter class="solr.StopFilterFactory"
>>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>>> > > <filter class="solr.PorterStemFilterFactory"/>
>>> > > </analyzer>
>>> > > </fieldType>
>>> > >
>>> > >
>>> > > Here's my solrconfig.xml on the highlighting portion:
>>> > >
>>> > > <requestHandler name="/highlight" class="solr.SearchHandler">
>>> > > <lst name="defaults">
>>> > > <str name="echoParams">explicit</str>
>>> > > <int name="rows">10</int>
>>> > > <str name="wt">json</str>
>>> > > <str name="indent">true</str>
>>> > > <str name="df">text</str>
>>> > > <str name="fl">id, title, content_type, last_modified, url, score
>>> </str>
>>> > >
>>> > > <str name="hl">on</str>
>>> > > <str name="hl.fl">id, title, content, author, tag</str>
>>> > > <str name="hl.highlightMultiTerm">true</str>
>>> > > <str name="hl.preserveMulti">true</str>
>>> > > <str name="hl.encoder">html</str>
>>> > > <str name="hl.fragsize">200</str>
>>> > > <str name="group">true</str>
>>> > > <str name="group.field">signature</str>
>>> > > <str name="group.main">true</str>
>>> > > <str name="group.cache.percent">100</str>
>>> > > </lst>
>>> > > </requestHandler>
>>> > >
>>> > > <boundaryScanner name="breakIterator"
>>> > > class="solr.highlight.BreakIteratorBoundaryScanner">
>>> > > <lst name="defaults">
>>> > > <str name="hl.bs.type">WORD</str>
>>> > > <str name="hl.bs.language">en</str>
>>> > > <str name="hl.bs.country">SG</str>
>>> > > </lst>
>>> > > </boundaryScanner>
>>> > >
>>> > >
>>> > > Meanwhile, I'll take a look at the articles too.
>>> > >
>>> > > Thank you.
>>> > >
>>> > > Regards,
>>> > > Edwin
>>> > >
>>> > >
>>> > > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
>>> <+s...@udngroup.com>
>>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>
>>> > > <+scott.chu@udngroup.com <+s...@udngroup.com> <+
>>> scott.chu@udngroup.com <+s...@udngroup.com>>>> wrote:
>>> > >
>>> > > > Hi Edwin,
>>> > > >
>>> > > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
>>> > > > know) so I didn't experience this problem.
>>> > > >
>>> > > > I'd suggest you post your schema.xml so we can see how you define
>>> your
>>> >
>>> > > > content field and the field type it uses?
>>> > > >
>>> > > > In the mean time, refer to these articles, maybe the answer or
>>> > workaround
>>> > > > can be deducted from them.
>>> > > >
>>> > > > https://issues.apache.org/jira/browse/SOLR-3390
>>> > > >
>>> > > >
>>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>>>
>>> > > >
>>> > > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>>> > > >
>>> > > > Good luck!
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
>>> scott.chu@udngroup.com <+s...@udngroup.com>> <+
>>> > scott.chu@udngroup.com <+s...@udngroup.com> <+
>>> scott.chu@udngroup.com <+s...@udngroup.com>>>
>>> > > > 2015/10/20
>>> > > >
>>> > > > ----- Original Message -----
>>> > > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>>> <+e...@gmail.com>
>>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>
>>> > > <+edwinyeozl@gmail.com <+e...@gmail.com> <+
>>> edwinyeozl@gmail.com <+e...@gmail.com>>>>
>>> > > > *To: *solr-user <solr-user@lucene.apache.org
>>> <+s...@lucene.apache.org>
>>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>
>>> > > <+solr-user@lucene.apache.org <+s...@lucene.apache.org> <+
>>> solr-user@lucene.apache.org <+s...@lucene.apache.org>>>>
>>> >
>>> > > > *Date: *2015-10-13, 17:04:29
>>> > > > *Subject: *Highlighting content field problem when using
>>> > > > JiebaTokenizerFactory
>>> > > >
>>> > > > Hi,
>>> > > >
>>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
>>> characters
>>> > > in
>>> > > >
>>> > > > Solr. It works fine with the segmentation when I'm using
>>> > > > the Analysis function on the Solr Admin UI.
>>> > > >
>>> > > > However, when I tried to do the highlighting in Solr, it is not
>>> > > > highlighting in the correct place. For example, when I search of
>>> > > 自然環境与企業本身,
>>> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
>>> > > >
>>> > > > Even when I search for English character like responsibility, it
>>> > > highlight
>>> > > > <em> *responsibilit<em>*y.
>>> > > >
>>> > > > Basically, the highlighting goes off by 1 character/space
>>> consistently.
>>> > > >
>>> > > > This problem only happens in content field, and not in any other
>>> > fields.
>>> > >
>>> > > > Does anyone knows what could be causing the issue?
>>> > > >
>>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>>> > > >
>>> > > >
>>> > > > Regards,
>>> > > > Edwin
>>> > > >
>>> > > >
>>> > > >
>>> > > > -----
>>> > > > 未在此訊息中找到病毒。
>>> > > > 已透過 AVG 檢查 - www.avg.com
>>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>>> > > >
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > -----
>>> > > 未在此訊息中找到病毒。
>>> > > 已透過 AVG 檢查 - www.avg.com
>>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
>>> > >
>>> > >
>>> >
>>> >
>>> >
>>> > -----
>>> > 未在此訊息中找到病毒。
>>> > 已透過 AVG 檢查 - www.avg.com
>>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>>> >
>>> >
>>>
>>>
>>>
>>> -----
>>> 未在此訊息中找到病毒。
>>> 已透過 AVG 檢查 - www.avg.com
>>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
>>>
>>>
>>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Scott,

I have tried to edit the SegToken.java file in the jieba-analysis-1.0.0
package with a +1 at both the startOffset and endOffset value (see code
below), and now the <em> tag of the content is shifted to the correct place
at the content. However, this means that in the title and other fields
where the <em> tag is orignally at the correct place, they will get
the "org.apache.lucene.search.highlight.InvalidTokenOffsetsException"
exception. I have temporary use another tokenizer for the other fields
first.

    public SegToken(Word word, int startOffset, int endOffset) {
        this.word = word;
        this.startOffset = startOffset+1;
        this.endOffset = endOffset+1;
    }

However, I don't think this can be a permanent solution, so I'm trying to
zoom in further to the code, to see what's the difference with the content
and other fields.

I have also find that althought JiebaTokenizer works better for Chinese
characters, it doesn't work well for English characters. For example, if I
search for "water", the JiebaTokenizer will cut it as follow:
w|at|er
It can't cut it as a full word, which HMMChineseTokenizer is able to.

Here's my configuration in schema.xml:

<fieldType name="text_chinese2" class="solr.TextField"
positionIncrementGap="100">
 <analyzer type="index">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15"/>
 </analyzer>
 <analyzer type="query">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
          </analyzer>
  </fieldType>

Does anyone knows if JiebaTokenizer is optimised to take in English
characters as well?

Regards,
Edwin


On 27 October 2015 at 15:57, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi Scott,
>
> Thank you for providing the links and references. Will look through them,
> and let you know if I find any solutions or workaround.
>
> Regards,
> Edwin
>
>
> On 27 October 2015 at 11:13, Scott Chu <sc...@udngroup.com> wrote:
>
>>
>> Take a look at Michael's 2 articles, they might help you calrify the idea
>> of highlighting in Solr:
>>
>> Changing Bits: Lucene's TokenStreams are actually graphs!
>>
>> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>>
>> Also take a look at 4th paragraph In his another article:
>>
>> Changing Bits: A new Lucene highlighter is born
>>
>> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
>>
>> Currently, I can't figure out the possible cause of your problem unless I
>> got spare time to test it on my own, which is not available these days (Got
>> some projects to close)!
>>
>> If you find the solution or workaround, pls. let us know. Good luck again!
>>
>> Scott Chu，scott.chu@udngroup.com
>> 2015/10/27
>>
>> ----- Original Message -----
>> *From: *Scott Chu <sc...@udngroup.com>
>> *To: *solr-user <so...@lucene.apache.org>
>> *Date: *2015-10-27, 10:27:45
>> *Subject: *Re: Highlighting content field problem when using
>> JiebaTokenizerFactory
>>
>> Hi Edward,
>>
>>     Took a lot of time to see if there's anything can help you to define
>> the cause of your problem. Maybe this might help you a bit:
>>
>> [SOLR-4722] Highlighter which generates a list of query term position(s)
>> for each item in a list of documents, or returns null if highlighting is
>> disabled. - AS...
>> https://issues.apache.org/jira/browse/SOLR-4722
>>
>> This one is modified from FastVectorHighLighter, so ensure those 3 term*
>> attributes are on.
>>
>> Scott Chu，scott.chu@udngroup.com
>> 2015/10/27
>>
>> ----- Original Message -----
>> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
>> *To: *solr-user <so...@lucene.apache.org>
>> *Date: *2015-10-23, 10:42:32
>> *Subject: *Re: Highlighting content field problem when using
>> JiebaTokenizerFactory
>>
>> Hi Scott,
>>
>> Thank you for your respond.
>>
>> 1. You said the problem only happens on "contents" field, so maybe
>> there're
>> something wrong with the contents of that field. Doe it contain any
>> special
>> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
>> something about HTML stripping will cause highlight problem. Maybe you can
>>
>> try purify that fields to be closed to pure text and see if highlight
>> comes
>> ok.
>> *A) I check that the SOLR-42 is mentioning about the
>> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
>> tokenizer is already deprecated too. I've tried with all kinds of content
>> for rich-text documents, and all of them have the same problem.*
>>
>> 2. Maybe something imcompatible between JiebaTokenizer and Solr
>> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
>> SmartChinese (I don't use this since I am dealing with Traditional Chinese
>>
>> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
>>
>> see if the problem goes away. However when I'm googling similar problem, I
>>
>> saw you asked same question on August at Huaban/Jieba-analysis and
>> somebody
>> said he also uses JiebaTokenizer but he doesn't have your problem. So I
>> see
>> this could be less suspect.
>> *A) I was thinking about the incompatible issue too, as I previously
>> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have
>> issue
>> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't
>> have
>> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
>> although I'm using Solr 5.3.0 now, the same problem persist. *
>>
>> I'm looking at the indexing process too, to see if there's any problem
>> there. But just can't figure out why it only happen to JiebaTokenizer, and
>>
>> it only happen for content field.
>>
>>
>> Regards,
>> Edwin
>>
>>
>> On 23 October 2015 at 09:41, Scott Chu <scott.chu@udngroup.com
>> <+s...@udngroup.com>> wrote:
>>
>> > Hi Edwin,
>> >
>> > Since you've tested all my suggestions and the problem is still there, I
>>
>> > can't think of anything wrong with your configuration. Now I can only
>> > suspect two things:
>> >
>> > 1. You said the problem only happens on "contents" field, so maybe
>> > there're something wrong with the contents of that field. Doe it contain
>>
>> > any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
>> > mentions something about HTML stripping will cause highlight problem.
>> Maybe
>> > you can try purify that fields to be closed to pure text and see if
>> > highlight comes ok.
>> >
>> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
>> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
>> > SmartChinese (I don't use this since I am dealing with Traditional
>> Chinese
>> > but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
>> and
>> > see if the problem goes away. However when I'm googling similar
>> problem, I
>> > saw you asked same question on August at Huaban/Jieba-analysis and
>> somebody
>> > said he also uses JiebaTokenizer but he doesn't have your problem. So I
>> see
>> > this could be less suspect.
>> >
>> > The theory of your problem could be something in indexing process causes
>>
>> > wrong position info. for that field and when Solr do highlighting, it
>> > retrieves wrong position info. and mark wrong position of highlight
>> target
>> > terms.
>> >
>> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
>> > 2015/10/23
>> >
>> > ----- Original Message -----
>> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>> <+e...@gmail.com>>
>> > *To: *solr-user <solr-user@lucene.apache.org
>> <+s...@lucene.apache.org>>
>> > *Date: *2015-10-22, 22:22:14
>> > *Subject: *Re: Highlighting content field problem when using
>> > JiebaTokenizerFactory
>> >
>> > Hi Scott,
>> >
>> > Thank you for your response and suggestions.
>> >
>> > With respond to your questions, here are the answers:
>> >
>> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
>> good
>> > job on CJK. I doubt this problem may be from those filters (note: I can
>> > understand you may use CJKWidthFilter to convert Japanese but doesn't
>> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
>> tried
>> > commenting out those filters, say leave only Jieba and StopFilter, and
>> see
>> >
>> > if this problem disppears?
>> > *A) Yes, I have tried commenting out the other filters and only left
>> with
>> > Jieba and StopFilter. The problem is still there.*
>> >
>> > 2.Does this problem occur only on Chinese search words? Does it happen
>> on
>> > English search words?
>> > *A) Yes, the same problem occurs on English words. For example, when I
>> > search for "word", it will highlight in this way: <em> wor<em>d*
>> >
>> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
>> > parameters in field declaration? I see only one is enabled. Please
>> refer to
>> > the answer in this stackoverflow question:
>> >
>> >
>> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
>>
>> > but the same problem persists as well.*
>> >
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
>> <+s...@udngroup.com>
>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
>> >
>> > > Hi solr-user,
>> > >
>> > > Can't judge the cause on fast glimpse of your definition but some
>> > > suggestions I can give:
>> > >
>> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
>> good
>> > > job on CJK. I doubt this problem may be from those filters (note: I
>> can
>> > > understand you may use CJKWidthFilter to convert Japanese but doesn't
>> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
>> > tried
>> > > commenting out those filters, say leave only Jieba and StopFilter, and
>>
>> > see
>> > > if this problem disppears?
>> > >
>> > > 2.Does this problem occur only on Chinese search words? Does it
>> happen on
>> > > English search words?
>> > >
>> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
>> > > parameters in field declaration? I see only one is enabled. Please
>> refer
>> > to
>> > > the answer in this stackoverflow question:
>> > >
>> >
>> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>> > >
>> > >
>> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
>> scott.chu@udngroup.com <+s...@udngroup.com>>
>> > > 2015/10/22
>> > >
>> > > ----- Original Message -----
>> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>> <+e...@gmail.com>
>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
>> > > *To: *solr-user <solr-user@lucene.apache.org
>> <+s...@lucene.apache.org>
>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
>> > > *Date: *2015-10-20, 12:04:11
>> > > *Subject: *Re: Highlighting content field problem when using
>> >
>> > > JiebaTokenizerFactory
>> > >
>> > > Hi Scott,
>> > >
>> > > Here's my schema.xml for content and title, which uses text_chinese.
>> The
>> >
>> > > problem only occurs in content, and not in title.
>> > >
>> > > <field name="content" type="text_chinese" indexed="true" stored="true"
>> > > omitNorms="true" termVectors="true"/>
>> > > <field name="title" type="text_chinese" indexed="true" stored="true"
>> > > omitNorms="true" termVectors="true"/>
>> > >
>> > >
>> > > <fieldType name="text_chinese" class="solr.TextField"
>> > > positionIncrementGap="100">
>> > > <analyzer type="index">
>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>> > > segMode="SEARCH"/>
>> > > <filter class="solr.CJKWidthFilterFactory"/>
>> > > <filter class="solr.CJKBigramFilterFactory"/>
>> > > <filter class="solr.StopFilterFactory"
>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> > > maxGramSize="15"/>
>> > > <filter class="solr.PorterStemFilterFactory"/>
>> > > </analyzer>
>> > > <analyzer type="query">
>> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>> > > segMode="SEARCH"/>
>> > > <filter class="solr.CJKWidthFilterFactory"/>
>> > > <filter class="solr.CJKBigramFilterFactory"/>
>> > > <filter class="solr.StopFilterFactory"
>> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>> > > <filter class="solr.PorterStemFilterFactory"/>
>> > > </analyzer>
>> > > </fieldType>
>> > >
>> > >
>> > > Here's my solrconfig.xml on the highlighting portion:
>> > >
>> > > <requestHandler name="/highlight" class="solr.SearchHandler">
>> > > <lst name="defaults">
>> > > <str name="echoParams">explicit</str>
>> > > <int name="rows">10</int>
>> > > <str name="wt">json</str>
>> > > <str name="indent">true</str>
>> > > <str name="df">text</str>
>> > > <str name="fl">id, title, content_type, last_modified, url, score
>> </str>
>> > >
>> > > <str name="hl">on</str>
>> > > <str name="hl.fl">id, title, content, author, tag</str>
>> > > <str name="hl.highlightMultiTerm">true</str>
>> > > <str name="hl.preserveMulti">true</str>
>> > > <str name="hl.encoder">html</str>
>> > > <str name="hl.fragsize">200</str>
>> > > <str name="group">true</str>
>> > > <str name="group.field">signature</str>
>> > > <str name="group.main">true</str>
>> > > <str name="group.cache.percent">100</str>
>> > > </lst>
>> > > </requestHandler>
>> > >
>> > > <boundaryScanner name="breakIterator"
>> > > class="solr.highlight.BreakIteratorBoundaryScanner">
>> > > <lst name="defaults">
>> > > <str name="hl.bs.type">WORD</str>
>> > > <str name="hl.bs.language">en</str>
>> > > <str name="hl.bs.country">SG</str>
>> > > </lst>
>> > > </boundaryScanner>
>> > >
>> > >
>> > > Meanwhile, I'll take a look at the articles too.
>> > >
>> > > Thank you.
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > >
>> > > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
>> <+s...@udngroup.com>
>> > <+scott.chu@udngroup.com <+s...@udngroup.com>>
>> > > <+scott.chu@udngroup.com <+s...@udngroup.com> <+
>> scott.chu@udngroup.com <+s...@udngroup.com>>>> wrote:
>> > >
>> > > > Hi Edwin,
>> > > >
>> > > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
>> > > > know) so I didn't experience this problem.
>> > > >
>> > > > I'd suggest you post your schema.xml so we can see how you define
>> your
>> >
>> > > > content field and the field type it uses?
>> > > >
>> > > > In the mean time, refer to these articles, maybe the answer or
>> > workaround
>> > > > can be deducted from them.
>> > > >
>> > > > https://issues.apache.org/jira/browse/SOLR-3390
>> > > >
>> > > >
>> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>>
>> > > >
>> > > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>> > > >
>> > > > Good luck!
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
>> scott.chu@udngroup.com <+s...@udngroup.com>> <+
>> > scott.chu@udngroup.com <+s...@udngroup.com> <+
>> scott.chu@udngroup.com <+s...@udngroup.com>>>
>> > > > 2015/10/20
>> > > >
>> > > > ----- Original Message -----
>> > > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
>> <+e...@gmail.com>
>> > <+edwinyeozl@gmail.com <+e...@gmail.com>>
>> > > <+edwinyeozl@gmail.com <+e...@gmail.com> <+edwinyeozl@gmail.com
>> <+e...@gmail.com>>>>
>> > > > *To: *solr-user <solr-user@lucene.apache.org
>> <+s...@lucene.apache.org>
>> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>
>> > > <+solr-user@lucene.apache.org <+s...@lucene.apache.org> <+
>> solr-user@lucene.apache.org <+s...@lucene.apache.org>>>>
>> >
>> > > > *Date: *2015-10-13, 17:04:29
>> > > > *Subject: *Highlighting content field problem when using
>> > > > JiebaTokenizerFactory
>> > > >
>> > > > Hi,
>> > > >
>> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
>> characters
>> > > in
>> > > >
>> > > > Solr. It works fine with the segmentation when I'm using
>> > > > the Analysis function on the Solr Admin UI.
>> > > >
>> > > > However, when I tried to do the highlighting in Solr, it is not
>> > > > highlighting in the correct place. For example, when I search of
>> > > 自然環境与企業本身,
>> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
>> > > >
>> > > > Even when I search for English character like responsibility, it
>> > > highlight
>> > > > <em> *responsibilit<em>*y.
>> > > >
>> > > > Basically, the highlighting goes off by 1 character/space
>> consistently.
>> > > >
>> > > > This problem only happens in content field, and not in any other
>> > fields.
>> > >
>> > > > Does anyone knows what could be causing the issue?
>> > > >
>> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>> > > >
>> > > >
>> > > > Regards,
>> > > > Edwin
>> > > >
>> > > >
>> > > >
>> > > > -----
>> > > > 未在此訊息中找到病毒。
>> > > > 已透過 AVG 檢查 - www.avg.com
>> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>> > > >
>> > > >
>> > >
>> > >
>> > >
>> > > -----
>> > > 未在此訊息中找到病毒。
>> > > 已透過 AVG 檢查 - www.avg.com
>> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
>> > >
>> > >
>> >
>> >
>> >
>> > -----
>> > 未在此訊息中找到病毒。
>> > 已透過 AVG 檢查 - www.avg.com
>> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>> >
>> >
>>
>>
>>
>> -----
>> 未在此訊息中找到病毒。
>> 已透過 AVG 檢查 - www.avg.com
>> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
>>
>>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Scott,

Thank you for providing the links and references. Will look through them,
and let you know if I find any solutions or workaround.

Regards,
Edwin


On 27 October 2015 at 11:13, Scott Chu <sc...@udngroup.com> wrote:

>
> Take a look at Michael's 2 articles, they might help you calrify the idea
> of highlighting in Solr:
>
> Changing Bits: Lucene's TokenStreams are actually graphs!
>
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
>
> Also take a look at 4th paragraph In his another article:
>
> Changing Bits: A new Lucene highlighter is born
>
> http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html
>
> Currently, I can't figure out the possible cause of your problem unless I
> got spare time to test it on my own, which is not available these days (Got
> some projects to close)!
>
> If you find the solution or workaround, pls. let us know. Good luck again!
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/27
>
> ----- Original Message -----
> *From: *Scott Chu <sc...@udngroup.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-27, 10:27:45
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Edward,
>
>     Took a lot of time to see if there's anything can help you to define
> the cause of your problem. Maybe this might help you a bit:
>
> [SOLR-4722] Highlighter which generates a list of query term position(s)
> for each item in a list of documents, or returns null if highlighting is
> disabled. - AS...
> https://issues.apache.org/jira/browse/SOLR-4722
>
> This one is modified from FastVectorHighLighter, so ensure those 3 term*
> attributes are on.
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/27
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-23, 10:42:32
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your respond.
>
> 1. You said the problem only happens on "contents" field, so maybe there're
> something wrong with the contents of that field. Doe it contain any special
> thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
> something about HTML stripping will cause highlight problem. Maybe you can
>
> try purify that fields to be closed to pure text and see if highlight comes
> ok.
> *A) I check that the SOLR-42 is mentioning about the
> HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
> tokenizer is already deprecated too. I've tried with all kinds of content
> for rich-text documents, and all of them have the same problem.*
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
>
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
>
> see if the problem goes away. However when I'm googling similar problem, I
>
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
> *A) I was thinking about the incompatible issue too, as I previously
> thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
> in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
> this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
> although I'm using Solr 5.3.0 now, the same problem persist. *
>
> I'm looking at the indexing process too, to see if there's any problem
> there. But just can't figure out why it only happen to JiebaTokenizer, and
>
> it only happen for content field.
>
>
> Regards,
> Edwin
>
>
> On 23 October 2015 at 09:41, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>> wrote:
>
> > Hi Edwin,
> >
> > Since you've tested all my suggestions and the problem is still there, I
>
> > can't think of anything wrong with your configuration. Now I can only
> > suspect two things:
> >
> > 1. You said the problem only happens on "contents" field, so maybe
> > there're something wrong with the contents of that field. Doe it contain
>
> > any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> > mentions something about HTML stripping will cause highlight problem.
> Maybe
> > you can try purify that fields to be closed to pure text and see if
> > highlight comes ok.
> >
> > 2. Maybe something imcompatible between JiebaTokenizer and Solr
> > highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> > SmartChinese (I don't use this since I am dealing with Traditional
> Chinese
> > but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg
> and
> > see if the problem goes away. However when I'm googling similar problem,
> I
> > saw you asked same question on August at Huaban/Jieba-analysis and
> somebody
> > said he also uses JiebaTokenizer but he doesn't have your problem. So I
> see
> > this could be less suspect.
> >
> > The theory of your problem could be something in indexing process causes
>
> > wrong position info. for that field and when Solr do highlighting, it
> > retrieves wrong position info. and mark wrong position of highlight
> target
> > terms.
> >
> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> > 2015/10/23
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>>
> > *Date: *2015-10-22, 22:22:14
> > *Subject: *Re: Highlighting content field problem when using
> > JiebaTokenizerFactory
> >
> > Hi Scott,
> >
> > Thank you for your response and suggestions.
> >
> > With respond to your questions, here are the answers:
> >
> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> > job on CJK. I doubt this problem may be from those filters (note: I can
> > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> tried
> > commenting out those filters, say leave only Jieba and StopFilter, and
> see
> >
> > if this problem disppears?
> > *A) Yes, I have tried commenting out the other filters and only left with
> > Jieba and StopFilter. The problem is still there.*
> >
> > 2.Does this problem occur only on Chinese search words? Does it happen on
> > English search words?
> > *A) Yes, the same problem occurs on English words. For example, when I
> > search for "word", it will highlight in this way: <em> wor<em>d*
> >
> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > parameters in field declaration? I see only one is enabled. Please refer
> to
> > the answer in this stackoverflow question:
> >
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
>
> > but the same problem persists as well.*
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>
> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
> >
> > > Hi solr-user,
> > >
> > > Can't judge the cause on fast glimpse of your definition but some
> > > suggestions I can give:
> > >
> > > 1. I take a look at Jieba. It uses a dictionary and it seems to do a
> good
> > > job on CJK. I doubt this problem may be from those filters (note: I can
> > > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> > tried
> > > commenting out those filters, say leave only Jieba and StopFilter, and
>
> > see
> > > if this problem disppears?
> > >
> > > 2.Does this problem occur only on Chinese search words? Does it happen
> on
> > > English search words?
> > >
> > > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > > parameters in field declaration? I see only one is enabled. Please
> refer
> > to
> > > the answer in this stackoverflow question:
> > >
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> > >
> > >
> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> scott.chu@udngroup.com <+s...@udngroup.com>>
> > > 2015/10/22
> > >
> > > ----- Original Message -----
> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>
> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
> > > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
> > > *Date: *2015-10-20, 12:04:11
> > > *Subject: *Re: Highlighting content field problem when using
> >
> > > JiebaTokenizerFactory
> > >
> > > Hi Scott,
> > >
> > > Here's my schema.xml for content and title, which uses text_chinese.
> The
> >
> > > problem only occurs in content, and not in title.
> > >
> > > <field name="content" type="text_chinese" indexed="true" stored="true"
> > > omitNorms="true" termVectors="true"/>
> > > <field name="title" type="text_chinese" indexed="true" stored="true"
> > > omitNorms="true" termVectors="true"/>
> > >
> > >
> > > <fieldType name="text_chinese" class="solr.TextField"
> > > positionIncrementGap="100">
> > > <analyzer type="index">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > > segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > > maxGramSize="15"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > </analyzer>
> > > <analyzer type="query">
> > > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > > segMode="SEARCH"/>
> > > <filter class="solr.CJKWidthFilterFactory"/>
> > > <filter class="solr.CJKBigramFilterFactory"/>
> > > <filter class="solr.StopFilterFactory"
> > > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > > <filter class="solr.PorterStemFilterFactory"/>
> > > </analyzer>
> > > </fieldType>
> > >
> > >
> > > Here's my solrconfig.xml on the highlighting portion:
> > >
> > > <requestHandler name="/highlight" class="solr.SearchHandler">
> > > <lst name="defaults">
> > > <str name="echoParams">explicit</str>
> > > <int name="rows">10</int>
> > > <str name="wt">json</str>
> > > <str name="indent">true</str>
> > > <str name="df">text</str>
> > > <str name="fl">id, title, content_type, last_modified, url, score
> </str>
> > >
> > > <str name="hl">on</str>
> > > <str name="hl.fl">id, title, content, author, tag</str>
> > > <str name="hl.highlightMultiTerm">true</str>
> > > <str name="hl.preserveMulti">true</str>
> > > <str name="hl.encoder">html</str>
> > > <str name="hl.fragsize">200</str>
> > > <str name="group">true</str>
> > > <str name="group.field">signature</str>
> > > <str name="group.main">true</str>
> > > <str name="group.cache.percent">100</str>
> > > </lst>
> > > </requestHandler>
> > >
> > > <boundaryScanner name="breakIterator"
> > > class="solr.highlight.BreakIteratorBoundaryScanner">
> > > <lst name="defaults">
> > > <str name="hl.bs.type">WORD</str>
> > > <str name="hl.bs.language">en</str>
> > > <str name="hl.bs.country">SG</str>
> > > </lst>
> > > </boundaryScanner>
> > >
> > >
> > > Meanwhile, I'll take a look at the articles too.
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>
> > <+scott.chu@udngroup.com <+s...@udngroup.com>>
> > > <+scott.chu@udngroup.com <+s...@udngroup.com> <+
> scott.chu@udngroup.com <+s...@udngroup.com>>>> wrote:
> > >
> > > > Hi Edwin,
> > > >
> > > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > > > know) so I didn't experience this problem.
> > > >
> > > > I'd suggest you post your schema.xml so we can see how you define
> your
> >
> > > > content field and the field type it uses?
> > > >
> > > > In the mean time, refer to these articles, maybe the answer or
> > workaround
> > > > can be deducted from them.
> > > >
> > > > https://issues.apache.org/jira/browse/SOLR-3390
> > > >
> > > > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> > > >
> > > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > > >
> > > > Good luck!
> > > >
> > > >
> > > >
> > > >
> > > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> scott.chu@udngroup.com <+s...@udngroup.com>> <+
> > scott.chu@udngroup.com <+s...@udngroup.com> <+
> scott.chu@udngroup.com <+s...@udngroup.com>>>
> > > > 2015/10/20
> > > >
> > > > ----- Original Message -----
> > > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>
> > <+edwinyeozl@gmail.com <+e...@gmail.com>>
> > > <+edwinyeozl@gmail.com <+e...@gmail.com> <+edwinyeozl@gmail.com
> <+e...@gmail.com>>>>
> > > > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>
> > > <+solr-user@lucene.apache.org <+s...@lucene.apache.org> <+
> solr-user@lucene.apache.org <+s...@lucene.apache.org>>>>
> >
> > > > *Date: *2015-10-13, 17:04:29
> > > > *Subject: *Highlighting content field problem when using
> > > > JiebaTokenizerFactory
> > > >
> > > > Hi,
> > > >
> > > > I'm trying to use the JiebaTokenizerFactory to index Chinese
> characters
> > > in
> > > >
> > > > Solr. It works fine with the segmentation when I'm using
> > > > the Analysis function on the Solr Admin UI.
> > > >
> > > > However, when I tried to do the highlighting in Solr, it is not
> > > > highlighting in the correct place. For example, when I search of
> > > 自然環境与企業本身,
> > > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > > >
> > > > Even when I search for English character like responsibility, it
> > > highlight
> > > > <em> *responsibilit<em>*y.
> > > >
> > > > Basically, the highlighting goes off by 1 character/space
> consistently.
> > > >
> > > > This problem only happens in content field, and not in any other
> > fields.
> > >
> > > > Does anyone knows what could be causing the issue?
> > > >
> > > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > > >
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > >
> > > > -----
> > > > 未在此訊息中找到病毒。
> > > > 已透過 AVG 檢查 - www.avg.com
> > > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > > >
> > > >
> > >
> > >
> > >
> > > -----
> > > 未在此訊息中找到病毒。
> > > 已透過 AVG 檢查 - www.avg.com
> > > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> > >
> > >
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Scott Chu <sc...@udngroup.com>.

Take a look at Michael's 2 articles, they might help you calrify the idea of highlighting in Solr:

Changing Bits: Lucene's TokenStreams are actually graphs!
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Also take a look at 4th paragraph In his another article:

Changing Bits: A new Lucene highlighter is born
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Currently, I can't figure out the possible cause of your problem unless I got spare time to test it on my own, which is not available these days (Got some projects to close)!

If you find the solution or workaround, pls. let us know. Good luck again!

Scott Chu，scott.chu@udngroup.com
2015/10/27 
----- Original Message ----- 
From: Scott Chu 
To: solr-user 
Date: 2015-10-27, 10:27:45
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Edward,

    Took a lot of time to see if there's anything can help you to define the cause of your problem. Maybe this might help you a bit: 

[SOLR-4722] Highlighter which generates a list of query term position(s) for each item in a list of documents, or returns null if highlighting is disabled. - AS...
https://issues.apache.org/jira/browse/SOLR-4722

This one is modified from FastVectorHighLighter, so ensure those 3 term* attributes are on.

Scott Chu，scott.chu@udngroup.com
2015/10/27 
----- Original Message ----- 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-23, 10:42:32
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can

try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese

but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and

see if the problem goes away. However when I'm googling similar problem, I

saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and

it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu <sc...@udngroup.com> wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I

> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain

> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes

> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/23
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your response and suggestions.
>
> With respond to your questions, here are the answers:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
>
> if this problem disppears?
> *A) Yes, I have tried commenting out the other filters and only left with
> Jieba and StopFilter. The problem is still there.*
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
> *A) Yes, the same problem occurs on English words. For example, when I
> search for "word", it will highlight in this way: <em> wor<em>d*
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
>
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,

> but the same problem persists as well.*
>
>
> Regards,
> Edwin
>
>
> On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>> wrote:
>
> > Hi solr-user,
> >
> > Can't judge the cause on fast glimpse of your definition but some
> > suggestions I can give:
> >
> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> > job on CJK. I doubt this problem may be from those filters (note: I can
> > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> tried
> > commenting out those filters, say leave only Jieba and StopFilter, and

> see
> > if this problem disppears?
> >
> > 2.Does this problem occur only on Chinese search words? Does it happen on
> > English search words?
> >
> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > parameters in field declaration? I see only one is enabled. Please refer
> to
> > the answer in this stackoverflow question:
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> >
> >
> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> > 2015/10/22
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>>
> > *Date: *2015-10-20, 12:04:11
> > *Subject: *Re: Highlighting content field problem when using
>
> > JiebaTokenizerFactory
> >
> > Hi Scott,
> >
> > Here's my schema.xml for content and title, which uses text_chinese. The
>
> > problem only occurs in content, and not in title.
> >
> > <field name="content" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> > <field name="title" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> >
> >
> > <fieldType name="text_chinese" class="solr.TextField"
> > positionIncrementGap="100">
> > <analyzer type="index">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="15"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > <analyzer type="query">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> >
> > Here's my solrconfig.xml on the highlighting portion:
> >
> > <requestHandler name="/highlight" class="solr.SearchHandler">
> > <lst name="defaults">
> > <str name="echoParams">explicit</str>
> > <int name="rows">10</int>
> > <str name="wt">json</str>
> > <str name="indent">true</str>
> > <str name="df">text</str>
> > <str name="fl">id, title, content_type, last_modified, url, score </str>
> >
> > <str name="hl">on</str>
> > <str name="hl.fl">id, title, content, author, tag</str>
> > <str name="hl.highlightMultiTerm">true</str>
> > <str name="hl.preserveMulti">true</str>
> > <str name="hl.encoder">html</str>
> > <str name="hl.fragsize">200</str>
> > <str name="group">true</str>
> > <str name="group.field">signature</str>
> > <str name="group.main">true</str>
> > <str name="group.cache.percent">100</str>
> > </lst>
> > </requestHandler>
> >
> > <boundaryScanner name="breakIterator"
> > class="solr.highlight.BreakIteratorBoundaryScanner">
> > <lst name="defaults">
> > <str name="hl.bs.type">WORD</str>
> > <str name="hl.bs.language">en</str>
> > <str name="hl.bs.country">SG</str>
> > </lst>
> > </boundaryScanner>
> >
> >
> > Meanwhile, I'll take a look at the articles too.
> >
> > Thank you.
> >
> > Regards,
> > Edwin
> >
> >
> > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>
> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
> >
> > > Hi Edwin,
> > >
> > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > > know) so I didn't experience this problem.
> > >
> > > I'd suggest you post your schema.xml so we can see how you define your
>
> > > content field and the field type it uses?
> > >
> > > In the mean time, refer to these articles, maybe the answer or
> workaround
> > > can be deducted from them.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-3390
> > >
> > > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words

> > >
> > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > >
> > > Good luck!
> > >
> > >
> > >
> > >
> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> scott.chu@udngroup.com <+s...@udngroup.com>>
> > > 2015/10/20
> > >
> > > ----- Original Message -----
> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>
> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
> > > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
>
> > > *Date: *2015-10-13, 17:04:29
> > > *Subject: *Highlighting content field problem when using
> > > JiebaTokenizerFactory
> > >
> > > Hi,
> > >
> > > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> > in
> > >
> > > Solr. It works fine with the segmentation when I'm using
> > > the Analysis function on the Solr Admin UI.
> > >
> > > However, when I tried to do the highlighting in Solr, it is not
> > > highlighting in the correct place. For example, when I search of
> > 自然環境与企業本身,
> > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > >
> > > Even when I search for English character like responsibility, it
> > highlight
> > > <em> *responsibilit<em>*y.
> > >
> > > Basically, the highlighting goes off by 1 character/space consistently.
> > >
> > > This problem only happens in content field, and not in any other
> fields.
> >
> > > Does anyone knows what could be causing the issue?
> > >
> > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > >
> > > -----
> > > 未在此訊息中找到病毒。
> > > 已透過 AVG 檢查 - www.avg.com
> > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > >
> > >
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>
>



-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Scott Chu <sc...@udngroup.com>.

Hi Edward,

    Took a lot of time to see if there's anything can help you to define the cause of your problem. Maybe this might help you a bit: 

[SOLR-4722] Highlighter which generates a list of query term position(s) for each item in a list of documents, or returns null if highlighting is disabled. - AS...
https://issues.apache.org/jira/browse/SOLR-4722

This one is modified from FastVectorHighLighter, so ensure those 3 term* attributes are on.

Scott Chu，scott.chu@udngroup.com
2015/10/27 
----- Original Message ----- 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-23, 10:42:32
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can

try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese

but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and

see if the problem goes away. However when I'm googling similar problem, I

saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and

it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu <sc...@udngroup.com> wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I

> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain

> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes

> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/23
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your response and suggestions.
>
> With respond to your questions, here are the answers:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
>
> if this problem disppears?
> *A) Yes, I have tried commenting out the other filters and only left with
> Jieba and StopFilter. The problem is still there.*
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
> *A) Yes, the same problem occurs on English words. For example, when I
> search for "word", it will highlight in this way: <em> wor<em>d*
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
>
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,

> but the same problem persists as well.*
>
>
> Regards,
> Edwin
>
>
> On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>> wrote:
>
> > Hi solr-user,
> >
> > Can't judge the cause on fast glimpse of your definition but some
> > suggestions I can give:
> >
> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> > job on CJK. I doubt this problem may be from those filters (note: I can
> > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> tried
> > commenting out those filters, say leave only Jieba and StopFilter, and

> see
> > if this problem disppears?
> >
> > 2.Does this problem occur only on Chinese search words? Does it happen on
> > English search words?
> >
> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > parameters in field declaration? I see only one is enabled. Please refer
> to
> > the answer in this stackoverflow question:
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> >
> >
> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> > 2015/10/22
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>>
> > *Date: *2015-10-20, 12:04:11
> > *Subject: *Re: Highlighting content field problem when using
>
> > JiebaTokenizerFactory
> >
> > Hi Scott,
> >
> > Here's my schema.xml for content and title, which uses text_chinese. The
>
> > problem only occurs in content, and not in title.
> >
> > <field name="content" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> > <field name="title" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> >
> >
> > <fieldType name="text_chinese" class="solr.TextField"
> > positionIncrementGap="100">
> > <analyzer type="index">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="15"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > <analyzer type="query">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> >
> > Here's my solrconfig.xml on the highlighting portion:
> >
> > <requestHandler name="/highlight" class="solr.SearchHandler">
> > <lst name="defaults">
> > <str name="echoParams">explicit</str>
> > <int name="rows">10</int>
> > <str name="wt">json</str>
> > <str name="indent">true</str>
> > <str name="df">text</str>
> > <str name="fl">id, title, content_type, last_modified, url, score </str>
> >
> > <str name="hl">on</str>
> > <str name="hl.fl">id, title, content, author, tag</str>
> > <str name="hl.highlightMultiTerm">true</str>
> > <str name="hl.preserveMulti">true</str>
> > <str name="hl.encoder">html</str>
> > <str name="hl.fragsize">200</str>
> > <str name="group">true</str>
> > <str name="group.field">signature</str>
> > <str name="group.main">true</str>
> > <str name="group.cache.percent">100</str>
> > </lst>
> > </requestHandler>
> >
> > <boundaryScanner name="breakIterator"
> > class="solr.highlight.BreakIteratorBoundaryScanner">
> > <lst name="defaults">
> > <str name="hl.bs.type">WORD</str>
> > <str name="hl.bs.language">en</str>
> > <str name="hl.bs.country">SG</str>
> > </lst>
> > </boundaryScanner>
> >
> >
> > Meanwhile, I'll take a look at the articles too.
> >
> > Thank you.
> >
> > Regards,
> > Edwin
> >
> >
> > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>
> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
> >
> > > Hi Edwin,
> > >
> > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > > know) so I didn't experience this problem.
> > >
> > > I'd suggest you post your schema.xml so we can see how you define your
>
> > > content field and the field type it uses?
> > >
> > > In the mean time, refer to these articles, maybe the answer or
> workaround
> > > can be deducted from them.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-3390
> > >
> > > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words

> > >
> > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > >
> > > Good luck!
> > >
> > >
> > >
> > >
> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> scott.chu@udngroup.com <+s...@udngroup.com>>
> > > 2015/10/20
> > >
> > > ----- Original Message -----
> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>
> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
> > > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
>
> > > *Date: *2015-10-13, 17:04:29
> > > *Subject: *Highlighting content field problem when using
> > > JiebaTokenizerFactory
> > >
> > > Hi,
> > >
> > > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> > in
> > >
> > > Solr. It works fine with the segmentation when I'm using
> > > the Analysis function on the Solr Admin UI.
> > >
> > > However, when I tried to do the highlighting in Solr, it is not
> > > highlighting in the correct place. For example, when I search of
> > 自然環境与企業本身,
> > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > >
> > > Even when I search for English character like responsibility, it
> > highlight
> > > <em> *responsibilit<em>*y.
> > >
> > > Basically, the highlighting goes off by 1 character/space consistently.
> > >
> > > This problem only happens in content field, and not in any other
> fields.
> >
> > > Does anyone knows what could be causing the issue?
> > >
> > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > >
> > > -----
> > > 未在此訊息中找到病毒。
> > > 已透過 AVG 檢查 - www.avg.com
> > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > >
> > >
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>
>



-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6173 / 病毒庫: 4450/10871 - 發佈日期: 10/22/15

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can
try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese
but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
see if the problem goes away. However when I'm googling similar problem, I
saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and
it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu <sc...@udngroup.com> wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I
> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain
> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes
> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/23
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your response and suggestions.
>
> With respond to your questions, here are the answers:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
>
> if this problem disppears?
> *A) Yes, I have tried commenting out the other filters and only left with
> Jieba and StopFilter. The problem is still there.*
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
> *A) Yes, the same problem occurs on English words. For example, when I
> search for "word", it will highlight in this way: <em> wor<em>d*
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
>
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> *A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
> but the same problem persists as well.*
>
>
> Regards,
> Edwin
>
>
> On 22 October 2015 at 16:25, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>> wrote:
>
> > Hi solr-user,
> >
> > Can't judge the cause on fast glimpse of your definition but some
> > suggestions I can give:
> >
> > 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> > job on CJK. I doubt this problem may be from those filters (note: I can
> > understand you may use CJKWidthFilter to convert Japanese but doesn't
> > understand why you use CJKBigramFilter and EdgeNGramFilter). Have you
> tried
> > commenting out those filters, say leave only Jieba and StopFilter, and
> see
> > if this problem disppears?
> >
> > 2.Does this problem occur only on Chinese search words? Does it happen on
> > English search words?
> >
> > 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> > parameters in field declaration? I see only one is enabled. Please refer
> to
> > the answer in this stackoverflow question:
> >
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
> >
> >
> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> > 2015/10/22
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>>
> > *Date: *2015-10-20, 12:04:11
> > *Subject: *Re: Highlighting content field problem when using
>
> > JiebaTokenizerFactory
> >
> > Hi Scott,
> >
> > Here's my schema.xml for content and title, which uses text_chinese. The
>
> > problem only occurs in content, and not in title.
> >
> > <field name="content" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> > <field name="title" type="text_chinese" indexed="true" stored="true"
> > omitNorms="true" termVectors="true"/>
> >
> >
> > <fieldType name="text_chinese" class="solr.TextField"
> > positionIncrementGap="100">
> > <analyzer type="index">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="15"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > <analyzer type="query">
> > <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> > segMode="SEARCH"/>
> > <filter class="solr.CJKWidthFilterFactory"/>
> > <filter class="solr.CJKBigramFilterFactory"/>
> > <filter class="solr.StopFilterFactory"
> > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> > <filter class="solr.PorterStemFilterFactory"/>
> > </analyzer>
> > </fieldType>
> >
> >
> > Here's my solrconfig.xml on the highlighting portion:
> >
> > <requestHandler name="/highlight" class="solr.SearchHandler">
> > <lst name="defaults">
> > <str name="echoParams">explicit</str>
> > <int name="rows">10</int>
> > <str name="wt">json</str>
> > <str name="indent">true</str>
> > <str name="df">text</str>
> > <str name="fl">id, title, content_type, last_modified, url, score </str>
> >
> > <str name="hl">on</str>
> > <str name="hl.fl">id, title, content, author, tag</str>
> > <str name="hl.highlightMultiTerm">true</str>
> > <str name="hl.preserveMulti">true</str>
> > <str name="hl.encoder">html</str>
> > <str name="hl.fragsize">200</str>
> > <str name="group">true</str>
> > <str name="group.field">signature</str>
> > <str name="group.main">true</str>
> > <str name="group.cache.percent">100</str>
> > </lst>
> > </requestHandler>
> >
> > <boundaryScanner name="breakIterator"
> > class="solr.highlight.BreakIteratorBoundaryScanner">
> > <lst name="defaults">
> > <str name="hl.bs.type">WORD</str>
> > <str name="hl.bs.language">en</str>
> > <str name="hl.bs.country">SG</str>
> > </lst>
> > </boundaryScanner>
> >
> >
> > Meanwhile, I'll take a look at the articles too.
> >
> > Thank you.
> >
> > Regards,
> > Edwin
> >
> >
> > On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>
> > <+scott.chu@udngroup.com <+s...@udngroup.com>>> wrote:
> >
> > > Hi Edwin,
> > >
> > > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > > know) so I didn't experience this problem.
> > >
> > > I'd suggest you post your schema.xml so we can see how you define your
>
> > > content field and the field type it uses?
> > >
> > > In the mean time, refer to these articles, maybe the answer or
> workaround
> > > can be deducted from them.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-3390
> > >
> > > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> > >
> > > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> > >
> > > Good luck!
> > >
> > >
> > >
> > >
> > > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com> <+
> scott.chu@udngroup.com <+s...@udngroup.com>>
> > > 2015/10/20
> > >
> > > ----- Original Message -----
> > > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>
> > <+edwinyeozl@gmail.com <+e...@gmail.com>>>
> > > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>
> > <+solr-user@lucene.apache.org <+s...@lucene.apache.org>>>
>
> > > *Date: *2015-10-13, 17:04:29
> > > *Subject: *Highlighting content field problem when using
> > > JiebaTokenizerFactory
> > >
> > > Hi,
> > >
> > > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> > in
> > >
> > > Solr. It works fine with the segmentation when I'm using
> > > the Analysis function on the Solr Admin UI.
> > >
> > > However, when I tried to do the highlighting in Solr, it is not
> > > highlighting in the correct place. For example, when I search of
> > 自然環境与企業本身,
> > > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> > >
> > > Even when I search for English character like responsibility, it
> > highlight
> > > <em> *responsibilit<em>*y.
> > >
> > > Basically, the highlighting goes off by 1 character/space consistently.
> > >
> > > This problem only happens in content field, and not in any other
> fields.
> >
> > > Does anyone knows what could be causing the issue?
> > >
> > > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > >
> > > -----
> > > 未在此訊息中找到病毒。
> > > 已透過 AVG 檢查 - www.avg.com
> > > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> > >
> > >
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Scott Chu <sc...@udngroup.com>.

Hi Edwin,

Since you've tested all my suggestions and the problem is still there, I can't think of anything wrong with your configuration. Now I can only suspect two things:

1. You said the problem only happens on "contents" field, so maybe there're something wrong with the contents of that field. Doe it contain any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions something about HTML stripping will cause highlight problem. Maybe you can try purify that fields to be closed to pure text and see if highlight comes ok.

2. Maybe something imcompatible between JiebaTokenizer and Solr highlighter. If you switch to other tokenizers, e.g. Standard, CJK, SmartChinese (I don't use this since I am dealing with Traditional Chinese but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and see if the problem goes away. However when I'm googling similar problem, I saw you asked same question on August at Huaban/Jieba-analysis and somebody said he also uses JiebaTokenizer but he doesn't have your problem. So I see this could be less suspect.

The theory of your problem could be something in indexing process causes wrong position info. for that field and when Solr do highlighting, it retrieves wrong position info. and mark wrong position of highlight target terms.

Scott Chu，scott.chu@udngroup.com
2015/10/23 
----- Original Message ----- 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-22, 22:22:14
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your response and suggestions.

With respond to your questions, here are the answers:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good
job on CJK. I doubt this problem may be from those filters (note: I can
understand you may use CJKWidthFilter to convert Japanese but doesn't
understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
commenting out those filters, say leave only Jieba and StopFilter, and see

if this problem disppears?
*A) Yes, I have tried commenting out the other filters and only left with
Jieba and StopFilter. The problem is still there.*

2.Does this problem occur only on Chinese search words? Does it happen on
English search words?
*A) Yes, the same problem occurs on English words. For example, when I
search for "word", it will highlight in this way: <em> wor<em>d*

3.To use FastVectorHighlighter, you seem to have to enable 3 term*
parameters in field declaration? I see only one is enabled. Please refer to
the answer in this stackoverflow question:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
*A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
but the same problem persists as well.*


Regards,
Edwin


On 22 October 2015 at 16:25, Scott Chu <sc...@udngroup.com> wrote:

> Hi solr-user,
>
> Can't judge the cause on fast glimpse of your definition but some
> suggestions I can give:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
> if this problem disppears?
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/22
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-20, 12:04:11
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Here's my schema.xml for content and title, which uses text_chinese. The

> problem only occurs in content, and not in title.
>
> <field name="content" type="text_chinese" indexed="true" stored="true"
> omitNorms="true" termVectors="true"/>
> <field name="title" type="text_chinese" indexed="true" stored="true"
> omitNorms="true" termVectors="true"/>
>
>
> <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="15"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
> segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
> </analyzer>
> </fieldType>
>
>
> Here's my solrconfig.xml on the highlighting portion:
>
> <requestHandler name="/highlight" class="solr.SearchHandler">
> <lst name="defaults">
> <str name="echoParams">explicit</str>
> <int name="rows">10</int>
> <str name="wt">json</str>
> <str name="indent">true</str>
> <str name="df">text</str>
> <str name="fl">id, title, content_type, last_modified, url, score </str>
>
> <str name="hl">on</str>
> <str name="hl.fl">id, title, content, author, tag</str>
> <str name="hl.highlightMultiTerm">true</str>
> <str name="hl.preserveMulti">true</str>
> <str name="hl.encoder">html</str>
> <str name="hl.fragsize">200</str>
> <str name="group">true</str>
> <str name="group.field">signature</str>
> <str name="group.main">true</str>
> <str name="group.cache.percent">100</str>
> </lst>
> </requestHandler>
>
> <boundaryScanner name="breakIterator"
> class="solr.highlight.BreakIteratorBoundaryScanner">
> <lst name="defaults">
> <str name="hl.bs.type">WORD</str>
> <str name="hl.bs.language">en</str>
> <str name="hl.bs.country">SG</str>
> </lst>
> </boundaryScanner>
>
>
> Meanwhile, I'll take a look at the articles too.
>
> Thank you.
>
> Regards,
> Edwin
>
>
> On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>> wrote:
>
> > Hi Edwin,
> >
> > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > know) so I didn't experience this problem.
> >
> > I'd suggest you post your schema.xml so we can see how you define your

> > content field and the field type it uses?
> >
> > In the mean time, refer to these articles, maybe the answer or workaround
> > can be deducted from them.
> >
> > https://issues.apache.org/jira/browse/SOLR-3390
> >
> > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> >
> > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> >
> > Good luck!
> >
> >
> >
> >
> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> > 2015/10/20
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>>
> > *Date: *2015-10-13, 17:04:29
> > *Subject: *Highlighting content field problem when using
> > JiebaTokenizerFactory
> >
> > Hi,
> >
> > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> in
> >
> > Solr. It works fine with the segmentation when I'm using
> > the Analysis function on the Solr Admin UI.
> >
> > However, when I tried to do the highlighting in Solr, it is not
> > highlighting in the correct place. For example, when I search of
> 自然環境与企業本身,
> > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> >
> > Even when I search for English character like responsibility, it
> highlight
> > <em> *responsibilit<em>*y.
> >
> > Basically, the highlighting goes off by 1 character/space consistently.
> >
> > This problem only happens in content field, and not in any other fields.
>
> > Does anyone knows what could be causing the issue?
> >
> > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> >
> >
> > Regards,
> > Edwin
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
>
>



-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Scott,

Thank you for your response and suggestions.

With respond to your questions, here are the answers:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good
job on CJK. I doubt this problem may be from those filters (note: I can
understand you may use CJKWidthFilter to convert Japanese but doesn't
understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
commenting out those filters, say leave only Jieba and StopFilter, and see
if this problem disppears?
*A) Yes, I have tried commenting out the other filters and only left with
Jieba and StopFilter. The problem is still there.*

2.Does this problem occur only on Chinese search words? Does it happen on
English search words?
*A) Yes, the same problem occurs on English words. For example, when I
search for "word", it will highlight in this way: <em> wor<em>d*

3.To use FastVectorHighlighter, you seem to have to enable 3 term*
parameters in field declaration? I see only one is enabled. Please refer to
the answer in this stackoverflow question:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
*A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
but the same problem persists as well.*


Regards,
Edwin


On 22 October 2015 at 16:25, Scott Chu <sc...@udngroup.com> wrote:

> Hi solr-user,
>
> Can't judge the cause on fast glimpse of your definition but some
> suggestions I can give:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
> if this problem disppears?
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/22
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-20, 12:04:11
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Here's my schema.xml for content and title, which uses text_chinese. The
> problem only occurs in content, and not in title.
>
> <field name="content" type="text_chinese" indexed="true" stored="true"
> omitNorms="true" termVectors="true"/>
>    <field name="title" type="text_chinese" indexed="true" stored="true"
> omitNorms="true" termVectors="true"/>
>
>
>   <fieldType name="text_chinese" class="solr.TextField"
> positionIncrementGap="100">
>  <analyzer type="index">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="15"/>
> <filter class="solr.PorterStemFilterFactory"/>
>  </analyzer>
>  <analyzer type="query">
> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
>  segMode="SEARCH"/>
> <filter class="solr.CJKWidthFilterFactory"/>
> <filter class="solr.CJKBigramFilterFactory"/>
> <filter class="solr.StopFilterFactory"
> words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> <filter class="solr.PorterStemFilterFactory"/>
>       </analyzer>
>    </fieldType>
>
>
> Here's my solrconfig.xml on the highlighting portion:
>
>   <requestHandler name="/highlight" class="solr.SearchHandler">
>       <lst name="defaults">
>            <str name="echoParams">explicit</str>
>            <int name="rows">10</int>
>            <str name="wt">json</str>
>            <str name="indent">true</str>
>   <str name="df">text</str>
>   <str name="fl">id, title, content_type, last_modified, url, score </str>
>
>   <str name="hl">on</str>
>            <str name="hl.fl">id, title, content, author, tag</str>
>   <str name="hl.highlightMultiTerm">true</str>
>            <str name="hl.preserveMulti">true</str>
>            <str name="hl.encoder">html</str>
>   <str name="hl.fragsize">200</str>
> <str name="group">true</str>
> <str name="group.field">signature</str>
> <str name="group.main">true</str>
> <str name="group.cache.percent">100</str>
>       </lst>
>   </requestHandler>
>
>     <boundaryScanner name="breakIterator"
> class="solr.highlight.BreakIteratorBoundaryScanner">
>  <lst name="defaults">
> <str name="hl.bs.type">WORD</str>
> <str name="hl.bs.language">en</str>
> <str name="hl.bs.country">SG</str>
>  </lst>
>     </boundaryScanner>
>
>
> Meanwhile, I'll take a look at the articles too.
>
> Thank you.
>
> Regards,
> Edwin
>
>
> On 20 October 2015 at 11:32, Scott Chu <scott.chu@udngroup.com
> <+s...@udngroup.com>> wrote:
>
> > Hi Edwin,
> >
> > I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> > know) so I didn't experience this problem.
> >
> > I'd suggest you post your schema.xml so we can see how you define your
> > content field and the field type it uses?
> >
> > In the mean time, refer to these articles, maybe the answer or workaround
> > can be deducted from them.
> >
> > https://issues.apache.org/jira/browse/SOLR-3390
> >
> > http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
> >
> > http://qnalist.com/questions/667066/highlighting-marks-wrong-words
> >
> > Good luck!
> >
> >
> >
> >
> > Scott Chu，scott.chu@udngroup.com <+s...@udngroup.com>
> > 2015/10/20
> >
> > ----- Original Message -----
> > *From: *Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> <+e...@gmail.com>>
> > *To: *solr-user <solr-user@lucene.apache.org
> <+s...@lucene.apache.org>>
> > *Date: *2015-10-13, 17:04:29
> > *Subject: *Highlighting content field problem when using
> > JiebaTokenizerFactory
> >
> > Hi,
> >
> > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> in
> >
> > Solr. It works fine with the segmentation when I'm using
> > the Analysis function on the Solr Admin UI.
> >
> > However, when I tried to do the highlighting in Solr, it is not
> > highlighting in the correct place. For example, when I search of
> 自然環境与企業本身,
> > it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
> >
> > Even when I search for English character like responsibility, it
> highlight
> > <em> *responsibilit<em>*y.
> >
> > Basically, the highlighting goes off by 1 character/space consistently.
> >
> > This problem only happens in content field, and not in any other fields.
>
> > Does anyone knows what could be causing the issue?
> >
> > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> >
> >
> > Regards,
> > Edwin
> >
> >
> >
> > -----
> > 未在此訊息中找到病毒。
> > 已透過 AVG 檢查 - www.avg.com
> > 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
> >
> >
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Scott Chu <sc...@udngroup.com>.

Hi solr-user,

Can't judge the cause on fast glimpse of your definition but some suggestions I can give:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good job on CJK. I doubt this problem may be from those filters (note: I can understand you may use CJKWidthFilter to convert Japanese but doesn't understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried commenting out those filters, say leave only Jieba and StopFilter, and see if this problem disppears?

2.Does this problem occur only on Chinese search words? Does it happen on English search words?

3.To use FastVectorHighlighter, you seem to have to enable 3 term* parameters in field declaration? I see only one is enabled. Please refer to the answer in this stackoverflow question: http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only


Scott Chu，scott.chu@udngroup.com
2015/10/22 
----- Original Message ----- 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-20, 12:04:11
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Here's my schema.xml for content and title, which uses text_chinese. The
problem only occurs in content, and not in title.

<field name="content" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>
   <field name="title" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>


  <fieldType name="text_chinese" class="solr.TextField"
positionIncrementGap="100">
 <analyzer type="index">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15"/>
<filter class="solr.PorterStemFilterFactory"/>
 </analyzer>
 <analyzer type="query">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
   </fieldType>


Here's my solrconfig.xml on the highlighting portion:

  <requestHandler name="/highlight" class="solr.SearchHandler">
      <lst name="defaults">
           <str name="echoParams">explicit</str>
           <int name="rows">10</int>
           <str name="wt">json</str>
           <str name="indent">true</str>
  <str name="df">text</str>
  <str name="fl">id, title, content_type, last_modified, url, score </str>

  <str name="hl">on</str>
           <str name="hl.fl">id, title, content, author, tag</str>
  <str name="hl.highlightMultiTerm">true</str>
           <str name="hl.preserveMulti">true</str>
           <str name="hl.encoder">html</str>
  <str name="hl.fragsize">200</str>
<str name="group">true</str>
<str name="group.field">signature</str>
<str name="group.main">true</str>
<str name="group.cache.percent">100</str>
      </lst>
  </requestHandler>

    <boundaryScanner name="breakIterator"
class="solr.highlight.BreakIteratorBoundaryScanner">
 <lst name="defaults">
<str name="hl.bs.type">WORD</str>
<str name="hl.bs.language">en</str>
<str name="hl.bs.country">SG</str>
 </lst>
    </boundaryScanner>


Meanwhile, I'll take a look at the articles too.

Thank you.

Regards,
Edwin


On 20 October 2015 at 11:32, Scott Chu <sc...@udngroup.com> wrote:

> Hi Edwin,
>
> I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> know) so I didn't experience this problem.
>
> I'd suggest you post your schema.xml so we can see how you define your
> content field and the field type it uses?
>
> In the mean time, refer to these articles, maybe the answer or workaround
> can be deducted from them.
>
> https://issues.apache.org/jira/browse/SOLR-3390
>
> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>
> Good luck!
>
>
>
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/20
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-13, 17:04:29
> *Subject: *Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
>
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然環境与企業本身,
> it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
>
> Even when I search for English character like responsibility, it highlight
> <em> *responsibilit<em>*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.

> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>
>



-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Scott,

Here's my schema.xml for content and title, which uses text_chinese. The
problem only occurs in content, and not in title.

<field name="content" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>
   <field name="title" type="text_chinese" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>


  <fieldType name="text_chinese" class="solr.TextField"
positionIncrementGap="100">
 <analyzer type="index">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15"/>
<filter class="solr.PorterStemFilterFactory"/>
 </analyzer>
 <analyzer type="query">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
   </fieldType>


Here's my solrconfig.xml on the highlighting portion:

  <requestHandler name="/highlight" class="solr.SearchHandler">
      <lst name="defaults">
           <str name="echoParams">explicit</str>
           <int name="rows">10</int>
           <str name="wt">json</str>
           <str name="indent">true</str>
  <str name="df">text</str>
  <str name="fl">id, title, content_type, last_modified, url, score </str>

  <str name="hl">on</str>
           <str name="hl.fl">id, title, content, author, tag</str>
  <str name="hl.highlightMultiTerm">true</str>
           <str name="hl.preserveMulti">true</str>
           <str name="hl.encoder">html</str>
  <str name="hl.fragsize">200</str>
<str name="group">true</str>
<str name="group.field">signature</str>
<str name="group.main">true</str>
<str name="group.cache.percent">100</str>
      </lst>
  </requestHandler>

    <boundaryScanner name="breakIterator"
class="solr.highlight.BreakIteratorBoundaryScanner">
 <lst name="defaults">
<str name="hl.bs.type">WORD</str>
<str name="hl.bs.language">en</str>
<str name="hl.bs.country">SG</str>
 </lst>
    </boundaryScanner>


Meanwhile, I'll take a look at the articles too.

Thank you.

Regards,
Edwin


On 20 October 2015 at 11:32, Scott Chu <sc...@udngroup.com> wrote:

> Hi Edwin,
>
> I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> know) so I didn't experience this problem.
>
> I'd suggest you post your schema.xml so we can see how you define your
> content field and the field type it uses?
>
> In the mean time, refer to these articles, maybe the answer or workaround
> can be deducted from them.
>
> https://issues.apache.org/jira/browse/SOLR-3390
>
> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>
> Good luck!
>
>
>
>
> Scott Chu，scott.chu@udngroup.com
> 2015/10/20
>
> ----- Original Message -----
> *From: *Zheng Lin Edwin Yeo <ed...@gmail.com>
> *To: *solr-user <so...@lucene.apache.org>
> *Date: *2015-10-13, 17:04:29
> *Subject: *Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
>
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然環境与企業本身,
> it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的
>
> Even when I search for English character like responsibility, it highlight
>  <em> *responsibilit<em>*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.
> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>
>
>
> -----
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Scott Chu <sc...@udngroup.com>.

Hi Edwin,

I didn't use Jieba on Chinese (I use only CJK, very foundamental, I know) so I didn't experience this problem. 

I'd suggest you post your schema.xml so we can see how you define your content field and the field type it uses?

In the mean time, refer to these articles, maybe the answer or workaround can be deducted from them.

https://issues.apache.org/jira/browse/SOLR-3390

http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words

http://qnalist.com/questions/667066/highlighting-marks-wrong-words

Good luck!




Scott Chu，scott.chu@udngroup.com
2015/10/20 
----- Original Message ----- 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-13, 17:04:29
Subject: Highlighting content field problem when using JiebaTokenizerFactory


Hi,

I'm trying to use the JiebaTokenizerFactory to index Chinese characters in

Solr. It works fine with the segmentation when I'm using
the Analysis function on the Solr Admin UI.

However, when I tried to do the highlighting in Solr, it is not
highlighting in the correct place. For example, when I search of 自然環境与企業本身,
it highlight 認<em>為自然環</em><em>境</em><em>与企</em><em>業本</em>身的

Even when I search for English character like responsibility, it highlight
 <em> *responsibilit<em>*y.

Basically, the highlighting goes off by 1 character/space consistently.

This problem only happens in content field, and not in any other fields.
Does anyone knows what could be causing the issue?

I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.


Regards,
Edwin



-----
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Hi Scott,

Thank you for your reply.

I've tried to set that and also tried changing to Fast Vector Highlighter,
but it isn't working as well. I got the same highlighting results as
previously.

Regards,
Edwin


On 19 October 2015 at 23:56, Scott Stults <sstults@opensourceconnections.com
> wrote:

> Edwin,
>
> Try setting hl.bs.language and hl.bs.country in your request or
> requestHandler:
>
>
> https://cwiki.apache.org/confluence/display/solr/FastVector+Highlighter#FastVectorHighlighter-UsingBoundaryScannerswiththeFastVectorHighlighter
>
>
> -Scott
>
> On Tue, Oct 13, 2015 at 5:04 AM, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >
> wrote:
>
> > Hi,
> >
> > I'm trying to use the JiebaTokenizerFactory to index Chinese characters
> in
> > Solr. It works fine with the segmentation when I'm using
> > the Analysis function on the Solr Admin UI.
> >
> > However, when I tried to do the highlighting in Solr, it is not
> > highlighting in the correct place. For example, when I search of
> 自然环境与企业本身,
> > it highlight 认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的
> >
> > Even when I search for English character like  responsibility, it
> highlight
> >  <em> *responsibilit<em>*y.
> >
> > Basically, the highlighting goes off by 1 character/space consistently.
> >
> > This problem only happens in content field, and not in any other fields.
> > Does anyone knows what could be causing the issue?
> >
> > I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
> >
> >
> > Regards,
> > Edwin
> >
>
>
>
> --
> Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
> | 434.409.2780
> http://www.opensourceconnections.com
>

Re: Highlighting content field problem when using JiebaTokenizerFactory

Posted by Scott Stults <ss...@opensourceconnections.com>.

Edwin,

Try setting hl.bs.language and hl.bs.country in your request or
requestHandler:

https://cwiki.apache.org/confluence/display/solr/FastVector+Highlighter#FastVectorHighlighter-UsingBoundaryScannerswiththeFastVectorHighlighter


-Scott

On Tue, Oct 13, 2015 at 5:04 AM, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然环境与企业本身,
> it highlight 认<em>为自然环</em><em>境</em><em>与企</em><em>业本</em>身的
>
> Even when I search for English character like  responsibility, it highlight
>  <em> *responsibilit<em>*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.
> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com