You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Jack Tang (JIRA)" <ji...@apache.org> on 2005/04/05 04:24:16 UTC

[jira] Created: (NUTCH-36) Chinese in Nutch

Chinese in Nutch
----------------

         Key: NUTCH-36
         URL: http://issues.apache.org/jira/browse/NUTCH-36
     Project: Nutch
        Type: Improvement
  Components: indexer, searcher  
 Environment: all
    Reporter: Jack Tang
    Priority: Minor


Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by Jack Tang <hi...@gmail.com>.

Hi Kerang

I think it is good like we can write our own CJK bi-gram segmentation.
The 3rd-part CJKTokenizer do a lot of duplicate work which
NutchAnalysis does.
If "+| <SIGRAM: (<CJK>)+ >", then the new CJKTokenizer  only focus on CJK words.

My another idea of CJK segmentation is making CJKTokenizer  as an
interface and it can be configured in
nutch-default.xml/nutch-site.xml. I think the design will improved CJK
segmentation in future.

Comments?

Regards
/Jack

On 9/27/05, Kerang Lv (JIRA) <ji...@apache.org> wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12330588 ]
>
> Kerang Lv commented on NUTCH-36:
> --------------------------------
>
> Code of a kind can be used to perform third-part CJK word
> segmentation in NutchAnalysis.jj. CJKTokenizer, a kind of bi-gram segmentation , was used in the following example.
> ================================================================================
> @@ -33,6 +33,7 @@
>  import org.apache.nutch.searcher.Query.Clause;
>
>  import org.apache.lucene.analysis.StopFilter;
> +import org.apache.lucene.analysis.cjk.CJKTokenizer;
>
>  import java.io.*;
>  import java.util.*;
> @@ -81,6 +82,14 @@
>  PARSER_END(NutchAnalysis)
>
>  TOKEN_MGR_DECLS : {
> +  /** use CJKTokenizer to process cjk character */
> +  private CJKTokenizer cjkTokenizer = null;
> +
> +  /** a global cjk token */
> +  private org.apache.lucene.analysis.Token cjkToken = null;
> +
> +  /** start offset of cjk sequence */
> +  private int cjkStartOffset = 0;
>
>    /** Constructs a token manager for the provided Reader. */
>    public NutchAnalysisTokenManager(Reader reader) {
> @@ -106,7 +115,46 @@
>      }
>
>    // chinese, japanese and korean characters
> -| <SIGRAM: <CJK> >
> +| <SIGRAM: (<CJK>)+ >
> +  {
> +    /**
> +     * use an instance of CJKTokenizer, cjkTokenizer, hold the maximum
> +     * matched cjk chars, and cjkToken for the current token;
> +     * reset matchedToken.image use cjkToken.termText();
> +     * reset matchedToken.beginColumn use cjkToken.startOffset();
> +     * reset matchedToken.endColumn use cjkToken.endOffset();
> +     * backup the last char when the next cjkToken is valid.
> +     */
> +    if(cjkTokenizer == null) {
> +      cjkTokenizer = new CJKTokenizer(new StringReader(image.toString()));
> +      cjkStartOffset = matchedToken.beginColumn;
> +      try {
> +        cjkToken = cjkTokenizer.next();
> +      } catch(IOException ioe) {
> +        cjkToken = null;
> +      }
> +    }
> +
> +    if(cjkToken != null && !cjkToken.termText().equals("")) {
> +      //sometime the cjkTokenizer returns an empty string, is it a bug?
> +      matchedToken.image = cjkToken.termText();
> +      matchedToken.beginColumn = cjkStartOffset + cjkToken.startOffset();
> +      matchedToken.endColumn = cjkStartOffset + cjkToken.endOffset();
> +      try {
> +        cjkToken = cjkTokenizer.next();
> +      } catch(IOException ioe) {
> +        cjkToken = null;
> +      }
> +      if(cjkToken != null && !cjkToken.termText().equals("")) {
> +        input_stream.backup(1);
> +      }
> +    }
> +
> +    if(cjkToken == null || cjkToken.termText().equals("")) {
> +      cjkTokenizer = null;
> +      cjkStartOffset = 0;
> +    }
> +  }
>
>
> > Chinese in Nutch
> > ----------------
> >
> >          Key: NUTCH-36
> >          URL: http://issues.apache.org/jira/browse/NUTCH-36
> >      Project: Nutch
> >         Type: Improvement
> >   Components: indexer, searcher
> >  Environment: all
> >     Reporter: Jack Tang
> >     Priority: Minor
> >  Attachments: &#26700
> >
> > Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word.
> > So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by Jack Tang <hi...@gmail.com>.

Cutting

I agree with you!
All segmentation of the character stream should be done in NutchAnalysis.jj. 

More, here are something wrong in my solution. I feel so so so sorry
about my "impulsive" patch. I found it some days ago, and I am working
on it.
In my project I just replace my CJKAnalyzer with ContentAnalyzer in
NutchDocumentAnalyzer.

Here is the reason what I got:
Say CJK character sequences "C1C2C3C4" ("C1" here means one CJK
character), passed through bi-gram segementation, the result should be
"C1C2"(0,2), "C2C3"(1,3), "C3C4"(2,4).[NOTE: first number in bracket
is token's start offset and the second one is end offset] In another
words, the bi-gram segmented terms should merged when they return new
Token. And the known in my solution is that the postion of tokens are
totally wrong, like "C1C2"(0,2), "C2C3"(3,5), "C3C4"(6,8). So, it is
crashed when the search summary show.

/Jack

On Apr 12, 2005 6:20 AM, Doug Cutting (JIRA) <ji...@apache.org> wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62604 ]
> 
> Doug Cutting commented on NUTCH-36:
> -----------------------------------
> 
> I like what this patch does, but not how it does it.  Nutch should perform bi-gram segementation of CJK character sequences.  This patch performs such segmentation at two places: in the character stream that is the input to the tokenizer, and in a filter that processes the output of the tokenizer.  I'm unclear why the latter is required.  The former should suffice, no?
> 
> But instead of segmenting in the character stream it should be done in the tokenizer itself.  I think this could be done with something like the following in NutchAnalysis.jj.
> 
> | <SIGRAM: <CJK> >
> 
> { if (prevCJK) {
>    matchedToken.image = prevCJK + matchedToken.image;
>  } else {
>    matchedToken.image = "_" + matchedToken.image;
>  }
> }
> 
> A little more would be required to maintain prevCJK.
> 
> Thoughts?
> 
> > Chinese in Nutch
> > ----------------
> >
> >          Key: NUTCH-36
> >          URL: http://issues.apache.org/jira/browse/NUTCH-36
> >      Project: Nutch
> >         Type: Improvement
> >   Components: indexer, searcher
> >  Environment: all
> >     Reporter: Jack Tang
> >     Priority: Minor
> >  Attachments: &#26700
> >
> > Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word.
> > So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>   http://issues.apache.org/jira/secure/Administrators.jspa
> -
> If you want more information on JIRA, or have a bug to report see:
>   http://www.atlassian.com/software/jira
> 
>

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by Jack Tang <hi...@gmail.com>.

Hi Kerang

I have test the query, no problem in summary highlight. It is really
amazing. It's the solution for Chinese bi-gram segmentation.

Regards
/Jack

On 9/22/05, Jack Tang <hi...@gmail.com> wrote:
> Hi Kerang
>
> Pretty nice hack!
> I will test highlight in query summary now...
> see you.
>
> /Jack
>
> On 9/22/05, Kerang Lv (JIRA) <ji...@apache.org> wrote:
> >     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12330192 ]
> >
> > Kerang Lv commented on NUTCH-36:
> > --------------------------------
> >
> > enghlitened by your last comment, the bi-gram segmentation could be done with the following in NutchAnalysis.jj
> > | <SIGRAM: <CJK><CJK> >
> >   {
> >     input_stream.backup(1);
> >   }
> >
> >
> > > Chinese in Nutch
> > > ----------------
> > >
> > >          Key: NUTCH-36
> > >          URL: http://issues.apache.org/jira/browse/NUTCH-36
> > >      Project: Nutch
> > >         Type: Improvement
> > >   Components: indexer, searcher
> > >  Environment: all
> > >     Reporter: Jack Tang
> > >     Priority: Minor
> > >  Attachments: &#26700
> > >
> > > Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word.
> > > So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > If you think it was sent incorrectly contact one of the administrators:
> >    http://issues.apache.org/jira/secure/Administrators.jspa
> > -
> > For more information on JIRA, see:
> >    http://www.atlassian.com/software/jira
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by Jack Tang <hi...@gmail.com>.

Hi Kerang

Pretty nice hack!
I will test highlight in query summary now...
see you.

/Jack

On 9/22/05, Kerang Lv (JIRA) <ji...@apache.org> wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12330192 ]
>
> Kerang Lv commented on NUTCH-36:
> --------------------------------
>
> enghlitened by your last comment, the bi-gram segmentation could be done with the following in NutchAnalysis.jj
> | <SIGRAM: <CJK><CJK> >
>   {
>     input_stream.backup(1);
>   }
>
>
> > Chinese in Nutch
> > ----------------
> >
> >          Key: NUTCH-36
> >          URL: http://issues.apache.org/jira/browse/NUTCH-36
> >      Project: Nutch
> >         Type: Improvement
> >   Components: indexer, searcher
> >  Environment: all
> >     Reporter: Jack Tang
> >     Priority: Minor
> >  Attachments: &#26700
> >
> > Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word.
> > So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

[jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62604 ]
     
Doug Cutting commented on NUTCH-36:
-----------------------------------

I like what this patch does, but not how it does it.  Nutch should perform bi-gram segementation of CJK character sequences.  This patch performs such segmentation at two places: in the character stream that is the input to the tokenizer, and in a filter that processes the output of the tokenizer.  I'm unclear why the latter is required.  The former should suffice, no?

But instead of segmenting in the character stream it should be done in the tokenizer itself.  I think this could be done with something like the following in NutchAnalysis.jj.

| <SIGRAM: <CJK> >

{ if (prevCJK) {
    matchedToken.image = prevCJK + matchedToken.image;
  } else {
    matchedToken.image = "_" + matchedToken.image;
  }
}

A little more would be required to maintain prevCJK.

Thoughts?

> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by "Kerang Lv (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12330588 ] 

Kerang Lv commented on NUTCH-36:
--------------------------------

Code of a kind can be used to perform third-part CJK word 
segmentation in NutchAnalysis.jj. CJKTokenizer, a kind of bi-gram segmentation , was used in the following example.
================================================================================
@@ -33,6 +33,7 @@
 import org.apache.nutch.searcher.Query.Clause;
 
 import org.apache.lucene.analysis.StopFilter;
+import org.apache.lucene.analysis.cjk.CJKTokenizer;
 
 import java.io.*;
 import java.util.*;
@@ -81,6 +82,14 @@
 PARSER_END(NutchAnalysis)
 
 TOKEN_MGR_DECLS : {
+  /** use CJKTokenizer to process cjk character */
+  private CJKTokenizer cjkTokenizer = null;
+
+  /** a global cjk token */
+  private org.apache.lucene.analysis.Token cjkToken = null;
+
+  /** start offset of cjk sequence */
+  private int cjkStartOffset = 0;
 
   /** Constructs a token manager for the provided Reader. */
   public NutchAnalysisTokenManager(Reader reader) {
@@ -106,7 +115,46 @@
     }
 
   // chinese, japanese and korean characters
-| <SIGRAM: <CJK> >
+| <SIGRAM: (<CJK>)+ >
+  {
+    /**
+     * use an instance of CJKTokenizer, cjkTokenizer, hold the maximum
+     * matched cjk chars, and cjkToken for the current token;
+     * reset matchedToken.image use cjkToken.termText();
+     * reset matchedToken.beginColumn use cjkToken.startOffset();
+     * reset matchedToken.endColumn use cjkToken.endOffset();
+     * backup the last char when the next cjkToken is valid. 
+     */
+    if(cjkTokenizer == null) {
+      cjkTokenizer = new CJKTokenizer(new StringReader(image.toString()));
+      cjkStartOffset = matchedToken.beginColumn;
+      try {
+        cjkToken = cjkTokenizer.next();
+      } catch(IOException ioe) {
+        cjkToken = null;
+      }
+    }
+
+    if(cjkToken != null && !cjkToken.termText().equals("")) {
+      //sometime the cjkTokenizer returns an empty string, is it a bug?
+      matchedToken.image = cjkToken.termText();
+      matchedToken.beginColumn = cjkStartOffset + cjkToken.startOffset();
+      matchedToken.endColumn = cjkStartOffset + cjkToken.endOffset();
+      try {
+        cjkToken = cjkTokenizer.next();
+      } catch(IOException ioe) {
+        cjkToken = null;
+      }
+      if(cjkToken != null && !cjkToken.termText().equals("")) {
+        input_stream.backup(1);
+      }
+    }
+
+    if(cjkToken == null || cjkToken.termText().equals("")) {
+      cjkTokenizer = null;
+      cjkStartOffset = 0;
+    }
+  }


> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by "Kerang Lv (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12330192 ] 

Kerang Lv commented on NUTCH-36:
--------------------------------

enghlitened by your last comment, the bi-gram segmentation could be done with the following in NutchAnalysis.jj
| <SIGRAM: <CJK><CJK> >
  {
    input_stream.backup(1);
  }


> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-36) Chinese in Nutch

Posted by "Jack Tang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-36?page=history ]

Jack Tang updated NUTCH-36:
---------------------------

    Attachment: &#26700

Attachment includes 
   1. patch of NutchAnalysis.jj
   2. patch of FastCharStream.java
   3. CJKTokenizer.java
   4. patch of NutchDocumentTokenizer.java

> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12357135 ] 

Andrzej Bialecki  commented on NUTCH-36:
----------------------------------------

Jack,

Have you tested the latest patches attached to this issue + your fix for summarizer? I can test that technically speaking they appear to do what was described, but knowing no Chinese I cannot testify if they produce any useful output...

> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by "Jack Tang (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12331394 ] 

Jack Tang commented on NUTCH-36:
--------------------------------

Kerang Lv's solution did well in NutchAnalysis but still there are some bugs in Summarizer. Say here is one chinese string (c1)(c2)(c3)(c4), the result of bi-gram is:
matched-image     start-offset    end-offset
(c1)(c2)                        0                     2
(c2)(c3)                        1                     3
(c3)(c4)                        2                     4

In search summaries, we should merge the tokens if the index is overlaped. You can follow this:

change code 

          if (highlight.contains(t.termText())) {
            excerpt.addToken(t.termText());
            excerpt.add(new Fragment(text.substring(offset, t.startOffset())));
            excerpt.add(new Highlight(text.substring(t.startOffset(),t.endOffset())));
            offset = t.endOffset();
            endToken = Math.min(j+SUM_CONTEXT, tokens.length);
          }

to

          if (highlight.contains(t.termText())) {
              if(offset * 2 ==  (t.startOffset() + t.endOffset() ))  { // cjk bi-gram
                  excerpt.addToken(t.termText().substring(offset - t.startOffset()));
                  excerpt.add(new Fragment(text.substring(t.startOffset() + 1,offset)));
                  excerpt.add(new Highlight(text.substring(t.startOffset() + 1 ,t.endOffset())));
              }
              else   {
                   excerpt.addToken(t.termText());
                   excerpt.add(new Fragment(text.substring(offset, t.startOffset())));
                   excerpt.add(new Highlight(text.substring(t.startOffset() ,t.endOffset())));
              }
              offset = t.endOffset();
              endToken = Math.min(j+SUM_CONTEXT, tokens.length);
          }


> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-36) Chinese in Nutch

Posted by "Jack Tang (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153 ]
     
Jack Tang commented on NUTCH-36:
--------------------------------

Follow below steps to make Nutch support Chinese well.

1. Modify NutchAnalysis.jj. 
===========================================
@@ -106,7 +106,7 @@
     }
 
   // chinese, japanese and korean characters
-| <SIGRAM: <CJK> >
+| <SIGRAM: (<CJK>)+ >
===========================================

Why change "<SIGRAM:<CJK>>" to "<SIGRAM: (<CJK>)+>"? Because Chinese(I don't know japanese and korean well) terms segmentation is totally different from English. In another words, word-by-word segmentation is inefficient for Chinese characters indexing and search.


2. Modify FastCharStream.java
===========================================
@@ -18,6 +18,8 @@
 
 import java.io.*;
 
+import org.apache.lucene.analysis.Token;
+
 /** An efficient implementation of JavaCC's CharStream interface.  <p>Note that
  * this does not do line-number counting, but instead keeps track of the
  * character position of the token in the input, as required by Lucene's {@link
@@ -69,10 +71,15 @@
     if (charsRead == -1)
       throw new IOException("read past eof");
     else
-      bufferLength += charsRead;
+    {
+    	 charsRead = new CJKCharStream().readChineseChars(newPosition, charsRead);
+    	 bufferLength += charsRead;
+    }
   }
 
-  public final char BeginToken() throws IOException {
+  
+
+public final char BeginToken() throws IOException {
     tokenStart = bufferPosition;
     return readChar();
   }
@@ -117,4 +124,45 @@
   public final int getBeginLine() {
     return 1;
   }
+  
+  
+  final class CJKCharStream
+  {
+  	  	
+  	/**
+  	 * @param newPosition
+  	 * @param charsRead
+  	 * @return
+  	 * @throws IOException
+  	 */
+  	int readChineseChars(int newPosition, int charsRead) 
+  	throws IOException 
+	{
+  		String str = new String(buffer,newPosition,charsRead);
+  		CJKTokenizer tokenizer = new CJKTokenizer(new StringReader(str));
+  		Token token = tokenizer.next();
+  		StringBuffer sb = new StringBuffer();
+  		while(token != null)
+  		{
+  		 	sb.append(token.termText()).append(" ");
+  		 	token = tokenizer.next();
+  		 }
+  		 
+  		 
+  		   		 
+  		 while(sb.length()>buffer.length-newPosition)
+  		 { 
+  		          char[] newBuffer = new char[buffer.length*2];
+  		          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
+  		          buffer = newBuffer;
+  		 }
+  		 
+  		 for(int i=0;i<sb.length();i++){
+  		            buffer[newPosition+i]=sb.charAt(i);
+  		 }
+  		 
+  		 return sb.length();
+  	}
+  }
+  
 }

To support "<SIGRAM: (<CJK>)+>" in NutchAnalysis.jj, we do Chinese term segmentation in FastCharStream which process before NutchAnalysis's parse method. And the main component is CJKTokenizer which Bi-segments Chinese terms.

3. Add CJKTokenizer.java

4. Modify NutchDocumentTokenizer.java
===========================================
@@ -46,8 +46,11 @@
         while (true) {
           t = tokenManager.getNextToken();
           switch (t.kind) {                       // skip query syntax tokens
-          case EOF: case WORD: case ACRONYM: case SIGRAM:
+          case EOF: case WORD: case ACRONYM:
             break loop;
+          case SIGRAM:
+          	CJKTokenizer cjkT = new CJKTokenizer(input);
+          	return cjkT.next();
           default:
           }
         }
===========================================
NutchDocumentTokenizer.tokenStream() is called by NutchDocumentAnalyzer, and int this way, the modified NutchDocumentTokenizer class let NutchDocumentAnalyzer supports Chinese.

> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor

>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira