You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by cao yuzhong <ca...@hotmail.com> on 2005/04/12 08:37:02 UTC

Chinese in Nutch:My solution

hi,every one:

I have integrated Nutch with an intelligent Chinese 
Lexical Analysis System.So Nutch now can segment
Chinese words effectively.

Following is my solution:

1.modify NutchAnalysis.jj:

-|  <#CJK:                                        // non-alphabets
-      [
-       "\u3040"-"\u318f",
-       "\u3300"-"\u337f",
-       "\u3400"-"\u3d2d",
-       "\u4e00"-"\u9fff",
-       "\uf900"-"\ufaff"
-      ]
-    >    

+|  <#OTHER_CJK:  //japanese and korean characters    
+      [
+       "\u3040"-"\u318f",
+       "\u3300"-"\u337f",
+       "\u3400"-"\u3d2d",
+       "\uf900"-"\ufaff"
+      ]
+    >    
+|  <#CHINESE:   //chinese characters
+     [
+       "\u4e00"-"\u9fff"
+     ]    
+   >

-| <SIGRAM: <CJK> >

+| <SIGRAM: <OTHER_CJK> >
+| <CNWORD: (<CHINESE>)+ > //chinese words

- ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
+ ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM> | token=<CNWORD>)

I will segment chinese characters intelligently but japanese 
and korean characters remains single-gram segmentation.

2.modify NutchDocumentTokenizer.java

-case EOF: case WORD: case ACRONYM: case SIGRAM: 
+case EOF: case WORD: case ACRONYM: case SIGRAM: case CNWORD:

3.modify FastCharStream.java

+private final static caomo.ICTCLASCaller spliter = new 
caomo.ICTCLASCaller();
+private final int IO_BUFFER_SIZE=2048;

-buffer = new char[2048];
+buffer = new char[IO_BUFFER_SIZE];

-int charsRead = input.read(buffer, newPosition, 
buffer.length-newPosition); 
+int charsRead=readString(newPosition);

+  // do intelligent Chinese word segmentation
+private int readString(int newPosition) throws java.io.IOException {
+    char[] tempBuffer = new char[IO_BUFFER_SIZE / 2]; //read from io
+    char[] hzBuffer = new char[IO_BUFFER_SIZE / 2];  //store Chinese 
characters string
+    int len=0;
+
+    len = input.read(tempBuffer, 0, IO_BUFFER_SIZE / 4);
+
+
+    int pos=-1;  //position in buffer
+    if (len > 0) {
+      pos=0;
+      
+      int hzPos=0; //position in hzBuffer
+      char c=' ';
+      int value=-1;
+      for(int i=0;i<len;i++){ //iterate tempBuffer
+          hzPos=0;
+          c=tempBuffer[i];
+          value=(int)c;
+
+          if( (value<19968)||(value>40959) ){  //non-chinese characters
+              buffer[pos + newPosition] = c;
+              pos++;
+          }
+          else{ //Chinese character    unicode: '\u4e00---'\u9fff'
+              hzBuffer[hzPos++]=' ';
+              hzBuffer[hzPos] = c;
+              hzPos++;
+              i++;
+              while(i<len){
+                  c=tempBuffer[i];
+                  value=(int)c;
+                  //Chinese character sequence,store it in hzBuffer
+                  if ( (value>=19968)&&(value<=40959) ){
+                      hzBuffer[hzPos] = c;
+                      hzPos++;
+                      i++;
+                  }
+                  else
+                      break; //have extracted a Chinese String 
+              }
+
+              i--;
+              if(hzPos>0){
+                  String str = new String(hzBuffer, 0, hzPos);
+                  String str2 = spliter.segSentence(str2); // perform 
Chinese word 
+                                                           // segmentation
+
+                  if(str2!=null){
+
+			while(str2.length()>buffer.length-newPosition){  //expand the buffer
+				char[] newBuffer = new char[buffer.length*2];
+				System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
+				buffer = newBuffer;
+			}
+
+                      for(int j=0;j<str2.length();j++){
+                          buffer[pos + newPosition] = str2.charAt(j);
+                          pos++;
+                      }
+                  }else{
+                      for(int j=0;j<str.length();j++){
+                          buffer[pos + newPosition] = str.charAt(j);
+                          pos++;
+                      }
+
+                  }
+             }
+          }
+      }
+      
+    }
+
+    return pos;
+  }


I use ICTCLASC to perform Chinese word segmentation.ICTCLASC don't
just simply perform bi-gram segmentation but using an approach based on 
multi-layer HMM. Its segmentation precision is 97.58%
ICTCLASC is free for researchers.  see:
http://www.nlp.org.cn/project/project.php?proj_id=6

4.modify Summarizer.java

+  //reset startOffset and endOffset of tokens
+  private void resetTokenOffset(Token[] tokens,String text)
+  {
+      String text3=text.toLowerCase();
+      
+      char[] textArray=text3.toCharArray();
+      int tokenStart=0;
+      char[] tokenArray=null;
+      int j;
+      Token preToken=new Token(" ",0,1);
+      Token curToken=new Token(" ",0,1);
+      Token nextToken=null;
+      int startSearch=0;
+      while(true){
+          tokenArray = null;
+          for (int i = startSearch; i < textArray.length; i++) {
+
+              if (tokenStart == tokens.length)
+                  break; 
+
+              if (tokenArray == null) {
+                  tokenArray = 
tokens[tokenStart].termText().toCharArray();
+                  preToken = curToken;
+                  curToken = tokens[tokenStart];
+                  nextToken = null;
+
+              }
+              
+	      //deals with following situation:(common grams)
+              //text:           about buaa a welcome from buaa president
+              //token sequences:about buaa buaa-a a a-welcome welcome from 
buaa president
+              if ((preToken.termText().charAt(0) ==
+                   curToken.termText().charAt(0)) &&
+                  (preToken.termText().length() < 
curToken.termText().length())) {
+                  if (curToken.termText().startsWith(preToken.termText() + 
"-")) { //buaa-a starts with buaa-
+                      if (tokenStart + 1 < tokens.length) {
+                          nextToken = tokens[tokenStart + 1];
+                          if (curToken.termText().endsWith("-" +
+                                  nextToken.termText())) { //meets buaa 
buaa-a a
+                              int curTokenLength = curToken.endOffset() -
+                                      curToken.startOffset();
+                              
curToken.setStartOffset(preToken.startOffset());
+                              curToken.setEndOffset(preToken.startOffset() 
+
+                                      curTokenLength);
+                              tokenStart++; 
+                              tokenArray = null;
+                              i = preToken.startOffset();
+                              startSearch=i;//the start position in 
textArray for the next turn,if need.               
+                              continue;
+                          }
+                      }
+
+                  }
+              }
+              //------------------------
+
+              j = 0;
+              if (textArray[i] == tokenArray[j]) {
+
+                  if (i + tokenArray.length - 1 >= textArray.length) { 
+                      //do nothing?
+                  } else {
+
+                      int k = i + 1;
+                      for (j = 1; j < tokenArray.length; j++) {
+                          if (textArray[k++] != tokenArray[j])
+                              break; //not meets
+                      }
+                      if (j >= tokenArray.length) { //meets
+                          curToken.setStartOffset(i);
+                          curToken.setEndOffset(i + tokenArray.length);
+
+                          i = i + tokenArray.length - 1;
+                          tokenStart++;
+                          startSearch=i;//the start position in textArray 
for the next turn,if need. 
+                          tokenArray = null;
+                      }
+                  }
+              }
+          }
+          if (tokenStart == tokens.length)
+                  break; //have resetted all tokens
+
+          if (tokenStart < tokens.length ) { //next turn
+              curToken.setStartOffset(preToken.startOffset());
+              curToken.setEndOffset(preToken.endOffset());
+              
+              tokenStart++; //skip this token
+          
+          }
+
+     }//the end of while(true)
+  }

under the line: Token[] tokens = getTokens(text)
in getSummary(String text, Query query);

+resetTokenOffset(tokens, text);

I perform Chinese word Segmentation after tokenizer and insert space 
between 
two Chinese words.So I need reset all tokens' startOffset and
endOffset in Summarizer.java.
To do this,I added method resetTokenOffset(Token[] tokens,String text)
in Summarizer.java and I have to add two methods setStartOffset(int start) 
and 
setEndOffset(int end) in Lucene's Token.java.



By the above four steps,Nutch can search Chinese web site
nearly perfectly.You can try it.I just made Nutch to do it,
but my solution is less perfect.

If Chinese word segmentation could be done in NutchAnalysis.jj
before tokenizer,then we don't need reset tokens' offset in 
Summarizer.java and everything will be perfect.
But it seems too difficult to perform intelligent Chinese word
segmentation in NutchAnalysis.jj.Even impossible??


Any suggestions?



Best regards

Cao Yuzhong
2005-04-12

Re: Chinese in Nutch:My solution

Posted by Jack Tang <hi...@gmail.com>.

Hi Cao

Great job!

On Apr 12, 2005 2:37 PM, cao yuzhong <ca...@hotmail.com> wrote:
> hi,every one:
> 
> I have integrated Nutch with an intelligent Chinese
> Lexical Analysis System.So Nutch now can segment
> Chinese words effectively.
> 
> Following is my solution:
> 
> 1.modify NutchAnalysis.jj:
> 
> -|  <#CJK:                                        // non-alphabets
> -      [
> -       "\u3040"-"\u318f",
> -       "\u3300"-"\u337f",
> -       "\u3400"-"\u3d2d",
> -       "\u4e00"-"\u9fff",
> -       "\uf900"-"\ufaff"
> -      ]
> -    >
> 
> +|  <#OTHER_CJK:  //japanese and korean characters
> +      [
> +       "\u3040"-"\u318f",
> +       "\u3300"-"\u337f",
> +       "\u3400"-"\u3d2d",
> +       "\uf900"-"\ufaff"
> +      ]
> +    >
> +|  <#CHINESE:   //chinese characters
> +     [
> +       "\u4e00"-"\u9fff"
> +     ]
> +   >
> 
> -| <SIGRAM: <CJK> >
> 
> +| <SIGRAM: <OTHER_CJK> >
> +| <CNWORD: (<CHINESE>)+ > //chinese words
> 
> - ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> + ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM> | token=<CNWORD>)
> 
> I will segment chinese characters intelligently but japanese
> and korean characters remains single-gram segmentation.

If some JK developers here, we will be very glad:)


> 2.modify NutchDocumentTokenizer.java
> 
> -case EOF: case WORD: case ACRONYM: case SIGRAM:
> +case EOF: case WORD: case ACRONYM: case SIGRAM: case CNWORD:
> 
> 3.modify FastCharStream.java
> I use ICTCLASC to perform Chinese word segmentation.ICTCLASC don't
> just simply perform bi-gram segmentation but using an approach based on
> multi-layer HMM. Its segmentation precision is 97.58%
> ICTCLASC is free for researchers.  see:
> http://www.nlp.org.cn/project/project.php?proj_id=6

Cool, and I should learn more....

> 4.modify Summarizer.java

> If Chinese word segmentation could be done in NutchAnalysis.jj
> before tokenizer,then we don't need reset tokens' offset in
> Summarizer.java and everything will be perfect.

True. You will find the truth in NutchAnalysisTokenManager.jjFillToken() method.


> But it seems too difficult to perform intelligent Chinese word
> segmentation in NutchAnalysis.jj.Even impossible??

In fact, Chinese segementation issue equals to the question here:
Say one english sentence S = "Nutchisasearchengine", how can we
get/guess the result: R="Nutch is a search engine" to the best of our
abilities ?

> Any suggestions?
> 
> Best regards
> 
> Cao Yuzhong
> 2005-04-12
> 
> 

Regards
/Jack