You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by kauu <ba...@gmail.com> on 2006/03/27 01:48:35 UTC

hi,how to use the ICTCLASCall

hi all:
  i get a problem when I integrat Nutch-0.7.1 with an intelligent Chinese
Lexical Analysis System.
  and i follow the next page:
http://www.nutchhacks.com/ftopic391.php&highlight=chinese
which wrote by *caoyuzhong
  *when ant my modified java files , javac told me that couldn't find the
symbol caomo.ICTCLASCaller
in this line
private final static caomo.ICTCLASCaller spliter = new
caomo.ICTCLASCaller();

so my question is how to deal with it?
any reply will be appreciated!
--
www.babatu.com

Re: hi,how to use the ICTCLASCall

Posted by kauu <ba...@gmail.com>.

hi all
  i get a big problem when i integrated ICTCLAS with nutch 0.7.1.
i followed the page "http://www.nutchhacks.com/ftopic391.php&highlight=chinese
"
but when i ant the nutch,i got a lot of errors like this:

i 've modified the files in org.apache.nutch.analysis directory. and my
question is that should i modified the lucene.
and how to deal with it!!!


any reply will be appreciated.


I have integrated Nutch with an intelligent Chinese
Lexical Analysis System.So Nutch now can segment
Chinese words effectively.

Following is my solution:

1.modify NutchAnalysis.jj:

-| <#CJK: // non-alphabets
- [
- "\u3040"-"\u318f",
- "\u3300"-"\u337f",
- "\u3400"-"\u3d2d",
- "\u4e00"-"\u9fff",
- "\uf900"-"\ufaff"
- ]
- >

+| <#OTHER_CJK: //japanese and korean characters
+ [
+ "\u3040"-"\u318f",
+ "\u3300"-"\u337f",
+ "\u3400"-"\u3d2d",
+ "\uf900"-"\ufaff"
+ ]
+ >
+| <#CHINESE: //chinese characters
+ [
+ "\u4e00"-"\u9fff"
+ ]
+ >

-| <SIGRAM: <CJK> >

+| <SIGRAM: <OTHER_CJK> >
+| <CNWORD: (<CHINESE>)+ > //chinese words

- ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
+ ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM> | token=<CNWORD>)

I will segment chinese characters intelligently but japanese
and korean characters remains single-gram segmentation.

2.modify NutchDocumentTokenizer.java

-case EOF: case WORD: case ACRONYM: case SIGRAM:
+case EOF: case WORD: case ACRONYM: case SIGRAM: case CNWORD:

3.modify FastCharStream.java

+private final static caomo.ICTCLASCaller spliter = new
caomo.ICTCLASCaller();
+private final int IO_BUFFER_SIZE=2048;

-buffer = new char[2048];
+buffer = new char[IO_BUFFER_SIZE];

-int charsRead = input.read(buffer, newPosition,
buffer.length-newPosition);
+int charsRead=readString(newPosition);

+ // do intelligent Chinese word segmentation
+private int readString(int newPosition) throws java.io.IOException {
+ char[] tempBuffer = new char[IO_BUFFER_SIZE / 2]; //read from io
+ char[] hzBuffer = new char[IO_BUFFER_SIZE / 2]; //store Chinese
characters string
+ int len=0;
+
+ len = input.read(tempBuffer, 0, IO_BUFFER_SIZE / 4);
+
+
+ int pos=-1; //position in buffer
+ if (len > 0) {
+ pos=0;
+
+ int hzPos=0; //position in hzBuffer
+ char c=' ';
+ int value=-1;
+ for(int i=0;i<len;i++){ //iterate tempBuffer
+ hzPos=0;
+ c=tempBuffer[i];
+ value=(int)c;
+
+ if( (value<19968)||(value>40959) ){ //non-chinese characters
+ buffer[pos + newPosition] = c;
+ pos++;
+ }
+ else{ //Chinese character unicode: '\u4e00---'\u9fff'
+ hzBuffer[hzPos++]=' ';
+ hzBuffer[hzPos] = c;
+ hzPos++;
+ i++;
+ while(i<len){
+ c=tempBuffer[i];
+ value=(int)c;
+ //Chinese character sequence,store it in hzBuffer
+ if ( (value>=19968)&&(value<=40959) ){
+ hzBuffer[hzPos] = c;
+ hzPos++;
+ i++;
+ }
+ else
+ break; //have extracted a Chinese String
+ }
+
+ i--;
+ if(hzPos>0){
+ String str = new String(hzBuffer, 0, hzPos);
+ String str2 = spliter.segSentence(str2); // perform
Chinese word
+ // segmentation
+
+ if(str2!=null){
+
+ while(str2.length()>buffer.length-newPosition){ //expand the buffer
+ char[] newBuffer = new char[buffer.length*2];
+ System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
+ buffer = newBuffer;
+ }
+
+ for(int j=0;j<str2.length();j++){
+ buffer[pos + newPosition] = str2.charAt(j);
+ pos++;
+ }
+ }else{
+ for(int j=0;j<str.length();j++){
+ buffer[pos + newPosition] = str.charAt(j);
+ pos++;
+ }
+
+ }
+ }
+ }
+ }
+
+ }
+
+ return pos;
+ }


I use ICTCLASC to perform Chinese word segmentation.ICTCLASC don't
just simply perform bi-gram segmentation but using an approach based on
multi-layer HMM. Its segmentation precision is 97.58%
ICTCLASC is free for researchers. see:
http://www.nlp.org.cn/project/project.php?proj_id=6

4.modify Summarizer.java

+ //reset startOffset and endOffset of tokens
+ private void resetTokenOffset(Token[] tokens,String text)
+ {
+ String text3=text.toLowerCase();
+
+ char[] textArray=text3.toCharArray();
+ int tokenStart=0;
+ char[] tokenArray=null;
+ int j;
+ Token preToken=new Token(" ",0,1);
+ Token curToken=new Token(" ",0,1);
+ Token nextToken=null;
+ int startSearch=0;
+ while(true){
+ tokenArray = null;
+ for (int i = startSearch; i < textArray.length; i++) {
+
+ if (tokenStart == tokens.length)
+ break;
+
+ if (tokenArray == null) {
+ tokenArray =
tokens[tokenStart].termText().toCharArray();
+ preToken = curToken;
+ curToken = tokens[tokenStart];
+ nextToken = null;
+
+ }
+
+ //deals with following situation:(common grams)
+ //text: about buaa a welcome from buaa president
+ //token sequences:about buaa buaa-a a a-welcome welcome from
buaa president
+ if ((preToken.termText().charAt(0) ==
+ curToken.termText().charAt(0)) &&
+ (preToken.termText().length() <
curToken.termText().length())) {
+ if (curToken.termText().startsWith(preToken.termText() +
"-")) { //buaa-a starts with buaa-
+ if (tokenStart + 1 < tokens.length) {
+ nextToken = tokens[tokenStart + 1];
+ if (curToken.termText().endsWith("-" +
+ nextToken.termText())) { //meets buaa
buaa-a a
+ int curTokenLength = curToken.endOffset() -
+ curToken.startOffset();
+
curToken.setStartOffset(preToken.startOffset());
+ curToken.setEndOffset(preToken.startOffset()
+
+ curTokenLength);
+ tokenStart++;
+ tokenArray = null;
+ i = preToken.startOffset();
+ startSearch=i;//the start position in
textArray for the next turn,if need.
+ continue;
+ }
+ }
+
+ }
+ }
+ //------------------------
+
+ j = 0;
+ if (textArray[i] == tokenArray[j]) {
+
+ if (i + tokenArray.length - 1 >= textArray.length) {
+ //do nothing?
+ } else {
+
+ int k = i + 1;
+ for (j = 1; j < tokenArray.length; j++) {
+ if (textArray[k++] != tokenArray[j])
+ break; //not meets
+ }
+ if (j >= tokenArray.length) { //meets
+ curToken.setStartOffset(i);
+ curToken.setEndOffset(i + tokenArray.length);
+
+ i = i + tokenArray.length - 1;
+ tokenStart++;
+ startSearch=i;//the start position in textArray
for the next turn,if need.
+ tokenArray = null;
+ }
+ }
+ }
+ }
+ if (tokenStart == tokens.length)
+ break; //have resetted all tokens
+
+ if (tokenStart < tokens.length ) { //next turn
+ curToken.setStartOffset(preToken.startOffset());
+ curToken.setEndOffset(preToken.endOffset());
+
+ tokenStart++; //skip this token
+
+ }
+
+ }//the end of while(true)
+ }

under the line: Token[] tokens = getTokens(text)
in getSummary(String text, Query query);

+resetTokenOffset(tokens, text);

I perform Chinese word Segmentation after tokenizer and insert space
between
two Chinese words.So I need reset all tokens' startOffset and
endOffset in Summarizer.java.
To do this,I added method resetTokenOffset(Token[] tokens,String text)
in Summarizer.java and I have to add two methods setStartOffset(int start)
and
setEndOffset(int end) in Lucene's Token.java.



By the above four steps,Nutch can search Chinese web site
nearly perfectly.You can try it.I just made Nutch to do it,
but my solution is less perfect.

If Chinese word segmentation could be done in NutchAnalysis.jj
before tokenizer,then we don't need reset tokens' offset in
Summarizer.java and everything will be perfect.
But it seems too difficult to perform intelligent Chinese word
segmentation in NutchAnalysis.jj.Even impossible??


Any suggestions?



Buildfile: build.xml

init:

compile-core:
    [javac] Compiling 247 source files to E:\search\new\nutch-
0.7.1\build\classes
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Query.java:408: unreported
exception org.apache.nutch.analysis.ParseException ; must be caught or
declared to be thrown
    [javac]     return fixup(NutchAnalysis.parseQuery (queryString));
    [javac]                                          ^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:140: cannot find
symbol
    [javac] symbol  : method setStartOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                              curToken.setStartOffset(
preToken.startOffset());
    [javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:141: cannot find
symbol
    [javac] symbol  : method setEndOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                               curToken.setEndOffset(
preToken.startOffset() + curTokenLength);
    [javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:164: cannot find
symbol
    [javac] symbol  : method setStartOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                                 curToken.setStartOffset(i);

[javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:165: cannot find
symbol
    [javac] symbol  : method setEndOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                                 curToken.setEndOffset(i +
tokenArray.length);

[javac]
^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:179: cannot find
symbol
    [javac] symbol  : method setStartOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                   curToken.setStartOffset(preToken.startOffset
());
    [javac]                                           ^
    [javac] E:\search\new\nutch-
0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:180: cannot find
symbol
    [javac] symbol  : method setEndOffset(int)
    [javac] location: class org.apache.lucene.analysis.Token
    [javac]                   curToken.setEndOffset (preToken.endOffset());
    [javac]                                           ^
    [javac] Note: * uses or overrides a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 7 errors

BUILD FAILED
E:\search\new\nutch-0.7.1\build.xml:70: Compile failed; see the compiler
error output for details.

Total time: 39 seconds


On 3/27/06, kauu <ba...@gmail.com> wrote:
>
> i get it !!!!
> thank goodness!!!!!
> i'am so happy to tell everyone i get it ! and i will write it for anyone
> else!
>
>
> On 3/27/06, kauu < babatu@gmail.com> wrote:
> >
> > thanks any way
> >
> >
> > On 3/27/06, Yong-gang Cao < chiefadminofficer@gmail.com> wrote:
> > >
> > > Please visit http://chiefadminofficer.googlepages.com/mycodes for the
> > > source
> > > code of ICTCLASCaller and the DLL used by it.
> > > You also need to get the data files from
> > > ICTCLAS<http://www.nlp.org.cn/project/project.php?proj_id=6>source
> > > site to run ICTCLASCaller.
> > > Notice: The codes and the DLL's usage is restricted by ICTCLAS
> > > copyright
> > > (NOT MINE).
> > > Details of usage are put into the comments of ICTCLASCaller.java.
> > > Good Luck!
> > >
> > > 2006/3/27, kauu < babatu@gmail.com>:
> > > >
> > > > hi all:
> > > >   i get a problem when I integrat Nutch-0.7.1 with an intelligent
> > > Chinese
> > > > Lexical Analysis System.
> > > >   and i follow the next page:
> > > > http://www.nutchhacks.com/ftopic391.php&highlight=chinese
> > > > which wrote by *caoyuzhong
> > > >   *when ant my modified java files , javac told me that couldn't
> > > find the
> > > > symbol caomo.ICTCLASCaller
> > > > in this line
> > > > private final static caomo.ICTCLASCaller spliter = new
> > > > caomo.ICTCLASCaller();
> > > >
> > > > so my question is how to deal with it?
> > > > any reply will be appreciated!
> > > > --
> > > > www.babatu.com
> > > >
> > > >
> > >
> > >
> > > --
> > > http://spaces.msn.com/members/caomo
> > > Beijing University of Aeronautics and Astronautics (BeiHang
> > > University)
> > > P.B.: 2-53# MailBox, 37 Xueyuan Road ,Beijing, 100083   P.R.China
> > >
> > >
> >
> >
> > --
> > www.babatu.com
> >
>
>
>
> --
> www.babatu.com
>



--
www.babatu.com

Re: hi,how to use the ICTCLASCall

Posted by kauu <ba...@gmail.com>.

i get it !!!!
thank goodness!!!!!
i'am so happy to tell everyone i get it ! and i will write it for anyone
else!

On 3/27/06, kauu <ba...@gmail.com> wrote:
>
> thanks any way
>
>
> On 3/27/06, Yong-gang Cao <ch...@gmail.com> wrote:
> >
> > Please visit http://chiefadminofficer.googlepages.com/mycodes for the
> > source
> > code of ICTCLASCaller and the DLL used by it.
> > You also need to get the data files from
> > ICTCLAS<http://www.nlp.org.cn/project/project.php?proj_id=6>source
> > site to run ICTCLASCaller.
> > Notice: The codes and the DLL's usage is restricted by ICTCLAS copyright
> >
> > (NOT MINE).
> > Details of usage are put into the comments of ICTCLASCaller.java.
> > Good Luck!
> >
> > 2006/3/27, kauu <ba...@gmail.com>:
> > >
> > > hi all:
> > >   i get a problem when I integrat Nutch-0.7.1 with an intelligent
> > Chinese
> > > Lexical Analysis System.
> > >   and i follow the next page:
> > > http://www.nutchhacks.com/ftopic391.php&highlight=chinese
> > > which wrote by *caoyuzhong
> > >   *when ant my modified java files , javac told me that couldn't find
> > the
> > > symbol caomo.ICTCLASCaller
> > > in this line
> > > private final static caomo.ICTCLASCaller spliter = new
> > > caomo.ICTCLASCaller();
> > >
> > > so my question is how to deal with it?
> > > any reply will be appreciated!
> > > --
> > > www.babatu.com
> > >
> > >
> >
> >
> > --
> > http://spaces.msn.com/members/caomo
> > Beijing University of Aeronautics and Astronautics (BeiHang University)
> > P.B.: 2-53# MailBox, 37 Xueyuan Road ,Beijing, 100083   P.R.China
> >
> >
>
>
> --
> www.babatu.com
>



--
www.babatu.com

Re: hi,how to use the ICTCLASCall

Posted by kauu <ba...@gmail.com>.

thanks any way

On 3/27/06, Yong-gang Cao <ch...@gmail.com> wrote:
>
> Please visit http://chiefadminofficer.googlepages.com/mycodes for the
> source
> code of ICTCLASCaller and the DLL used by it.
> You also need to get the data files from
> ICTCLAS<http://www.nlp.org.cn/project/project.php?proj_id=6>source
> site to run ICTCLASCaller.
> Notice: The codes and the DLL's usage is restricted by ICTCLAS copyright
> (NOT MINE).
> Details of usage are put into the comments of ICTCLASCaller.java.
> Good Luck!
>
> 2006/3/27, kauu <ba...@gmail.com>:
> >
> > hi all:
> >   i get a problem when I integrat Nutch-0.7.1 with an intelligent
> Chinese
> > Lexical Analysis System.
> >   and i follow the next page:
> > http://www.nutchhacks.com/ftopic391.php&highlight=chinese
> > which wrote by *caoyuzhong
> >   *when ant my modified java files , javac told me that couldn't find
> the
> > symbol caomo.ICTCLASCaller
> > in this line
> > private final static caomo.ICTCLASCaller spliter = new
> > caomo.ICTCLASCaller();
> >
> > so my question is how to deal with it?
> > any reply will be appreciated!
> > --
> > www.babatu.com
> >
> >
>
>
> --
> http://spaces.msn.com/members/caomo
> Beijing University of Aeronautics and Astronautics (BeiHang University)
> P.B.: 2-53# MailBox, 37 Xueyuan Road ,Beijing, 100083  P.R.China
>
>


--
www.babatu.com

Re: hi,how to use the ICTCLASCall

Posted by Yong-gang Cao <ch...@gmail.com>.

Please visit http://chiefadminofficer.googlepages.com/mycodes for the source
code of ICTCLASCaller and the DLL used by it.
You also need to get the data files from
ICTCLAS<http://www.nlp.org.cn/project/project.php?proj_id=6>source
site to run ICTCLASCaller.
Notice: The codes and the DLL's usage is restricted by ICTCLAS copyright
(NOT MINE).
Details of usage are put into the comments of ICTCLASCaller.java.
Good Luck!

2006/3/27, kauu <ba...@gmail.com>:
>
> hi all:
>   i get a problem when I integrat Nutch-0.7.1 with an intelligent Chinese
> Lexical Analysis System.
>   and i follow the next page:
> http://www.nutchhacks.com/ftopic391.php&highlight=chinese
> which wrote by *caoyuzhong
>   *when ant my modified java files , javac told me that couldn't find the
> symbol caomo.ICTCLASCaller
> in this line
> private final static caomo.ICTCLASCaller spliter = new
> caomo.ICTCLASCaller();
>
> so my question is how to deal with it?
> any reply will be appreciated!
> --
> www.babatu.com
>
>


--
http://spaces.msn.com/members/caomo
Beijing University of Aeronautics and Astronautics (BeiHang University)
P.B.: 2-53# MailBox, 37 Xueyuan Road ,Beijing, 100083  P.R.China