You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Onur Deniz <de...@yahoo.com> on 2008/09/16 15:09:51 UTC

modifiying a core class (Content.java) using plugins?

is it possible?

well, in eclipse it succeeded. i added some encoding code in Content.java using HtmlParser (a plugin). it workes succesfully in eclipse (I have tested using SegmentReader only, not any unit tests though).

but when compiling using ant I get compile errors.

 
here is the modification in Content.java in nutch-0.9.tar.gz release version (not trunk)
I have replaced the line:
   buffer.append(new String(content)); // try default encoding
with
	    Configuration conf = NutchConfiguration.create();
	    HtmlParser parser = new HtmlParser();
	    parser.setConf(conf);
	    Parse parse = parser.getParse( this );
	    String encoding=parse.getData().getParseMeta().get("OriginalCharEncoding");
		String localEncodedString="java incompatible encoding";
		try{
			localEncodedString = new String(content,encoding);
		}
		catch(Exception e){
			e.printStackTrace();
		}
		buffer.append(localEncodedString);

here is the compile errors;
compile-core:
    [javac] Compiling 165 source files to /home/onur/nutch-0.9/build/classes
    [javac] /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:39: package org.apache.nutch.parse.html does not exist
    [javac] import org.apache.nutch.parse.html.HtmlParser;
    [javac]                                   ^
    [javac] /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240: cannot find symbol
    [javac] symbol  : class HtmlParser
    [javac] location: class org.apache.nutch.protocol.Content
    [javac] 	    HtmlParser parser = new HtmlParser();
    [javac] 	    ^
    [javac] /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240: cannot find symbol
    [javac] symbol  : class HtmlParser
    [javac] location: class org.apache.nutch.protocol.Content
    [javac] 	    HtmlParser parser = new HtmlParser();
    [javac] 	                            ^
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: Some input files use unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 3 errors

BUILD FAILED
/home/onur/nutch-0.9/build.xml:106: Compile failed; see the compiler error output for details.


do I need to make any other configuration to fix it? (parse-html exists in nutch-default.xml plugin.includes property, i tried also adding it in nutch-site.xml, but did not work)
or it is not intended to use plugins in core code?

any ideas?

(by the way what I'm trying to do here is to enable encoding in -get functionality.. it normally gives content in platform-default encoding (utf-8) )

thanks


onur deniz



      

Re: modifiying a core class (Content.java) using plugins?

Posted by Onur Deniz <de...@yahoo.com>.
i looked at .classpath and build.xml file. and guess what the problem is now.

why it works in eclipse is, in .classpath file all the plugin sources and libs had been added before building. 

but using ant, in build.xml;
  <!-- ====================================================== -->
  <!-- Compile the Java files                                 -->
  <!-- ====================================================== -->
  <target name="compile" depends="compile-core, compile-plugins"/>

  <target name="compile-core" depends="init">
    <javac 
     encoding="${build.encoding}" 
     srcdir="${src.dir}"
     includes="org/apache/nutch/**/*.java"
     destdir="${build.classes}"
     debug="${javac.debug}"
     optimize="${javac.optimize}"
     target="${javac.version}"
     source="${javac.version}"
     deprecation="${javac.deprecation}">
      <classpath refid="classpath"/>
    </javac>    
  </target>

  <target name="compile-plugins">
    <ant dir="src/plugin" target="deploy" inheritAll="false"/>
  </target>

i don't know using ant and how build.xml is interpreted, but it seems it first tries to compile core classes and then plugins. that's because why it gives those compile errors i think. (i tried to changing the order, it gives compile errors - which is normal i think...)

so, here is the new question(s)..

does changing build.xml solve my problem? (first compiling and adding parse-html libs and then compile-core) in parse-html there are classes that include core-classes i think..

if that does not solve, any other suggestions? 

(i think to write a plugin but it seems also not possible- Content.java is final or it is not a extension-point.. actually i know little about writing a plugin)


regards


onur deniz 


--- On Tue, 9/16/08, Onur Deniz <de...@yahoo.com> wrote:

> From: Onur Deniz <de...@yahoo.com>
> Subject: modifiying a core class (Content.java) using plugins?
> To: nutch-user@lucene.apache.org
> Date: Tuesday, September 16, 2008, 4:09 PM
> is it possible?
> 
> well, in eclipse it succeeded. i added some encoding code
> in Content.java using HtmlParser (a plugin). it workes
> succesfully in eclipse (I have tested using SegmentReader
> only, not any unit tests though).
> 
> but when compiling using ant I get compile errors.
> 
>  
> here is the modification in Content.java in
> nutch-0.9.tar.gz release version (not trunk)
> I have replaced the line:
>    buffer.append(new String(content)); // try default
> encoding
> with
> 	    Configuration conf = NutchConfiguration.create();
> 	    HtmlParser parser = new HtmlParser();
> 	    parser.setConf(conf);
> 	    Parse parse = parser.getParse( this );
> 	    String
> encoding=parse.getData().getParseMeta().get("OriginalCharEncoding");
> 		String localEncodedString="java incompatible
> encoding";
> 		try{
> 			localEncodedString = new String(content,encoding);
> 		}
> 		catch(Exception e){
> 			e.printStackTrace();
> 		}
> 		buffer.append(localEncodedString);
> 
> here is the compile errors;
> compile-core:
>     [javac] Compiling 165 source files to
> /home/onur/nutch-0.9/build/classes
>     [javac]
> /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:39:
> package org.apache.nutch.parse.html does not exist
>     [javac] import org.apache.nutch.parse.html.HtmlParser;
>     [javac]                                   ^
>     [javac]
> /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240:
> cannot find symbol
>     [javac] symbol  : class HtmlParser
>     [javac] location: class
> org.apache.nutch.protocol.Content
>     [javac] 	    HtmlParser parser = new HtmlParser();
>     [javac] 	    ^
>     [javac]
> /home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240:
> cannot find symbol
>     [javac] symbol  : class HtmlParser
>     [javac] location: class
> org.apache.nutch.protocol.Content
>     [javac] 	    HtmlParser parser = new HtmlParser();
>     [javac] 	                            ^
>     [javac] Note: Some input files use or override a
> deprecated API.
>     [javac] Note: Recompile with -Xlint:deprecation for
> details.
>     [javac] Note: Some input files use unchecked or unsafe
> operations.
>     [javac] Note: Recompile with -Xlint:unchecked for
> details.
>     [javac] 3 errors
> 
> BUILD FAILED
> /home/onur/nutch-0.9/build.xml:106: Compile failed; see the
> compiler error output for details.
> 
> 
> do I need to make any other configuration to fix it?
> (parse-html exists in nutch-default.xml plugin.includes
> property, i tried also adding it in nutch-site.xml, but did
> not work)
> or it is not intended to use plugins in core code?
> 
> any ideas?
> 
> (by the way what I'm trying to do here is to enable
> encoding in -get functionality.. it normally gives content
> in platform-default encoding (utf-8) )
> 
> thanks
> 
> 
> onur deniz