You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/10/17 15:53:54 UTC
writing a new parse-exe plugin
Hi all,
I'm trying to write a new plugin that will download pages with contentType:
x-dosexec (EXE) files.
i've followed the "write your own plugin tutorial" in the wiki and done the
following actions: (some actions are not mentioned in the tutorial)
1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml [displayed
below]
3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml [displayed
below]
4. Written the java code
$NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser.java
5. Add "nutch-extensionpoints" & "parse-exe" to the 'plugins-include'
property in $NUTCH_HOME/conf/nutch-site.xml
6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written below]
7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
below]
8. copied $NUTCH_HOME/build/plugins/parse-exe/parse-exe.jar to
$NUTCH_HOME/plugins/parse-exe
9. run ant (build successful)
the log shows that nutch identifies the plugin:
2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Registered Plugins:
2007-10-17 15:15:55,657 INFO plugin.PluginRepository - the nutch
core extension points (nutch-extensionpoints)
2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Html Parse
Plug-in (parse-html)
2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Exe Parse
Plug-in (parse-exe)
but when the fetcher encounters a x-dosexec file it thorws an exception:
2007-10-17 15:17:16,146 WARN parse.ParseUtil - No suitable parser found
when trying to parse content http://www.foo.com/yyy/foo.exe of type
application/x-dosexec
2007-10-17 15:17:16,146 WARN fetcher.Fetcher - Error parsing:
http://www.foo.com/yyy/foo.exe: failed(2,200):
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe
(sorry, but the url has been masked for security reasons)
Am i missing something??
thanks !!
[$NUTCH_HOME/src/plugins/build.xml]
<ant dir="parse-exe" target="deploy"/>
[parse-plugins.xml]
<mimeType name="application/x-dosexec">
<plugin id="parse-exe" />
</mimeType>
[plugin.xml] // copied and changed from parse-pdf
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="parse-exe"
name="Exe Parse Plug-in"
version="1.0.0"
provider-name="nutch.org">
<runtime>
<library name="parse-exe.jar">
<export name="*"/>
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints"/>
<import plugin="lib-log4j"/>
</requires>
<extension id="org.apache.nutch.parse.exe"
name="ExeParse"
point="org.apache.nutch.parse.Parser">
<implementation id="org.apache.nutch.parse.exe.ExeParse"
class="org.apache.nutch.parse.exe.ExeParse">
<parameter name="contentType" value="application/x-dosexec"/>
<parameter name="pathSuffix" value=""/>
</implementation>
</extension>
</plugin>
-----------------------------------------------------------------------------------------------------------------
[build.xml]
<?xml version="1.0"?>
<project name="parse-exe" default="jar-core">
<import file="../build-plugin.xml"/>
</project>
------------------------------------------------------------------------
[ExeParser.java]
public class ExeParser implements Parser {
public static final Log LOG = LogFactory.getLog("
org.apache.nutch.parse.exe");
private Configuration conf;
public Parse getParse(Content content) {
try {
byte[] raw = content.getContent();
// enter here my code ( i will replace this with real code)
LOG.info ("EDRI:: you have reached the parse-exe plugin!");
System.out.println("EDRI:: system.out.print... parse-exe");
String contentLength = content.getMetadata().get(
Response.CONTENT_LENGTH);
if (contentLength != null && raw.length !=
Integer.parseInt(contentLength))
{
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_TRUNCATED,
"Content truncated at "+raw.length
+" bytes. Parser can't handle incomplete exe
file.").getEmptyParse(getConf());
}
} catch (Exception e) { // run time exception
if (LOG.isWarnEnabled()) {
LOG.warn("General exception in EXE parser: "+e.getMessage());
e.printStackTrace(LogUtil.getWarnStream(LOG));
}
return new ParseStatus(ParseStatus.FAILED,
"Can't be handled as exe document. " +
e).getEmptyParse(getConf());
}
/// i'm not sure what to return here if i only need to d/l the file
ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, "",null,
null, null);
parseData.setConf(this.conf);
return new ParseImpl("", parseData);
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
--
Eyal Edri
Re: writing a new parse-exe plugin
Posted by eyal edri <ey...@gmail.com>.
Found how to associate multiple contentTypes to a certain plugin:
just add the content type to the conf/parse-plugins.xml file: (the plugin
can take more than one type)
<mimeType name="application/x-dosexec">
<plugin id="parse-exe" />
</mimeType>
On 10/18/07, eyal edri <ey...@gmail.com> wrote:
>
> Excellent !! :)
>
> that did the trick!
>
> Any chance to create a new page on the plugin central for writing a
> nutch-0.9 plugin, stating the checklist (written below)?
> (i would have uploaded, but dont have the rights to open a new page)
>
> The checklist: (relevant for a parse plugin, implementing the Parse
> extention point)
>
> 1. Create new dir under $NUTCH_HOME/src/plugins/parse-XXX
> 2. Create new $NUTCH_HOME/src/plugins/parse-XXX/plugin.xml
> [displayed below]
> 3. Create new $NUTCH_HOME/src/plugins/parse-XXX/build.xml [displayed
> below]
> 4. Write the java code
> $NUTCH_HOME/src/plugin/parse-XXX/src/java/org/apache/nutch/parse/XXX/XXXParser.java
> 5. Add "nutch-extensionpoints" & "parse-XXX" to the
> 'plugins-include' property in $NUTCH_HOME/conf/nutch- site.xml
> 6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written
> below] (new mime type & alias)
> 7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> below]
> 8. copied $NUTCH_HOME/build/plugins/parse-XXX/parse- XXX.jar to
> $NUTCH_HOME/plugins/parse-XXX
> 9. run ant (build successful)
>
> I've got a few of more questions just to tie the loose ends..:
>
> 1. Exe extension has a few content types related to it (e.g.
> application(x-exe|x-msdos|x-msdownload|octet-strem))
> how can i config parse-exe to capture all of them? (solved)
> 2. i've noticed that after every build i need to copy
> build/parse-exe/parse-exe.jar to plugins/parse-exe, any way to tell him to
> build it directly
> to plugins/parse-exe?
> 3. i get a nullPointerException from fetcher after the parse-exe works,
> can you guide me on what i should return from the parse-exe?
>
> the parse-exe plugin: ( the getParse funtion)
>
> public class ExeParser implements Parser {
> public static final Log LOG = LogFactory.getLog(ExeParser.class );
> private Configuration conf;
> public static final String DOWNLOAD_DIR = "/home/eyale/HTTPSEC/nutch-0.9
> /DOWNLOADS/";
>
> public ExeParser() {
> LOG.info ("EDRI:: created exe-parser object");
> }
>
> public Parse getParse(Content content) {
> String resultText = null;
> String resultTitle = null;
> Outlink[] outlinks = null;
>
> try {
>
> byte[] raw = content.getContent();
>
> // enter here my code
>
> String contentLength = content.getMetadata().get(
> Response.CONTENT_LENGTH);
> if (contentLength != null && raw.length != Integer.parseInt(contentLength))
> {
> return new ParseStatus(ParseStatus.FAILED ,
> ParseStatus.FAILED_TRUNCATED,
> "Content truncated at "+raw.length
> +" bytes. Parser can't handle incomplete exe
> file.").getEmptyParse(getConf());
> }
> // download the file
> downloadContentType(content);
>
> }catch (Exception e) { // run time exception
> if (LOG.isWarnEnabled()) {
> LOG.warn("General exception in EXE parser: "+e.getMessage());
> e.printStackTrace (LogUtil.getWarnStream(LOG));
> }
> return new ParseStatus(ParseStatus.FAILED,
> "Can't be handled as exe document. " +
> e).getEmptyParse(getConf());
> }
>
> final ParseData parseData = new ParseData( ParseStatus.STATUS_SUCCESS,
> resultTitle, outlinks,
> content.getMetadata());
> return new ParseImpl(resultText, parseData);
> }
>
> Thanks!!!
>
>
>
>
>
>
> On 10/17/07, eyal edri < eyal.edri@gmail.com> wrote:
> >
> > Hi all,
> >
> > I'm trying to write a new plugin that will download pages with
> > contentType: x-dosexec (EXE) files.
> > i've followed the "write your own plugin tutorial" in the wiki and done
> > the following actions: (some actions are not mentioned in the tutorial)
> >
> > 1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
> > 2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml
> > [displayed below]
> > 3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml
> > [displayed below]
> > 4. Written the java code
> > $NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser.java
> > 5. Add "nutch-extensionpoints" & "parse-exe" to the
> > 'plugins-include' property in $NUTCH_HOME/conf/nutch- site.xml
> > 6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written
> > below]
> > 7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> > below]
> > 8. copied $NUTCH_HOME/build/plugins/parse-exe/parse- exe.jar to
> > $NUTCH_HOME/plugins/parse-exe
> > 9. run ant (build successful)
> >
> > the log shows that nutch identifies the plugin:
> >
> > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Registered
> > Plugins:
> > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - the
> > nutch core extension points (nutch-extensionpoints)
> > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Html
> > Parse Plug-in (parse-html)
> > 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Exe
> > Parse Plug-in (parse-exe)
> >
> > but when the fetcher encounters a x-dosexec file it thorws an exception:
> >
> >
> > 2007-10-17 15:17:16,146 WARN parse.ParseUtil - No suitable parser found
> > when trying to parse content http://www.foo.com/yyy/foo.exe of type
> > application/x-dosexec
> > 2007-10-17 15:17:16,146 WARN fetcher.Fetcher - Error parsing:
> > http://www.foo.com/yyy/foo.exe: failed(2,200):
> > org.apache.nutch.parse.ParseException: parser not found for
> > contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe
> >
> > (sorry, but the url has been masked for security reasons)
> >
> > Am i missing something??
> >
> > thanks !!
> >
> >
> >
> > [$NUTCH_HOME/src/plugins/build.xml]
> >
> > <ant dir="parse-exe" target="deploy"/>
> >
> > [parse-plugins.xml]
> >
> > <mimeType name="application/x-dosexec">
> > <plugin id="parse-exe" />
> > </mimeType>
> >
> >
> > [plugin.xml] // copied and changed from parse-pdf
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <plugin
> > id="parse-exe"
> > name="Exe Parse Plug-in"
> > version="1.0.0"
> > provider-name="nutch.org">
> >
> > <runtime>
> > <library name="parse-exe.jar">
> > <export name="*"/>
> > </library>
> > </runtime>
> >
> > <requires>
> > <import plugin="nutch-extensionpoints"/>
> > <import plugin="lib-log4j"/>
> > </requires>
> >
> > <extension id="org.apache.nutch.parse.exe"
> > name="ExeParse"
> > point="org.apache.nutch.parse.Parser">
> >
> > <implementation id="org.apache.nutch.parse.exe.ExeParse"
> > class=" org.apache.nutch.parse.exe.ExeParse">
> > <parameter name="contentType" value="application/x-dosexec"/>
> > <parameter name="pathSuffix" value=""/>
> > </implementation>
> > </extension>
> >
> > </plugin>
> >
> >
> > -----------------------------------------------------------------------------------------------------------------
> >
> > [build.xml]
> >
> > <?xml version=" 1.0"?>
> >
> > <project name="parse-exe" default="jar-core">
> >
> > <import file="../build-plugin.xml"/>
> >
> > </project>
> >
> > ------------------------------------------------------------------------
> > [ExeParser.java]
> >
> > public class ExeParser implements Parser {
> > public static final Log LOG = LogFactory.getLog("
> > org.apache.nutch.parse.exe");
> > private Configuration conf;
> >
> > public Parse getParse(Content content) {
> >
> > try {
> >
> > byte[] raw = content.getContent();
> >
> > // enter here my code ( i will replace this with real code)
> > LOG.info ("EDRI:: you have reached the parse-exe plugin!");
> > System.out.println("EDRI:: system.out.print... parse-exe");
> >
> >
> >
> >
> > String contentLength = content.getMetadata().get(
> > Response.CONTENT_LENGTH );
> > if (contentLength != null && raw.length != Integer.parseInt(contentLength))
> > {
> > return new ParseStatus(ParseStatus.FAILED,
> > ParseStatus.FAILED_TRUNCATED,
> > "Content truncated at "+raw.length
> > +" bytes. Parser can't handle incomplete exe
> > file.").getEmptyParse(getConf());
> > }
> >
> > } catch (Exception e) { // run time exception
> > if (LOG.isWarnEnabled()) {
> > LOG.warn("General exception in EXE parser: "+e.getMessage());
> > e.printStackTrace(LogUtil.getWarnStream(LOG));
> > }
> > return new ParseStatus(ParseStatus.FAILED,
> > "Can't be handled as exe document. " +
> > e).getEmptyParse(getConf());
> > }
> >
> > /// i'm not sure what to return here if i only need to d/l the file
> >
> > ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,
> > "",null, null, null);
> > parseData.setConf(this.conf);
> > return new ParseImpl("", parseData);
> > }
> >
> > public void setConf(Configuration conf) {
> > this.conf = conf;
> > }
> >
> > public Configuration getConf() {
> > return this.conf;
> > }
> >
> >
> >
> >
> >
> > --
> > Eyal Edri
>
>
>
>
> --
> Eyal Edri
--
Eyal Edri
Re: writing a new parse-exe plugin
Posted by eyal edri <ey...@gmail.com>.
Excellent !! :)
that did the trick!
Any chance to create a new page on the plugin central for writing a
nutch-0.9 plugin, stating the checklist (written below)?
(i would have uploaded, but dont have the rights to open a new page)
The checklist: (relevant for a parse plugin, implementing the Parse
extention point)
1. Create new dir under $NUTCH_HOME/src/plugins/parse-XXX
2. Create new $NUTCH_HOME/src/plugins/parse-XXX/plugin.xml [displayed
below]
3. Create new $NUTCH_HOME/src/plugins/parse-XXX/build.xml [displayed
below]
4. Write the java code
$NUTCH_HOME/src/plugin/parse-XXX/src/java/org/apache/nutch/parse/XXX/XXXParser.java
5. Add "nutch-extensionpoints" & "parse-XXX" to the 'plugins-include'
property in $NUTCH_HOME/conf/nutch-site.xml
6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written below]
(new mime type & alias)
7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
below]
8. copied $NUTCH_HOME/build/plugins/parse-XXX/parse-XXX.jar to
$NUTCH_HOME/plugins/parse-XXX
9. run ant (build successful)
I've got a few of more questions just to tie the loose ends..:
1. Exe extension has a few content types related to it (e.g.
application(x-exe|x-msdos|x-msdownload|octet-strem))
how can i config parse-exe to capture all of them?
2. i've noticed that after every build i need to copy build/parse-exe/parse-
exe.jar to plugins/parse-exe, any way to tell him to build it directly
to plugins/parse-exe?
3. i get a nullPointerException from fetcher after the parse-exe works, can
you guide me on what i should return from the parse-exe?
the parse-exe plugin: ( the getParse funtion)
public class ExeParser implements Parser {
public static final Log LOG = LogFactory.getLog(ExeParser.class);
private Configuration conf;
public static final String DOWNLOAD_DIR = "/home/eyale/HTTPSEC/nutch-0.9
/DOWNLOADS/";
public ExeParser() {
LOG.info ("EDRI:: created exe-parser object");
}
public Parse getParse(Content content) {
String resultText = null;
String resultTitle = null;
Outlink[] outlinks = null;
try {
byte[] raw = content.getContent();
// enter here my code
String contentLength = content.getMetadata().get(
Response.CONTENT_LENGTH);
if (contentLength != null && raw.length !=
Integer.parseInt(contentLength))
{
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_TRUNCATED,
"Content truncated at "+raw.length
+" bytes. Parser can't handle incomplete exe
file.").getEmptyParse(getConf());
}
// download the file
downloadContentType(content);
}catch (Exception e) { // run time exception
if (LOG.isWarnEnabled()) {
LOG.warn("General exception in EXE parser: "+e.getMessage());
e.printStackTrace(LogUtil.getWarnStream(LOG));
}
return new ParseStatus(ParseStatus.FAILED,
"Can't be handled as exe document. " +
e).getEmptyParse(getConf());
}
final ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,
resultTitle, outlinks,
content.getMetadata());
return new ParseImpl(resultText, parseData);
}
Thanks!!!
On 10/17/07, eyal edri <ey...@gmail.com> wrote:
>
> Hi all,
>
> I'm trying to write a new plugin that will download pages with
> contentType: x-dosexec (EXE) files.
> i've followed the "write your own plugin tutorial" in the wiki and done
> the following actions: (some actions are not mentioned in the tutorial)
>
> 1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
> 2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml
> [displayed below]
> 3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml
> [displayed below]
> 4. Written the java code
> $NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser.java
> 5. Add "nutch-extensionpoints" & "parse-exe" to the
> 'plugins-include' property in $NUTCH_HOME/conf/nutch- site.xml
> 6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written
> below]
> 7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> below]
> 8. copied $NUTCH_HOME/build/plugins/parse-exe/parse- exe.jar to
> $NUTCH_HOME/plugins/parse-exe
> 9. run ant (build successful)
>
> the log shows that nutch identifies the plugin:
>
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Registered
> Plugins:
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - the nutch
> core extension points (nutch-extensionpoints)
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Exe Parse
> Plug-in (parse-exe)
>
> but when the fetcher encounters a x-dosexec file it thorws an exception:
>
> 2007-10-17 15:17:16,146 WARN parse.ParseUtil - No suitable parser found
> when trying to parse content http://www.foo.com/yyy/foo.exe of type
> application/x-dosexec
> 2007-10-17 15:17:16,146 WARN fetcher.Fetcher - Error parsing:
> http://www.foo.com/yyy/foo.exe: failed(2,200):
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe
>
> (sorry, but the url has been masked for security reasons)
>
> Am i missing something??
>
> thanks !!
>
>
>
> [$NUTCH_HOME/src/plugins/build.xml]
>
> <ant dir="parse-exe" target="deploy"/>
>
> [parse-plugins.xml]
>
> <mimeType name="application/x-dosexec">
> <plugin id="parse-exe" />
> </mimeType>
>
>
> [plugin.xml] // copied and changed from parse-pdf
>
> <?xml version="1.0" encoding="UTF-8"?>
> <plugin
> id="parse-exe"
> name="Exe Parse Plug-in"
> version="1.0.0"
> provider-name="nutch.org">
>
> <runtime>
> <library name="parse-exe.jar">
> <export name="*"/>
> </library>
> </runtime>
>
> <requires>
> <import plugin="nutch-extensionpoints"/>
> <import plugin="lib-log4j"/>
> </requires>
>
> <extension id="org.apache.nutch.parse.exe"
> name="ExeParse"
> point="org.apache.nutch.parse.Parser">
>
> <implementation id="org.apache.nutch.parse.exe.ExeParse"
> class=" org.apache.nutch.parse.exe.ExeParse">
> <parameter name="contentType" value="application/x-dosexec"/>
> <parameter name="pathSuffix" value=""/>
> </implementation>
> </extension>
>
> </plugin>
>
>
> -----------------------------------------------------------------------------------------------------------------
>
> [build.xml]
>
> <?xml version=" 1.0"?>
>
> <project name="parse-exe" default="jar-core">
>
> <import file="../build-plugin.xml"/>
>
> </project>
>
> ------------------------------------------------------------------------
> [ExeParser.java]
>
> public class ExeParser implements Parser {
> public static final Log LOG = LogFactory.getLog("
> org.apache.nutch.parse.exe");
> private Configuration conf;
>
> public Parse getParse(Content content) {
>
> try {
>
> byte[] raw = content.getContent();
>
> // enter here my code ( i will replace this with real code)
> LOG.info ("EDRI:: you have reached the parse-exe plugin!");
> System.out.println("EDRI:: system.out.print... parse-exe");
>
>
>
>
> String contentLength = content.getMetadata().get(
> Response.CONTENT_LENGTH );
> if (contentLength != null && raw.length != Integer.parseInt(contentLength))
> {
> return new ParseStatus(ParseStatus.FAILED,
> ParseStatus.FAILED_TRUNCATED,
> "Content truncated at "+raw.length
> +" bytes. Parser can't handle incomplete exe
> file.").getEmptyParse(getConf());
> }
>
> } catch (Exception e) { // run time exception
> if (LOG.isWarnEnabled()) {
> LOG.warn("General exception in EXE parser: "+e.getMessage());
> e.printStackTrace(LogUtil.getWarnStream(LOG));
> }
> return new ParseStatus(ParseStatus.FAILED,
> "Can't be handled as exe document. " +
> e).getEmptyParse(getConf());
> }
>
> /// i'm not sure what to return here if i only need to d/l the file
>
> ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS,
> "",null, null, null);
> parseData.setConf(this.conf);
> return new ParseImpl("", parseData);
> }
>
> public void setConf(Configuration conf) {
> this.conf = conf;
> }
>
> public Configuration getConf() {
> return this.conf;
> }
>
>
>
>
>
> --
> Eyal Edri
--
Eyal Edri
Re: writing a new parse-exe plugin
Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Eyal,
Did you also modify parse-plugins.xml at the bottom to add an alias for
parse-exe to point to the actual extension point id? I'm guessing that's
your problem. Check out the bottom of parse-plugins.xml for an example of
this.
Let me know if you still need more help and we'll go from there.
Thanks,
Chris
On 10/17/07 6:53 AM, "eyal edri" <ey...@gmail.com> wrote:
> Hi all,
>
> I'm trying to write a new plugin that will download pages with contentType:
> x-dosexec (EXE) files.
> i've followed the "write your own plugin tutorial" in the wiki and done the
> following actions: (some actions are not mentioned in the tutorial)
>
> 1. Created a new dir under $NUTCH_HOME/src/plugins/parse-exe
> 2. Created new $NUTCH_HOME/src/plugins/parse-exe/plugin.xml [displayed
> below]
> 3. Created new $NUTCH_HOME/src/plugins/parse-exe/build.xml [displayed
> below]
> 4. Written the java code
>
> $NUTCH_HOME/src/plugin/parse-exe/src/java/org/apache/nutch/parse/exe/ExeParser
> .java
> 5. Add "nutch-extensionpoints" & "parse-exe" to the 'plugins-include'
> property in $NUTCH_HOME/conf/nutch-site.xml
> 6. Add code to the $NUTCH_HOME/conf/parse-plugins.xml [written below]
> 7. Added code the the $NUTCH_HOME/src/plugins/build.xml [written
> below]
> 8. copied $NUTCH_HOME/build/plugins/parse-exe/parse-exe.jar to
> $NUTCH_HOME/plugins/parse-exe
> 9. run ant (build successful)
>
> the log shows that nutch identifies the plugin:
>
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Registered Plugins:
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - the nutch
> core extension points (nutch-extensionpoints)
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2007-10-17 15:15:55,657 INFO plugin.PluginRepository - Exe Parse
> Plug-in (parse-exe)
>
> but when the fetcher encounters a x-dosexec file it thorws an exception:
>
> 2007-10-17 15:17:16,146 WARN parse.ParseUtil - No suitable parser found
> when trying to parse content http://www.foo.com/yyy/foo.exe of type
> application/x-dosexec
> 2007-10-17 15:17:16,146 WARN fetcher.Fetcher - Error parsing:
> http://www.foo.com/yyy/foo.exe: failed(2,200):
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/x-dosexec url=http://www.foo.com/yyy/movie30.exe
>
> (sorry, but the url has been masked for security reasons)
>
> Am i missing something??
>
> thanks !!
>
>
>
> [$NUTCH_HOME/src/plugins/build.xml]
>
> <ant dir="parse-exe" target="deploy"/>
>
> [parse-plugins.xml]
>
> <mimeType name="application/x-dosexec">
> <plugin id="parse-exe" />
> </mimeType>
>
>
> [plugin.xml] // copied and changed from parse-pdf
>
> <?xml version="1.0" encoding="UTF-8"?>
> <plugin
> id="parse-exe"
> name="Exe Parse Plug-in"
> version="1.0.0"
> provider-name="nutch.org">
>
> <runtime>
> <library name="parse-exe.jar">
> <export name="*"/>
> </library>
> </runtime>
>
> <requires>
> <import plugin="nutch-extensionpoints"/>
> <import plugin="lib-log4j"/>
> </requires>
>
> <extension id="org.apache.nutch.parse.exe"
> name="ExeParse"
> point="org.apache.nutch.parse.Parser">
>
> <implementation id="org.apache.nutch.parse.exe.ExeParse"
> class="org.apache.nutch.parse.exe.ExeParse">
> <parameter name="contentType" value="application/x-dosexec"/>
> <parameter name="pathSuffix" value=""/>
> </implementation>
> </extension>
>
> </plugin>
>
> ------------------------------------------------------------------------------
> -----------------------------------
>
> [build.xml]
>
> <?xml version="1.0"?>
>
> <project name="parse-exe" default="jar-core">
>
> <import file="../build-plugin.xml"/>
>
> </project>
>
> ------------------------------------------------------------------------
> [ExeParser.java]
>
> public class ExeParser implements Parser {
> public static final Log LOG = LogFactory.getLog("
> org.apache.nutch.parse.exe");
> private Configuration conf;
>
> public Parse getParse(Content content) {
>
> try {
>
> byte[] raw = content.getContent();
>
> // enter here my code ( i will replace this with real code)
> LOG.info ("EDRI:: you have reached the parse-exe plugin!");
> System.out.println("EDRI:: system.out.print... parse-exe");
>
>
>
>
> String contentLength = content.getMetadata().get(
> Response.CONTENT_LENGTH);
> if (contentLength != null && raw.length !=
> Integer.parseInt(contentLength))
> {
> return new ParseStatus(ParseStatus.FAILED,
> ParseStatus.FAILED_TRUNCATED,
> "Content truncated at "+raw.length
> +" bytes. Parser can't handle incomplete exe
> file.").getEmptyParse(getConf());
> }
>
> } catch (Exception e) { // run time exception
> if (LOG.isWarnEnabled()) {
> LOG.warn("General exception in EXE parser: "+e.getMessage());
> e.printStackTrace(LogUtil.getWarnStream(LOG));
> }
> return new ParseStatus(ParseStatus.FAILED,
> "Can't be handled as exe document. " +
> e).getEmptyParse(getConf());
> }
>
> /// i'm not sure what to return here if i only need to d/l the file
>
> ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, "",null,
> null, null);
> parseData.setConf(this.conf);
> return new ParseImpl("", parseData);
> }
>
> public void setConf(Configuration conf) {
> this.conf = conf;
> }
>
> public Configuration getConf() {
> return this.conf;
> }
>
>
>
>
______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory Pasadena, CA
Office: 171-266B Mailstop: 171-246
_______________________________________________________
Disclaimer: The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.