You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jason Manfield <ra...@yahoo.com> on 2005/05/02 19:24:07 UTC
How do I enable PDF/Word etc. parsing in nutch?
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: How do I enable PDF/Word etc. parsing in nutch?
Posted by EM <em...@cpuedge.com>.
add it to the list of plugins in your config file
Jason Manfield wrote:
>
>__________________________________________________
>Do You Yahoo!?
>Tired of spam? Yahoo! Mail has the best spam protection around
>http://mail.yahoo.com
>
>
RE: How do I enable PDF/Word etc. parsing in nutch?
Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Jason,
Step 1: edit <nutch_home>/nutch-default.xml and edit the following lines:
<property>
<name>plugin.includes</name>
<!-- enable your plugins here -->
<value>protocol-(http|file)|urlfilter-regex|parse-(text|html|rss|msword|pdf)
|index-basic|query-(basic|site|url)</value\
>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
Step 2: make sure that the plugin is built:
From the <nutch_home> directory, perform the following:
# ensure that the core classes are built
% ant compile-core
# ensure that the plugins are built
% ant compile-plugins
Note, that the compile-plugins task assumes that your plugin build info is
in <nutch_home>/src/plugin/build.xml, so if you're building a new plugin,
you'll have to add the ant compile info there, just follow the examples of
the other plugins.
Step 3: you're done.
Good luck.
Thanks,
Chris
______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
_________________________________________________
Jet Propulsion Laboratory Pasadena, CA
Office: 171-266B Mailstop: 171-246
_______________________________________________________
Disclaimer: The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
> -----Original Message-----
> From: Jason Manfield [mailto:rarish911@yahoo.com]
> Sent: Monday, May 02, 2005 10:24 AM
> To: nutch-user@incubator.apache.org
> Subject: How do I enable PDF/Word etc. parsing in nutch?
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com