You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jason Manfield <ra...@yahoo.com> on 2005/05/02 19:24:07 UTC

How do I enable PDF/Word etc. parsing in nutch?

  
__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: How do I enable PDF/Word etc. parsing in nutch?

Posted by EM <em...@cpuedge.com>.
add it to the list of plugins in your config file

Jason Manfield wrote:

>  
>__________________________________________________
>Do You Yahoo!?
>Tired of spam?  Yahoo! Mail has the best spam protection around 
>http://mail.yahoo.com 
>  
>

RE: How do I enable PDF/Word etc. parsing in nutch?

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi Jason,

 Step 1: edit <nutch_home>/nutch-default.xml and edit the following lines:

<property>
  <name>plugin.includes</name>

<!-- enable your plugins here -->
 
<value>protocol-(http|file)|urlfilter-regex|parse-(text|html|rss|msword|pdf)
|index-basic|query-(basic|site|url)</value\
>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.  By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

 Step 2: make sure that the plugin is built:

  From the <nutch_home> directory, perform the following:
 
  # ensure that the core classes are built
  % ant compile-core

  # ensure that the plugins are built
  % ant compile-plugins

Note, that the compile-plugins task assumes that your plugin build info is
in <nutch_home>/src/plugin/build.xml, so if you're building a new plugin,
you'll have to add the ant compile info there, just follow the examples of
the other plugins.

 Step 3: you're done.


Good luck.


Thanks,
  Chris



______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246

_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Jason Manfield [mailto:rarish911@yahoo.com]
> Sent: Monday, May 02, 2005 10:24 AM
> To: nutch-user@incubator.apache.org
> Subject: How do I enable PDF/Word etc. parsing in nutch?
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com