You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by bob knob <an...@yahoo.com> on 2006/04/11 17:57:21 UTC
Enabling different file types
Hi, it's me again,
If I'm going to use Nutch, I need xls, ppt, & doc file
types to be searchable if at all possible. The wiki
says most file types are disabled by default, but they
can be turned on by changing conf/nutch-site.xml.
Unfortunately there is no documentation that I can
find for this file... any ideas how to do it, or
sample xml that somebody could send over?
Thanks,
Bob
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: Enabling different file types
Posted by Rajesh Munavalli <fi...@gmail.com>.
Follow these steps for nutch-0.7.2:
(1) Modify the nutch-default.xml for the following property
For ex: if you want to include "doc" file type, replace the <value> node to
"parse-(text|html|doc)" as shown below.
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|doc)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
(2) The next step is to develop the appropriate plugin for the particular
file. The parse needs to implement the interface "Parser" (
org.apache.nutch.parse )in nutch.
More details can be found in the following link
http://wiki.apache.org/nutch/WritingPluginExample
(3) Modify the plugin.xml. The link above describes everything in detail.
Here is an example plugin.xml I wrote for XHTML parser. Observe the
"contentType" which matches the file type you are trying to parse.
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="parse-xhtml" name="Xhtml Parse Plug-in" version="1.0.0"
provider-name="dessci.com">
<runtime>
<library name="parse-xhtml.jar">
<export name="*"/>
</library>
<library name="nekohtml-0.9.4.jar"/>
<library name="tagsoup-1.0rc3.jar"/>
</runtime>
<extension id="com.dessci.search.nutch.parse.xhtml"
name="XhtmlParse"
point="org.apache.nutch.parse.Parser">
<implementation id="com.dessci.search.nutch.parse.xhtml.XhtmlParser"
class="com.dessci.search.nutch.parse.xhtml.XhtmlParser
"
contentType="application/xhtml+xml"
pathSuffix=""/>
</extension>
</plugin>
Hope this helps,
--Rajesh Munavalli
On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Hi, it's me again,
>
> If I'm going to use Nutch, I need xls, ppt, & doc file
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?
>
> Thanks,
> Bob
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Re: Enabling different file types
Posted by Rajesh Munavalli <fi...@gmail.com>.
Have a look at http://jakarta.apache.org/poi/
On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?
>
> --- J�r�me Charron <je...@gmail.com>
> wrote:
>
> > > types to be searchable if at all possible. The
> > wiki
> > > says most file types are disabled by default, but
> > they
> > > can be turned on by changing conf/nutch-site.xml.
> > > Unfortunately there is no documentation that I can
> > > find for this file... any ideas how to do it, or
> > > sample xml that somebody could send over?
> >
> > Simply add the plugin name in the plugin.includes
> > property.
> > For instance, to activate word, powerpoint and excel
> > parsing, just add in
> > this property :
> > ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> > ...
> > or in a shorter syntax :
> > ... |parse-ms(excel|powerpoint|word)| ...
> >
> > This is described on the Wiki in the page :
> > http://wiki.apache.org/nutch/WritingPluginExample
> > Section "Getting Nutch to Use Your Plugin"
> >
> >
> > Regards
> >
> > J�r�me
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Re: Enabling different file types
Posted by Jérôme Charron <je...@gmail.com>.
> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?
They are available in the trunk version, not in the 0.7.x
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
Re: Enabling different file types
Posted by bob knob <an...@yahoo.com>.
Okay but it sounds like I need parser plugins for
word, excel and powerpoint - plugins only has a
parser-msword directory. Has anyone created plugins
for excel & powerpoint?
--- J�r�me Charron <je...@gmail.com>
wrote:
> > types to be searchable if at all possible. The
> wiki
> > says most file types are disabled by default, but
> they
> > can be turned on by changing conf/nutch-site.xml.
> > Unfortunately there is no documentation that I can
> > find for this file... any ideas how to do it, or
> > sample xml that somebody could send over?
>
> Simply add the plugin name in the plugin.includes
> property.
> For instance, to activate word, powerpoint and excel
> parsing, just add in
> this property :
> ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> ...
> or in a shorter syntax :
> ... |parse-ms(excel|powerpoint|word)| ...
>
> This is described on the Wiki in the page :
> http://wiki.apache.org/nutch/WritingPluginExample
> Section "Getting Nutch to Use Your Plugin"
>
>
> Regards
>
> J�r�me
>
> --
> http://motrech.free.fr/
> http://www.frutch.org/
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: Enabling different file types
Posted by Jérôme Charron <je...@gmail.com>.
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?
Simply add the plugin name in the plugin.includes property.
For instance, to activate word, powerpoint and excel parsing, just add in
this property :
... |parse-msexcel|parse-mspowerpoint|parse-msword| ...
or in a shorter syntax :
... |parse-ms(excel|powerpoint|word)| ...
This is described on the Wiki in the page :
http://wiki.apache.org/nutch/WritingPluginExample
Section "Getting Nutch to Use Your Plugin"
Regards
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/