You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by bob knob <an...@yahoo.com> on 2006/04/11 16:05:11 UTC

Auto-crawling & re-crawling the web site

Hi,

I am currently evaluating Nutch for use on an intranet
site search engine. I am by no means an expert in this
field although I am trying to learn more about it.

1 I was reading one of the articles referenced on the
nutch site:

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

-and I was a little bit concerned about its warning
concerning "re-crawling" the site. I understand that
there are several steps of crawling, building the
index, etc., but it sounded to me like new pages on my
web site would be ignored until I restarted the Nutch
server even after I've re-crawled. Am I correct about
this? How do most people deal with it?

2 It seems like I would want to re-crawl or re-index
the site on a nightly basis. All of this seems to be
done with shell scripts, and I wonder what options are
available to someone working on a Windows platform. I
could run cygrunsrv/cron on Windows I guess. Is there
some reason more of this scripting couldn't be redone
as a Java program? Also, has anybody considered
creating a Windows service to manage indexing/crawling
like the one that manages the Tomcat web server?

Thanks,
Bob

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Enabling different file types

Posted by Rajesh Munavalli <fi...@gmail.com>.
Follow these steps for nutch-0.7.2:

(1) Modify the nutch-default.xml for the following property
For ex: if you want to include "doc" file type, replace the <value> node to
"parse-(text|html|doc)" as shown below.

<property>
  <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|doc)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

(2) The next step is to develop the appropriate plugin for the particular
file. The parse needs to implement the interface "Parser" (
org.apache.nutch.parse )in nutch.

More details can be found in the following link
http://wiki.apache.org/nutch/WritingPluginExample

(3) Modify the plugin.xml. The link above describes everything in detail.
Here is an example plugin.xml I wrote for XHTML parser. Observe the
"contentType" which matches the file type you are trying to parse.

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="parse-xhtml" name="Xhtml Parse Plug-in" version="1.0.0"
provider-name="dessci.com">

    <runtime>
      <library name="parse-xhtml.jar">
         <export name="*"/>
      </library>
      <library name="nekohtml-0.9.4.jar"/>
      <library name="tagsoup-1.0rc3.jar"/>
   </runtime>

   <extension id="com.dessci.search.nutch.parse.xhtml"
              name="XhtmlParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="com.dessci.search.nutch.parse.xhtml.XhtmlParser"
                      class="com.dessci.search.nutch.parse.xhtml.XhtmlParser
"
                      contentType="application/xhtml+xml"
                      pathSuffix=""/>

   </extension>

</plugin>



Hope this helps,

--Rajesh Munavalli
On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Hi, it's me again,
>
> If I'm going to use Nutch, I need xls, ppt, & doc file
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?
>
> Thanks,
> Bob
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Enabling different file types

Posted by Rajesh Munavalli <fi...@gmail.com>.
Have a look at http://jakarta.apache.org/poi/

On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?
>
> --- J�r�me Charron <je...@gmail.com>
> wrote:
>
> > > types to be searchable if at all possible. The
> > wiki
> > > says most file types are disabled by default, but
> > they
> > > can be turned on by changing conf/nutch-site.xml.
> > > Unfortunately there is no documentation that I can
> > > find for this file... any ideas how to do it, or
> > > sample xml that somebody could send over?
> >
> > Simply add the plugin name in the plugin.includes
> > property.
> > For instance, to activate word, powerpoint and excel
> > parsing, just add in
> > this property :
> > ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> > ...
> > or in a shorter syntax :
> > ... |parse-ms(excel|powerpoint|word)| ...
> >
> > This is described on the Wiki in the page :
> > http://wiki.apache.org/nutch/WritingPluginExample
> > Section "Getting Nutch to Use Your Plugin"
> >
> >
> > Regards
> >
> > J�r�me
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Enabling different file types

Posted by Jérôme Charron <je...@gmail.com>.
> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?

They are available in the trunk version, not in the 0.7.x

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Enabling different file types

Posted by bob knob <an...@yahoo.com>.
Okay but it sounds like I need parser plugins for
word, excel and powerpoint - plugins only has a
parser-msword directory. Has anyone created plugins
for excel & powerpoint? 

--- J�r�me Charron <je...@gmail.com>
wrote:

> > types to be searchable if at all possible. The
> wiki
> > says most file types are disabled by default, but
> they
> > can be turned on by changing conf/nutch-site.xml.
> > Unfortunately there is no documentation that I can
> > find for this file... any ideas how to do it, or
> > sample xml that somebody could send over?
> 
> Simply add the plugin name in the plugin.includes
> property.
> For instance, to activate word, powerpoint and excel
> parsing, just add in
> this property :
> ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> ...
> or in a shorter syntax :
> ... |parse-ms(excel|powerpoint|word)| ...
> 
> This is described on the Wiki in the page :
> http://wiki.apache.org/nutch/WritingPluginExample
> Section "Getting Nutch to Use Your Plugin"
> 
> 
> Regards
> 
> J�r�me
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Enabling different file types

Posted by Jérôme Charron <je...@gmail.com>.
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?

Simply add the plugin name in the plugin.includes property.
For instance, to activate word, powerpoint and excel parsing, just add in
this property :
... |parse-msexcel|parse-mspowerpoint|parse-msword| ...
or in a shorter syntax :
... |parse-ms(excel|powerpoint|word)| ...

This is described on the Wiki in the page :
http://wiki.apache.org/nutch/WritingPluginExample
Section "Getting Nutch to Use Your Plugin"


Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Enabling different file types

Posted by bob knob <an...@yahoo.com>.
Hi, it's me again,

If I'm going to use Nutch, I need xls, ppt, & doc file
types to be searchable if at all possible. The wiki
says most file types are disabled by default, but they
can be turned on by changing conf/nutch-site.xml.
Unfortunately there is no documentation that I can
find for this file... any ideas how to do it, or
sample xml that somebody could send over?

Thanks,
Bob

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com