You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/04/11 12:17:04 UTC

Re: Small dev question

Gal Nitzan wrote:
> Hi Andrzej,
>
> I have two questions in regards to ParseOutputFormat.java:
>
> 1. On line 102 a String[] is used. Do you think it might be better to use a
> ListArray? It will save a few cycles down the road -- it shall save you to
> use "validCount" and will save you the "if" on line 121. I can make a patch
> if you think I'm correct on this.
>   

I doubt it would save anything, and even if, the savings would be 
negligible. Creating a new entry in ListArray and hooking it up to the 
list has some cost, too.

> 2. If I understand the functionality correct, on line 87 a new CrawlDatum is
> created for the fetched page. The interval is set to 0.0. Could you please
> explain why it is set to 0.0?
>   
That's only a special additional CrawlDatum, which serves as a signature container. You see, if we don't parse at the same time as we fetch then we can't put the signature in the same CrawlDatum (see the logic in Fetcher.FetcherThread.output()), so we need another instance, to pick up the signature when running updatedb.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Small dev question

Posted by Gal Nitzan <gn...@usa.net>.

Thank you very much for your prompt reply.

I see what you mean.

Regards,

Gal.


-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Tuesday, April 11, 2006 12:17 PM
To: nutch-user@lucene.apache.org
Subject: Re: Small dev question

Gal Nitzan wrote:
> Hi Andrzej,
>
> I have two questions in regards to ParseOutputFormat.java:
>
> 1. On line 102 a String[] is used. Do you think it might be better to use
a
> ListArray? It will save a few cycles down the road -- it shall save you to
> use "validCount" and will save you the "if" on line 121. I can make a
patch
> if you think I'm correct on this.
>   

I doubt it would save anything, and even if, the savings would be 
negligible. Creating a new entry in ListArray and hooking it up to the 
list has some cost, too.

> 2. If I understand the functionality correct, on line 87 a new CrawlDatum
is
> created for the fetched page. The interval is set to 0.0. Could you please
> explain why it is set to 0.0?
>   
That's only a special additional CrawlDatum, which serves as a signature
container. You see, if we don't parse at the same time as we fetch then we
can't put the signature in the same CrawlDatum (see the logic in
Fetcher.FetcherThread.output()), so we need another instance, to pick up the
signature when running updatedb.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Enabling different file types

Posted by Rajesh Munavalli <fi...@gmail.com>.

Follow these steps for nutch-0.7.2:

(1) Modify the nutch-default.xml for the following property
For ex: if you want to include "doc" file type, replace the <value> node to
"parse-(text|html|doc)" as shown below.

<property>
  <name>plugin.includes</name>

<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|doc)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

(2) The next step is to develop the appropriate plugin for the particular
file. The parse needs to implement the interface "Parser" (
org.apache.nutch.parse )in nutch.

More details can be found in the following link
http://wiki.apache.org/nutch/WritingPluginExample

(3) Modify the plugin.xml. The link above describes everything in detail.
Here is an example plugin.xml I wrote for XHTML parser. Observe the
"contentType" which matches the file type you are trying to parse.

<?xml version="1.0" encoding="UTF-8"?>
<plugin id="parse-xhtml" name="Xhtml Parse Plug-in" version="1.0.0"
provider-name="dessci.com">

    <runtime>
      <library name="parse-xhtml.jar">
         <export name="*"/>
      </library>
      <library name="nekohtml-0.9.4.jar"/>
      <library name="tagsoup-1.0rc3.jar"/>
   </runtime>

   <extension id="com.dessci.search.nutch.parse.xhtml"
              name="XhtmlParse"
              point="org.apache.nutch.parse.Parser">

      <implementation id="com.dessci.search.nutch.parse.xhtml.XhtmlParser"
                      class="com.dessci.search.nutch.parse.xhtml.XhtmlParser
"
                      contentType="application/xhtml+xml"
                      pathSuffix=""/>

   </extension>

</plugin>

Hope this helps,

--Rajesh Munavalli
On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Hi, it's me again,
>
> If I'm going to use Nutch, I need xls, ppt, & doc file
> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?
>
> Thanks,
> Bob
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Enabling different file types

Posted by Rajesh Munavalli <fi...@gmail.com>.

Have a look at http://jakarta.apache.org/poi/

On 4/11/06, bob knob <an...@yahoo.com> wrote:
>
> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?
>
> --- J�r�me Charron <je...@gmail.com>
> wrote:
>
> > > types to be searchable if at all possible. The
> > wiki
> > > says most file types are disabled by default, but
> > they
> > > can be turned on by changing conf/nutch-site.xml.
> > > Unfortunately there is no documentation that I can
> > > find for this file... any ideas how to do it, or
> > > sample xml that somebody could send over?
> >
> > Simply add the plugin name in the plugin.includes
> > property.
> > For instance, to activate word, powerpoint and excel
> > parsing, just add in
> > this property :
> > ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> > ...
> > or in a shorter syntax :
> > ... |parse-ms(excel|powerpoint|word)| ...
> >
> > This is described on the Wiki in the page :
> > http://wiki.apache.org/nutch/WritingPluginExample
> > Section "Getting Nutch to Use Your Plugin"
> >
> >
> > Regards
> >
> > J�r�me
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Enabling different file types

Posted by Jérôme Charron <je...@gmail.com>.

> Okay but it sounds like I need parser plugins for
> word, excel and powerpoint - plugins only has a
> parser-msword directory. Has anyone created plugins
> for excel & powerpoint?

They are available in the trunk version, not in the 0.7.x

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Enabling different file types

Posted by bob knob <an...@yahoo.com>.

Okay but it sounds like I need parser plugins for
word, excel and powerpoint - plugins only has a
parser-msword directory. Has anyone created plugins
for excel & powerpoint? 

--- Jï¿½rï¿½me Charron <je...@gmail.com>
wrote:

> > types to be searchable if at all possible. The
> wiki
> > says most file types are disabled by default, but
> they
> > can be turned on by changing conf/nutch-site.xml.
> > Unfortunately there is no documentation that I can
> > find for this file... any ideas how to do it, or
> > sample xml that somebody could send over?
> 
> Simply add the plugin name in the plugin.includes
> property.
> For instance, to activate word, powerpoint and excel
> parsing, just add in
> this property :
> ... |parse-msexcel|parse-mspowerpoint|parse-msword|
> ...
> or in a shorter syntax :
> ... |parse-ms(excel|powerpoint|word)| ...
> 
> This is described on the Wiki in the page :
> http://wiki.apache.org/nutch/WritingPluginExample
> Section "Getting Nutch to Use Your Plugin"
> 
> 
> Regards
> 
> Jï¿½rï¿½me
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Enabling different file types

Posted by Jérôme Charron <je...@gmail.com>.

> types to be searchable if at all possible. The wiki
> says most file types are disabled by default, but they
> can be turned on by changing conf/nutch-site.xml.
> Unfortunately there is no documentation that I can
> find for this file... any ideas how to do it, or
> sample xml that somebody could send over?

Simply add the plugin name in the plugin.includes property.
For instance, to activate word, powerpoint and excel parsing, just add in
this property :
... |parse-msexcel|parse-mspowerpoint|parse-msword| ...
or in a shorter syntax :
... |parse-ms(excel|powerpoint|word)| ...

This is described on the Wiki in the page :
http://wiki.apache.org/nutch/WritingPluginExample
Section "Getting Nutch to Use Your Plugin"


Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Enabling different file types

Posted by bob knob <an...@yahoo.com>.

Hi, it's me again,

If I'm going to use Nutch, I need xls, ppt, & doc file
types to be searchable if at all possible. The wiki
says most file types are disabled by default, but they
can be turned on by changing conf/nutch-site.xml.
Unfortunately there is no documentation that I can
find for this file... any ideas how to do it, or
sample xml that somebody could send over?

Thanks,
Bob

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Auto-crawling & re-crawling the web site

Posted by bob knob <an...@yahoo.com>.

Hi,

I am currently evaluating Nutch for use on an intranet
site search engine. I am by no means an expert in this
field although I am trying to learn more about it.

1 I was reading one of the articles referenced on the
nutch site:

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

-and I was a little bit concerned about its warning
concerning "re-crawling" the site. I understand that
there are several steps of crawling, building the
index, etc., but it sounded to me like new pages on my
web site would be ignored until I restarted the Nutch
server even after I've re-crawled. Am I correct about
this? How do most people deal with it?

2 It seems like I would want to re-crawl or re-index
the site on a nightly basis. All of this seems to be
done with shell scripts, and I wonder what options are
available to someone working on a Windows platform. I
could run cygrunsrv/cron on Windows I guess. Is there
some reason more of this scripting couldn't be redone
as a Java program? Also, has anybody considered
creating a Windows service to manage indexing/crawling
like the one that manages the Tomcat web server?

Thanks,
Bob

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com