You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Ricardo J. Méndez" <me...@gmail.com> on 2007/02/22 04:41:07 UTC

Customizing crawling

Hi,

I posted this to nutch-agent as well, but nutch-user seems to be more
active.

I've got a few questions about customizing the crawling process.  I
tried checking out the Wiki, but many of the pages linked from
"Becoming a Nutch Developer" are still unwritten, so any pointers you
can provide would be very welcome.

While some of the issues were covered on the recent "focused crawls"
thread, I still have a few questions.

1) Which types of links does Nutch follow? Only HREFs?  If so, I'd like
it to follow some <link /> references from the page's Header.  I know
that I can obtain the link reference with a Parse plugin, but how should
I add the reference to the list of items to be crawled?

2) Which type of plugin or response from one - if any - determines what
items go into the database?  For instance, can I write a plugin that
returns "false" if I don't want the database to store a PDF, or a Word
document?  Or maybe a specific page, based on something found by a Parse
plugin?

Thanks in advance,



Ricardo J. Méndez
http://ricardo.strangevistas.net/

Re: Customizing crawling

Posted by Renaud Richardet <re...@apache.org>.

Ricardo J. Méndez wrote:
> Hi,
>
> Just noticed I didn't reply to a question.
>
>
>   
>> Which pages are still unwritten?
>>     
>
> I was reading http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer ,
> which has wiki links to wiki pages for CrawlDb, LinkDb,
> DeleteDuplicates, ParseSegment, CrawlDbMerger and DistributedSearch
> that, when clicked, display
>
> "This page does not exist yet. You can create a new empty page, or use
> one of the page templates. Before creating the page, please check if a
> similar page already exists."
>   
it's because whenever you write a "camelcase word", the wiki assumes 
it's a link to another wiki page. But in the case above, it's just a few 
java classes, for which there are no wiki pages...

Renaud

Re: Customizing crawling

Posted by "Ricardo J. Méndez" <me...@gmail.com>.

Hi,

Just noticed I didn't reply to a question.


> Which pages are still unwritten?

I was reading http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer ,
which has wiki links to wiki pages for CrawlDb, LinkDb,
DeleteDuplicates, ParseSegment, CrawlDbMerger and DistributedSearch
that, when clicked, display

"This page does not exist yet. You can create a new empty page, or use
one of the page templates. Before creating the page, please check if a
similar page already exists."

I reading the actual classes as I go.  Best regards,



Ricardo J. Méndez
http://ricardo.strangevistas.net/

Re: Customizing crawling

Posted by "Ricardo J. Méndez" <me...@gmail.com>.

Very clear answer, thanks a lot Dennis!



Ricardo J. Méndez
http://ricardo.strangevistas.net/

Re: Customizing crawling

Posted by "Ricardo J. Méndez" <me...@gmail.com>.

Hi again,

Dennis Kubes wrote:

> Nutch gets outlinks from the pages it parses.  This is either during the
> fetch process with parsing enabled or during just a parse process (see
> org.apache.nutch.parse.ParseSegment).  The content is parsed via plugins
> configured in parse-plugins.xml in the conf directory.  During the parse
> links are created as Outlink objects that are added to a ParseData
> object that is itself added to a Parse object.  During the writing out
> of the parse object (ParseOutputFormat) the outlinks are saved as
> CrawlDatums in the crawl_parse directory under the segment.  Then during
> the UpdateDb job (see CrawlDb) this crawl_parse is merged into the
> master Crawl Database.  That is the long answer.
> 
> Short answer is when you parse get Outlinks and add them to the
> ParseData -> Parse object and then will be updated automatically to he
> CrawlDb when the UpdateDb job is run and it will be fetched when the
> next Fetch job is run.

I was attempting to do this from an HtmlParseFilter plugin, at which
point the data is already parsed and the Outlinks have already been
created.  I thought there might be a way to modify the Outlinks at this
point, but I haven't found one.

It looks like the work that I'm interested on is being done in
DOMContentUtils.getOutlinks, the relevant bit of code from HtmlParser being:

      utils.getOutlinks(baseTag!=null?baseTag:base, l, root);
      outlinks = (Outlink[])l.toArray(new Outlink[l.size()]);

Soon after the outlinks are assigned to the ParseData object, and
there's no method to modify that array.

Is there a plugin type that would allow me to extend this without
altering the HtmlParse plugin, or at least DOMContentUtils?

I'm just getting acquainted with Nutch organization, so please be
patient if I ask an obvious question.  Thanks in advance,

Ricardo J. Méndez
http://ricardo.strangevistas.net/

Re: Customizing crawling

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Ricardo J. Méndez wrote:
> Hi,
> 
> I posted this to nutch-agent as well, but nutch-user seems to be more
> active.
> 
> I've got a few questions about customizing the crawling process.  I
> tried checking out the Wiki, but many of the pages linked from
> "Becoming a Nutch Developer" are still unwritten, so any pointers you
> can provide would be very welcome.

Which pages are still unwritten?
> 
> While some of the issues were covered on the recent "focused crawls"
> thread, I still have a few questions.
> 
> 1) Which types of links does Nutch follow? Only HREFs?  If so, I'd like
> it to follow some <link /> references from the page's Header.  I know
> that I can obtain the link reference with a Parse plugin, but how should
> I add the reference to the list of items to be crawled?

Nutch gets outlinks from the pages it parses.  This is either during the 
fetch process with parsing enabled or during just a parse process (see 
org.apache.nutch.parse.ParseSegment).  The content is parsed via plugins 
configured in parse-plugins.xml in the conf directory.  During the parse 
links are created as Outlink objects that are added to a ParseData 
object that is itself added to a Parse object.  During the writing out 
of the parse object (ParseOutputFormat) the outlinks are saved as 
CrawlDatums in the crawl_parse directory under the segment.  Then during 
the UpdateDb job (see CrawlDb) this crawl_parse is merged into the 
master Crawl Database.  That is the long answer.

Short answer is when you parse get Outlinks and add them to the 
ParseData -> Parse object and then will be updated automatically to he 
CrawlDb when the UpdateDb job is run and it will be fetched when the 
next Fetch job is run.
> 
> 2) Which type of plugin or response from one - if any - determines what
> items go into the database?  For instance, can I write a plugin that
> returns "false" if I don't want the database to store a PDF, or a Word
> document?  Or maybe a specific page, based on something found by a Parse
> plugin?

You can write url filters and url normalizers (scope outlink) that will 
prevent items from going into the CrawlDb.  Or if you are writing your 
own parse plugin, simply don't add the link to the Outlinks.

Dennis Kubes
> 
> Thanks in advance,
> 
> 
> 
> Ricardo J. Méndez
> http://ricardo.strangevistas.net/