You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Hartl, Florian" <fl...@sap.com> on 2011/12/14 02:11:17 UTC

how to adjust 'content'

Hi there,

I am fairly new to nutch and this is my first message I am sending to this mailing list.
Even after a fair amount of research I am not able to solve my issue. Hope you can help me.

My question is, whether it is possible to adjust the data that is stored in "content" so that
1.      after every crawled part of a website a "." is added. So if there are two headlines right next to each other (<h1>headline1</h1><h2>headline2</h2>) the result would be: "headline1. headline2"
2.      it is somehow indicated whether a crawled word/phrase is a link or not.

If yes, how much work would it be? And most importantly how would it work?

Thanks for your help!


Re: how to adjust 'content'

Posted by Sebastian Nagel <wa...@googlemail.com>.
On 12/14/2011 07:41 AM, Avni, Itamar wrote:
 > Regarding (1) I'd suggest plugin-in your own additional implementation for HtmlParseFilter, where 
 > you can manipulate the content as you like, and set it back on the returned ParseResult.ParseText.

On 12/14/2011 02:11 AM, Hartl, Florian wrote:
 > 2.      it is somehow indicated whether a crawled word/phrase is a link or not.

If you write your own parse-filter plugin as suggested by Itamar you have to traverse
the DOM tree and construct the content anew. Links are just elements named "a", "img", "iframe", 
etc. And their children contain the "anchor texts" (words being links).

If you are just interested in the "anchor texts" without context you can directly access
the links via (hope I'm correct, so better check the documentation):
   parseResult.get(url).getData().getOutlinks()

The filter method you have to implement gets both as arguments - the DOM tree (DocumentFragment) and 
the ParseResult.

 > If yes, how much work would it be? And most importantly how would it work?
There is a pretty good tutorial on writing a plugin:
   http://wiki.apache.org/nutch/WritingPluginExample
You could also have a look into one of the parse-filter plugins available with Nutch, e.g.,
microformats-reltag or creativecommons.

Regards, Sebastian

RE: how to adjust 'content'

Posted by "Avni, Itamar" <It...@verint.com>.
Regarding (1) I'd suggest plugin-in your own additional implementation for HtmlParseFilter, where you can manipulate the content as you like, and set it back on the returned ParseResult.ParseText.
Regarding (2) I'm less familiar with.

Itamar

-----Original Message-----
From: Hartl, Florian [mailto:florian.hartl@sap.com] 
Sent: יום ד 14 דצמבר 2011 03:11
To: user@nutch.apache.org
Subject: how to adjust 'content'

Hi there,

I am fairly new to nutch and this is my first message I am sending to this mailing list.
Even after a fair amount of research I am not able to solve my issue. Hope you can help me.

My question is, whether it is possible to adjust the data that is stored in "content" so that
1.      after every crawled part of a website a "." is added. So if there are two headlines right next to each other (<h1>headline1</h1><h2>headline2</h2>) the result would be: "headline1. headline2"
2.      it is somehow indicated whether a crawled word/phrase is a link or not.

If yes, how much work would it be? And most importantly how would it work?

Thanks for your help!

This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
The information is intended to be for the use of the individual(s) or
entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.