You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Rohan Thakur <ro...@gmail.com> on 2013/03/11 10:34:43 UTC

does nutch take care of any format change in the websites that is been crawled

hi

I am new to nutch I wanted to know does nutch take care of any kind of
format change in the urls that we have set to crawl and does not require
any manual changes to the kind of changes that has been applied to the urls
to be crawled. like if we want to extract the price and model number from
particular urls and have configured it in nutch now we if they have changed
the way the model name and its price been displayed in the urls like any
changes in the tags will we still be able to extract the required data from
the urls without changing any thing in nutch.

thanks

regards
Rohan

Re: does nutch take care of any format change in the websites that is been crawled

Posted by Gora Mohanty <go...@mimirtech.com>.
On 11 March 2013 15:04, Rohan Thakur <ro...@gmail.com> wrote:
> hi
>
> I am new to nutch I wanted to know does nutch take care of any kind of
> format change in the urls that we have set to crawl and does not require
> any manual changes to the kind of changes that has been applied to the urls
> to be crawled. like if we want to extract the price and model number from
> particular urls and have configured it in nutch now we if they have changed
> the way the model name and its price been displayed in the urls like any
> changes in the tags will we still be able to extract the required data from
> the urls without changing any thing in nutch.

If I understand you correctly, you are asking if Nutch
can automatically adapt to changes in a web page's
structure. The answer is no, beyond maybe something
trivial that can be captured by an extension to Nutch's
HtmlParser. Maybe you could give an example of what
you are trying to accomplish.

Regards,
Gora