You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by NamNH <n....@gmail.com> on 2006/09/08 09:59:10 UTC
Customize the crawl process
I want to customize the crawling process by modifying the way pages are
stored. As far as I know, Nutch will stored web pages in binary file, page
by page. After a link analysis step, Nutch will crawl to the destination
page and download it. When pages are stored, I want to write only link to a
different text/binary file with the structure in the example below
E.g. Assuming that page A has link to page B, C and we number them 1, 2 and
3. I want to write my file as
1 2 (Enter for a new line)
1 3
and etc.
How can I do this with Nutch? Please provide me some hints. Thank you very
much.
--
NamNH
-------------------------------------------
Contacts
Cell 0912500501
Office 8581530
Re: Customize the crawl process
Posted by Dennis Kubes <nu...@dragonflymc.com>.
You would need to modify Fetcher line 433 to use a a text output format
like this:
job.setOutputFormat(TextOutputFormat.class);
and you would need to modify Fetcher line 307 only collect the
information you are looking for, maybe something link this:
Outlink[] links = parse.getData().getOutlinks();
for (int i = 0; i < links.length; i++) {
output.collect(key, links[i]);
}
Dennis
NamNH wrote:
> I want to customize the crawling process by modifying the way pages are
> stored. As far as I know, Nutch will stored web pages in binary file,
> page
> by page. After a link analysis step, Nutch will crawl to the destination
> page and download it. When pages are stored, I want to write only link
> to a
> different text/binary file with the structure in the example below
> E.g. Assuming that page A has link to page B, C and we number them 1,
> 2 and
> 3. I want to write my file as
> 1 2 (Enter for a new line)
> 1 3
> and etc.
> How can I do this with Nutch? Please provide me some hints. Thank you
> very
> much.
>