You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2005/08/15 19:20:03 UTC
Fetcher, ParseText, ParseData - need to modify
I just catched some output from Fetcher.FetcherThread.outputPage(.) and
noticed that some anchors are in a text, and some <OPTIONS> tags within
a text too.
LOG.info("ParseText = "+text);
LOG.info("ParseData = "+ parseData);
I'd like to modify behaviour, ParseText should contain subset of a text
which I need, and ParseData should contain all anchors.
Where to start? Would be nice to have plugins modifying Fetcher
behaviour...
Re: FW: Fetcher, ParseText, ParseData - need to modify
Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,
To change nutch standard html parsing the best place to start would be
probably parse-html plugin.
Regards
Piotr
Fuad Efendi wrote:
> 1. This is part of ParseText:
> Any Accessories Backup Devices & Media Barebone Systems Camcorder
> Accessories Camcorders Cases & External Enclosures CD / DVD Drives &
> Media Cooling Devices Digital Camera Accessories Digital Cameras
>
> - it is content of Dropdown, <OPTIONS> in HTML
>
>
> 2. I have some sub-text in ParseText which seems to be an anchor, I
> compared visually with web-page...
>
>
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Monday, August 15, 2005 1:20 PM
> To: nutch-dev@lucene.apache.org
> Subject: Fetcher, ParseText, ParseData - need to modify
>
>
> I just catched some output from Fetcher.FetcherThread.outputPage(.) and
> noticed that some anchors are in a text, and some <OPTIONS> tags within
> a text too.
> LOG.info("ParseText = "+text);
> LOG.info("ParseData = "+ parseData);
>
> I'd like to modify behaviour, ParseText should contain subset of a text
> which I need, and ParseData should contain all anchors.
>
> Where to start? Would be nice to have plugins modifying Fetcher
> behaviour...
>
>
FW: Fetcher, ParseText, ParseData - need to modify
Posted by Fuad Efendi <fu...@efendi.ca>.
1. This is part of ParseText:
Any Accessories Backup Devices & Media Barebone Systems Camcorder
Accessories Camcorders Cases & External Enclosures CD / DVD Drives &
Media Cooling Devices Digital Camera Accessories Digital Cameras
- it is content of Dropdown, <OPTIONS> in HTML
2. I have some sub-text in ParseText which seems to be an anchor, I
compared visually with web-page...
-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca]
Sent: Monday, August 15, 2005 1:20 PM
To: nutch-dev@lucene.apache.org
Subject: Fetcher, ParseText, ParseData - need to modify
I just catched some output from Fetcher.FetcherThread.outputPage(.) and
noticed that some anchors are in a text, and some <OPTIONS> tags within
a text too.
LOG.info("ParseText = "+text);
LOG.info("ParseData = "+ parseData);
I'd like to modify behaviour, ParseText should contain subset of a text
which I need, and ParseData should contain all anchors.
Where to start? Would be nice to have plugins modifying Fetcher
behaviour...