You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Fuad Efendi <fu...@efendi.ca> on 2005/08/15 19:20:03 UTC

Fetcher, ParseText, ParseData - need to modify

I just catched some output from Fetcher.FetcherThread.outputPage(.) and
noticed that some anchors are in a text, and some <OPTIONS> tags within
a text too.
          LOG.info("ParseText = "+text);
          LOG.info("ParseData = "+ parseData);

I'd like to modify behaviour, ParseText should contain subset of a text
which I need, and ParseData should contain all anchors.

Where to start? Would be nice to have plugins modifying Fetcher
behaviour...

Re: FW: Fetcher, ParseText, ParseData - need to modify

Posted by Piotr Kosiorowski <pk...@gmail.com>.
Hello,
To change nutch standard html parsing the best place to start would be 
probably parse-html plugin.
Regards
Piotr
Fuad Efendi wrote:
> 1. This is part of ParseText:
> Any Accessories Backup Devices & Media Barebone Systems Camcorder
> Accessories Camcorders Cases & External Enclosures CD / DVD Drives &
> Media Cooling Devices Digital Camera Accessories Digital Cameras
> 
> - it is content of Dropdown, <OPTIONS> in HTML
> 
> 
> 2. I have some sub-text in ParseText which seems to be an anchor, I
> compared visually with web-page...
> 
> 
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca] 
> Sent: Monday, August 15, 2005 1:20 PM
> To: nutch-dev@lucene.apache.org
> Subject: Fetcher, ParseText, ParseData - need to modify
> 
> 
> I just catched some output from Fetcher.FetcherThread.outputPage(.) and
> noticed that some anchors are in a text, and some <OPTIONS> tags within
> a text too.
>           LOG.info("ParseText = "+text);
>           LOG.info("ParseData = "+ parseData);
> 
> I'd like to modify behaviour, ParseText should contain subset of a text
> which I need, and ParseData should contain all anchors.
> 
> Where to start? Would be nice to have plugins modifying Fetcher
> behaviour...
> 
> 


FW: Fetcher, ParseText, ParseData - need to modify

Posted by Fuad Efendi <fu...@efendi.ca>.
1. This is part of ParseText:
Any Accessories Backup Devices & Media Barebone Systems Camcorder
Accessories Camcorders Cases & External Enclosures CD / DVD Drives &
Media Cooling Devices Digital Camera Accessories Digital Cameras

- it is content of Dropdown, <OPTIONS> in HTML


2. I have some sub-text in ParseText which seems to be an anchor, I
compared visually with web-page...


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Monday, August 15, 2005 1:20 PM
To: nutch-dev@lucene.apache.org
Subject: Fetcher, ParseText, ParseData - need to modify


I just catched some output from Fetcher.FetcherThread.outputPage(.) and
noticed that some anchors are in a text, and some <OPTIONS> tags within
a text too.
          LOG.info("ParseText = "+text);
          LOG.info("ParseData = "+ parseData);

I'd like to modify behaviour, ParseText should contain subset of a text
which I need, and ParseData should contain all anchors.

Where to start? Would be nice to have plugins modifying Fetcher
behaviour...