You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ragy Eleish <ra...@gmail.com> on 2006/02/23 19:13:44 UTC

Extracting multiple entries from a single URL

Hi,

I have a need to get multiplte search results entries from a single URL. For
example I want to index the photo captions in this url
http://racer007.albumpost.com/montreal without having to navigate to each
picture page, because sometimes there is no individual picture page.

I did it by writing an HTMLParserFilter, modifying ParseData, and Fetcher,
then disabling the clean duplicate code in the CrawlerTool. I did this in
Nutch 0.7.1 Is there a better way of doing thing?

Regards

--Ragy

Re: Extracting multiple entries from a single URL

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi Ragy,
well this is may difficult. The problem is that nutch 0.7 and 0.8 is  
centralized around urls.
You have the crawl DB or webdb and the keys of these entries are  
urls. So to get the complete workflow running with multiple documents  
would be difficult.
However what you can do in nutch 0.8 is writing a own map reduce job,  
where the mapper is extracting the content and the reduce generates  
an index from that.
In such a case you need also write a custom ui since the data  
structures will different e.g. would not have the segments data.
Anyway it is possible but some work.
HTH
Stefan

Am 23.02.2006 um 19:13 schrieb Ragy Eleish:

> Hi,
>
> I have a need to get multiplte search results entries from a single  
> URL. For
> example I want to index the photo captions in this url
> http://racer007.albumpost.com/montreal without having to navigate  
> to each
> picture page, because sometimes there is no individual picture page.
>
> I did it by writing an HTMLParserFilter, modifying ParseData, and  
> Fetcher,
> then disabling the clean duplicate code in the CrawlerTool. I did  
> this in
> Nutch 0.7.1 Is there a better way of doing thing?
>
> Regards
>
> --Ragy

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com