You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ragy Eleish <ra...@gmail.com> on 2006/02/23 19:13:44 UTC
Extracting multiple entries from a single URL
Hi,
I have a need to get multiplte search results entries from a single URL. For
example I want to index the photo captions in this url
http://racer007.albumpost.com/montreal without having to navigate to each
picture page, because sometimes there is no individual picture page.
I did it by writing an HTMLParserFilter, modifying ParseData, and Fetcher,
then disabling the clean duplicate code in the CrawlerTool. I did this in
Nutch 0.7.1 Is there a better way of doing thing?
Regards
--Ragy
Re: Extracting multiple entries from a single URL
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi Ragy,
well this is may difficult. The problem is that nutch 0.7 and 0.8 is
centralized around urls.
You have the crawl DB or webdb and the keys of these entries are
urls. So to get the complete workflow running with multiple documents
would be difficult.
However what you can do in nutch 0.8 is writing a own map reduce job,
where the mapper is extracting the content and the reduce generates
an index from that.
In such a case you need also write a custom ui since the data
structures will different e.g. would not have the segments data.
Anyway it is possible but some work.
HTH
Stefan
Am 23.02.2006 um 19:13 schrieb Ragy Eleish:
> Hi,
>
> I have a need to get multiplte search results entries from a single
> URL. For
> example I want to index the photo captions in this url
> http://racer007.albumpost.com/montreal without having to navigate
> to each
> picture page, because sometimes there is no individual picture page.
>
> I did it by writing an HTMLParserFilter, modifying ParseData, and
> Fetcher,
> then disabling the clean duplicate code in the CrawlerTool. I did
> this in
> Nutch 0.7.1 Is there a better way of doing thing?
>
> Regards
>
> --Ragy
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com