You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dp...@comcast.net on 2005/09/24 05:08:32 UTC

Documents in Nutch

Hello,

I'm trying to determine if I could use Nutch for a project and having some conceptual difficulties.

It appears that Nutch indexes by page. Each page/url is a Lucene document, with fields for content, title, url, boost, etc... but I want to have a set of pages represented by a single document. Is that possible?

For example, suppose I have a merchant who sells collectibles, and I have some information on that merchant, such as name, general description, store location, hours of operation, contact information, etc... and I also have a URL to his site where more information is available. I want to be able to search for merchants based on keywords found in the name, general description, or any of the pages crawled by the url they specified. Is this possible?

It seems like to do this I'd have to add fields for name, general description, etc... to each of the pages (docs) crawled, which seems like an lot of redundancy. Is there a better way to do this?

Thanks,

-jim

Re: Documents in Nutch

Posted by EM <em...@cpuedge.com>.
dprantzalos@comcast.net wrote:

>Hello,
>
>I'm trying to determine if I could use Nutch for a project and having some conceptual difficulties.
>
>It appears that Nutch indexes by page. Each page/url is a Lucene document, with fields for content, title, url, boost, etc... but I want to have a set of pages represented by a single document. Is that possible?
>
>For example, suppose I have a merchant who sells collectibles, and I have some information on that merchant, such as name, general description, store location, hours of operation, contact information, etc... and I also have a URL to his site where more information is available. I want to be able to search for merchants based on keywords found in the name, general description, or any of the pages crawled by the url they specified. Is this possible?
>
>It seems like to do this I'd have to add fields for name, general description, etc... to each of the pages (docs) crawled, which seems like an lot of redundancy. Is there a better way to do this?
>
>Thanks,
>
>-jim
>  
>
Add an unique identifier to the document and use a separate external 
database.