You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Li Zheng wei <ma...@hotmail.com> on 2007/06/10 23:55:29 UTC
How to add parsed metadata to Parse.getData?

Hi, I am in the process of indexing meta tags in html, I know that Nutch 
does not add the meta tag information to parse.MetaData itself, so I need 
to write a plugin to do that.

The problem is:
The code I find doing this work is 
parse.getData().getMeta().put(.......);

But the compiler says wrong to this code

I don't know what is the reason?

Thanks very much!

Mark





>From: Andrzej Bialecki <ab...@getopt.org>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: Crawling the web and going into depth
>Date: Sun, 10 Jun 2007 18:58:40 +0200
>
>Enzo Michelangeli wrote:
>>----- Original Message ----- From: "Andrzej Bialecki" 
>><ab...@getopt.org>
>>Sent: Sunday, June 10, 2007 5:48 PM
>>
>>>Enzo Michelangeli wrote:
>>>>----- Original Message ----- From: "Berlin Brown" 
>>>><be...@gmail.com>
>>>>Sent: Sunday, June 10, 2007 11:24 AM
>>>>
>>>>>Yea, but how do crawl the actual pages like you would a intranet
>>>>>crawl. For example, lets say that I have 20 urls in my set from 
>>>>>the
>>>>>DmozParser.  Lets also say that I want to go into the depth 3 
>>>>>levels
>>>>>deep into the 20 urls.  Is that possible.
>>>>>
>>>>>For example with the intranet crawl I would start with some seed 
>>>>>URL
>>>>>and then go into some depth.  How would I do that URLs fetched 
>>>>>from
>>>>>for example dmoz.
>>>>
>>>>The only way I can imagine is doing it on a host-by-host basis, 
>>>>restricting the host you crawl at various stages with an 
>>>>URLFilter, e.g. by changing the content of regex-urlfilter.txt .
>>>
>>>One simple and efficient way to limit the maximum depth (i.e. the 
>>>number of path elements) for any given site is to ... count the 
>>>slashes ;) You can do it in a regex, or you can implement your own 
>>>URLFilter plugin that does exactly this.
>>
>>Well, it depends on what you mean by "depth": maybe Berlin wants to 
>>limit the length of the chain of recursion (page1.html links to 
>>page2.html, which links to page3.html - and we stop there). Also, 
>>in these days many sites like blogs or CMS-based ones have 
>>dynamically-generated content, with no relationship between '/' and 
>>tree structure in the server's filesystem.
>
>Yes, there could be different definitions of depth.
>
>When it comes to depth as in the sense of proximity, i.e. how many 
>levels removed the page is from the starting point - no problem with 
>that either ;) Here's how you can do it: put a counter in 
>CrawlDatum.metadata, and pass it around to newly discovered pages, 
>increasing it by one. When you reach a limit, you stop adding 
>outlinks from such pages.
>
>If I'm not mistaken it could be handled throughout the whole cycle 
>if you use a ScoringPlugin.
>
>
>--
>Best regards,
>Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
>[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>___|||__||  \|  ||  |  Embedded Unix, System Integration
>http://www.sigram.com  Contact: info at sigram dot com
>
>

_________________________________________________________________
享用世界上最大的电子邮件系统― MSN Hotmail。  http://www.hotmail.com