You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Li Zheng wei <ma...@hotmail.com> on 2007/06/10 23:55:29 UTC
How to add parsed metadata to Parse.getData?
Hi, I am in the process of indexing meta tags in html, I know that Nutch
does not add the meta tag information to parse.MetaData itself, so I need
to write a plugin to do that.
The problem is:
The code I find doing this work is
parse.getData().getMeta().put(.......);
But the compiler says wrong to this code
I don't know what is the reason?
Thanks very much!
Mark
>From: Andrzej Bialecki <ab...@getopt.org>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: Crawling the web and going into depth
>Date: Sun, 10 Jun 2007 18:58:40 +0200
>
>Enzo Michelangeli wrote:
>>----- Original Message ----- From: "Andrzej Bialecki"
>><ab...@getopt.org>
>>Sent: Sunday, June 10, 2007 5:48 PM
>>
>>>Enzo Michelangeli wrote:
>>>>----- Original Message ----- From: "Berlin Brown"
>>>><be...@gmail.com>
>>>>Sent: Sunday, June 10, 2007 11:24 AM
>>>>
>>>>>Yea, but how do crawl the actual pages like you would a intranet
>>>>>crawl. For example, lets say that I have 20 urls in my set from
>>>>>the
>>>>>DmozParser. Lets also say that I want to go into the depth 3
>>>>>levels
>>>>>deep into the 20 urls. Is that possible.
>>>>>
>>>>>For example with the intranet crawl I would start with some seed
>>>>>URL
>>>>>and then go into some depth. How would I do that URLs fetched
>>>>>from
>>>>>for example dmoz.
>>>>
>>>>The only way I can imagine is doing it on a host-by-host basis,
>>>>restricting the host you crawl at various stages with an
>>>>URLFilter, e.g. by changing the content of regex-urlfilter.txt .
>>>
>>>One simple and efficient way to limit the maximum depth (i.e. the
>>>number of path elements) for any given site is to ... count the
>>>slashes ;) You can do it in a regex, or you can implement your own
>>>URLFilter plugin that does exactly this.
>>
>>Well, it depends on what you mean by "depth": maybe Berlin wants to
>>limit the length of the chain of recursion (page1.html links to
>>page2.html, which links to page3.html - and we stop there). Also,
>>in these days many sites like blogs or CMS-based ones have
>>dynamically-generated content, with no relationship between '/' and
>>tree structure in the server's filesystem.
>
>Yes, there could be different definitions of depth.
>
>When it comes to depth as in the sense of proximity, i.e. how many
>levels removed the page is from the starting point - no problem with
>that either ;) Here's how you can do it: put a counter in
>CrawlDatum.metadata, and pass it around to newly discovered pages,
>increasing it by one. When you reach a limit, you stop adding
>outlinks from such pages.
>
>If I'm not mistaken it could be handled throughout the whole cycle
>if you use a ScoringPlugin.
>
>
>--
>Best regards,
>Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
>[__ || __|__/|__||\/| Information Retrieval, Semantic Web
>___|||__|| \| || | Embedded Unix, System Integration
>http://www.sigram.com Contact: info at sigram dot com
>
>
_________________________________________________________________
享用世界上最大的电子邮件系统― MSN Hotmail。 http://www.hotmail.com