You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Uroš Gruber <ur...@sir-mag.com> on 2006/09/06 19:43:05 UTC

[Fwd: Re: get CrawlDatum]

A while ago I posted this on dev list but without reply. I wonder if 
this is right approach and If I continue to create this feature?
Do you think this idea would help nutch or maybe this is dead end and 
you've already talked about this.

regards

Uros

Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> ParseData.metadata sounds nice, but I think I'm lost again :)
>> If I understand code flow the best place would be in Fetcher [262]
>>
>> but i'm not sure that datum holds info of url being fetched
>
> On the input to the fetcher you get a URL and a CrawlDatum (originally 
> coming from the crawldb). Check for example how the segment name is 
> passed around in metadata, you can use the same method.
>
Hi,

I made some draft patch. But there is still some problems I see. I know 
code needs to be cleaned and test. But right now I don't know what 
number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop 
value is incremented by 1. (still no idea what to do with external 
link). Then I can add config value max_hop etc. to limit fetcher and 
generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.

Re: [Fwd: Re: get CrawlDatum]

Posted by Uroš Gruber <ur...@sir-mag.com>.

Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> I made some draft patch. But there is still some problems I see. I 
>> know code needs to be cleaned and test. But right now I don't know 
>> what number set to external urls. For internal linking works great.
>
> (the patch changes CrawlDatum itself, I think it would be better to 
> put the hop counter in CrawlDatum.metaData.)
>
I can try to make with metaData
>>
>> What is the whole idea of this changes.
>>
>> Injected urls always get hop 0. While fetching/updating/generating 
>> hop value is incremented by 1. (still no idea what to do with 
>> external link). Then I can add config value max_hop etc. to limit 
>> fetcher and generator to create more urls.
>>
>> This way it's possible to limit crawling vertically
>>
>> Comments are welcome.
>
> Well, it really depends on what you want to do when you encounter an 
> external link. Do you want to restart the counter, i.e. crawl the new 
> site at full depth up to max_hop? Then set hop=0. Do you want to 
> terminate the crawl at that link? then set hop=max_hop.
>
I talk with my friend about this and here is what we've came up. Let say 
URLs manualy injected are good and checked by human and probably you 
wan't to start from it. So setting hop to 0 at injection is ok. While 
crawling we have some sort of filtering by host (regexp etc.). We need 
no worry about urls we don't have in our list so hop can be set whatever 
it's, maybe to max_hop.

But here a scenario We add foo.com and bar.com from injection. After 
crawling we find on site foo.com link to bar.com/hop/hop/index.html
We can set url hop to 0 or to max because we can update this after we 
found this url on bar.com site.

Checking for hop needs to be done while updating I think, so we don't 
end up with bunch of urls having hop greater than max_hop.

I will try to make a decent patch for this to check and if there is any 
idea by others please make a comment on this.

regards

Uros

Re: [Fwd: Re: get CrawlDatum]

Posted by Andrzej Bialecki <ab...@getopt.org>.

Uroš Gruber wrote:
> I made some draft patch. But there is still some problems I see. I 
> know code needs to be cleaned and test. But right now I don't know 
> what number set to external urls. For internal linking works great.

(the patch changes CrawlDatum itself, I think it would be better to put 
the hop counter in CrawlDatum.metaData.)

>
> What is the whole idea of this changes.
>
> Injected urls always get hop 0. While fetching/updating/generating hop 
> value is incremented by 1. (still no idea what to do with external 
> link). Then I can add config value max_hop etc. to limit fetcher and 
> generator to create more urls.
>
> This way it's possible to limit crawling vertically
>
> Comments are welcome.

Well, it really depends on what you want to do when you encounter an 
external link. Do you want to restart the counter, i.e. crawl the new 
site at full depth up to max_hop? Then set hop=0. Do you want to 
terminate the crawl at that link? then set hop=max_hop.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com