You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Uroš Gruber <ur...@sir-mag.com> on 2006/08/30 09:52:20 UTC

get CrawlDatum

Hi,

Could someone point me how to get CrawlDatum data from key url in 
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of 
url being crawled.

I hope I was clear enough about my problem.

regards

Uros

Re: get CrawlDatum

Posted by Uroš Gruber <ur...@sir-mag.com>.

Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> ParseData.metadata sounds nice, but I think I'm lost again :)
>> If I understand code flow the best place would be in Fetcher [262]
>>
>> but i'm not sure that datum holds info of url being fetched
>
> On the input to the fetcher you get a URL and a CrawlDatum (originally 
> coming from the crawldb). Check for example how the segment name is 
> passed around in metadata, you can use the same method.
>
Hi,

I made some draft patch. But there is still some problems I see. I know 
code needs to be cleaned and test. But right now I don't know what 
number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop 
value is incremented by 1. (still no idea what to do with external 
link). Then I can add config value max_hop etc. to limit fetcher and 
generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.

regards,

Uros

Re: get CrawlDatum

Posted by Andrzej Bialecki <ab...@getopt.org>.

Uroš Gruber wrote:
> ParseData.metadata sounds nice, but I think I'm lost again :)
> If I understand code flow the best place would be in Fetcher [262]
>
> but i'm not sure that datum holds info of url being fetched

On the input to the fetcher you get a URL and a CrawlDatum (originally 
coming from the crawldb). Check for example how the segment name is 
passed around in metadata, you can use the same method.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: get CrawlDatum

Posted by Uroš Gruber <ur...@sir-mag.com>.

Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> Hi,
>>
>> Could someone point me how to get CrawlDatum data from key url in 
>> ParseOutputFormat.write [83].
>> I would like to add data to link urls but this data depend on data of 
>> url being crawled.
>
> You can't, because that instance of CrawlDatum is not available at 
> this place. Either you need to provide it on the input to the 
> map/reduce job (but then you will have to change input and output 
> formats), or you should prepare this information in advance during 
> parsing, and put it into ParseData.metadata.
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]

but i'm not sure that datum holds info of url being fetched

>>
>> I hope I was clear enough about my problem.
> I hope so too ;)
>

Re: get CrawlDatum

Posted by Andrzej Bialecki <ab...@getopt.org>.

Uroš Gruber wrote:
> Hi,
>
> Could someone point me how to get CrawlDatum data from key url in 
> ParseOutputFormat.write [83].
> I would like to add data to link urls but this data depend on data of 
> url being crawled.

You can't, because that instance of CrawlDatum is not available at this 
place. Either you need to provide it on the input to the map/reduce job 
(but then you will have to change input and output formats), or you 
should prepare this information in advance during parsing, and put it 
into ParseData.metadata.

>
> I hope I was clear enough about my problem.
I hope so too ;)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com