You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Uroš Gruber <ur...@sir-mag.com> on 2006/08/30 09:52:20 UTC
get CrawlDatum
Hi,
Could someone point me how to get CrawlDatum data from key url in
ParseOutputFormat.write [83].
I would like to add data to link urls but this data depend on data of
url being crawled.
I hope I was clear enough about my problem.
regards
Uros
Re: get CrawlDatum
Posted by Uroš Gruber <ur...@sir-mag.com>.
Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> ParseData.metadata sounds nice, but I think I'm lost again :)
>> If I understand code flow the best place would be in Fetcher [262]
>>
>> but i'm not sure that datum holds info of url being fetched
>
> On the input to the fetcher you get a URL and a CrawlDatum (originally
> coming from the crawldb). Check for example how the segment name is
> passed around in metadata, you can use the same method.
>
Hi,
I made some draft patch. But there is still some problems I see. I know
code needs to be cleaned and test. But right now I don't know what
number set to external urls. For internal linking works great.
What is the whole idea of this changes.
Injected urls always get hop 0. While fetching/updating/generating hop
value is incremented by 1. (still no idea what to do with external
link). Then I can add config value max_hop etc. to limit fetcher and
generator to create more urls.
This way it's possible to limit crawling vertically
Comments are welcome.
regards,
Uros
Re: get CrawlDatum
Posted by Andrzej Bialecki <ab...@getopt.org>.
Uroš Gruber wrote:
> ParseData.metadata sounds nice, but I think I'm lost again :)
> If I understand code flow the best place would be in Fetcher [262]
>
> but i'm not sure that datum holds info of url being fetched
On the input to the fetcher you get a URL and a CrawlDatum (originally
coming from the crawldb). Check for example how the segment name is
passed around in metadata, you can use the same method.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: get CrawlDatum
Posted by Uroš Gruber <ur...@sir-mag.com>.
Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> Hi,
>>
>> Could someone point me how to get CrawlDatum data from key url in
>> ParseOutputFormat.write [83].
>> I would like to add data to link urls but this data depend on data of
>> url being crawled.
>
> You can't, because that instance of CrawlDatum is not available at
> this place. Either you need to provide it on the input to the
> map/reduce job (but then you will have to change input and output
> formats), or you should prepare this information in advance during
> parsing, and put it into ParseData.metadata.
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]
but i'm not sure that datum holds info of url being fetched
>>
>> I hope I was clear enough about my problem.
> I hope so too ;)
>
Re: get CrawlDatum
Posted by Andrzej Bialecki <ab...@getopt.org>.
Uroš Gruber wrote:
> Hi,
>
> Could someone point me how to get CrawlDatum data from key url in
> ParseOutputFormat.write [83].
> I would like to add data to link urls but this data depend on data of
> url being crawled.
You can't, because that instance of CrawlDatum is not available at this
place. Either you need to provide it on the input to the map/reduce job
(but then you will have to change input and output formats), or you
should prepare this information in advance during parsing, and put it
into ParseData.metadata.
>
> I hope I was clear enough about my problem.
I hope so too ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com