You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by DS jha <ae...@gmail.com> on 2007/07/25 01:17:58 UTC

getting document link graph

Hi -

I want to read the map of incoming and outgoing links of a document
and use that for some analysis purpose.  Does nutch store link graph
once fetch/parse/index is complete?

After browsing thru the code, it does seem that during document
parsing and storing, incoming and outgoing links are getting passed
around between objects but is that information available once the
process is complete - by reading segment or index information?

Thanks,
Jha

Re: getting document link graph

Posted by Brian Whitman <br...@variogr.am>.

On Jul 24, 2007, at 7:17 PM, DS jha wrote:
> After browsing thru the code, it does seem that during document
> parsing and storing, incoming and outgoing links are getting passed
> around between objects but is that information available once the
> process is complete - by reading segment or index information?
>

/bin/nutch readlinkdb /path/to/linkdb -dump dump_dir

Re: getting document link graph

Posted by Enis Soztutar <en...@gmail.com>.

Linkdb contains all the information about the web graph. After fetching 
the segments, you should run bin/nutch invertlinks to build the linkdb, 
which is a MapFile. The entries in the MapFile are <key,value> pairs, 
where keys are Text objects(containing urls) and values are Inlinks 
objects. In fact FYI, linkdb can easily be "processed" by map-reduce jobs.

DS jha wrote:
> Hi -
>
> I want to read the map of incoming and outgoing links of a document
> and use that for some analysis purpose.  Does nutch store link graph
> once fetch/parse/index is complete?
>
> After browsing thru the code, it does seem that during document
> parsing and storing, incoming and outgoing links are getting passed
> around between objects but is that information available once the
> process is complete - by reading segment or index information?
>
> Thanks,
> Jha
>