You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Paul Tomblin <pt...@xcski.com> on 2009/07/28 16:46:04 UTC

Dumping what I have?

The nutch data files are pretty opaque, and even "strings" can't extract
anything except the occasional URL.  Is there any code to dump the contents
of the various files in a human readable form?

-- 
http://www.linkedin.com/in/paultomblin

Re: Dumping what I have?

Posted by Paul Tomblin <pt...@xcski.com>.
Awesome!  Thanks.

On Tue, Jul 28, 2009 at 12:26 PM, reinhard schwab <re...@aon.at>wrote:

> yes, there are tools which you can use to dump the content of crawl db,
> link db and segments.
>
> dump=./crawl/dump
> bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
> bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb
> bin/nutch readseg -dump $1 $dump/segments/$1
>
> you will get more info if you call
>
> bin/nutch readdb
> bin/nutch readlinkdb
> bin/nutch readseg
>
> Paul Tomblin schrieb:
> > The nutch data files are pretty opaque, and even "strings" can't extract
> > anything except the occasional URL.  Is there any code to dump the
> contents
> > of the various files in a human readable form?
> >
> >
>
>


-- 
http://www.linkedin.com/in/paultomblin

Re: Dumping what I have?

Posted by reinhard schwab <re...@aon.at>.
yes, there are tools which you can use to dump the content of crawl db,
link db and segments.

dump=./crawl/dump
bin/nutch readdb $crawl/crawldb -dump $dump/crawldb
bin/nutch readlinkdb $crawl/linkdb -dump $dump/linkdb
bin/nutch readseg -dump $1 $dump/segments/$1

you will get more info if you call

bin/nutch readdb
bin/nutch readlinkdb
bin/nutch readseg

Paul Tomblin schrieb:
> The nutch data files are pretty opaque, and even "strings" can't extract
> anything except the occasional URL.  Is there any code to dump the contents
> of the various files in a human readable form?
>
>   


Re: Dumping what I have?

Posted by schroedi <sc...@gmail.com>.
Hi Paul,

yeah there is a dump command

bin/nutch readlinkdb crawl/linkdb/ -dump dumpdir
You can also dump the CrawlDB, but I dont know if the complete data are
dumpable and this is usefull for you...

HTH

Mario

Paul Tomblin wrote:
> The nutch data files are pretty opaque, and even "strings" can't extract
> anything except the occasional URL.  Is there any code to dump the contents
> of the various files in a human readable form?
>
>   

-- 

Mario Schröder | http://www.finanz-checks.de
Office: +49 361 2152062
Phone: +49 34464 62301 Cell: +49 163 27 09 807
http://www.xing.com/go/invite/6035007.9c143c