You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by pa...@personifi.com on 2005/10/21 21:17:07 UTC

Create WebDB

Is there a way to recreate a webdb from your segments if the webdb got
damaged or deleted?


No valid local directories 2nd try

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

I try to run generate with mapred and I get:

051021 064740 Generator: starting
051021 064740 Generator: segment: crawl/segments/20051021064740
051021 064740 Generator: Selecting most-linked urls due for fetch.
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/mapred-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 Client connection to 127.0.0.1:8011: starting
051021 064740 Client connection to 127.0.0.1:8009: starting
Exception in thread "main" java.io.IOException: No valid local directories.
       at org.apache.nutch.ipc.Client.call(Client.java:294)
       at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
       at $Proxy0.submitJob(Unknown Source)
       at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
       at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
       at org.apache.nutch.crawl.Generator.generate(Generator.java:191)
       at org.apache.nutch.crawl.Generator.main(Generator.java:258)

Any idea?

Regards,

Gal


No valid local directories

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

I try to run generate with mapred and I get:

051021 064740 Generator: starting
051021 064740 Generator: segment: crawl/segments/20051021064740
051021 064740 Generator: Selecting most-linked urls due for fetch.
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/mapred-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 Client connection to 127.0.0.1:8011: starting
051021 064740 Client connection to 127.0.0.1:8009: starting
Exception in thread "main" java.io.IOException: No valid local directories.
        at org.apache.nutch.ipc.Client.call(Client.java:294)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy0.submitJob(Unknown Source)
        at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:191)
        at org.apache.nutch.crawl.Generator.main(Generator.java:258)

Any idea?

Regards,

Gal


Re: Create WebDB

Posted by Matt Kangas <ka...@gmail.com>.
i'm not aware of any tool that does this. but if i had to do this in  
a pinch, and i was on unix, i'd do the following:

rm /tmp/urls.txt
cd <segments-dir>
for X in 2005*; do
nutch segread -dump -nocontent -noparsedata -noparsetext $X | grep  
^URL | cut -d " " -f 2- >> /tmp/urls.txt
done

then use create a new webdb and import /tmp/urls.txt. hope that helps.

--matt

On Oct 21, 2005, at 3:17 PM, <pa...@personifi.com>  
<pa...@personifi.com> wrote:

> Is there a way to recreate a webdb from your segments if the webdb got
> damaged or deleted?

--
Matt Kangas / kangas@gmail.com



Re: Create WebDB

Posted by Michael Ji <fj...@yahoo.com>.
I think you can create an empty webdb by calling
adminTool, then, running updateWebDB from segments;
the thing is, you might lost Pages which is not
fetched in current segment;

if you have all the old segments, you can iterate
through all the segments, so make sure no Page is
missing in webDB;

Michael Ji,

--- paul@personifi.com wrote:

> Is there a way to recreate a webdb from your
> segments if the webdb got
> damaged or deleted?
> 
> 



		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com