You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by pa...@personifi.com on 2005/10/21 21:17:07 UTC
Create WebDB
Is there a way to recreate a webdb from your segments if the webdb got
damaged or deleted?
No valid local directories 2nd try
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I try to run generate with mapred and I get:
051021 064740 Generator: starting
051021 064740 Generator: segment: crawl/segments/20051021064740
051021 064740 Generator: Selecting most-linked urls due for fetch.
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/mapred-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 Client connection to 127.0.0.1:8011: starting
051021 064740 Client connection to 127.0.0.1:8009: starting
Exception in thread "main" java.io.IOException: No valid local directories.
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy0.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.crawl.Generator.generate(Generator.java:191)
at org.apache.nutch.crawl.Generator.main(Generator.java:258)
Any idea?
Regards,
Gal
No valid local directories
Posted by Gal Nitzan <gn...@usa.net>.
Hi,
I try to run generate with mapred and I get:
051021 064740 Generator: starting
051021 064740 Generator: segment: crawl/segments/20051021064740
051021 064740 Generator: Selecting most-linked urls due for fetch.
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/mapred-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-default.xml
051021 064740 parsing file:/home/gnitzan/mapred/conf/nutch-site.xml
051021 064740 Client connection to 127.0.0.1:8011: starting
051021 064740 Client connection to 127.0.0.1:8009: starting
Exception in thread "main" java.io.IOException: No valid local directories.
at org.apache.nutch.ipc.Client.call(Client.java:294)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy0.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.crawl.Generator.generate(Generator.java:191)
at org.apache.nutch.crawl.Generator.main(Generator.java:258)
Any idea?
Regards,
Gal
Re: Create WebDB
Posted by Matt Kangas <ka...@gmail.com>.
i'm not aware of any tool that does this. but if i had to do this in
a pinch, and i was on unix, i'd do the following:
rm /tmp/urls.txt
cd <segments-dir>
for X in 2005*; do
nutch segread -dump -nocontent -noparsedata -noparsetext $X | grep
^URL | cut -d " " -f 2- >> /tmp/urls.txt
done
then use create a new webdb and import /tmp/urls.txt. hope that helps.
--matt
On Oct 21, 2005, at 3:17 PM, <pa...@personifi.com>
<pa...@personifi.com> wrote:
> Is there a way to recreate a webdb from your segments if the webdb got
> damaged or deleted?
--
Matt Kangas / kangas@gmail.com
Re: Create WebDB
Posted by Michael Ji <fj...@yahoo.com>.
I think you can create an empty webdb by calling
adminTool, then, running updateWebDB from segments;
the thing is, you might lost Pages which is not
fetched in current segment;
if you have all the old segments, you can iterate
through all the segments, so make sure no Page is
missing in webDB;
Michael Ji,
--- paul@personifi.com wrote:
> Is there a way to recreate a webdb from your
> segments if the webdb got
> damaged or deleted?
>
>
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com