You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Henrich Martin <ma...@googlemail.com> on 2011/01/07 13:14:05 UTC

Empty linkdb

Hello,

using 'cygwin' running 'crawl' command as in

nutch-1.2/bin/nutch crawl seed/urls -dir c1 -depth 3 -threads 1 >& c1.log

everything works as expected. In particular the 'linkdb' is created and
populated correctly.

The 'hadoop' logs read:

 2011-01-07 11:51:55,129 INFO  crawl.LinkDb - LinkDb: starting at 2011-01-07
11:51:55
2011-01-07 11:51:55,129 INFO  crawl.LinkDb - LinkDb: linkdb: c4/linkdb
2011-01-07 11:51:55,129 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2011-01-07 11:51:55,129 INFO  crawl.LinkDb - LinkDb: URL filter: true
2011-01-07 11:51:55,129 INFO  crawl.LinkDb - LinkDb: adding segment: *
file:/D:/mynutch/c4/segments/20110107114838*
2011-01-07 11:51:55,129 INFO  crawl.LinkDb - LinkDb: adding segment: *
file:/D:/mynutch/c4/segments/20110107114949*
2011-01-07 11:51:55,129 INFO  crawl.LinkDb - LinkDb: adding segment: *
file:/D:/mynutch/c4/segments/20110107115101*
2011-01-07 11:51:55,144 WARN  mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2011-01-07 11:52:12,270 INFO  crawl.LinkDb - LinkDb: finished at 2011-01-07
11:52:12, elapsed: 00:00:17


On the contrary using 'cygwin' running 'invertlinks' as in

nutch-1.2/bin/nutch invertlinks c1/linkdb -dir c1/segments

over the same or any other input segments the resulting 'linkdb' is created
correctly but remains empty.

Then the 'hadoop' logs read:

2011-01-07 11:45:37,126 INFO  crawl.LinkDb - LinkDb: starting at 2011-01-07
11:45:37
2011-01-07 11:45:37,126 INFO  crawl.LinkDb - LinkDb: linkdb: c1/linkdb6
2011-01-07 11:45:37,126 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2011-01-07 11:45:37,126 INFO  crawl.LinkDb - LinkDb: URL filter: true
2011-01-07 11:45:37,142 INFO  crawl.LinkDb - LinkDb: adding segment: *
file:/D:/mynutch/c1/segments/20110106153349*
2011-01-07 11:45:37,142 INFO  crawl.LinkDb - LinkDb: adding segment: *
file:/D:/mynutch/c1/segments/20110106153544*
2011-01-07 11:45:37,142 INFO  crawl.LinkDb - LinkDb: adding segment: *
file:/D:/mynutch/c1/segments/20110106154120*
2011-01-07 11:45:53,314 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2011-01-07 11:45:54,236 INFO  crawl.LinkDb - LinkDb: finished at 2011-01-07
11:45:54, elapsed: 00:00:17

Notice the difference in the 'WARN' message. Some path issue i suspect. Any
ideas?

Thx