You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by brainstorm <br...@gmail.com> on 2008/12/03 19:55:43 UTC

readlinkdb fails to dump linkdb

Using nutch 0.9 (hadoop 0.17.1):

[hadoop@cluster working]$ bin/nutch readlinkdb
/home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
LinkDb dump: starting
LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
java.io.IOException: Type mismatch in value from map: expected
org.apache.nutch.crawl.Inlinks, recieved
org.apache.nutch.crawl.CrawlDatum
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

LinkDbReader: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)

This is the first time I use readlinkdb and the rest of the crawling
process is working ok, I've looked up JIRA and there's no related bug.

I've also tried latest trunk nutch but DFS is not working for me:

[hadoop@cluster trunk]$ bin/hadoop dfs -ls

Exception in thread "main" java.lang.RuntimeException:
java.lang.ClassNotFoundException:
org.apache.hadoop.hdfs.DistributedFileSystem
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hdfs.DistributedFileSystem
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
        ... 10 more

Should I file both bugs on JIRA ?

Re: readlinkdb fails to dump linkdb

Posted by Doğacan Güney <do...@gmail.com>.

On Thu, Dec 4, 2008 at 11:33 AM, brainstorm <br...@gmail.com> wrote:
> On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney <do...@gmail.com> wrote:
>> On Wed, Dec 3, 2008 at 8:55 PM, brainstorm <br...@gmail.com> wrote:
>>> Using nutch 0.9 (hadoop 0.17.1):
>>>
>>> [hadoop@cluster working]$ bin/nutch readlinkdb
>>> /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
>>> LinkDb dump: starting
>>> LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
>>                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> It seems you are providing a crawldb as argument. You should pass the linkdb.
>
>
> Thanks a lot for the hint, but I cannot find "linkdb" dir anywhere on
> the HDFS :_/ Can you point me where should it be ?

A linkdb is created with the command: invertlinks, e.g:

bin/nutch invertlinks crawl/linkdb crawl/segments/....

>
>
>>> java.io.IOException: Type mismatch in value from map: expected
>>> org.apache.nutch.crawl.Inlinks, recieved
>>> org.apache.nutch.crawl.CrawlDatum
>>>        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
>>>        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
>>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>>>        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
>>>
>>> LinkDbReader: java.io.IOException: Job failed!
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
>>>        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
>>>        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)
>>>
>>> This is the first time I use readlinkdb and the rest of the crawling
>>> process is working ok, I've looked up JIRA and there's no related bug.
>>>
>>> I've also tried latest trunk nutch but DFS is not working for me:
>>>
>>> [hadoop@cluster trunk]$ bin/hadoop dfs -ls
>>>
>>> Exception in thread "main" java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException:
>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
>>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
>>>        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
>>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
>>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
>>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
>>>        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
>>>        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
>>> Caused by: java.lang.ClassNotFoundException:
>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>>>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>>>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>>>        at java.lang.Class.forName0(Native Method)
>>>        at java.lang.Class.forName(Class.java:247)
>>>        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
>>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
>>>        ... 10 more
>>>
>>> Should I file both bugs on JIRA ?
>>>
>>
>> This I am not sure, but did you try ant clean; ant? It may be a
>> version mismatch.
>
>
> Yes, I did ant clean && ant before trying the above command. I also
> tried to upgrade the filesystem unsuccessfully and even created it
> from scratch:
>
> https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650556#action_12650556
>
>
>>
>> --
>> Doğacan Güney
>>
>



-- 
Doğacan Güney

Re: readlinkdb fails to dump linkdb

Posted by brainstorm <br...@gmail.com>.

On Wed, Dec 3, 2008 at 8:29 PM, Doğacan Güney <do...@gmail.com> wrote:
> On Wed, Dec 3, 2008 at 8:55 PM, brainstorm <br...@gmail.com> wrote:
>> Using nutch 0.9 (hadoop 0.17.1):
>>
>> [hadoop@cluster working]$ bin/nutch readlinkdb
>> /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
>> LinkDb dump: starting
>> LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
>                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> It seems you are providing a crawldb as argument. You should pass the linkdb.


Thanks a lot for the hint, but I cannot find "linkdb" dir anywhere on
the HDFS :_/ Can you point me where should it be ?


>> java.io.IOException: Type mismatch in value from map: expected
>> org.apache.nutch.crawl.Inlinks, recieved
>> org.apache.nutch.crawl.CrawlDatum
>>        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
>>        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>>        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
>>
>> LinkDbReader: java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
>>        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
>>        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)
>>
>> This is the first time I use readlinkdb and the rest of the crawling
>> process is working ok, I've looked up JIRA and there's no related bug.
>>
>> I've also tried latest trunk nutch but DFS is not working for me:
>>
>> [hadoop@cluster trunk]$ bin/hadoop dfs -ls
>>
>> Exception in thread "main" java.lang.RuntimeException:
>> java.lang.ClassNotFoundException:
>> org.apache.hadoop.hdfs.DistributedFileSystem
>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
>>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
>>        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
>>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
>>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
>>        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
>>        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.hadoop.hdfs.DistributedFileSystem
>>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>>        at java.lang.Class.forName0(Native Method)
>>        at java.lang.Class.forName(Class.java:247)
>>        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
>>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
>>        ... 10 more
>>
>> Should I file both bugs on JIRA ?
>>
>
> This I am not sure, but did you try ant clean; ant? It may be a
> version mismatch.


Yes, I did ant clean && ant before trying the above command. I also
tried to upgrade the filesystem unsuccessfully and even created it
from scratch:

https://issues.apache.org/jira/browse/HADOOP-1212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650556#action_12650556


>
> --
> Doğacan Güney
>

Re: readlinkdb fails to dump linkdb

Posted by Doğacan Güney <do...@gmail.com>.

On Wed, Dec 3, 2008 at 8:55 PM, brainstorm <br...@gmail.com> wrote:
> Using nutch 0.9 (hadoop 0.17.1):
>
> [hadoop@cluster working]$ bin/nutch readlinkdb
> /home/hadoop/crawl-20081201/crawldb -dump crawled_urls.txt
> LinkDb dump: starting
> LinkDb db: /home/hadoop/crawl-urls-20081201/crawldb
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It seems you are providing a crawldb as argument. You should pass the linkdb.

> java.io.IOException: Type mismatch in value from map: expected
> org.apache.nutch.crawl.Inlinks, recieved
> org.apache.nutch.crawl.CrawlDatum
>        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
>        at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
>        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
>
> LinkDbReader: java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
>        at org.apache.nutch.crawl.LinkDbReader.processDumpJob(LinkDbReader.java:110)
>        at org.apache.nutch.crawl.LinkDbReader.run(LinkDbReader.java:127)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.LinkDbReader.main(LinkDbReader.java:114)
>
> This is the first time I use readlinkdb and the rest of the crawling
> process is working ok, I've looked up JIRA and there's no related bug.
>
> I've also tried latest trunk nutch but DFS is not working for me:
>
> [hadoop@cluster trunk]$ bin/hadoop dfs -ls
>
> Exception in thread "main" java.lang.RuntimeException:
> java.lang.ClassNotFoundException:
> org.apache.hadoop.hdfs.DistributedFileSystem
>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:648)
>        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1334)
>        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
>        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1351)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:118)
>        at org.apache.hadoop.fs.FsShell.init(FsShell.java:88)
>        at org.apache.hadoop.fs.FsShell.run(FsShell.java:1698)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>        at org.apache.hadoop.fs.FsShell.main(FsShell.java:1847)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.hdfs.DistributedFileSystem
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>        at java.lang.Class.forName0(Native Method)
>        at java.lang.Class.forName(Class.java:247)
>        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:628)
>        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:646)
>        ... 10 more
>
> Should I file both bugs on JIRA ?
>

This I am not sure, but did you try ant clean; ant? It may be a
version mismatch.


-- 
Doğacan Güney