You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kranthi reddy <kr...@gmail.com> on 2008/07/11 07:57:40 UTC

CRAWLING USING HADOOP

Hi ,

 I am trying to crawl a few sites using nutch and hadoop . I have a cluster
of 10 pc's and i have given nutch as a job file to hadoop. I am able to
execute most commands like

 bin_temp/hadoop dfs -put xxx yyy  (ls, mkdir) etc

But when i try to run nutch then i get the following error.

bin_temp/nutch crawl tempcrawl/urls -dir tempcrawl/crawl -depth 1

Exception in thread "main" java.net.SocketTimeoutException: timed out
waiting for rpc response
        at org.apache.hadoop.ipc.Client.call(Client.java:473)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
        at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:247)
        at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:105)
        at
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.initialize(DistributedFileSystem.java:67)
        at
org.apache.hadoop.fs.FilterFileSystem.initialize(FilterFileSystem.java:57)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:160)
        at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:83)

Some one please help me out.

When i remove the hadoop-env.sh ,hadoop-site.xml and masters file and
replace slaves with "localhost" ....i am able to crawl perfectly well (but
only on master pc :(( )

Thank you in advance.
Kranthi reddy.B

Re: CRAWLING USING HADOOP

Posted by brainstorm <br...@gmail.com>.
Looks like you haven't done:

bin/hadoop namenode -format

*before anything else* (do start-all.sh, *after* formatting the
namenode)... this is just a guess. I *do* recommend you to start from
scratch reading this howto and follow it strictly step by step:

http://wiki.apache.org/nutch/NutchHadoopTutorial

It worked for me... good luck ! ;)

On Fri, Jul 11, 2008 at 7:57 AM, kranthi reddy <kr...@gmail.com> wrote:
> Hi ,
>
>  I am trying to crawl a few sites using nutch and hadoop . I have a cluster
> of 10 pc's and i have given nutch as a job file to hadoop. I am able to
> execute most commands like
>
>  bin_temp/hadoop dfs -put xxx yyy  (ls, mkdir) etc
>
> But when i try to run nutch then i get the following error.
>
> bin_temp/nutch crawl tempcrawl/urls -dir tempcrawl/crawl -depth 1
>
> Exception in thread "main" java.net.SocketTimeoutException: timed out
> waiting for rpc response
>        at org.apache.hadoop.ipc.Client.call(Client.java:473)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
>        at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
>        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:247)
>        at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:105)
>        at
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.initialize(DistributedFileSystem.java:67)
>        at
> org.apache.hadoop.fs.FilterFileSystem.initialize(FilterFileSystem.java:57)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:160)
>        at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
>        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:83)
>
> Some one please help me out.
>
> When i remove the hadoop-env.sh ,hadoop-site.xml and masters file and
> replace slaves with "localhost" ....i am able to crawl perfectly well (but
> only on master pc :(( )
>
> Thank you in advance.
> Kranthi reddy.B
>

Nutch performance

Posted by Anton Potekhin <an...@orbita1.ru>.
Hello! i would like to know how many pages can nutch index daily and how 
many searches can it handle. I understand that it depends on the 
hardware because i want to know not exact results ;-).  For example  i 
will use 4 servers.
And i will use the following configuration:
1) server1 i will use for Jobtracker, namenode
2) server2 i will use for the first Tasktracker and the second DateNode
2) server3 i will use for the second Tasktracker and the second DateNode
2) server4 i will use for tomcat for searching

How many pages can i index daily and how many searches can this 
configuration handle?

I realize a lot of this depends on the hardware but in general what would you say. 
And what can you say what i must change in this configuration? 
And what hardware do you recommend to use for each server?