You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kranthi reddy <kr...@gmail.com> on 2008/07/11 07:57:40 UTC
CRAWLING USING HADOOP
Hi ,
I am trying to crawl a few sites using nutch and hadoop . I have a cluster
of 10 pc's and i have given nutch as a job file to hadoop. I am able to
execute most commands like
bin_temp/hadoop dfs -put xxx yyy (ls, mkdir) etc
But when i try to run nutch then i get the following error.
bin_temp/nutch crawl tempcrawl/urls -dir tempcrawl/crawl -depth 1
Exception in thread "main" java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:473)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:247)
at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:105)
at
org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.initialize(DistributedFileSystem.java:67)
at
org.apache.hadoop.fs.FilterFileSystem.initialize(FilterFileSystem.java:57)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:160)
at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:83)
Some one please help me out.
When i remove the hadoop-env.sh ,hadoop-site.xml and masters file and
replace slaves with "localhost" ....i am able to crawl perfectly well (but
only on master pc :(( )
Thank you in advance.
Kranthi reddy.B
Re: CRAWLING USING HADOOP
Posted by brainstorm <br...@gmail.com>.
Looks like you haven't done:
bin/hadoop namenode -format
*before anything else* (do start-all.sh, *after* formatting the
namenode)... this is just a guess. I *do* recommend you to start from
scratch reading this howto and follow it strictly step by step:
http://wiki.apache.org/nutch/NutchHadoopTutorial
It worked for me... good luck ! ;)
On Fri, Jul 11, 2008 at 7:57 AM, kranthi reddy <kr...@gmail.com> wrote:
> Hi ,
>
> I am trying to crawl a few sites using nutch and hadoop . I have a cluster
> of 10 pc's and i have given nutch as a job file to hadoop. I am able to
> execute most commands like
>
> bin_temp/hadoop dfs -put xxx yyy (ls, mkdir) etc
>
> But when i try to run nutch then i get the following error.
>
> bin_temp/nutch crawl tempcrawl/urls -dir tempcrawl/crawl -depth 1
>
> Exception in thread "main" java.net.SocketTimeoutException: timed out
> waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:473)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:163)
> at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:247)
> at org.apache.hadoop.dfs.DFSClient.<init>(DFSClient.java:105)
> at
> org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.initialize(DistributedFileSystem.java:67)
> at
> org.apache.hadoop.fs.FilterFileSystem.initialize(FilterFileSystem.java:57)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:160)
> at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:119)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:91)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:83)
>
> Some one please help me out.
>
> When i remove the hadoop-env.sh ,hadoop-site.xml and masters file and
> replace slaves with "localhost" ....i am able to crawl perfectly well (but
> only on master pc :(( )
>
> Thank you in advance.
> Kranthi reddy.B
>
Nutch performance
Posted by Anton Potekhin <an...@orbita1.ru>.
Hello! i would like to know how many pages can nutch index daily and how
many searches can it handle. I understand that it depends on the
hardware because i want to know not exact results ;-). For example i
will use 4 servers.
And i will use the following configuration:
1) server1 i will use for Jobtracker, namenode
2) server2 i will use for the first Tasktracker and the second DateNode
2) server3 i will use for the second Tasktracker and the second DateNode
2) server4 i will use for tomcat for searching
How many pages can i index daily and how many searches can this
configuration handle?
I realize a lot of this depends on the hardware but in general what would you say.
And what can you say what i must change in this configuration?
And what hardware do you recommend to use for each server?