You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by brainstorm <br...@gmail.com> on 2008/08/05 02:05:24 UTC

Re: Distributed fetching only happening in one node ?

Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
the warnings... BUT, on a 7-node nutch cluster:

1) Fetching is only happening on *one* node despite several values
tested on settings:
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
export HADOOP_HEAPSIZE

I've played with mapreduce (hadoop-site.xml) settings as advised on:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

But nutch keeps crawling only using one node, instead of seven
nodes... anybody knows why ?

I've had a look at the code, searching for:

conf.setNumMapTasks(int num), but found none: so I guess that the
number of mappers & reducers are not limited programatically.

2) Even on a single node, the fetching is really slow: 1 url or page
per second, at most.

Can anybody shed some light into this ? Pointing which class/code I
should look into to modify this behaviour will help also.

Anybody has a distributed nutch crawling cluster working with all
nodes fetching at fetch phase ?

I even did some numbers using wordcount example using 7 nodes at 100%
cpu usage using a 425MB parsedtext file:

maps	reduces	heapsize	time
2	2	500	3m43.049s
4	4	500	4m41.846s
8	8	500	4m29.344s
16	16	500	3m43.672s
32	32	500	3m41.367s
64	64	500	4m27.275s
128	128	500	4m35.233s
256	256	500	3m41.916s
			
			
2	2	2000	4m31.434s
4	4	2000	
8	8	2000	
16	16	2000	4m32.213s
32	32	2000	
64	64	2000	
128	128	2000	
256	256	2000	4m38.310s

Thanks in advance,
Roman

On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <br...@gmail.com> wrote:
> While seeing DFS wireshark trace (and the corresponding RST's), the
> crawl continued to next step... seems that this WARNING is actually
> slowing down the whole crawling process (it took 36 minutes to
> complete the previous fetch) with just 3 urls seed file :-!!!
>
> I just posted a couple of exceptions/questions regarding DFS on hadoop
> core mailing list.
>
> PD: As a side note, the following error caught my attention:
>
> Fetcher: starting
> Fetcher: segment: crawl-ecxi/segments/20080715172458
> Too many fetch-failures
> task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> task_200807151723_0005_m_000000_0: fetching http://upc.es/
> task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> with: org.apache.nutch.protocol.http.api.HttpException:
> java.net.UnknownHostException: upc.cat
>
> Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
> exist, it just gets redirected to www.upc.cat :-/
>
> On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <br...@gmail.com> wrote:
>> Yep, I know about wireshark, and wanted to avoid it to debug this
>> issue (perhaps there was a simple solution/known bug/issue)...
>>
>> I just launched wireshark on frontend with filter tcp.port == 50010,
>> and now I'm diving on the tcp stream... let's see if I see the light
>> (RST flag somewhere ?), thanks anyway for replying ;)
>>
>> Just for the record, the phase that stalls is fetcher during reduce:
>>
>> Jobid   User    Name    Map % Complete  Map Total       Maps Completed  Reduce %
>> Complete        Reduce Total    Reduces Completed
>> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458        100.00%
>>        2       2       16.66%
>>
>>        1       0
>>
>> It's stuck on 16%, no traffic, no crawling, but still "running".
>>
>> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> <pm...@sim-gtech.com> wrote:
>>> Hi brain,
>>>        If I were you, I would download wireshark
>>> (http://www.wireshark.org/download.html) to see what is happening at the
>>> network layer and see if that provides any clues.  A socket exception
>>> that you don't expect is usually due to one side of the conversation not
>>> understanding the other side.  If you have 4 machines, then you have 4
>>> possible places where default firewall rules could be causing an issue.
>>> If it is not the firewall rules, the NAT rules could be a potential
>>> source of error.  Also, even a router hardware error could cause a
>>> problem.
>>>        If you understand TCP, just make sure that you see all the
>>> correct TCP stuff happening in wireshark.  If you don't understand
>>> wireshark's display, let me know, and I'll pass on some quickstart
>>> information.
>>>
>>>        If you already know all of this, I don't have any way to help
>>> you, as it looks like you're trying to accomplish something trickier
>>> with nutch than I have ever attempted.
>>>
>>> Patrick
>>>
>>> -----Original Message-----
>>> From: brainstorm [mailto:braincode@gmail.com]
>>> Sent: Tuesday, July 15, 2008 10:08 AM
>>> To: nutch-user@lucene.apache.org
>>> Subject: Re: Distributed fetching only happening in one node ?
>>>
>>> Boiling down the problem I'm stuck on this:
>>>
>>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>>>        at
>>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>>        at
>>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>>        at
>>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>>        at
>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>>>        at
>>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>>>        at java.lang.Thread.run(Thread.java:595)
>>>
>>> Checked that firewall settings between node & frontend were not
>>> blocking packets, and they don't... anyone knows why is this ? If not,
>>> could you provide a convenient way to debug it ?
>>>
>>> Thanks !
>>>
>>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <br...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>>>> best suited network topology for inet crawling (frontend being a net
>>>> bottleneck), but I think it's fine for testing purposes.
>>>>
>>>> I'm having issues with fetch mapreduce job:
>>>>
>>>> According to ganglia monitoring (network traffic), and hadoop
>>>> administrative interfaces, fetch phase is only being executed in the
>>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>>>> were executed neatly distributed on all nodes:
>>>>
>>>> job_200807131223_0001   hadoop  inject urls     100.00%
>>>>        2       2       100.00%
>>>>        1       1
>>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>>> 100.00%
>>>>        3       3       100.00%
>>>>        1       1
>>>> job_200807131223_0003   hadoop  generate: select
>>>> crawl-ecxi/segments/20080713123547      100.00%
>>>>        3       3       100.00%
>>>>        1       1
>>>> job_200807131223_0004   hadoop  generate: partition
>>>> crawl-ecxi/segments/20080713123547      100.00%
>>>>        4       4       100.00%
>>>>        2       2
>>>>
>>>> I've checked that:
>>>>
>>>> 1) Nodes have inet connectivity, firewall settings
>>>> 2) There's enough space on local discs
>>>> 3) Proper processes are running on nodes
>>>>
>>>> frontend-node:
>>>> ==========
>>>>
>>>> [root@cluster ~]# jps
>>>> 29232 NameNode
>>>> 29489 DataNode
>>>> 29860 JobTracker
>>>> 29778 SecondaryNameNode
>>>> 31122 Crawl
>>>> 30137 TaskTracker
>>>> 10989 Jps
>>>> 1818 TaskTracker$Child
>>>>
>>>> leaf nodes:
>>>> ========
>>>>
>>>> [root@cluster ~]# cluster-fork jps
>>>> compute-0-1:
>>>> 23929 Jps
>>>> 15568 TaskTracker
>>>> 15361 DataNode
>>>> compute-0-2:
>>>> 32272 TaskTracker
>>>> 32065 DataNode
>>>> 7197 Jps
>>>> 2397 TaskTracker$Child
>>>> compute-0-3:
>>>> 12054 DataNode
>>>> 19584 Jps
>>>> 14824 TaskTracker$Child
>>>> 12261 TaskTracker
>>>>
>>>> 4) Logs only show fetching process (taking place only in the head
>>> node):
>>>>
>>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>>>> http://valleycycles.net/
>>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>>> robots.txt for http://www.getting-forward.org/:
>>>> java.net.UnknownHostException: www.getting-forward.org
>>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>>> robots.txt for http://www.getting-forward.org/:
>>>> java.net.UnknownHostException: www.getting-forward.org
>>>>
>>>> What am I missing ? Why there are no fetching instances on nodes ? I
>>>> used the following custom script to launch a pristine crawl each time:
>>>>
>>>> #!/bin/sh
>>>>
>>>> # 1) Stops hadoop daemons
>>>> # 2) Overwrites new url list on HDFS
>>>> # 3) Starts hadoop daemons
>>>> # 4) Performs a clean crawl
>>>>
>>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>>>
>>>> CRAWL_DIR=crawl-ecxi || $1
>>>> URL_DIR=urls || $2
>>>>
>>>> echo $CRAWL_DIR
>>>> echo $URL_DIR
>>>>
>>>> echo "Leaving safe mode..."
>>>> ./hadoop dfsadmin -safemode leave
>>>>
>>>> echo "Removing seed urls directory and previous crawled content..."
>>>> ./hadoop dfs -rmr $URL_DIR
>>>> ./hadoop dfs -rmr $CRAWL_DIR
>>>>
>>>> echo "Removing past logs"
>>>>
>>>> rm -rf ../logs/*
>>>>
>>>> echo "Uploading seed urls..."
>>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>>>
>>>> #echo "Entering safe mode..."
>>>> #./hadoop dfsadmin -safemode enter
>>>>
>>>> echo "******************"
>>>> echo "* STARTING CRAWL *"
>>>> echo "******************"
>>>>
>>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>>>
>>>>
>>>> Next step I'm thinking on to fix the problem is to install
>>>> nutch+hadoop as specified in this past nutch-user mail:
>>>>
>>>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10225.html
>>>>
>>>> As I don't know if it's current practice on trunk (archived mail is
>>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>>>> it or if it's being worked on by someone... I haven't found a matching
>>>> bug on JIRA :_/
>>>>
>>>
>>
>

[SOLVED] Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

At last, fixed, thanks to Andrzej ! Fellow nutchers, please, revise
your hadoop-site.xml file, specially those settings:

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

You should set these values to something like 4 maps and 2 reduces.

*and*

mapred.tasktracker.map.tasks
mapred.tasktracker.reduce.tasks

You should set these values to something like 23 maps and 13 reduces.

Assuming you have a 8-node cluster ;)

Regards,
Roman

On Thu, Aug 14, 2008 at 9:43 PM, brainstorm <br...@gmail.com> wrote:
> Sorry for that late reply... got summer power outages on the building
> that prevented me from running more tests on the cluster, now I'm back
> online... replying below.
>
> On Mon, Aug 11, 2008 at 5:59 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>> brainstorm wrote:
>>>
>>> On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>>>>
>>>> brainstorm wrote:
>>>>
>>>>> This is one example crawled segment:
>>>>>
>>>>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
>>>>>
>>>>> As you see, just one part-NNNN file is generated... in the conf file
>>>>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
>>>>> suggested in previous emails).
>>>>
>>>> First of all - for a 7 node cluster the mapred.map.tasks should be set at
>>>> least to something around 23 or 31 or even higher, and the number of
>>>> reduce
>>>> tasks to e.g. 11.
>>>
>>>
>>>
>>> I see, now it makes more sense to me than just assigning 2 maps by
>>> default as suggested before... then, according to:
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> Maps:
>>>
>>> Given:
>>> 64MB DFS blocks
>>> 500MB RAM per node
>>> 500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
>>> exceptions occur)
>>>
>>> 31 maps... we'll see if it works. It would be cool to have a more
>>> precise "formula" to calculate this number in the nutch case. I assume
>>> that "23 to 31 or higher" is empirically determined by you: thanks for
>>> sharing your knowledge !
>>
>> That's already described on the Wiki page that you mention above ...
>>
>>
>>> Reduces:
>>> 1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) =
>>> 135
>>>
>>> This number is the total number of reduces running across the cluster
>>> nodes ?
>>
>> Hmm .. did you actually try running 11 simultaneous reduce tasks on each
>> node? It very much depends on the CPU, the amount of available RAM and the
>> heapsize of each task (mapred.child.java.opts). My experience is that it
>> takes a beefy hardware to run more than ~4-5 reduce tasks per node - load
>> avg is above 20, CPU is pegged at 100% and disks are thrashing. YYMV of
>> course.
>>
>> Regarding the number - what you calculated is the upper bound of all
>> possible simultaneous tasks, assuming you have 7 nodes and each will run 11
>> tasks at the same time. This is not what I meant - I meant that you should
>> set the total number of reduces to 11 or so. What that page doesn't discuss
>> is that there is also some cost in job startup / finish, so there is a sweet
>> spot number somewhere that fits your current data size and your current
>> cluster. In other words, it's better not to run too many reduces, just the
>> right number so that individual sort operations run quickly, and tasks
>> occupy most of the available slots.
>>
>>
>>> In conclusion, as you predicted (and if the script is not horribly
>>> broken), the non-dmoz sample is quite homogeneous (there are lots of
>>> urls coming from auto-generated ad sites, for instance)... adding the
>>> fact that *a lot* of them lead to "Unknown host exceptions", the crawl
>>> ends being extremely slow.
>>>
>>> But that does not solve the fact that few nodes are actually fetching
>>> on DMOZ-based crawl. So next thing to try is to raise
>>> mapred.map.tasks.maximum as you suggested, should fix my issues... I
>>> hope so :/
>>
>> I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7.
>>
>> Just to be sure ... are you sure you are running a distributed JobTracker?
>> Can you see the JobTracker UI in the browser?
>
>
>
> Yes, distributed JobTracker is running (full cluster mode), I can see
> all the tasks via :50030... but I'm having same results with your
> maps/reduces values: just two nodes are fetching.
>
> Could it be possible that, given the dmoz url input filesize (31KB) is
> not being splitted on all nodes due to 64MB DFS block size ? (just one
> block "slot" for 31KB file)... just wondering :/
>
>
>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

Sorry for that late reply... got summer power outages on the building
that prevented me from running more tests on the cluster, now I'm back
online... replying below.

On Mon, Aug 11, 2008 at 5:59 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> brainstorm wrote:
>>
>> On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>>>
>>> brainstorm wrote:
>>>
>>>> This is one example crawled segment:
>>>>
>>>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
>>>>
>>>> As you see, just one part-NNNN file is generated... in the conf file
>>>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
>>>> suggested in previous emails).
>>>
>>> First of all - for a 7 node cluster the mapred.map.tasks should be set at
>>> least to something around 23 or 31 or even higher, and the number of
>>> reduce
>>> tasks to e.g. 11.
>>
>>
>>
>> I see, now it makes more sense to me than just assigning 2 maps by
>> default as suggested before... then, according to:
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> Maps:
>>
>> Given:
>> 64MB DFS blocks
>> 500MB RAM per node
>> 500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
>> exceptions occur)
>>
>> 31 maps... we'll see if it works. It would be cool to have a more
>> precise "formula" to calculate this number in the nutch case. I assume
>> that "23 to 31 or higher" is empirically determined by you: thanks for
>> sharing your knowledge !
>
> That's already described on the Wiki page that you mention above ...
>
>
>> Reduces:
>> 1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) =
>> 135
>>
>> This number is the total number of reduces running across the cluster
>> nodes ?
>
> Hmm .. did you actually try running 11 simultaneous reduce tasks on each
> node? It very much depends on the CPU, the amount of available RAM and the
> heapsize of each task (mapred.child.java.opts). My experience is that it
> takes a beefy hardware to run more than ~4-5 reduce tasks per node - load
> avg is above 20, CPU is pegged at 100% and disks are thrashing. YYMV of
> course.
>
> Regarding the number - what you calculated is the upper bound of all
> possible simultaneous tasks, assuming you have 7 nodes and each will run 11
> tasks at the same time. This is not what I meant - I meant that you should
> set the total number of reduces to 11 or so. What that page doesn't discuss
> is that there is also some cost in job startup / finish, so there is a sweet
> spot number somewhere that fits your current data size and your current
> cluster. In other words, it's better not to run too many reduces, just the
> right number so that individual sort operations run quickly, and tasks
> occupy most of the available slots.
>
>
>> In conclusion, as you predicted (and if the script is not horribly
>> broken), the non-dmoz sample is quite homogeneous (there are lots of
>> urls coming from auto-generated ad sites, for instance)... adding the
>> fact that *a lot* of them lead to "Unknown host exceptions", the crawl
>> ends being extremely slow.
>>
>> But that does not solve the fact that few nodes are actually fetching
>> on DMOZ-based crawl. So next thing to try is to raise
>> mapred.map.tasks.maximum as you suggested, should fix my issues... I
>> hope so :/
>
> I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7.
>
> Just to be sure ... are you sure you are running a distributed JobTracker?
> Can you see the JobTracker UI in the browser?



Yes, distributed JobTracker is running (full cluster mode), I can see
all the tasks via :50030... but I'm having same results with your
maps/reduces values: just two nodes are fetching.

Could it be possible that, given the dmoz url input filesize (31KB) is
not being splitted on all nodes due to 64MB DFS block size ? (just one
block "slot" for 31KB file)... just wondering :/



> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Distributed fetching only happening in one node ?

Posted by Andrzej Bialecki <ab...@getopt.org>.

brainstorm wrote:
> On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>> brainstorm wrote:
>>
>>> This is one example crawled segment:
>>>
>>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
>>>
>>> As you see, just one part-NNNN file is generated... in the conf file
>>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
>>> suggested in previous emails).
>> First of all - for a 7 node cluster the mapred.map.tasks should be set at
>> least to something around 23 or 31 or even higher, and the number of reduce
>> tasks to e.g. 11.
> 
> 
> 
> I see, now it makes more sense to me than just assigning 2 maps by
> default as suggested before... then, according to:
> 
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> 
> Maps:
> 
> Given:
> 64MB DFS blocks
> 500MB RAM per node
> 500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
> exceptions occur)
> 
> 31 maps... we'll see if it works. It would be cool to have a more
> precise "formula" to calculate this number in the nutch case. I assume
> that "23 to 31 or higher" is empirically determined by you: thanks for
> sharing your knowledge !

That's already described on the Wiki page that you mention above ...

> Reduces:
> 1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) = 135
> 
> This number is the total number of reduces running across the cluster nodes ?

Hmm .. did you actually try running 11 simultaneous reduce tasks on each 
node? It very much depends on the CPU, the amount of available RAM and 
the heapsize of each task (mapred.child.java.opts). My experience is 
that it takes a beefy hardware to run more than ~4-5 reduce tasks per 
node - load avg is above 20, CPU is pegged at 100% and disks are 
thrashing. YYMV of course.

Regarding the number - what you calculated is the upper bound of all 
possible simultaneous tasks, assuming you have 7 nodes and each will run 
11 tasks at the same time. This is not what I meant - I meant that you 
should set the total number of reduces to 11 or so. What that page 
doesn't discuss is that there is also some cost in job startup / finish, 
so there is a sweet spot number somewhere that fits your current data 
size and your current cluster. In other words, it's better not to run 
too many reduces, just the right number so that individual sort 
operations run quickly, and tasks occupy most of the available slots.

> In conclusion, as you predicted (and if the script is not horribly
> broken), the non-dmoz sample is quite homogeneous (there are lots of
> urls coming from auto-generated ad sites, for instance)... adding the
> fact that *a lot* of them lead to "Unknown host exceptions", the crawl
> ends being extremely slow.
> 
> But that does not solve the fact that few nodes are actually fetching
> on DMOZ-based crawl. So next thing to try is to raise
> mapred.map.tasks.maximum as you suggested, should fix my issues... I
> hope so :/

I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7.

Just to be sure ... are you sure you are running a distributed 
JobTracker? Can you see the JobTracker UI in the browser?

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> brainstorm wrote:
>
>> This is one example crawled segment:
>>
>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
>>
>> As you see, just one part-NNNN file is generated... in the conf file
>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
>> suggested in previous emails).
>
> First of all - for a 7 node cluster the mapred.map.tasks should be set at
> least to something around 23 or 31 or even higher, and the number of reduce
> tasks to e.g. 11.



I see, now it makes more sense to me than just assigning 2 maps by
default as suggested before... then, according to:

http://wiki.apache.org/hadoop/HowManyMapsAndReduces

Maps:

Given:
64MB DFS blocks
500MB RAM per node
500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
exceptions occur)

31 maps... we'll see if it works. It would be cool to have a more
precise "formula" to calculate this number in the nutch case. I assume
that "23 to 31 or higher" is empirically determined by you: thanks for
sharing your knowledge !

Reduces:
1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) = 135

This number is the total number of reduces running across the cluster nodes ?




> Secondly - you should not put this property in nutch-site.xml, instead it
> should be put in mapred-default.xml or hadoop-site.xml. I lost track of
> which version of Nutch / Hadoop you are using ... if it's Hadoop 0.12.x,
> then you need to be careful about where you put mapred.map.tasks, and it has
> to be placed in mapred-default.xml. If it's a more recent Hadoop version
> then you can put these values in hadoop-site.xml.



My fault ! I actually meant hadoop-site.xml... besides,
mapred-default.xml is ignored by hadoop in my case:

I'm using hadoop 0.17.1, included on latest nutch trunk as of now.



> And finally - what is the distribution of urls in your seed list among
> unique hosts? I.e. how many urls come from a single host? Guessing from the
> path above - if you are trying to do a DMOZ crawl, then the distribution
> should be ok. I've done a DMOZ crawl a month ago, using the then current
> trunk/ and all was working well.



I've made the following ruby snippet to get an idea of the
distribution of the input url list (perhaps it is not the paramount of
correctness and accuracy, but I think it more or less shows what we're
looking for):

# invert_urls.rb
#!/usr/bin/ruby
require 'pp'

dist = {}

STDIN.readlines.each do |url|
	url=url.strip[7..-1] # strip http://
	url=url.split(".") # array context for "proper" reverse
	url.reverse!
	
	dist.merge!({url[0]+'.'+url[1] => ""}) # hash context, discarding
duplicate urls (just till 1st level)
end

pp dist

rvalls@escerttop:~/bin$ wc -l urls.txt
2500001 urls.txt

rvalls@escerttop:~/bin$ ./invert_urls.rb < urls.txt > unique

rvalls@escerttop:~/bin$ wc -l unique
1706762 unique <---- coming from *different* hosts


So there are roughly 790000 (2500001-1706762) "repeated" urls... 31%
of the sample

rvalls@escerttop:~/bin$ head urls.txt
http://business-card-flyer-free-post.qoxa.info
http://www.download-art.com
http://catcenter.co.uk
http://761.hidbi.info
http://seemovie.movblogs.com
http://clearwaterart.com
http://www.travel-insurance-guide.org
http://www.pc-notdienst.at
http://projec-txt.cn
http://www.yoraispage.com

rvalls@escerttop:~/bin$ head unique
{"de.tsv-nellmersbach"=>"",
 "cn.color-that-pokemon"=>"",
 "com.bluestar-studio"=>"",
 "it.vaisardegna"=>"",
 "com.bramalearangersclub"=>"",
 "org.fpc-hou"=>"",
 "com.warhotel"=>"",
 "com.tokayblue"=>"",
 "be.wgreekwaves"=>"",
 "org.fairhopelibrary"=>"",

Comparing with DMOZ sample:

rvalls@escerttop:~/bin$ ./invert_urls.rb < random-dmoz-20080806.txt > unique
rvalls@escerttop:~/bin$ wc -l random-dmoz-20080806.txt
908 random-dmoz-20080806.txt
rvalls@escerttop:~/bin$ wc -l unique
788 unique

...13% repeated urls

In conclusion, as you predicted (and if the script is not horribly
broken), the non-dmoz sample is quite homogeneous (there are lots of
urls coming from auto-generated ad sites, for instance)... adding the
fact that *a lot* of them lead to "Unknown host exceptions", the crawl
ends being extremely slow.

But that does not solve the fact that few nodes are actually fetching
on DMOZ-based crawl. So next thing to try is to raise
mapred.map.tasks.maximum as you suggested, should fix my issues... I
hope so :/

Thanks !




> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Distributed fetching only happening in one node ?

Posted by Andrzej Bialecki <ab...@getopt.org>.

brainstorm wrote:

> This is one example crawled segment:
> 
> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
> 
> As you see, just one part-NNNN file is generated... in the conf file
> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
> suggested in previous emails).

First of all - for a 7 node cluster the mapred.map.tasks should be set 
at least to something around 23 or 31 or even higher, and the number of 
reduce tasks to e.g. 11.

Secondly - you should not put this property in nutch-site.xml, instead 
it should be put in mapred-default.xml or hadoop-site.xml. I lost track 
of which version of Nutch / Hadoop you are using ... if it's Hadoop 
0.12.x, then you need to be careful about where you put 
mapred.map.tasks, and it has to be placed in mapred-default.xml. If it's 
a more recent Hadoop version then you can put these values in 
hadoop-site.xml.

And finally - what is the distribution of urls in your seed list among 
unique hosts? I.e. how many urls come from a single host? Guessing from 
the path above - if you are trying to do a DMOZ crawl, then the 
distribution should be ok. I've done a DMOZ crawl a month ago, using the 
then current trunk/ and all was working well.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

On Fri, Aug 8, 2008 at 1:18 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> brainstorm wrote:
>>
>> It was wondering... if I split the input urls like this:
>>
>> url1.txt url2.txt ... urlN.txt
>>
>> Will this input spread map jobs to N nodes ? Right now I'm using just
>
> No, it won't - because these files are first added to a crawldb, and only
> then Generator creates partial fetchlists out of the whole crawldb.
>
> Here's how it works:
>
> * Generator first prepares the list of candidate urls for fetching
>
> * then it applies limits e.g. maximum number of urls per host
>
> * and finally partitions the fetchlist so that all urls from the same host
> end up in the same partition. The number of output partitions from Generator
> is equal to the default number of map tasks. Why? because Fetcher will
> create one map task per each partition in the fetchlist.



Somebody said that 2 mapred.map.tasks was ok for a 7 node cluster
setup, but using greater values for mapred.map.tasks (tested from 2
till 256) does not alter the output/fix the problem, no additional
part-XXXX are generated for each map and no additional nodes
participate on fetching :/

What should I do ?



> So - please check how many part-NNNNN files you have in the generated
> fetchlist.



This is one example crawled segment:

/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000

As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).



Thanks for your support ! ;)


>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Distributed fetching only happening in one node ?

Posted by Andrzej Bialecki <ab...@getopt.org>.

brainstorm wrote:
> It was wondering... if I split the input urls like this:
> 
> url1.txt url2.txt ... urlN.txt
> 
> Will this input spread map jobs to N nodes ? Right now I'm using just

No, it won't - because these files are first added to a crawldb, and 
only then Generator creates partial fetchlists out of the whole crawldb.

Here's how it works:

* Generator first prepares the list of candidate urls for fetching

* then it applies limits e.g. maximum number of urls per host

* and finally partitions the fetchlist so that all urls from the same 
host end up in the same partition. The number of output partitions from 
Generator is equal to the default number of map tasks. Why? because 
Fetcher will create one map task per each partition in the fetchlist.

So - please check how many part-NNNNN files you have in the generated 
fetchlist.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Distributed fetching only happening in one node ?

Posted by Alexander Aristov <al...@gmail.com>.

it should have no difference as all usrls from all files in the directory
are injected first.

Alex

2008/8/8 brainstorm <br...@gmail.com>

> It was wondering... if I split the input urls like this:
>
> url1.txt url2.txt ... urlN.txt
>
> Will this input spread map jobs to N nodes ? Right now I'm using just
> one (big) urls.txt file (just 2 nodes actually fetching).
>
> Thanks in advance,
> Roman
>
> On Wed, Aug 6, 2008 at 11:51 PM, brainstorm <br...@gmail.com> wrote:
> > On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> >> brainstorm wrote:
> >>>
> >>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> >>> values 2 and 1 respectively *in the past*, same results. Right now, I
> >>> have 32 for both: same results as those settings are just a hint for
> >>> nutch.
> >>>
> >>> Regarding number of threads *per host* I tried with 10 and 20 in the
> >>> past, same results.
> >>
> >> Indeed, the default number of maps and reduces can be changed for any
> >> particular job - the number of maps is adjusted according to the number
> of
> >> input splits (InputFormat.getSplits()), and the number of reduces can be
> >> adjusted programmatically in the application.
> >
> >
> >
> > For now, my focus is on using nutch commandline tool:
> >
> > $ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5
> >
> > I assume (perhaps incorrectly), that nutch will determine the number
> > of maps & reduces dynamically. Is it true or should I switch to a
> > custom coded crawler using nutch API ?
> >
> > Btw, having a look at "getSplits", I suspect that Fetcher does
> > precisely what I don't want it to do: does not split inputs... then,
> > the less input splits, the less maps will be spread on nodes on fetch
> > phase, am I wrong ?:
> >
> > public class Fetcher
> > (...)
> >
> >    /** Don't split inputs, to keep things polite. */
> >    public InputSplit[] getSplits(JobConf job, int nSplits)
> >      throws IOException {
> >      Path[] files = listPaths(job);
> >      FileSystem fs = FileSystem.get(job);
> >      InputSplit[] splits = new InputSplit[files.length];
> >      for (int i = 0; i < files.length; i++) {
> >        splits[i] = new FileSplit(files[i], 0,
> >            fs.getFileStatus(files[i]).getLen(), (String[])null);
> >      }
> >      return splits;
> >    }
> >  }
> >
> >
> > Thanks in advance,
> > Roman
> >
> >
> >
> >> Back to your issue: I suspect that your fetchlist is highly homogenous,
> i.e.
> >> contains urls from a single host. Nutch makes sure that all urls from a
> >> single host end up in a single map task, to ensure the politeness
> settings,
> >> so that's probably why you see only a single map task fetching all urls.
> >>
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >
>



-- 
Best Regards
Alexander Aristov

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

It was wondering... if I split the input urls like this:

url1.txt url2.txt ... urlN.txt

Will this input spread map jobs to N nodes ? Right now I'm using just
one (big) urls.txt file (just 2 nodes actually fetching).

Thanks in advance,
Roman

On Wed, Aug 6, 2008 at 11:51 PM, brainstorm <br...@gmail.com> wrote:
> On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>> brainstorm wrote:
>>>
>>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>>> values 2 and 1 respectively *in the past*, same results. Right now, I
>>> have 32 for both: same results as those settings are just a hint for
>>> nutch.
>>>
>>> Regarding number of threads *per host* I tried with 10 and 20 in the
>>> past, same results.
>>
>> Indeed, the default number of maps and reduces can be changed for any
>> particular job - the number of maps is adjusted according to the number of
>> input splits (InputFormat.getSplits()), and the number of reduces can be
>> adjusted programmatically in the application.
>
>
>
> For now, my focus is on using nutch commandline tool:
>
> $ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5
>
> I assume (perhaps incorrectly), that nutch will determine the number
> of maps & reduces dynamically. Is it true or should I switch to a
> custom coded crawler using nutch API ?
>
> Btw, having a look at "getSplits", I suspect that Fetcher does
> precisely what I don't want it to do: does not split inputs... then,
> the less input splits, the less maps will be spread on nodes on fetch
> phase, am I wrong ?:
>
> public class Fetcher
> (...)
>
>    /** Don't split inputs, to keep things polite. */
>    public InputSplit[] getSplits(JobConf job, int nSplits)
>      throws IOException {
>      Path[] files = listPaths(job);
>      FileSystem fs = FileSystem.get(job);
>      InputSplit[] splits = new InputSplit[files.length];
>      for (int i = 0; i < files.length; i++) {
>        splits[i] = new FileSplit(files[i], 0,
>            fs.getFileStatus(files[i]).getLen(), (String[])null);
>      }
>      return splits;
>    }
>  }
>
>
> Thanks in advance,
> Roman
>
>
>
>> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
>> contains urls from a single host. Nutch makes sure that all urls from a
>> single host end up in a single map task, to ensure the politeness settings,
>> so that's probably why you see only a single map task fetching all urls.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> brainstorm wrote:
>>
>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>> values 2 and 1 respectively *in the past*, same results. Right now, I
>> have 32 for both: same results as those settings are just a hint for
>> nutch.
>>
>> Regarding number of threads *per host* I tried with 10 and 20 in the
>> past, same results.
>
> Indeed, the default number of maps and reduces can be changed for any
> particular job - the number of maps is adjusted according to the number of
> input splits (InputFormat.getSplits()), and the number of reduces can be
> adjusted programmatically in the application.



For now, my focus is on using nutch commandline tool:

$ bin/nutch crawl $URL_DIR_DFS -dir $CRAWL_DIR -depth 5

I assume (perhaps incorrectly), that nutch will determine the number
of maps & reduces dynamically. Is it true or should I switch to a
custom coded crawler using nutch API ?

Btw, having a look at "getSplits", I suspect that Fetcher does
precisely what I don't want it to do: does not split inputs... then,
the less input splits, the less maps will be spread on nodes on fetch
phase, am I wrong ?:

public class Fetcher
(...)

    /** Don't split inputs, to keep things polite. */
    public InputSplit[] getSplits(JobConf job, int nSplits)
      throws IOException {
      Path[] files = listPaths(job);
      FileSystem fs = FileSystem.get(job);
      InputSplit[] splits = new InputSplit[files.length];
      for (int i = 0; i < files.length; i++) {
        splits[i] = new FileSplit(files[i], 0,
            fs.getFileStatus(files[i]).getLen(), (String[])null);
      }
      return splits;
    }
  }


Thanks in advance,
Roman



> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
> contains urls from a single host. Nutch makes sure that all urls from a
> single host end up in a single map task, to ensure the politeness settings,
> so that's probably why you see only a single map task fetching all urls.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Distributed fetching only happening in one node ?

Posted by Jordan Mendler <jm...@ucla.edu>.

I am looking to do the same thing. If anyone finds a way, please post here.

Thanks,
Jordan

On Sun, Aug 10, 2008 at 11:31 AM, soila <sp...@ece.cmu.edu> wrote:

>
> Hi Andrzej,
>
> I am experiencing similar problems distributing the fetch across multiple
> nodes. I am crawling a single host in an intranet and I would like to know
> how I can modify nutch's behavior so that it distributes the search over
> multiple nodes.
>
> Soila
>
> Andrzej Bialecki wrote:
> >
> > brainstorm wrote:
> >> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> >> values 2 and 1 respectively *in the past*, same results. Right now, I
> >> have 32 for both: same results as those settings are just a hint for
> >> nutch.
> >>
> >> Regarding number of threads *per host* I tried with 10 and 20 in the
> >> past, same results.
> >
> > Indeed, the default number of maps and reduces can be changed for any
> > particular job - the number of maps is adjusted according to the number
> > of input splits (InputFormat.getSplits()), and the number of reduces can
> > be adjusted programmatically in the application.
> >
> > Back to your issue: I suspect that your fetchlist is highly homogenous,
> > i.e. contains urls from a single host. Nutch makes sure that all urls
> > from a single host end up in a single map task, to ensure the politeness
> > settings, so that's probably why you see only a single map task fetching
> > all urls.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Distributed-fetching-only-happening-in-one-node---tp18429531p18915705.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Distributed fetching only happening in one node ?

Posted by soila <sp...@ece.cmu.edu>.

Hi Andrzej,

I am experiencing similar problems distributing the fetch across multiple
nodes. I am crawling a single host in an intranet and I would like to know
how I can modify nutch's behavior so that it distributes the search over
multiple nodes.

Soila

Andrzej Bialecki wrote:
> 
> brainstorm wrote:
>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>> values 2 and 1 respectively *in the past*, same results. Right now, I
>> have 32 for both: same results as those settings are just a hint for
>> nutch.
>> 
>> Regarding number of threads *per host* I tried with 10 and 20 in the
>> past, same results.
> 
> Indeed, the default number of maps and reduces can be changed for any 
> particular job - the number of maps is adjusted according to the number 
> of input splits (InputFormat.getSplits()), and the number of reduces can 
> be adjusted programmatically in the application.
> 
> Back to your issue: I suspect that your fetchlist is highly homogenous, 
> i.e. contains urls from a single host. Nutch makes sure that all urls 
> from a single host end up in a single map task, to ensure the politeness 
> settings, so that's probably why you see only a single map task fetching 
> all urls.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Distributed-fetching-only-happening-in-one-node---tp18429531p18915705.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

Andrzej, thanks for your advise... I was using a 20MB url list
provided by our customers, I've to write a script to determine the
homogeneusness of the input seed urls file.

As a preliminar test, I've run a crawl using the integrated nutch DMOZ
parser (as suggested on the official nutch tutorial), which I assume
that chooses urls in a more heterogeneous fashion. The resulting url
list, is a random enough sample ? ... In fact, being a directory, the
number of repeated urls should be low, isn't it ?

Bad news is that I'm getting the same results, just two nodes[1] are
actually fetching :_( So I guess the problem is somewhere else (I
already left the number of map & reduces to 2 and 1 as suggested in
this thread).

Any further ideas/tests/fixes ?

Thanks a lot for your patience and support,
Roman

[1] one of them being the frontend (invariably) and the other one, a
random node on each new crawl.

On Tue, Aug 5, 2008 at 12:07 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> brainstorm wrote:
>>
>> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
>> values 2 and 1 respectively *in the past*, same results. Right now, I
>> have 32 for both: same results as those settings are just a hint for
>> nutch.
>>
>> Regarding number of threads *per host* I tried with 10 and 20 in the
>> past, same results.
>
> Indeed, the default number of maps and reduces can be changed for any
> particular job - the number of maps is adjusted according to the number of
> input splits (InputFormat.getSplits()), and the number of reduces can be
> adjusted programmatically in the application.
>
> Back to your issue: I suspect that your fetchlist is highly homogenous, i.e.
> contains urls from a single host. Nutch makes sure that all urls from a
> single host end up in a single map task, to ensure the politeness settings,
> so that's probably why you see only a single map task fetching all urls.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Distributed fetching only happening in one node ?

Posted by Andrzej Bialecki <ab...@getopt.org>.

brainstorm wrote:
> Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
> values 2 and 1 respectively *in the past*, same results. Right now, I
> have 32 for both: same results as those settings are just a hint for
> nutch.
> 
> Regarding number of threads *per host* I tried with 10 and 20 in the
> past, same results.

Indeed, the default number of maps and reduces can be changed for any 
particular job - the number of maps is adjusted according to the number 
of input splits (InputFormat.getSplits()), and the number of reduces can 
be adjusted programmatically in the application.

Back to your issue: I suspect that your fetchlist is highly homogenous, 
i.e. contains urls from a single host. Nutch makes sure that all urls 
from a single host end up in a single map task, to ensure the politeness 
settings, so that's probably why you see only a single map task fetching 
all urls.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

Sure, I tried with mapred.map.tasks and mapred.reduce.tasks with
values 2 and 1 respectively *in the past*, same results. Right now, I
have 32 for both: same results as those settings are just a hint for
nutch.

Regarding number of threads *per host* I tried with 10 and 20 in the
past, same results.

I appreciate your support Alexander, thank you :)

On Tue, Aug 5, 2008 at 9:17 AM, Alexander Aristov
<al...@gmail.com> wrote:
> Still not clear.
>
> What values for mapred.map.tasks and mapred.reduce.tasks do you have now?
> Check the hadoop-site.xml file as it may affect your configuration also.
>
> Alexander
>
> 2008/8/5 brainstorm <br...@gmail.com>
>
>> Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2).
>>
>> On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <br...@gmail.com> wrote:
>> > Right, I've checked before with mapred.map.tasks to 2 and
>> > mapred.reduce.tasks to 1.
>> >
>> > I've also played with several values on the following settings:
>> >
>> > <property>
>> >  <name>fetcher.server.delay</name>
>> >  <value>1.5</value>
>> >  <description>The number of seconds the fetcher will delay between
>> >   successive requests to the same server.</description>
>> > </property>
>> >
>> > <property>
>> >  <name>http.max.delays</name>
>> >  <value>3</value>
>> >  <description>The number of times a thread will delay when trying to
>> >  fetch a page.  Each time it finds that a host is busy, it will wait
>> >  fetcher.server.delay.  After http.max.delays attepts, it will give
>> >  up on the page for now.</description>
>> > </property>
>> >
>> > Only one node executes the fetch phase anyway :_(
>> >
>> > Thanks for the hint anyway... more ideas ?
>> >
>> > On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
>> > <al...@gmail.com> wrote:
>> >> Hi
>> >>
>> >> 1. You should have set
>> >> mapred.map.tasks
>> >> and
>> >> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
>> >>
>> >> 2. You can specify number of threads to perform fetching. Also there is
>> a
>> >> parameter that slows down fetching from one URL,so called polite
>> fetching to
>> >> not DOS the site.
>> >>
>> >> So check you configuration.
>> >>
>> >> Alex
>> >>
>> >> 2008/8/5 brainstorm <br...@gmail.com>
>> >>
>> >>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
>> >>> the warnings... BUT, on a 7-node nutch cluster:
>> >>>
>> >>> 1) Fetching is only happening on *one* node despite several values
>> >>> tested on settings:
>> >>> mapred.tasktracker.map.tasks.maximum
>> >>> mapred.tasktracker.reduce.tasks.maximum
>> >>> export HADOOP_HEAPSIZE
>> >>>
>> >>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>> >>>
>> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>> >>>
>> >>> But nutch keeps crawling only using one node, instead of seven
>> >>> nodes... anybody knows why ?
>> >>>
>> >>> I've had a look at the code, searching for:
>> >>>
>> >>> conf.setNumMapTasks(int num), but found none: so I guess that the
>> >>> number of mappers & reducers are not limited programatically.
>> >>>
>> >>> 2) Even on a single node, the fetching is really slow: 1 url or page
>> >>> per second, at most.
>> >>>
>> >>> Can anybody shed some light into this ? Pointing which class/code I
>> >>> should look into to modify this behaviour will help also.
>> >>>
>> >>> Anybody has a distributed nutch crawling cluster working with all
>> >>> nodes fetching at fetch phase ?
>> >>>
>> >>> I even did some numbers using wordcount example using 7 nodes at 100%
>> >>> cpu usage using a 425MB parsedtext file:
>> >>>
>> >>> maps    reduces heapsize        time
>> >>> 2       2       500     3m43.049s
>> >>> 4       4       500     4m41.846s
>> >>> 8       8       500     4m29.344s
>> >>> 16      16      500     3m43.672s
>> >>> 32      32      500     3m41.367s
>> >>> 64      64      500     4m27.275s
>> >>> 128     128     500     4m35.233s
>> >>> 256     256     500     3m41.916s
>> >>>
>> >>>
>> >>> 2       2       2000    4m31.434s
>> >>> 4       4       2000
>> >>> 8       8       2000
>> >>> 16      16      2000    4m32.213s
>> >>> 32      32      2000
>> >>> 64      64      2000
>> >>> 128     128     2000
>> >>> 256     256     2000    4m38.310s
>> >>>
>> >>> Thanks in advance,
>> >>> Roman
>> >>>
>> >>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <br...@gmail.com>
>> wrote:
>> >>> > While seeing DFS wireshark trace (and the corresponding RST's), the
>> >>> > crawl continued to next step... seems that this WARNING is actually
>> >>> > slowing down the whole crawling process (it took 36 minutes to
>> >>> > complete the previous fetch) with just 3 urls seed file :-!!!
>> >>> >
>> >>> > I just posted a couple of exceptions/questions regarding DFS on
>> hadoop
>> >>> > core mailing list.
>> >>> >
>> >>> > PD: As a side note, the following error caught my attention:
>> >>> >
>> >>> > Fetcher: starting
>> >>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
>> >>> > Too many fetch-failures
>> >>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
>> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
>> >>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
>> >>> > with: org.apache.nutch.protocol.http.api.HttpException:
>> >>> > java.net.UnknownHostException: upc.cat
>> >>> >
>> >>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
>> >>> > exist, it just gets redirected to www.upc.cat :-/
>> >>> >
>> >>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <br...@gmail.com>
>> wrote:
>> >>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
>> >>> >> issue (perhaps there was a simple solution/known bug/issue)...
>> >>> >>
>> >>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
>> >>> >> and now I'm diving on the tcp stream... let's see if I see the light
>> >>> >> (RST flag somewhere ?), thanks anyway for replying ;)
>> >>> >>
>> >>> >> Just for the record, the phase that stalls is fetcher during reduce:
>> >>> >>
>> >>> >> Jobid   User    Name    Map % Complete  Map Total       Maps
>> Completed
>> >>>  Reduce %
>> >>> >> Complete        Reduce Total    Reduces Completed
>> >>> >> job_200807151723_0005   hadoop  fetch
>> crawl-ecxi/segments/20080715172458
>> >>>        100.00%
>> >>> >>        2       2       16.66%
>> >>> >>
>> >>> >>        1       0
>> >>> >>
>> >>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
>> >>> >>
>> >>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> >>> >> <pm...@sim-gtech.com> wrote:
>> >>> >>> Hi brain,
>> >>> >>>        If I were you, I would download wireshark
>> >>> >>> (http://www.wireshark.org/download.html) to see what is happening
>> at
>> >>> the
>> >>> >>> network layer and see if that provides any clues.  A socket
>> exception
>> >>> >>> that you don't expect is usually due to one side of the
>> conversation
>> >>> not
>> >>> >>> understanding the other side.  If you have 4 machines, then you
>> have 4
>> >>> >>> possible places where default firewall rules could be causing an
>> issue.
>> >>> >>> If it is not the firewall rules, the NAT rules could be a potential
>> >>> >>> source of error.  Also, even a router hardware error could cause a
>> >>> >>> problem.
>> >>> >>>        If you understand TCP, just make sure that you see all the
>> >>> >>> correct TCP stuff happening in wireshark.  If you don't understand
>> >>> >>> wireshark's display, let me know, and I'll pass on some quickstart
>> >>> >>> information.
>> >>> >>>
>> >>> >>>        If you already know all of this, I don't have any way to
>> help
>> >>> >>> you, as it looks like you're trying to accomplish something
>> trickier
>> >>> >>> with nutch than I have ever attempted.
>> >>> >>>
>> >>> >>> Patrick
>> >>> >>>
>> >>> >>> -----Original Message-----
>> >>> >>> From: brainstorm [mailto:braincode@gmail.com]
>> >>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
>> >>> >>> To: nutch-user@lucene.apache.org
>> >>> >>> Subject: Re: Distributed fetching only happening in one node ?
>> >>> >>>
>> >>> >>> Boiling down the problem I'm stuck on this:
>> >>> >>>
>> >>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>> >>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>> >>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>> >>> >>>        at
>> >>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>> >>> >>>        at
>> >>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> >>> >>>        at
>> >>> >>>
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> >>> >>>        at
>> >>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>> >>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >>> >>>        at
>> >>> >>>
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>> >>> >>>        at
>> >>> >>>
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>> >>> >>>        at
>> >>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>> >>> >>>        at java.lang.Thread.run(Thread.java:595)
>> >>> >>>
>> >>> >>> Checked that firewall settings between node & frontend were not
>> >>> >>> blocking packets, and they don't... anyone knows why is this ? If
>> not,
>> >>> >>> could you provide a convenient way to debug it ?
>> >>> >>>
>> >>> >>> Thanks !
>> >>> >>>
>> >>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <br...@gmail.com>
>> >>> wrote:
>> >>> >>>> Hi,
>> >>> >>>>
>> >>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>> >>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>> >>> >>>> best suited network topology for inet crawling (frontend being a
>> net
>> >>> >>>> bottleneck), but I think it's fine for testing purposes.
>> >>> >>>>
>> >>> >>>> I'm having issues with fetch mapreduce job:
>> >>> >>>>
>> >>> >>>> According to ganglia monitoring (network traffic), and hadoop
>> >>> >>>> administrative interfaces, fetch phase is only being executed in
>> the
>> >>> >>>> frontend node, where I launched "nutch crawl". Previous nutch
>> phases
>> >>> >>>> were executed neatly distributed on all nodes:
>> >>> >>>>
>> >>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
>> >>> >>>>        2       2       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>> >>> >>> 100.00%
>> >>> >>>>        3       3       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0003   hadoop  generate: select
>> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>> >>>>        3       3       100.00%
>> >>> >>>>        1       1
>> >>> >>>> job_200807131223_0004   hadoop  generate: partition
>> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>> >>>>        4       4       100.00%
>> >>> >>>>        2       2
>> >>> >>>>
>> >>> >>>> I've checked that:
>> >>> >>>>
>> >>> >>>> 1) Nodes have inet connectivity, firewall settings
>> >>> >>>> 2) There's enough space on local discs
>> >>> >>>> 3) Proper processes are running on nodes
>> >>> >>>>
>> >>> >>>> frontend-node:
>> >>> >>>> ==========
>> >>> >>>>
>> >>> >>>> [root@cluster ~]# jps
>> >>> >>>> 29232 NameNode
>> >>> >>>> 29489 DataNode
>> >>> >>>> 29860 JobTracker
>> >>> >>>> 29778 SecondaryNameNode
>> >>> >>>> 31122 Crawl
>> >>> >>>> 30137 TaskTracker
>> >>> >>>> 10989 Jps
>> >>> >>>> 1818 TaskTracker$Child
>> >>> >>>>
>> >>> >>>> leaf nodes:
>> >>> >>>> ========
>> >>> >>>>
>> >>> >>>> [root@cluster ~]# cluster-fork jps
>> >>> >>>> compute-0-1:
>> >>> >>>> 23929 Jps
>> >>> >>>> 15568 TaskTracker
>> >>> >>>> 15361 DataNode
>> >>> >>>> compute-0-2:
>> >>> >>>> 32272 TaskTracker
>> >>> >>>> 32065 DataNode
>> >>> >>>> 7197 Jps
>> >>> >>>> 2397 TaskTracker$Child
>> >>> >>>> compute-0-3:
>> >>> >>>> 12054 DataNode
>> >>> >>>> 19584 Jps
>> >>> >>>> 14824 TaskTracker$Child
>> >>> >>>> 12261 TaskTracker
>> >>> >>>>
>> >>> >>>> 4) Logs only show fetching process (taking place only in the head
>> >>> >>> node):
>> >>> >>>>
>> >>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>> >>> >>>> http://valleycycles.net/
>> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>> >>>>
>> >>> >>>> What am I missing ? Why there are no fetching instances on nodes ?
>> I
>> >>> >>>> used the following custom script to launch a pristine crawl each
>> time:
>> >>> >>>>
>> >>> >>>> #!/bin/sh
>> >>> >>>>
>> >>> >>>> # 1) Stops hadoop daemons
>> >>> >>>> # 2) Overwrites new url list on HDFS
>> >>> >>>> # 3) Starts hadoop daemons
>> >>> >>>> # 4) Performs a clean crawl
>> >>> >>>>
>> >>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> >>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>> >>> >>>>
>> >>> >>>> CRAWL_DIR=crawl-ecxi || $1
>> >>> >>>> URL_DIR=urls || $2
>> >>> >>>>
>> >>> >>>> echo $CRAWL_DIR
>> >>> >>>> echo $URL_DIR
>> >>> >>>>
>> >>> >>>> echo "Leaving safe mode..."
>> >>> >>>> ./hadoop dfsadmin -safemode leave
>> >>> >>>>
>> >>> >>>> echo "Removing seed urls directory and previous crawled
>> content..."
>> >>> >>>> ./hadoop dfs -rmr $URL_DIR
>> >>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
>> >>> >>>>
>> >>> >>>> echo "Removing past logs"
>> >>> >>>>
>> >>> >>>> rm -rf ../logs/*
>> >>> >>>>
>> >>> >>>> echo "Uploading seed urls..."
>> >>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>> >>> >>>>
>> >>> >>>> #echo "Entering safe mode..."
>> >>> >>>> #./hadoop dfsadmin -safemode enter
>> >>> >>>>
>> >>> >>>> echo "******************"
>> >>> >>>> echo "* STARTING CRAWL *"
>> >>> >>>> echo "******************"
>> >>> >>>>
>> >>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> Next step I'm thinking on to fix the problem is to install
>> >>> >>>> nutch+hadoop as specified in this past nutch-user mail:
>> >>> >>>>
>> >>> >>>>
>> >>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10225.html
>> >>> >>>>
>> >>> >>>> As I don't know if it's current practice on trunk (archived mail
>> is
>> >>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to
>> fix
>> >>> >>>> it or if it's being worked on by someone... I haven't found a
>> matching
>> >>> >>>> bug on JIRA :_/
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards
>> >> Alexander Aristov
>> >>
>> >
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>

Re: Distributed fetching only happening in one node ?

Posted by Alexander Aristov <al...@gmail.com>.

Still not clear.

What values for mapred.map.tasks and mapred.reduce.tasks do you have now?
Check the hadoop-site.xml file as it may affect your configuration also.

Alexander

2008/8/5 brainstorm <br...@gmail.com>

> Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2).
>
> On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <br...@gmail.com> wrote:
> > Right, I've checked before with mapred.map.tasks to 2 and
> > mapred.reduce.tasks to 1.
> >
> > I've also played with several values on the following settings:
> >
> > <property>
> >  <name>fetcher.server.delay</name>
> >  <value>1.5</value>
> >  <description>The number of seconds the fetcher will delay between
> >   successive requests to the same server.</description>
> > </property>
> >
> > <property>
> >  <name>http.max.delays</name>
> >  <value>3</value>
> >  <description>The number of times a thread will delay when trying to
> >  fetch a page.  Each time it finds that a host is busy, it will wait
> >  fetcher.server.delay.  After http.max.delays attepts, it will give
> >  up on the page for now.</description>
> > </property>
> >
> > Only one node executes the fetch phase anyway :_(
> >
> > Thanks for the hint anyway... more ideas ?
> >
> > On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
> > <al...@gmail.com> wrote:
> >> Hi
> >>
> >> 1. You should have set
> >> mapred.map.tasks
> >> and
> >> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
> >>
> >> 2. You can specify number of threads to perform fetching. Also there is
> a
> >> parameter that slows down fetching from one URL,so called polite
> fetching to
> >> not DOS the site.
> >>
> >> So check you configuration.
> >>
> >> Alex
> >>
> >> 2008/8/5 brainstorm <br...@gmail.com>
> >>
> >>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
> >>> the warnings... BUT, on a 7-node nutch cluster:
> >>>
> >>> 1) Fetching is only happening on *one* node despite several values
> >>> tested on settings:
> >>> mapred.tasktracker.map.tasks.maximum
> >>> mapred.tasktracker.reduce.tasks.maximum
> >>> export HADOOP_HEAPSIZE
> >>>
> >>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
> >>>
> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
> >>>
> >>> But nutch keeps crawling only using one node, instead of seven
> >>> nodes... anybody knows why ?
> >>>
> >>> I've had a look at the code, searching for:
> >>>
> >>> conf.setNumMapTasks(int num), but found none: so I guess that the
> >>> number of mappers & reducers are not limited programatically.
> >>>
> >>> 2) Even on a single node, the fetching is really slow: 1 url or page
> >>> per second, at most.
> >>>
> >>> Can anybody shed some light into this ? Pointing which class/code I
> >>> should look into to modify this behaviour will help also.
> >>>
> >>> Anybody has a distributed nutch crawling cluster working with all
> >>> nodes fetching at fetch phase ?
> >>>
> >>> I even did some numbers using wordcount example using 7 nodes at 100%
> >>> cpu usage using a 425MB parsedtext file:
> >>>
> >>> maps    reduces heapsize        time
> >>> 2       2       500     3m43.049s
> >>> 4       4       500     4m41.846s
> >>> 8       8       500     4m29.344s
> >>> 16      16      500     3m43.672s
> >>> 32      32      500     3m41.367s
> >>> 64      64      500     4m27.275s
> >>> 128     128     500     4m35.233s
> >>> 256     256     500     3m41.916s
> >>>
> >>>
> >>> 2       2       2000    4m31.434s
> >>> 4       4       2000
> >>> 8       8       2000
> >>> 16      16      2000    4m32.213s
> >>> 32      32      2000
> >>> 64      64      2000
> >>> 128     128     2000
> >>> 256     256     2000    4m38.310s
> >>>
> >>> Thanks in advance,
> >>> Roman
> >>>
> >>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <br...@gmail.com>
> wrote:
> >>> > While seeing DFS wireshark trace (and the corresponding RST's), the
> >>> > crawl continued to next step... seems that this WARNING is actually
> >>> > slowing down the whole crawling process (it took 36 minutes to
> >>> > complete the previous fetch) with just 3 urls seed file :-!!!
> >>> >
> >>> > I just posted a couple of exceptions/questions regarding DFS on
> hadoop
> >>> > core mailing list.
> >>> >
> >>> > PD: As a side note, the following error caught my attention:
> >>> >
> >>> > Fetcher: starting
> >>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
> >>> > Too many fetch-failures
> >>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> >>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> >>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> >>> > with: org.apache.nutch.protocol.http.api.HttpException:
> >>> > java.net.UnknownHostException: upc.cat
> >>> >
> >>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
> >>> > exist, it just gets redirected to www.upc.cat :-/
> >>> >
> >>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <br...@gmail.com>
> wrote:
> >>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
> >>> >> issue (perhaps there was a simple solution/known bug/issue)...
> >>> >>
> >>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
> >>> >> and now I'm diving on the tcp stream... let's see if I see the light
> >>> >> (RST flag somewhere ?), thanks anyway for replying ;)
> >>> >>
> >>> >> Just for the record, the phase that stalls is fetcher during reduce:
> >>> >>
> >>> >> Jobid   User    Name    Map % Complete  Map Total       Maps
> Completed
> >>>  Reduce %
> >>> >> Complete        Reduce Total    Reduces Completed
> >>> >> job_200807151723_0005   hadoop  fetch
> crawl-ecxi/segments/20080715172458
> >>>        100.00%
> >>> >>        2       2       16.66%
> >>> >>
> >>> >>        1       0
> >>> >>
> >>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
> >>> >>
> >>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
> >>> >> <pm...@sim-gtech.com> wrote:
> >>> >>> Hi brain,
> >>> >>>        If I were you, I would download wireshark
> >>> >>> (http://www.wireshark.org/download.html) to see what is happening
> at
> >>> the
> >>> >>> network layer and see if that provides any clues.  A socket
> exception
> >>> >>> that you don't expect is usually due to one side of the
> conversation
> >>> not
> >>> >>> understanding the other side.  If you have 4 machines, then you
> have 4
> >>> >>> possible places where default firewall rules could be causing an
> issue.
> >>> >>> If it is not the firewall rules, the NAT rules could be a potential
> >>> >>> source of error.  Also, even a router hardware error could cause a
> >>> >>> problem.
> >>> >>>        If you understand TCP, just make sure that you see all the
> >>> >>> correct TCP stuff happening in wireshark.  If you don't understand
> >>> >>> wireshark's display, let me know, and I'll pass on some quickstart
> >>> >>> information.
> >>> >>>
> >>> >>>        If you already know all of this, I don't have any way to
> help
> >>> >>> you, as it looks like you're trying to accomplish something
> trickier
> >>> >>> with nutch than I have ever attempted.
> >>> >>>
> >>> >>> Patrick
> >>> >>>
> >>> >>> -----Original Message-----
> >>> >>> From: brainstorm [mailto:braincode@gmail.com]
> >>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
> >>> >>> To: nutch-user@lucene.apache.org
> >>> >>> Subject: Re: Distributed fetching only happening in one node ?
> >>> >>>
> >>> >>> Boiling down the problem I'm stuck on this:
> >>> >>>
> >>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
> >>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
> >>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
> >>> >>>        at
> >>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
> >>> >>>        at
> >>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >>> >>>        at
> >>> >>>
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> >>> >>>        at
> >>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> >>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >>> >>>        at
> >>> >>>
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
> >>> >>>        at
> >>> >>>
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
> >>> >>>        at
> >>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
> >>> >>>        at java.lang.Thread.run(Thread.java:595)
> >>> >>>
> >>> >>> Checked that firewall settings between node & frontend were not
> >>> >>> blocking packets, and they don't... anyone knows why is this ? If
> not,
> >>> >>> could you provide a convenient way to debug it ?
> >>> >>>
> >>> >>> Thanks !
> >>> >>>
> >>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <br...@gmail.com>
> >>> wrote:
> >>> >>>> Hi,
> >>> >>>>
> >>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> >>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> >>> >>>> best suited network topology for inet crawling (frontend being a
> net
> >>> >>>> bottleneck), but I think it's fine for testing purposes.
> >>> >>>>
> >>> >>>> I'm having issues with fetch mapreduce job:
> >>> >>>>
> >>> >>>> According to ganglia monitoring (network traffic), and hadoop
> >>> >>>> administrative interfaces, fetch phase is only being executed in
> the
> >>> >>>> frontend node, where I launched "nutch crawl". Previous nutch
> phases
> >>> >>>> were executed neatly distributed on all nodes:
> >>> >>>>
> >>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
> >>> >>>>        2       2       100.00%
> >>> >>>>        1       1
> >>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
> >>> >>> 100.00%
> >>> >>>>        3       3       100.00%
> >>> >>>>        1       1
> >>> >>>> job_200807131223_0003   hadoop  generate: select
> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>> >>>>        3       3       100.00%
> >>> >>>>        1       1
> >>> >>>> job_200807131223_0004   hadoop  generate: partition
> >>> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>> >>>>        4       4       100.00%
> >>> >>>>        2       2
> >>> >>>>
> >>> >>>> I've checked that:
> >>> >>>>
> >>> >>>> 1) Nodes have inet connectivity, firewall settings
> >>> >>>> 2) There's enough space on local discs
> >>> >>>> 3) Proper processes are running on nodes
> >>> >>>>
> >>> >>>> frontend-node:
> >>> >>>> ==========
> >>> >>>>
> >>> >>>> [root@cluster ~]# jps
> >>> >>>> 29232 NameNode
> >>> >>>> 29489 DataNode
> >>> >>>> 29860 JobTracker
> >>> >>>> 29778 SecondaryNameNode
> >>> >>>> 31122 Crawl
> >>> >>>> 30137 TaskTracker
> >>> >>>> 10989 Jps
> >>> >>>> 1818 TaskTracker$Child
> >>> >>>>
> >>> >>>> leaf nodes:
> >>> >>>> ========
> >>> >>>>
> >>> >>>> [root@cluster ~]# cluster-fork jps
> >>> >>>> compute-0-1:
> >>> >>>> 23929 Jps
> >>> >>>> 15568 TaskTracker
> >>> >>>> 15361 DataNode
> >>> >>>> compute-0-2:
> >>> >>>> 32272 TaskTracker
> >>> >>>> 32065 DataNode
> >>> >>>> 7197 Jps
> >>> >>>> 2397 TaskTracker$Child
> >>> >>>> compute-0-3:
> >>> >>>> 12054 DataNode
> >>> >>>> 19584 Jps
> >>> >>>> 14824 TaskTracker$Child
> >>> >>>> 12261 TaskTracker
> >>> >>>>
> >>> >>>> 4) Logs only show fetching process (taking place only in the head
> >>> >>> node):
> >>> >>>>
> >>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> >>> >>>> http://valleycycles.net/
> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>> >>>> robots.txt for http://www.getting-forward.org/:
> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>> >>>> robots.txt for http://www.getting-forward.org/:
> >>> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>> >>>>
> >>> >>>> What am I missing ? Why there are no fetching instances on nodes ?
> I
> >>> >>>> used the following custom script to launch a pristine crawl each
> time:
> >>> >>>>
> >>> >>>> #!/bin/sh
> >>> >>>>
> >>> >>>> # 1) Stops hadoop daemons
> >>> >>>> # 2) Overwrites new url list on HDFS
> >>> >>>> # 3) Starts hadoop daemons
> >>> >>>> # 4) Performs a clean crawl
> >>> >>>>
> >>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> >>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
> >>> >>>>
> >>> >>>> CRAWL_DIR=crawl-ecxi || $1
> >>> >>>> URL_DIR=urls || $2
> >>> >>>>
> >>> >>>> echo $CRAWL_DIR
> >>> >>>> echo $URL_DIR
> >>> >>>>
> >>> >>>> echo "Leaving safe mode..."
> >>> >>>> ./hadoop dfsadmin -safemode leave
> >>> >>>>
> >>> >>>> echo "Removing seed urls directory and previous crawled
> content..."
> >>> >>>> ./hadoop dfs -rmr $URL_DIR
> >>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
> >>> >>>>
> >>> >>>> echo "Removing past logs"
> >>> >>>>
> >>> >>>> rm -rf ../logs/*
> >>> >>>>
> >>> >>>> echo "Uploading seed urls..."
> >>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
> >>> >>>>
> >>> >>>> #echo "Entering safe mode..."
> >>> >>>> #./hadoop dfsadmin -safemode enter
> >>> >>>>
> >>> >>>> echo "******************"
> >>> >>>> echo "* STARTING CRAWL *"
> >>> >>>> echo "******************"
> >>> >>>>
> >>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
> >>> >>>>
> >>> >>>>
> >>> >>>> Next step I'm thinking on to fix the problem is to install
> >>> >>>> nutch+hadoop as specified in this past nutch-user mail:
> >>> >>>>
> >>> >>>>
> >>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10225.html
> >>> >>>>
> >>> >>>> As I don't know if it's current practice on trunk (archived mail
> is
> >>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to
> fix
> >>> >>>> it or if it's being worked on by someone... I haven't found a
> matching
> >>> >>>> bug on JIRA :_/
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> Best Regards
> >> Alexander Aristov
> >>
> >
>



-- 
Best Regards
Alexander Aristov

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

Correction: Only 2 nodes doing map operation on fetch (nodes 7 and 2).

On Tue, Aug 5, 2008 at 9:11 AM, brainstorm <br...@gmail.com> wrote:
> Right, I've checked before with mapred.map.tasks to 2 and
> mapred.reduce.tasks to 1.
>
> I've also played with several values on the following settings:
>
> <property>
>  <name>fetcher.server.delay</name>
>  <value>1.5</value>
>  <description>The number of seconds the fetcher will delay between
>   successive requests to the same server.</description>
> </property>
>
> <property>
>  <name>http.max.delays</name>
>  <value>3</value>
>  <description>The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.</description>
> </property>
>
> Only one node executes the fetch phase anyway :_(
>
> Thanks for the hint anyway... more ideas ?
>
> On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
> <al...@gmail.com> wrote:
>> Hi
>>
>> 1. You should have set
>> mapred.map.tasks
>> and
>> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
>>
>> 2. You can specify number of threads to perform fetching. Also there is a
>> parameter that slows down fetching from one URL,so called polite fetching to
>> not DOS the site.
>>
>> So check you configuration.
>>
>> Alex
>>
>> 2008/8/5 brainstorm <br...@gmail.com>
>>
>>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
>>> the warnings... BUT, on a 7-node nutch cluster:
>>>
>>> 1) Fetching is only happening on *one* node despite several values
>>> tested on settings:
>>> mapred.tasktracker.map.tasks.maximum
>>> mapred.tasktracker.reduce.tasks.maximum
>>> export HADOOP_HEAPSIZE
>>>
>>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> But nutch keeps crawling only using one node, instead of seven
>>> nodes... anybody knows why ?
>>>
>>> I've had a look at the code, searching for:
>>>
>>> conf.setNumMapTasks(int num), but found none: so I guess that the
>>> number of mappers & reducers are not limited programatically.
>>>
>>> 2) Even on a single node, the fetching is really slow: 1 url or page
>>> per second, at most.
>>>
>>> Can anybody shed some light into this ? Pointing which class/code I
>>> should look into to modify this behaviour will help also.
>>>
>>> Anybody has a distributed nutch crawling cluster working with all
>>> nodes fetching at fetch phase ?
>>>
>>> I even did some numbers using wordcount example using 7 nodes at 100%
>>> cpu usage using a 425MB parsedtext file:
>>>
>>> maps    reduces heapsize        time
>>> 2       2       500     3m43.049s
>>> 4       4       500     4m41.846s
>>> 8       8       500     4m29.344s
>>> 16      16      500     3m43.672s
>>> 32      32      500     3m41.367s
>>> 64      64      500     4m27.275s
>>> 128     128     500     4m35.233s
>>> 256     256     500     3m41.916s
>>>
>>>
>>> 2       2       2000    4m31.434s
>>> 4       4       2000
>>> 8       8       2000
>>> 16      16      2000    4m32.213s
>>> 32      32      2000
>>> 64      64      2000
>>> 128     128     2000
>>> 256     256     2000    4m38.310s
>>>
>>> Thanks in advance,
>>> Roman
>>>
>>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <br...@gmail.com> wrote:
>>> > While seeing DFS wireshark trace (and the corresponding RST's), the
>>> > crawl continued to next step... seems that this WARNING is actually
>>> > slowing down the whole crawling process (it took 36 minutes to
>>> > complete the previous fetch) with just 3 urls seed file :-!!!
>>> >
>>> > I just posted a couple of exceptions/questions regarding DFS on hadoop
>>> > core mailing list.
>>> >
>>> > PD: As a side note, the following error caught my attention:
>>> >
>>> > Fetcher: starting
>>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
>>> > Too many fetch-failures
>>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
>>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
>>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
>>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
>>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
>>> > with: org.apache.nutch.protocol.http.api.HttpException:
>>> > java.net.UnknownHostException: upc.cat
>>> >
>>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
>>> > exist, it just gets redirected to www.upc.cat :-/
>>> >
>>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <br...@gmail.com> wrote:
>>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
>>> >> issue (perhaps there was a simple solution/known bug/issue)...
>>> >>
>>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
>>> >> and now I'm diving on the tcp stream... let's see if I see the light
>>> >> (RST flag somewhere ?), thanks anyway for replying ;)
>>> >>
>>> >> Just for the record, the phase that stalls is fetcher during reduce:
>>> >>
>>> >> Jobid   User    Name    Map % Complete  Map Total       Maps Completed
>>>  Reduce %
>>> >> Complete        Reduce Total    Reduces Completed
>>> >> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458
>>>        100.00%
>>> >>        2       2       16.66%
>>> >>
>>> >>        1       0
>>> >>
>>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
>>> >>
>>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>>> >> <pm...@sim-gtech.com> wrote:
>>> >>> Hi brain,
>>> >>>        If I were you, I would download wireshark
>>> >>> (http://www.wireshark.org/download.html) to see what is happening at
>>> the
>>> >>> network layer and see if that provides any clues.  A socket exception
>>> >>> that you don't expect is usually due to one side of the conversation
>>> not
>>> >>> understanding the other side.  If you have 4 machines, then you have 4
>>> >>> possible places where default firewall rules could be causing an issue.
>>> >>> If it is not the firewall rules, the NAT rules could be a potential
>>> >>> source of error.  Also, even a router hardware error could cause a
>>> >>> problem.
>>> >>>        If you understand TCP, just make sure that you see all the
>>> >>> correct TCP stuff happening in wireshark.  If you don't understand
>>> >>> wireshark's display, let me know, and I'll pass on some quickstart
>>> >>> information.
>>> >>>
>>> >>>        If you already know all of this, I don't have any way to help
>>> >>> you, as it looks like you're trying to accomplish something trickier
>>> >>> with nutch than I have ever attempted.
>>> >>>
>>> >>> Patrick
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: brainstorm [mailto:braincode@gmail.com]
>>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
>>> >>> To: nutch-user@lucene.apache.org
>>> >>> Subject: Re: Distributed fetching only happening in one node ?
>>> >>>
>>> >>> Boiling down the problem I'm stuck on this:
>>> >>>
>>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>>> >>>        at
>>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>>> >>>        at
>>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>>> >>>        at
>>> >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>>> >>>        at
>>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>> >>>        at
>>> >>>
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>>> >>>        at
>>> >>>
>>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>>> >>>        at
>>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>>> >>>        at java.lang.Thread.run(Thread.java:595)
>>> >>>
>>> >>> Checked that firewall settings between node & frontend were not
>>> >>> blocking packets, and they don't... anyone knows why is this ? If not,
>>> >>> could you provide a convenient way to debug it ?
>>> >>>
>>> >>> Thanks !
>>> >>>
>>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <br...@gmail.com>
>>> wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>>> >>>> best suited network topology for inet crawling (frontend being a net
>>> >>>> bottleneck), but I think it's fine for testing purposes.
>>> >>>>
>>> >>>> I'm having issues with fetch mapreduce job:
>>> >>>>
>>> >>>> According to ganglia monitoring (network traffic), and hadoop
>>> >>>> administrative interfaces, fetch phase is only being executed in the
>>> >>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>>> >>>> were executed neatly distributed on all nodes:
>>> >>>>
>>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
>>> >>>>        2       2       100.00%
>>> >>>>        1       1
>>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>>> >>> 100.00%
>>> >>>>        3       3       100.00%
>>> >>>>        1       1
>>> >>>> job_200807131223_0003   hadoop  generate: select
>>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>>> >>>>        3       3       100.00%
>>> >>>>        1       1
>>> >>>> job_200807131223_0004   hadoop  generate: partition
>>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>>> >>>>        4       4       100.00%
>>> >>>>        2       2
>>> >>>>
>>> >>>> I've checked that:
>>> >>>>
>>> >>>> 1) Nodes have inet connectivity, firewall settings
>>> >>>> 2) There's enough space on local discs
>>> >>>> 3) Proper processes are running on nodes
>>> >>>>
>>> >>>> frontend-node:
>>> >>>> ==========
>>> >>>>
>>> >>>> [root@cluster ~]# jps
>>> >>>> 29232 NameNode
>>> >>>> 29489 DataNode
>>> >>>> 29860 JobTracker
>>> >>>> 29778 SecondaryNameNode
>>> >>>> 31122 Crawl
>>> >>>> 30137 TaskTracker
>>> >>>> 10989 Jps
>>> >>>> 1818 TaskTracker$Child
>>> >>>>
>>> >>>> leaf nodes:
>>> >>>> ========
>>> >>>>
>>> >>>> [root@cluster ~]# cluster-fork jps
>>> >>>> compute-0-1:
>>> >>>> 23929 Jps
>>> >>>> 15568 TaskTracker
>>> >>>> 15361 DataNode
>>> >>>> compute-0-2:
>>> >>>> 32272 TaskTracker
>>> >>>> 32065 DataNode
>>> >>>> 7197 Jps
>>> >>>> 2397 TaskTracker$Child
>>> >>>> compute-0-3:
>>> >>>> 12054 DataNode
>>> >>>> 19584 Jps
>>> >>>> 14824 TaskTracker$Child
>>> >>>> 12261 TaskTracker
>>> >>>>
>>> >>>> 4) Logs only show fetching process (taking place only in the head
>>> >>> node):
>>> >>>>
>>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>>> >>>> http://valleycycles.net/
>>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> >>>> robots.txt for http://www.getting-forward.org/:
>>> >>>> java.net.UnknownHostException: www.getting-forward.org
>>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>>> >>>> robots.txt for http://www.getting-forward.org/:
>>> >>>> java.net.UnknownHostException: www.getting-forward.org
>>> >>>>
>>> >>>> What am I missing ? Why there are no fetching instances on nodes ? I
>>> >>>> used the following custom script to launch a pristine crawl each time:
>>> >>>>
>>> >>>> #!/bin/sh
>>> >>>>
>>> >>>> # 1) Stops hadoop daemons
>>> >>>> # 2) Overwrites new url list on HDFS
>>> >>>> # 3) Starts hadoop daemons
>>> >>>> # 4) Performs a clean crawl
>>> >>>>
>>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>>> >>>>
>>> >>>> CRAWL_DIR=crawl-ecxi || $1
>>> >>>> URL_DIR=urls || $2
>>> >>>>
>>> >>>> echo $CRAWL_DIR
>>> >>>> echo $URL_DIR
>>> >>>>
>>> >>>> echo "Leaving safe mode..."
>>> >>>> ./hadoop dfsadmin -safemode leave
>>> >>>>
>>> >>>> echo "Removing seed urls directory and previous crawled content..."
>>> >>>> ./hadoop dfs -rmr $URL_DIR
>>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
>>> >>>>
>>> >>>> echo "Removing past logs"
>>> >>>>
>>> >>>> rm -rf ../logs/*
>>> >>>>
>>> >>>> echo "Uploading seed urls..."
>>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>>> >>>>
>>> >>>> #echo "Entering safe mode..."
>>> >>>> #./hadoop dfsadmin -safemode enter
>>> >>>>
>>> >>>> echo "******************"
>>> >>>> echo "* STARTING CRAWL *"
>>> >>>> echo "******************"
>>> >>>>
>>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>>> >>>>
>>> >>>>
>>> >>>> Next step I'm thinking on to fix the problem is to install
>>> >>>> nutch+hadoop as specified in this past nutch-user mail:
>>> >>>>
>>> >>>>
>>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10225.html
>>> >>>>
>>> >>>> As I don't know if it's current practice on trunk (archived mail is
>>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>>> >>>> it or if it's being worked on by someone... I haven't found a matching
>>> >>>> bug on JIRA :_/
>>> >>>>
>>> >>>
>>> >>
>>> >
>>>
>>
>>
>>
>> --
>> Best Regards
>> Alexander Aristov
>>
>

Re: Distributed fetching only happening in one node ?

Posted by brainstorm <br...@gmail.com>.

Right, I've checked before with mapred.map.tasks to 2 and
mapred.reduce.tasks to 1.

I've also played with several values on the following settings:

<property>
  <name>fetcher.server.delay</name>
  <value>1.5</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>3</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

Only one node executes the fetch phase anyway :_(

Thanks for the hint anyway... more ideas ?

On Tue, Aug 5, 2008 at 8:04 AM, Alexander Aristov
<al...@gmail.com> wrote:
> Hi
>
> 1. You should have set
> mapred.map.tasks
> and
> mapred.reduce.tasks parameters They are set to 2 and 1 by default.
>
> 2. You can specify number of threads to perform fetching. Also there is a
> parameter that slows down fetching from one URL,so called polite fetching to
> not DOS the site.
>
> So check you configuration.
>
> Alex
>
> 2008/8/5 brainstorm <br...@gmail.com>
>
>> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
>> the warnings... BUT, on a 7-node nutch cluster:
>>
>> 1) Fetching is only happening on *one* node despite several values
>> tested on settings:
>> mapred.tasktracker.map.tasks.maximum
>> mapred.tasktracker.reduce.tasks.maximum
>> export HADOOP_HEAPSIZE
>>
>> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>>
>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>
>> But nutch keeps crawling only using one node, instead of seven
>> nodes... anybody knows why ?
>>
>> I've had a look at the code, searching for:
>>
>> conf.setNumMapTasks(int num), but found none: so I guess that the
>> number of mappers & reducers are not limited programatically.
>>
>> 2) Even on a single node, the fetching is really slow: 1 url or page
>> per second, at most.
>>
>> Can anybody shed some light into this ? Pointing which class/code I
>> should look into to modify this behaviour will help also.
>>
>> Anybody has a distributed nutch crawling cluster working with all
>> nodes fetching at fetch phase ?
>>
>> I even did some numbers using wordcount example using 7 nodes at 100%
>> cpu usage using a 425MB parsedtext file:
>>
>> maps    reduces heapsize        time
>> 2       2       500     3m43.049s
>> 4       4       500     4m41.846s
>> 8       8       500     4m29.344s
>> 16      16      500     3m43.672s
>> 32      32      500     3m41.367s
>> 64      64      500     4m27.275s
>> 128     128     500     4m35.233s
>> 256     256     500     3m41.916s
>>
>>
>> 2       2       2000    4m31.434s
>> 4       4       2000
>> 8       8       2000
>> 16      16      2000    4m32.213s
>> 32      32      2000
>> 64      64      2000
>> 128     128     2000
>> 256     256     2000    4m38.310s
>>
>> Thanks in advance,
>> Roman
>>
>> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <br...@gmail.com> wrote:
>> > While seeing DFS wireshark trace (and the corresponding RST's), the
>> > crawl continued to next step... seems that this WARNING is actually
>> > slowing down the whole crawling process (it took 36 minutes to
>> > complete the previous fetch) with just 3 urls seed file :-!!!
>> >
>> > I just posted a couple of exceptions/questions regarding DFS on hadoop
>> > core mailing list.
>> >
>> > PD: As a side note, the following error caught my attention:
>> >
>> > Fetcher: starting
>> > Fetcher: segment: crawl-ecxi/segments/20080715172458
>> > Too many fetch-failures
>> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
>> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
>> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
>> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
>> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
>> > with: org.apache.nutch.protocol.http.api.HttpException:
>> > java.net.UnknownHostException: upc.cat
>> >
>> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
>> > exist, it just gets redirected to www.upc.cat :-/
>> >
>> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <br...@gmail.com> wrote:
>> >> Yep, I know about wireshark, and wanted to avoid it to debug this
>> >> issue (perhaps there was a simple solution/known bug/issue)...
>> >>
>> >> I just launched wireshark on frontend with filter tcp.port == 50010,
>> >> and now I'm diving on the tcp stream... let's see if I see the light
>> >> (RST flag somewhere ?), thanks anyway for replying ;)
>> >>
>> >> Just for the record, the phase that stalls is fetcher during reduce:
>> >>
>> >> Jobid   User    Name    Map % Complete  Map Total       Maps Completed
>>  Reduce %
>> >> Complete        Reduce Total    Reduces Completed
>> >> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458
>>        100.00%
>> >>        2       2       16.66%
>> >>
>> >>        1       0
>> >>
>> >> It's stuck on 16%, no traffic, no crawling, but still "running".
>> >>
>> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
>> >> <pm...@sim-gtech.com> wrote:
>> >>> Hi brain,
>> >>>        If I were you, I would download wireshark
>> >>> (http://www.wireshark.org/download.html) to see what is happening at
>> the
>> >>> network layer and see if that provides any clues.  A socket exception
>> >>> that you don't expect is usually due to one side of the conversation
>> not
>> >>> understanding the other side.  If you have 4 machines, then you have 4
>> >>> possible places where default firewall rules could be causing an issue.
>> >>> If it is not the firewall rules, the NAT rules could be a potential
>> >>> source of error.  Also, even a router hardware error could cause a
>> >>> problem.
>> >>>        If you understand TCP, just make sure that you see all the
>> >>> correct TCP stuff happening in wireshark.  If you don't understand
>> >>> wireshark's display, let me know, and I'll pass on some quickstart
>> >>> information.
>> >>>
>> >>>        If you already know all of this, I don't have any way to help
>> >>> you, as it looks like you're trying to accomplish something trickier
>> >>> with nutch than I have ever attempted.
>> >>>
>> >>> Patrick
>> >>>
>> >>> -----Original Message-----
>> >>> From: brainstorm [mailto:braincode@gmail.com]
>> >>> Sent: Tuesday, July 15, 2008 10:08 AM
>> >>> To: nutch-user@lucene.apache.org
>> >>> Subject: Re: Distributed fetching only happening in one node ?
>> >>>
>> >>> Boiling down the problem I'm stuck on this:
>> >>>
>> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
>> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
>> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
>> >>>        at
>> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
>> >>>        at
>> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>> >>>        at
>> >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>> >>>        at
>> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>> >>>        at
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
>> >>>        at
>> >>>
>> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
>> >>>        at
>> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
>> >>>        at java.lang.Thread.run(Thread.java:595)
>> >>>
>> >>> Checked that firewall settings between node & frontend were not
>> >>> blocking packets, and they don't... anyone knows why is this ? If not,
>> >>> could you provide a convenient way to debug it ?
>> >>>
>> >>> Thanks !
>> >>>
>> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <br...@gmail.com>
>> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
>> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
>> >>>> best suited network topology for inet crawling (frontend being a net
>> >>>> bottleneck), but I think it's fine for testing purposes.
>> >>>>
>> >>>> I'm having issues with fetch mapreduce job:
>> >>>>
>> >>>> According to ganglia monitoring (network traffic), and hadoop
>> >>>> administrative interfaces, fetch phase is only being executed in the
>> >>>> frontend node, where I launched "nutch crawl". Previous nutch phases
>> >>>> were executed neatly distributed on all nodes:
>> >>>>
>> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
>> >>>>        2       2       100.00%
>> >>>>        1       1
>> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
>> >>> 100.00%
>> >>>>        3       3       100.00%
>> >>>>        1       1
>> >>>> job_200807131223_0003   hadoop  generate: select
>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>>>        3       3       100.00%
>> >>>>        1       1
>> >>>> job_200807131223_0004   hadoop  generate: partition
>> >>>> crawl-ecxi/segments/20080713123547      100.00%
>> >>>>        4       4       100.00%
>> >>>>        2       2
>> >>>>
>> >>>> I've checked that:
>> >>>>
>> >>>> 1) Nodes have inet connectivity, firewall settings
>> >>>> 2) There's enough space on local discs
>> >>>> 3) Proper processes are running on nodes
>> >>>>
>> >>>> frontend-node:
>> >>>> ==========
>> >>>>
>> >>>> [root@cluster ~]# jps
>> >>>> 29232 NameNode
>> >>>> 29489 DataNode
>> >>>> 29860 JobTracker
>> >>>> 29778 SecondaryNameNode
>> >>>> 31122 Crawl
>> >>>> 30137 TaskTracker
>> >>>> 10989 Jps
>> >>>> 1818 TaskTracker$Child
>> >>>>
>> >>>> leaf nodes:
>> >>>> ========
>> >>>>
>> >>>> [root@cluster ~]# cluster-fork jps
>> >>>> compute-0-1:
>> >>>> 23929 Jps
>> >>>> 15568 TaskTracker
>> >>>> 15361 DataNode
>> >>>> compute-0-2:
>> >>>> 32272 TaskTracker
>> >>>> 32065 DataNode
>> >>>> 7197 Jps
>> >>>> 2397 TaskTracker$Child
>> >>>> compute-0-3:
>> >>>> 12054 DataNode
>> >>>> 19584 Jps
>> >>>> 14824 TaskTracker$Child
>> >>>> 12261 TaskTracker
>> >>>>
>> >>>> 4) Logs only show fetching process (taking place only in the head
>> >>> node):
>> >>>>
>> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
>> >>>> http://valleycycles.net/
>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
>> >>>> robots.txt for http://www.getting-forward.org/:
>> >>>> java.net.UnknownHostException: www.getting-forward.org
>> >>>>
>> >>>> What am I missing ? Why there are no fetching instances on nodes ? I
>> >>>> used the following custom script to launch a pristine crawl each time:
>> >>>>
>> >>>> #!/bin/sh
>> >>>>
>> >>>> # 1) Stops hadoop daemons
>> >>>> # 2) Overwrites new url list on HDFS
>> >>>> # 3) Starts hadoop daemons
>> >>>> # 4) Performs a clean crawl
>> >>>>
>> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
>> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
>> >>>>
>> >>>> CRAWL_DIR=crawl-ecxi || $1
>> >>>> URL_DIR=urls || $2
>> >>>>
>> >>>> echo $CRAWL_DIR
>> >>>> echo $URL_DIR
>> >>>>
>> >>>> echo "Leaving safe mode..."
>> >>>> ./hadoop dfsadmin -safemode leave
>> >>>>
>> >>>> echo "Removing seed urls directory and previous crawled content..."
>> >>>> ./hadoop dfs -rmr $URL_DIR
>> >>>> ./hadoop dfs -rmr $CRAWL_DIR
>> >>>>
>> >>>> echo "Removing past logs"
>> >>>>
>> >>>> rm -rf ../logs/*
>> >>>>
>> >>>> echo "Uploading seed urls..."
>> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
>> >>>>
>> >>>> #echo "Entering safe mode..."
>> >>>> #./hadoop dfsadmin -safemode enter
>> >>>>
>> >>>> echo "******************"
>> >>>> echo "* STARTING CRAWL *"
>> >>>> echo "******************"
>> >>>>
>> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
>> >>>>
>> >>>>
>> >>>> Next step I'm thinking on to fix the problem is to install
>> >>>> nutch+hadoop as specified in this past nutch-user mail:
>> >>>>
>> >>>>
>> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10225.html
>> >>>>
>> >>>> As I don't know if it's current practice on trunk (archived mail is
>> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
>> >>>> it or if it's being worked on by someone... I haven't found a matching
>> >>>> bug on JIRA :_/
>> >>>>
>> >>>
>> >>
>> >
>>
>
>
>
> --
> Best Regards
> Alexander Aristov
>

Re: Distributed fetching only happening in one node ?

Posted by Alexander Aristov <al...@gmail.com>.

Hi

1. You should have set
mapred.map.tasks
and
mapred.reduce.tasks parameters They are set to 2 and 1 by default.

2. You can specify number of threads to perform fetching. Also there is a
parameter that slows down fetching from one URL,so called polite fetching to
not DOS the site.

So check you configuration.

Alex

2008/8/5 brainstorm <br...@gmail.com>

> Ok, DFS warnings problem solved, seems that hadoop-0.17.1 patch fixes
> the warnings... BUT, on a 7-node nutch cluster:
>
> 1) Fetching is only happening on *one* node despite several values
> tested on settings:
> mapred.tasktracker.map.tasks.maximum
> mapred.tasktracker.reduce.tasks.maximum
> export HADOOP_HEAPSIZE
>
> I've played with mapreduce (hadoop-site.xml) settings as advised on:
>
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>
> But nutch keeps crawling only using one node, instead of seven
> nodes... anybody knows why ?
>
> I've had a look at the code, searching for:
>
> conf.setNumMapTasks(int num), but found none: so I guess that the
> number of mappers & reducers are not limited programatically.
>
> 2) Even on a single node, the fetching is really slow: 1 url or page
> per second, at most.
>
> Can anybody shed some light into this ? Pointing which class/code I
> should look into to modify this behaviour will help also.
>
> Anybody has a distributed nutch crawling cluster working with all
> nodes fetching at fetch phase ?
>
> I even did some numbers using wordcount example using 7 nodes at 100%
> cpu usage using a 425MB parsedtext file:
>
> maps    reduces heapsize        time
> 2       2       500     3m43.049s
> 4       4       500     4m41.846s
> 8       8       500     4m29.344s
> 16      16      500     3m43.672s
> 32      32      500     3m41.367s
> 64      64      500     4m27.275s
> 128     128     500     4m35.233s
> 256     256     500     3m41.916s
>
>
> 2       2       2000    4m31.434s
> 4       4       2000
> 8       8       2000
> 16      16      2000    4m32.213s
> 32      32      2000
> 64      64      2000
> 128     128     2000
> 256     256     2000    4m38.310s
>
> Thanks in advance,
> Roman
>
> On Tue, Jul 15, 2008 at 7:15 PM, brainstorm <br...@gmail.com> wrote:
> > While seeing DFS wireshark trace (and the corresponding RST's), the
> > crawl continued to next step... seems that this WARNING is actually
> > slowing down the whole crawling process (it took 36 minutes to
> > complete the previous fetch) with just 3 urls seed file :-!!!
> >
> > I just posted a couple of exceptions/questions regarding DFS on hadoop
> > core mailing list.
> >
> > PD: As a side note, the following error caught my attention:
> >
> > Fetcher: starting
> > Fetcher: segment: crawl-ecxi/segments/20080715172458
> > Too many fetch-failures
> > task_200807151723_0005_m_000000_0: Fetcher: threads: 10
> > task_200807151723_0005_m_000000_0: fetching http://upc.es/
> > task_200807151723_0005_m_000000_0: fetching http://upc.edu/
> > task_200807151723_0005_m_000000_0: fetching http://upc.cat/
> > task_200807151723_0005_m_000000_0: fetch of http://upc.cat/ failed
> > with: org.apache.nutch.protocol.http.api.HttpException:
> > java.net.UnknownHostException: upc.cat
> >
> > Unknown host ?¿ Just try "http://upc.cat" on your browser, it *does*
> > exist, it just gets redirected to www.upc.cat :-/
> >
> > On Tue, Jul 15, 2008 at 5:42 PM, brainstorm <br...@gmail.com> wrote:
> >> Yep, I know about wireshark, and wanted to avoid it to debug this
> >> issue (perhaps there was a simple solution/known bug/issue)...
> >>
> >> I just launched wireshark on frontend with filter tcp.port == 50010,
> >> and now I'm diving on the tcp stream... let's see if I see the light
> >> (RST flag somewhere ?), thanks anyway for replying ;)
> >>
> >> Just for the record, the phase that stalls is fetcher during reduce:
> >>
> >> Jobid   User    Name    Map % Complete  Map Total       Maps Completed
>  Reduce %
> >> Complete        Reduce Total    Reduces Completed
> >> job_200807151723_0005   hadoop  fetch crawl-ecxi/segments/20080715172458
>        100.00%
> >>        2       2       16.66%
> >>
> >>        1       0
> >>
> >> It's stuck on 16%, no traffic, no crawling, but still "running".
> >>
> >> On Tue, Jul 15, 2008 at 4:28 PM, Patrick Markiewicz
> >> <pm...@sim-gtech.com> wrote:
> >>> Hi brain,
> >>>        If I were you, I would download wireshark
> >>> (http://www.wireshark.org/download.html) to see what is happening at
> the
> >>> network layer and see if that provides any clues.  A socket exception
> >>> that you don't expect is usually due to one side of the conversation
> not
> >>> understanding the other side.  If you have 4 machines, then you have 4
> >>> possible places where default firewall rules could be causing an issue.
> >>> If it is not the firewall rules, the NAT rules could be a potential
> >>> source of error.  Also, even a router hardware error could cause a
> >>> problem.
> >>>        If you understand TCP, just make sure that you see all the
> >>> correct TCP stuff happening in wireshark.  If you don't understand
> >>> wireshark's display, let me know, and I'll pass on some quickstart
> >>> information.
> >>>
> >>>        If you already know all of this, I don't have any way to help
> >>> you, as it looks like you're trying to accomplish something trickier
> >>> with nutch than I have ever attempted.
> >>>
> >>> Patrick
> >>>
> >>> -----Original Message-----
> >>> From: brainstorm [mailto:braincode@gmail.com]
> >>> Sent: Tuesday, July 15, 2008 10:08 AM
> >>> To: nutch-user@lucene.apache.org
> >>> Subject: Re: Distributed fetching only happening in one node ?
> >>>
> >>> Boiling down the problem I'm stuck on this:
> >>>
> >>> 2008-07-14 16:43:24,976 WARN  dfs.DataNode -
> >>> 192.168.0.100:50010:Failed to transfer blk_-855404545666908011 to
> >>> 192.168.0.252:50010 got java.net.SocketException: Connection reset
> >>>        at
> >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
> >>>        at
> >>> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >>>        at
> >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> >>>        at
> >>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> >>>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >>>        at
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1602)
> >>>        at
> >>>
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1636)
> >>>        at
> >>> org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2391)
> >>>        at java.lang.Thread.run(Thread.java:595)
> >>>
> >>> Checked that firewall settings between node & frontend were not
> >>> blocking packets, and they don't... anyone knows why is this ? If not,
> >>> could you provide a convenient way to debug it ?
> >>>
> >>> Thanks !
> >>>
> >>> On Sun, Jul 13, 2008 at 3:41 PM, brainstorm <br...@gmail.com>
> wrote:
> >>>> Hi,
> >>>>
> >>>> I'm running nutch+hadoop from trunk (rev) on a 4 machine rocks
> >>>> cluster: 1 frontend doing NAT to 3 leaf nodes. I know it's not the
> >>>> best suited network topology for inet crawling (frontend being a net
> >>>> bottleneck), but I think it's fine for testing purposes.
> >>>>
> >>>> I'm having issues with fetch mapreduce job:
> >>>>
> >>>> According to ganglia monitoring (network traffic), and hadoop
> >>>> administrative interfaces, fetch phase is only being executed in the
> >>>> frontend node, where I launched "nutch crawl". Previous nutch phases
> >>>> were executed neatly distributed on all nodes:
> >>>>
> >>>> job_200807131223_0001   hadoop  inject urls     100.00%
> >>>>        2       2       100.00%
> >>>>        1       1
> >>>> job_200807131223_0002   hadoop  crawldb crawl-ecxi/crawldb
> >>> 100.00%
> >>>>        3       3       100.00%
> >>>>        1       1
> >>>> job_200807131223_0003   hadoop  generate: select
> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>>>        3       3       100.00%
> >>>>        1       1
> >>>> job_200807131223_0004   hadoop  generate: partition
> >>>> crawl-ecxi/segments/20080713123547      100.00%
> >>>>        4       4       100.00%
> >>>>        2       2
> >>>>
> >>>> I've checked that:
> >>>>
> >>>> 1) Nodes have inet connectivity, firewall settings
> >>>> 2) There's enough space on local discs
> >>>> 3) Proper processes are running on nodes
> >>>>
> >>>> frontend-node:
> >>>> ==========
> >>>>
> >>>> [root@cluster ~]# jps
> >>>> 29232 NameNode
> >>>> 29489 DataNode
> >>>> 29860 JobTracker
> >>>> 29778 SecondaryNameNode
> >>>> 31122 Crawl
> >>>> 30137 TaskTracker
> >>>> 10989 Jps
> >>>> 1818 TaskTracker$Child
> >>>>
> >>>> leaf nodes:
> >>>> ========
> >>>>
> >>>> [root@cluster ~]# cluster-fork jps
> >>>> compute-0-1:
> >>>> 23929 Jps
> >>>> 15568 TaskTracker
> >>>> 15361 DataNode
> >>>> compute-0-2:
> >>>> 32272 TaskTracker
> >>>> 32065 DataNode
> >>>> 7197 Jps
> >>>> 2397 TaskTracker$Child
> >>>> compute-0-3:
> >>>> 12054 DataNode
> >>>> 19584 Jps
> >>>> 14824 TaskTracker$Child
> >>>> 12261 TaskTracker
> >>>>
> >>>> 4) Logs only show fetching process (taking place only in the head
> >>> node):
> >>>>
> >>>> 2008-07-13 13:33:22,306 INFO  fetcher.Fetcher - fetching
> >>>> http://valleycycles.net/
> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>>> robots.txt for http://www.getting-forward.org/:
> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>>> 2008-07-13 13:33:22,349 INFO  api.RobotRulesParser - Couldn't get
> >>>> robots.txt for http://www.getting-forward.org/:
> >>>> java.net.UnknownHostException: www.getting-forward.org
> >>>>
> >>>> What am I missing ? Why there are no fetching instances on nodes ? I
> >>>> used the following custom script to launch a pristine crawl each time:
> >>>>
> >>>> #!/bin/sh
> >>>>
> >>>> # 1) Stops hadoop daemons
> >>>> # 2) Overwrites new url list on HDFS
> >>>> # 3) Starts hadoop daemons
> >>>> # 4) Performs a clean crawl
> >>>>
> >>>> #export JAVA_HOME=/usr/lib/jvm/java-6-sun
> >>>> export JAVA_HOME=/usr/java/jdk1.5.0_10
> >>>>
> >>>> CRAWL_DIR=crawl-ecxi || $1
> >>>> URL_DIR=urls || $2
> >>>>
> >>>> echo $CRAWL_DIR
> >>>> echo $URL_DIR
> >>>>
> >>>> echo "Leaving safe mode..."
> >>>> ./hadoop dfsadmin -safemode leave
> >>>>
> >>>> echo "Removing seed urls directory and previous crawled content..."
> >>>> ./hadoop dfs -rmr $URL_DIR
> >>>> ./hadoop dfs -rmr $CRAWL_DIR
> >>>>
> >>>> echo "Removing past logs"
> >>>>
> >>>> rm -rf ../logs/*
> >>>>
> >>>> echo "Uploading seed urls..."
> >>>> ./hadoop dfs -put ../$URL_DIR $URL_DIR
> >>>>
> >>>> #echo "Entering safe mode..."
> >>>> #./hadoop dfsadmin -safemode enter
> >>>>
> >>>> echo "******************"
> >>>> echo "* STARTING CRAWL *"
> >>>> echo "******************"
> >>>>
> >>>> ./nutch crawl $URL_DIR -dir $CRAWL_DIR -depth 3
> >>>>
> >>>>
> >>>> Next step I'm thinking on to fix the problem is to install
> >>>> nutch+hadoop as specified in this past nutch-user mail:
> >>>>
> >>>>
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg10225.html
> >>>>
> >>>> As I don't know if it's current practice on trunk (archived mail is
> >>>> from Wed, 02 Jan 2008), I wanted to ask if there's another way to fix
> >>>> it or if it's being worked on by someone... I haven't found a matching
> >>>> bug on JIRA :_/
> >>>>
> >>>
> >>
> >
>



-- 
Best Regards
Alexander Aristov