You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by brainstorm <br...@gmail.com> on 2008/08/15 00:19:10 UTC
[SOLVED] Re: Distributed fetching only happening in one node ?

At last, fixed, thanks to Andrzej ! Fellow nutchers, please, revise
your hadoop-site.xml file, specially those settings:

mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum

You should set these values to something like 4 maps and 2 reduces.

*and*

mapred.tasktracker.map.tasks
mapred.tasktracker.reduce.tasks

You should set these values to something like 23 maps and 13 reduces.

Assuming you have a 8-node cluster ;)

Regards,
Roman

On Thu, Aug 14, 2008 at 9:43 PM, brainstorm <br...@gmail.com> wrote:
> Sorry for that late reply... got summer power outages on the building
> that prevented me from running more tests on the cluster, now I'm back
> online... replying below.
>
> On Mon, Aug 11, 2008 at 5:59 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>> brainstorm wrote:
>>>
>>> On Mon, Aug 11, 2008 at 12:04 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
>>>>
>>>> brainstorm wrote:
>>>>
>>>>> This is one example crawled segment:
>>>>>
>>>>> /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000
>>>>>
>>>>> As you see, just one part-NNNN file is generated... in the conf file
>>>>> (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
>>>>> suggested in previous emails).
>>>>
>>>> First of all - for a 7 node cluster the mapred.map.tasks should be set at
>>>> least to something around 23 or 31 or even higher, and the number of
>>>> reduce
>>>> tasks to e.g. 11.
>>>
>>>
>>>
>>> I see, now it makes more sense to me than just assigning 2 maps by
>>> default as suggested before... then, according to:
>>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> Maps:
>>>
>>> Given:
>>> 64MB DFS blocks
>>> 500MB RAM per node
>>> 500MB on hadoop-env.sh HEAPSIZE variable (otherwise OutofHeapSpace
>>> exceptions occur)
>>>
>>> 31 maps... we'll see if it works. It would be cool to have a more
>>> precise "formula" to calculate this number in the nutch case. I assume
>>> that "23 to 31 or higher" is empirically determined by you: thanks for
>>> sharing your knowledge !
>>
>> That's already described on the Wiki page that you mention above ...
>>
>>
>>> Reduces:
>>> 1.75 * (nodes * mapred.tasktracker.tasks.maximum) = ceil(1.75 * 7 * 11) =
>>> 135
>>>
>>> This number is the total number of reduces running across the cluster
>>> nodes ?
>>
>> Hmm .. did you actually try running 11 simultaneous reduce tasks on each
>> node? It very much depends on the CPU, the amount of available RAM and the
>> heapsize of each task (mapred.child.java.opts). My experience is that it
>> takes a beefy hardware to run more than ~4-5 reduce tasks per node - load
>> avg is above 20, CPU is pegged at 100% and disks are thrashing. YYMV of
>> course.
>>
>> Regarding the number - what you calculated is the upper bound of all
>> possible simultaneous tasks, assuming you have 7 nodes and each will run 11
>> tasks at the same time. This is not what I meant - I meant that you should
>> set the total number of reduces to 11 or so. What that page doesn't discuss
>> is that there is also some cost in job startup / finish, so there is a sweet
>> spot number somewhere that fits your current data size and your current
>> cluster. In other words, it's better not to run too many reduces, just the
>> right number so that individual sort operations run quickly, and tasks
>> occupy most of the available slots.
>>
>>
>>> In conclusion, as you predicted (and if the script is not horribly
>>> broken), the non-dmoz sample is quite homogeneous (there are lots of
>>> urls coming from auto-generated ad sites, for instance)... adding the
>>> fact that *a lot* of them lead to "Unknown host exceptions", the crawl
>>> ends being extremely slow.
>>>
>>> But that does not solve the fact that few nodes are actually fetching
>>> on DMOZ-based crawl. So next thing to try is to raise
>>> mapred.map.tasks.maximum as you suggested, should fix my issues... I
>>> hope so :/
>>
>> I suggest that you try first a value of 4-5, #maps = 23, and #reduces=7.
>>
>> Just to be sure ... are you sure you are running a distributed JobTracker?
>> Can you see the JobTracker UI in the browser?
>
>
>
> Yes, distributed JobTracker is running (full cluster mode), I can see
> all the tasks via :50030... but I'm having same results with your
> maps/reduces values: just two nodes are fetching.
>
> Could it be possible that, given the dmoz url input filesize (31KB) is
> not being splitted on all nodes due to 64MB DFS block size ? (just one
> block "slot" for 31KB file)... just wondering :/
>
>
>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>