You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Aishwarya Venkataraman <av...@cs.ucsd.edu> on 2011/10/14 05:12:25 UTC
Web crawler on hadoop becomes unresponsive
Hello,
I trying to make my web crawling go faster with hadoop. My mapper just
consists of a few lines and my reducer is an IdentityReducer
while read line;do
#result="`wget -O - --timeout=500 http://$line 2>&1`"
echo $result
done
I am crawling about 50,000 sites. But my mapper always seems to time out
after sometime. The crawler just becomes unresponsive I guess.
I am not able to see which site is causing the problem as mapper deletes the
output if the job fails. I am running a single node hadoop cluster
currently.
Is this the problem ?
Did anyone else have a similar problem ? I am not sure why this is
happening. Can I prevent mapper from deleting intermediate outputs ?
I tried running mapper against 10-20 sites as opposed to 50k sites and that
worked fine.
Thanks,
Aishwarya