You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2016/12/02 16:36:02 UTC
problem with nutch 1.12 and topN parameter

Hi all.
I need some help or suggestions with this intersting topic.
I am using nutch 1.12 in local mode.
The problems is that for some reason nutch always take into account aproximately a half of urls indicated by topN parameter. 
I am crawling http://www.cubadebate.cu/ website and all its subdomains. like
          http://en.cubadebate.cu/,
          http://mesaredonda.cubadebate.cu/,
          http://razonesdecuba.cubadebate.cu/,

When i finish the first iteration nutch detect 506 outlinks from the root page and its are indexed well(see atached cubadebate.cu), 
but my stats from crawldb have only half, see below.

after first iteration

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:	255
retry 0:	255
min score:	0.001
avg score:	0.006909804
max score:	1.0
status 1 (db_unfetched):	254
status 2 (db_fetched):	1
CrawlDb statistics: done

see only aproximately the half of 506 detected.
*******************************************
After second iteration this is my crawldb stats

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:	1013
retry 0:	1007
retry 1:	6
min score:	0.0
avg score:	0.0022329714
max score:	1.003
status 1 (db_unfetched):	964
status 2 (db_fetched):	49
CrawlDb statistics: done
**************************************************8

It is very curious that nutch only visit 48 urls(aproximately the half of 100) when the topN parameter is 100 see below

#############################################
# MODIFY THE PARAMETERS BELOW TO YOUR NEEDS #
#############################################

# set the number of slaves nodes
numSlaves=1

# and the total number of available tasks
# sets Hadoop parameter "mapreduce.job.reduces"
numTasks=`expr $numSlaves \* 2`

# number of urls to fetch in one iteration
# 250K per task?
sizeFetchlist=`expr $numSlaves \* 100`

# time limit for feching
timeLimitFetch=600

# num threads for fetching
numThreads=100

#############################################

here are some properties

seed file
-------------------------
http://www.cubadebate.cu/
-------------------------------


nutch-site.xml file
-------------------------------------
<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
</property>

<property>
  <name>db.max.outlinks.per.page</name>
  <value>10000</value>
</property>

--------------------------------------

i have restricted by domain-urlfilter.txt
--------------------------
cubadebate.cu
--------------------------

Please i really need any help or suggestions, I am missing something?





La @universidad_uci es Fidel. Los j�venes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre