You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Rafi Iz <ra...@hotmail.com> on 2006/01/12 04:29:55 UTC

Does the data size in 0.8 vesion should be much smaller than in version 0.7?

Hi,

I am running few cycles of fetching on nutch 0.8 and I notice that the data 
size is much smaller than the data size I got in version 0.7 (running the 
same cycle about the same time from different machines), about 5G after the 
third cycle starting with about 72000 URLs .
All the processes ended sucssesfuly, everything seems to be fine but I am 
afraid that I'm missing somthing.


Each cycle includes :
fetch segments/..
updatedb crawldb segments/..
generate crawldb segments

The configuration in nutch-site.xml are :
<property>
  <name>fs.default.name</name>
  <value>machine1:50000</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>machine1:50020</value>
</property>

<property>
  <name>ndfs.name.dir</name>
  <value>/home/nutch_svn/nutch/trunk/ndfs/name</value>
</property>

<property>
  <name>ndfs.data.dir</name>
  <value>/home/nutch_svn/nutch/trunk/ndfs/data</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/nutch_svn/nutch/trunk/mapred/local</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/home/nutch_svn/nutch/trunk/mapred/system</value>
</property>

<property>
  <name>mapred.temp.dir</name>
  <value>/home/nutch_svn/nutch/trunk/mapred/temp</value>
</property>

<property>
  <name>mapred.map.tasks</name>
  <value>12</value>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>6</value>
</property>

<property>
  <name>generate.max.per.host</name>
  <value>-1</value>
</property>


Thanks,
-Rafi

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to 
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement