You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov> on 2012/01/01 22:47:59 UTC

Re: topN-help

Great to hear!

Happy New Year!

Cheers,
Chris

On Jan 1, 2012, at 1:15 PM, tahere ganjiyar wrote:

> thanks for your answer, i set maxOutlink to -1 ,now i can crawl every
> link in my sites. it work for me, thanks.
> 
> 
> 
> On 12/20/11, Mattmann, Chris A (388J) <ch...@jpl.nasa.gov> wrote:
>> Hi,
>> 
>> Try changing the properties related to max outlinks in the
>> nutch-default.xml. That should help.
>> 
>> Cheers,
>> Chris
>> 
>> On Dec 19, 2011, at 2:49 PM, tahere ganjiyar wrote:
>> 
>>> hi, i crawl one site that it has 100 link in depth 1, and 100 links in
>>> depth 2, but nutch only crawl 23 links from depth 1 and 30 from depth 2.
>>> how can i force nutch to crawl all links in depth 1 and 2. i use nutch 1.3
>>> 
>>> topN=10000
>>> depth =2
>>> and in my nutch-site.xml:
>>> <property>
>>>        <name>http.content.limit</name>
>>>        <value>-1</value>
>>>        <description>
>>>  </description>
>>>    </property>
>>> <property>
>>>        <name>http.agent.name</name>
>>>        <value>My Nutch Spider</value>
>>>        <description>
>>>  </description>
>>>    </property>
>>> 
>>> 
>>> 
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattmann@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++