You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by KRIS MUSSHORN <mu...@comcast.net> on 2016/09/09 18:20:40 UTC

nutch crawl everything

Executing this does NOT index everything in and under seed.txt. 

./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TEST_CORE urls/ crawl -1 

I have to run it multiple times to get all content. 

Is it possible related to this setting in nutch-site.xml? 

<property> 
<name>db.max.outlinks.per.page</name> 
<value>-1</value> 
<description> 
allow unlimited outlinks with -1 
</description> 
</property> 

Thx, 

Kris

Re: nutch crawl everything

Posted by BlackIce <bl...@gmail.com>.

You will need to run Nutch several times in order to fetch everything.

If you have one URL in your seed.txt, it will only index ONE page/file ie:
Index.html of that URL - then process this page and add all links it finds
in index.html to the database. On the next run it will then fetch the links
it found in the first run, on the 3rd run it will fetch the links it found
on the 2nd run and so forth...

Have a great weekend everyone !

On Fri, Sep 9, 2016 at 9:05 PM, Comcast <mu...@comcast.net> wrote:

> Tried that. Same result
>
> Sent from my iPhone
>
> > On Sep 9, 2016, at 3:04 PM, BlackIce <bl...@gmail.com> wrote:
> >
> > Change the -1 to a positive number like 5 or so.... (In the command)
> >
> >> On Sep 9, 2016 8:20 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:
> >>
> >> Executing this does NOT index everything in and under seed.txt.
> >>
> >> ./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TEST_CORE
> >> urls/ crawl -1
> >>
> >> I have to run it multiple times to get all content.
> >>
> >> Is it possible related to this setting in nutch-site.xml?
> >>
> >> <property>
> >> <name>db.max.outlinks.per.page</name>
> >> <value>-1</value>
> >> <description>
> >> allow unlimited outlinks with -1
> >> </description>
> >> </property>
> >>
> >> Thx,
> >>
> >> Kris
> >>
>

Re: nutch crawl everything

Posted by Comcast <mu...@comcast.net>.

Tried that. Same result 

Sent from my iPhone

> On Sep 9, 2016, at 3:04 PM, BlackIce <bl...@gmail.com> wrote:
> 
> Change the -1 to a positive number like 5 or so.... (In the command)
> 
>> On Sep 9, 2016 8:20 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:
>> 
>> Executing this does NOT index everything in and under seed.txt.
>> 
>> ./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TEST_CORE
>> urls/ crawl -1
>> 
>> I have to run it multiple times to get all content.
>> 
>> Is it possible related to this setting in nutch-site.xml?
>> 
>> <property>
>> <name>db.max.outlinks.per.page</name>
>> <value>-1</value>
>> <description>
>> allow unlimited outlinks with -1
>> </description>
>> </property>
>> 
>> Thx,
>> 
>> Kris
>>

Re: nutch crawl everything

Posted by BlackIce <bl...@gmail.com>.

Change the -1 to a positive number like 5 or so.... (In the command)

On Sep 9, 2016 8:20 PM, "KRIS MUSSHORN" <mu...@comcast.net> wrote:

> Executing this does NOT index everything in and under seed.txt.
>
> ./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TEST_CORE
> urls/ crawl -1
>
> I have to run it multiple times to get all content.
>
> Is it possible related to this setting in nutch-site.xml?
>
> <property>
> <name>db.max.outlinks.per.page</name>
> <value>-1</value>
> <description>
> allow unlimited outlinks with -1
> </description>
> </property>
>
> Thx,
>
> Kris
>