You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by James Ford <si...@gmail.com> on 2012/03/01 23:28:52 UTC

Only fetching initial seedlist

Hello,

I am having a problem getting nutch to crawl and fetch the initial seedlist
only. It seems like nutch tend to skip some urls? Or it does not parse some
of them?

For example with the following seedlist:

http://www.domain.com/?_PageId=492&AreaId=441
http://www.domain.com/?_PageId=631&AreaId=11
http://www.domain.com/?_PageId=490&AreaId=19

Nutch would not fetch and parse all the urls? I am not that interested in
the outlinks, my general purpose is to crawl, fetch and parse the seedlist
ONLY.

I am using the crawl command with a depth of 1 and infinite topN. I have
also tried injecting manually.

Thanks,
James Ford

--
View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Only fetching initial seedlist

Posted by Lewis John Mcgibbney <le...@gmail.com>.

What makes you think that?

On Fri, Mar 2, 2012 at 12:07 PM, James Ford <si...@gmail.com> wrote:

> But it seems that the solution to my problem is to set
> db.max.outlinks.per.page to 0?
>

Baring in mind that it makes it pretty difficult to provide help if this is
not mentioned initially.

Re: Only fetching initial seedlist

Posted by James Ford <si...@gmail.com>.

I am 100% sure that the regex-urlfilters is not the problem. I know regex
patterns from before. But it seems that the solution to my problem is to set
db.max.outlinks.per.page to 0?

--
View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3793290.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Only fetching initial seedlist

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi James,

You're seed URLs are more than likely being filtered out for searching by
your settings in conf/regex-urlfilter.xml. Have a good read through the
urlfilter documentation [0] and basic examples that are provided in other
urlfilters, also it might help to do a bit of reading regarding regular
expression's and how you can make the most of them in Nutch urlfilters.

hth

[0]
http://svn.apache.org/viewvc/nutch/trunk/conf/regex-urlfilter.txt.template?view=markup

On Fri, Mar 2, 2012 at 11:52 AM, James Ford <si...@gmail.com> wrote:

> Eh,
>
> Can't you guys be a little more specific? I have searched the archives, and
> found nothing of value?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3793253.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

-- 
*Lewis*

Re: Only fetching initial seedlist

Posted by James Ford <si...@gmail.com>.

Eh,

Can't you guys be a little more specific? I have searched the archives, and
found nothing of value?

--
View this message in context: http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3793253.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Only fetching initial seedlist

Posted by Markus Jelsma <ma...@openindex.io>.

 indeed. check urlfilters and plugins.

 On Fri, 2 Mar 2012 05:59:20 +0200, remi tassing <ta...@gmail.com> 
 wrote:
> This question comes a lot, try searching the mailinglist archive
>
> On Friday, March 2, 2012, James Ford <si...@gmail.com> wrote:
>> Hello,
>>
>> I am having a problem getting nutch to crawl and fetch the initial
> seedlist
>> only. It seems like nutch tend to skip some urls? Or it does not 
>> parse
> some
>> of them?
>>
>> For example with the following seedlist:
>>
>> http://www.domain.com/?_PageId=492&AreaId=441
>> http://www.domain.com/?_PageId=631&AreaId=11
>> http://www.domain.com/?_PageId=490&AreaId=19
>>
>> Nutch would not fetch and parse all the urls? I am not that 
>> interested in
>> the outlinks, my general purpose is to crawl, fetch and parse the 
>> seedlist
>> ONLY.
>>
>> I am using the crawl command with a depth of 1 and infinite topN. I 
>> have
>> also tried injecting manually.
>>
>> Thanks,
>> James Ford
>>
>> --
>> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>

-- 
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350

Re: Only fetching initial seedlist

Posted by remi tassing <ta...@gmail.com>.

This question comes a lot, try searching the mailinglist archive

On Friday, March 2, 2012, James Ford <si...@gmail.com> wrote:
> Hello,
>
> I am having a problem getting nutch to crawl and fetch the initial
seedlist
> only. It seems like nutch tend to skip some urls? Or it does not parse
some
> of them?
>
> For example with the following seedlist:
>
> http://www.domain.com/?_PageId=492&AreaId=441
> http://www.domain.com/?_PageId=631&AreaId=11
> http://www.domain.com/?_PageId=490&AreaId=19
>
> Nutch would not fetch and parse all the urls? I am not that interested in
> the outlinks, my general purpose is to crawl, fetch and parse the seedlist
> ONLY.
>
> I am using the crawl command with a depth of 1 and infinite topN. I have
> also tried injecting manually.
>
> Thanks,
> James Ford
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: Only fetching initial seedlist

Posted by Jean-François Gingras <je...@gmail.com>.

you can set db.ignore.external.links to "true" and
also db.max.outlinks.per.page to "0"
Hope this help

On Thu, Mar 1, 2012 at 5:28 PM, James Ford <si...@gmail.com> wrote:

> Hello,
>
> I am having a problem getting nutch to crawl and fetch the initial seedlist
> only. It seems like nutch tend to skip some urls? Or it does not parse some
> of them?
>
> For example with the following seedlist:
>
> http://www.domain.com/?_PageId=492&AreaId=441
> http://www.domain.com/?_PageId=631&AreaId=11
> http://www.domain.com/?_PageId=490&AreaId=19
>
> Nutch would not fetch and parse all the urls? I am not that interested in
> the outlinks, my general purpose is to crawl, fetch and parse the seedlist
> ONLY.
>
> I am using the crawl command with a depth of 1 and infinite topN. I have
> also tried injecting manually.
>
> Thanks,
> James Ford
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Only-fetching-initial-seedlist-tp3791957p3791957.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Jean-François Gingras