You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/06 14:09:58 UTC

nutch infinite deph crawl

Hello,

I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3

and in my urls/sites file, I have two sites like:

http://www.mysite.com
http://www.mysite2.com

I would like to crawl those two sites to infinite depth, and just
index all the pages in these sites. But I dont want it to go to remote
sites, like facebook if there is a link from those sites.

How do I do it? I know this is a primitive question, but I have looked
all the documentation but could not figure it out.

Best Regards,
C.B.

Re: nutch infinite deph crawl

Posted by Cam Bazz <ca...@gmail.com>.

Hello Lewis,

Thank you very much. I have indeed figured it out, and now nutch is
indexing like I want.

Best,
C.B.

On Wed, Jul 6, 2011 at 6:32 PM, lewis john mcgibbney
<le...@gmail.com> wrote:
> Hi C.B.,
>
> To fetch all the pages in your two domains, I would start with a breadth of
> search equal to one i.e. -depth 1. This way after every crawl you can
> evaluate how your crawldb and linkdb are looking with readdb and readlinkdb
> respectively and operate in an ancremental manner. This will also allow you
> to use Luke, and you can see the quality of your index.
>
> Please note that we can use URLfilters for filtering out domains we do not
> want to search, also have a look at redirect properties, as well as
> db.ignore.external.links and db.ignore.internal.links in nutch-site.xml.
> Once you read the description of these properties you will get an idea for
> the type of conf setting which will provide best results.
>
> On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <ca...@gmail.com> wrote:
>
>> Hello,
>>
>> I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3
>>
>> and in my urls/sites file, I have two sites like:
>>
>> http://www.mysite.com
>> http://www.mysite2.com
>>
>> I would like to crawl those two sites to infinite depth, and just
>> index all the pages in these sites. But I dont want it to go to remote
>> sites, like facebook if there is a link from those sites.
>>
>> How do I do it? I know this is a primitive question, but I have looked
>> all the documentation but could not figure it out.
>>
>> Best Regards,
>> C.B.
>>
>
>
>
> --
> *Lewis*
>

Re: nutch infinite deph crawl

Posted by Markus Jelsma <ma...@openindex.io>.

db.ignore.external.links will not necessarily keep you within one domain when 
it comes to redirections. Try the domain filter instead or url filter.

On Wednesday 06 July 2011 17:32:53 lewis john mcgibbney wrote:
> Hi C.B.,
> 
> To fetch all the pages in your two domains, I would start with a breadth of
> search equal to one i.e. -depth 1. This way after every crawl you can
> evaluate how your crawldb and linkdb are looking with readdb and readlinkdb
> respectively and operate in an ancremental manner. This will also allow you
> to use Luke, and you can see the quality of your index.
> 
> Please note that we can use URLfilters for filtering out domains we do not
> want to search, also have a look at redirect properties, as well as
> db.ignore.external.links and db.ignore.internal.links in nutch-site.xml.
> Once you read the description of these properties you will get an idea for
> the type of conf setting which will provide best results.
> 
> On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <ca...@gmail.com> wrote:
> > Hello,
> > 
> > I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3
> > 
> > and in my urls/sites file, I have two sites like:
> > 
> > http://www.mysite.com
> > http://www.mysite2.com
> > 
> > I would like to crawl those two sites to infinite depth, and just
> > index all the pages in these sites. But I dont want it to go to remote
> > sites, like facebook if there is a link from those sites.
> > 
> > How do I do it? I know this is a primitive question, but I have looked
> > all the documentation but could not figure it out.
> > 
> > Best Regards,
> > C.B.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: nutch infinite deph crawl

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi C.B.,

To fetch all the pages in your two domains, I would start with a breadth of
search equal to one i.e. -depth 1. This way after every crawl you can
evaluate how your crawldb and linkdb are looking with readdb and readlinkdb
respectively and operate in an ancremental manner. This will also allow you
to use Luke, and you can see the quality of your index.

Please note that we can use URLfilters for filtering out domains we do not
want to search, also have a look at redirect properties, as well as
db.ignore.external.links and db.ignore.internal.links in nutch-site.xml.
Once you read the description of these properties you will get an idea for
the type of conf setting which will provide best results.

On Wed, Jul 6, 2011 at 5:09 AM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> I am running nutch with bin/nutch crawl urls -dir crawl -depth 3 -topN 3
>
> and in my urls/sites file, I have two sites like:
>
> http://www.mysite.com
> http://www.mysite2.com
>
> I would like to crawl those two sites to infinite depth, and just
> index all the pages in these sites. But I dont want it to go to remote
> sites, like facebook if there is a link from those sites.
>
> How do I do it? I know this is a primitive question, but I have looked
> all the documentation but could not figure it out.
>
> Best Regards,
> C.B.
>

-- 
*Lewis*