You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Shadi Saleh <pr...@gmail.com> on 2015/01/04 16:22:56 UTC

Depth option

Hello,

I want to check this point please.

I am using crawl to crawl www.example.com with depth =1 option, So if that
website contains url to other website e.g. www.example2.com nutch will not
crawl it , is it enogh to use depth option or should I use url filer?


Best


-- 




*Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty
of Mathematics and Physics*
*-Charles University in Prague*

*16017 Prague 6 - Czech Republic Mob +420773515578*

Re: Depth option

Posted by "Meraj A. Khan" <me...@gmail.com>.

Shadi,

I am not sure what will be the case if example.com itself has external
links,I think it will fetch those with depth 1,but  if you want to disbale
the fetching of external links , just set the external.links property to
false,you dont need any url filter set up if you do so.
On Jan 4, 2015 10:37 AM, "Shadi Saleh" <pr...@gmail.com> wrote:

> Thanks Adil,
>
> crawldb is not empty, now it contains old and current folder, should I
> clean it before I start new crawl? what is the proper way?
>
> Best
>
> On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi <ai...@gmail.com>
> wrote:
>
> > Yes, you are correct. no need to use the url filter. But this will work
> > only if your crawldb remains empty.
> >
> > Regards
> > Adil I. Abbasi
> >
> > On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh <pr...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I want to check this point please.
> > >
> > > I am using crawl to crawl www.example.com with depth =1 option, So if
> > that
> > > website contains url to other website e.g. www.example2.com nutch will
> > not
> > > crawl it , is it enogh to use depth option or should I use url filer?
> > >
> > >
> > > Best
> > >
> > >
> > > --
> > >
> > >
> > >
> > >
> > > *Shadi SalehPh.D StudentInstitute of Formal and Applied
> > LinguisticsFaculty
> > > of Mathematics and Physics*
> > > *-Charles University in Prague*
> > >
> > > *16017 Prague 6 - Czech Republic Mob +420773515578*
> > >
> >
>
>
>
> --
>
>
>
>
> *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty
> of Mathematics and Physics*
> *-Charles University in Prague*
>
> *16017 Prague 6 - Czech Republic Mob +420773515578*
>

Depth option

Posted by Adil Ishaque Abbasi <ai...@gmail.com>.

I believe you need to clean it.

Regards
Adil I. Abbasi

On Sun, Jan 4, 2015 at 8:35 PM, Shadi Saleh <propatrio@gmail.com
<javascript:_e(%7B%7D,'cvml','propatrio@gmail.com');>> wrote:

> Thanks Adil,
>
> crawldb is not empty, now it contains old and current folder, should I
> clean it before I start new crawl? what is the proper way?
>
> Best
>
> On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi <aiabbasi@gmail.com
> <javascript:_e(%7B%7D,'cvml','aiabbasi@gmail.com');>>
> wrote:
>
> > Yes, you are correct. no need to use the url filter. But this will work
> > only if your crawldb remains empty.
> >
> > Regards
> > Adil I. Abbasi
> >
> > On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh <propatrio@gmail.com
> <javascript:_e(%7B%7D,'cvml','propatrio@gmail.com');>> wrote:
> >
> > > Hello,
> > >
> > > I want to check this point please.
> > >
> > > I am using crawl to crawl www.example.com with depth =1 option, So if
> > that
> > > website contains url to other website e.g. www.example2.com nutch will
> > not
> > > crawl it , is it enogh to use depth option or should I use url filer?
> > >
> > >
> > > Best
> > >
> > >
> > > --
> > >
> > >
> > >
> > >
> > > *Shadi SalehPh.D StudentInstitute of Formal and Applied
> > LinguisticsFaculty
> > > of Mathematics and Physics*
> > > *-Charles University in Prague*
> > >
> > > *16017 Prague 6 - Czech Republic Mob +420773515578*
> > >
> >
>
>
>
> --
>
>
>
>
> *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty
> of Mathematics and Physics*
> *-Charles University in Prague*
>
> *16017 Prague 6 - Czech Republic Mob +420773515578*
>



-- 
Regards
Adil I. Abbasi

Re: Depth option

Posted by Shadi Saleh <pr...@gmail.com>.

Thanks Adil,

crawldb is not empty, now it contains old and current folder, should I
clean it before I start new crawl? what is the proper way?

Best

On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi <ai...@gmail.com>
wrote:

> Yes, you are correct. no need to use the url filter. But this will work
> only if your crawldb remains empty.
>
> Regards
> Adil I. Abbasi
>
> On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh <pr...@gmail.com> wrote:
>
> > Hello,
> >
> > I want to check this point please.
> >
> > I am using crawl to crawl www.example.com with depth =1 option, So if
> that
> > website contains url to other website e.g. www.example2.com nutch will
> not
> > crawl it , is it enogh to use depth option or should I use url filer?
> >
> >
> > Best
> >
> >
> > --
> >
> >
> >
> >
> > *Shadi SalehPh.D StudentInstitute of Formal and Applied
> LinguisticsFaculty
> > of Mathematics and Physics*
> > *-Charles University in Prague*
> >
> > *16017 Prague 6 - Czech Republic Mob +420773515578*
> >
>



-- 




*Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty
of Mathematics and Physics*
*-Charles University in Prague*

*16017 Prague 6 - Czech Republic Mob +420773515578*

Re: Depth option

Posted by Adil Ishaque Abbasi <ai...@gmail.com>.

Yes, you are correct. no need to use the url filter. But this will work
only if your crawldb remains empty.

Regards
Adil I. Abbasi

On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh <pr...@gmail.com> wrote:

> Hello,
>
> I want to check this point please.
>
> I am using crawl to crawl www.example.com with depth =1 option, So if that
> website contains url to other website e.g. www.example2.com nutch will not
> crawl it , is it enogh to use depth option or should I use url filer?
>
>
> Best
>
>
> --
>
>
>
>
> *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty
> of Mathematics and Physics*
> *-Charles University in Prague*
>
> *16017 Prague 6 - Czech Republic Mob +420773515578*
>

RE: Depth option

Posted by Markus Jelsma <ma...@openindex.io>.

I would recommend to use the domain-urlfilter, it is the most straightforward method of controlling the list of hosts in the crawldb.
M

 
 
-----Original message-----
> From:Shadi Saleh <pr...@gmail.com>
> Sent: Sunday 4th January 2015 16:23
> To: user <us...@nutch.apache.org>
> Subject: Depth option
> 
> Hello,
> 
> I want to check this point please.
> 
> I am using crawl to crawl www.example.com with depth =1 option, So if that
> website contains url to other website e.g. www.example2.com nutch will not
> crawl it , is it enogh to use depth option or should I use url filer?
> 
> 
> Best
> 
> 
> -- 
> 
> 
> 
> 
> *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty
> of Mathematics and Physics*
> *-Charles University in Prague*
> 
> *16017 Prague 6 - Czech Republic Mob +420773515578*
>