You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jean-Luc <je...@eserver.hopto.org> on 2005/05/11 21:43:41 UTC

RE : Crawl some sites

*This message was transferred with a trial version of CommuniGate(tm) Pro*
Use this command line to inject url's to your existing db:
nutch inject db -urlfile sites.txt

Work's for me :)




-----Message d'origine-----
De : Ian Reardon [mailto:irnutch@gmail.com] 
Envoyé : mercredi 11 mai 2005 00:02
À : nutch-user@incubator.apache.org
Objet : Crawl some sites

 I would like to crawl some specific sites with nutch for content. I
will be physicaly looking for sites all the time and would like to add
them to my index on a regular basis.  So say I look around for sites to
crawl and say add 1 or 2 a week.  Can anyone psudo walk through this
with me?

I crawled some sites with nutch by creating a flat file of URL's and
then ran the crawl command, it created the directories/db's but I tried
to add a new site after the crawl but I got an error about directory or
DB already exists.  Do I have to recrawl all my content every time I add
something?? So say delete the folder, add the new site to my flat file
and crawl them all over again?  Thanks.




Re: [Nutch-general] RE : Crawl some sites

Posted by Zhou LiBing <zh...@gmail.com>.
If I want to crawl the whole WWW but I don't use the DMOZ data,What should 
Ido?
 

 On 5/12/05, Jean-Luc <je...@eserver.hopto.org> wrote: 
> 
> *This message was transferred with a trial version of CommuniGate(tm) Pro*
> Use this command line to inject url's to your existing db:
> nutch inject db -urlfile sites.txt
> 
> Work's for me :)
> 
> -----Message d'origine-----
> De : Ian Reardon [mailto:irnutch@gmail.com]
> Envoyé : mercredi 11 mai 2005 00:02
> À : nutch-user@incubator.apache.org
> Objet : Crawl some sites
> 
> I would like to crawl some specific sites with nutch for content. I
> will be physicaly looking for sites all the time and would like to add
> them to my index on a regular basis. So say I look around for sites to
> crawl and say add 1 or 2 a week. Can anyone psudo walk through this
> with me?
> 
> I crawled some sites with nutch by creating a flat file of URL's and
> then ran the crawl command, it created the directories/db's but I tried
> to add a new site after the crawl but I got an error about directory or
> DB already exists. Do I have to recrawl all my content every time I add
> something?? So say delete the folder, add the new site to my flat file
> and crawl them all over again? Thanks.
> 
> -------------------------------------------------------
> This SF.Net <http://SF.Net> email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_ids93&alloc_id281&opclick
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 



-- 
---Letter From your friend Blue at HUST CGCL---