You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by co...@complexityintelligence.com on 2011/12/02 16:23:53 UTC

Best strategy for boundary defined crawling

Hello,


   We want to crawl the DMOZ set of web sites, and only this set. Which
is the best
strategy to use with Nutch ?


   I'm new with Nutch, and I'm comparing it with our in-house crawling
solution, we may switch
to Nutch if this test is ok. 


   I think that the trivial solution is something like:


     - Parse the DMOZ content file and extract the seed urls (all urls)
     - Use an regex-url filter adding one entry for each url in the seed
file
     - I hope that some options exists to limit the crawl inside the
space of each domain,
        and of course, skipping outbound (to different domain) links.



   I think that file based regex url is not a good solution. If I have a
database, even in Java,
   like HSQLDB or H2 with all regex url filter entry, can I use a db
instead of file ? Writing a plug-in
   is not a problem, if needed.



Thanks,
Alessio

Re: Best strategy for boundary defined crawling

Posted by Markus Jelsma <ma...@openindex.io>.

I'm not sure what you mean; you want a set of domains, crawl them but never 
add new domains to the DB? You can set ignore.external.links to true, this is 
an easy method to restrict the DB to one or more specific domains.

On Friday 02 December 2011 16:23:53 contacts@complexityintelligence.com wrote:
> Hello,
> 
> 
>    We want to crawl the DMOZ set of web sites, and only this set. Which
> is the best
> strategy to use with Nutch ?
> 
> 
>    I'm new with Nutch, and I'm comparing it with our in-house crawling
> solution, we may switch
> to Nutch if this test is ok.
> 
> 
>    I think that the trivial solution is something like:
> 
> 
>      - Parse the DMOZ content file and extract the seed urls (all urls)
>      - Use an regex-url filter adding one entry for each url in the seed
> file
>      - I hope that some options exists to limit the crawl inside the
> space of each domain,
>         and of course, skipping outbound (to different domain) links.
> 
> 
> 
>    I think that file based regex url is not a good solution. If I have a
> database, even in Java,
>    like HSQLDB or H2 with all regex url filter entry, can I use a db
> instead of file ? Writing a plug-in
>    is not a problem, if needed.
> 
> 
> 
> Thanks,
> Alessio

-- 
Markus Jelsma - CTO - Openindex

Re: Best strategy for boundary defined crawling

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Contacts,

We also maintain a utility class for processing Dmoz material, which may or
may not be of use to you!

http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/tools/DmozParser.java

On Sat, Dec 3, 2011 at 6:53 PM, jotta <so...@gmail.com> wrote:

> Hi contacts,
>
> Why can't you just put all domains, which do you want to crawl into "seed
> directory", then place them in regex-urlfilter.txt file like this:
> +http://(www\.)?exampledomain\.com.*
>
> and with something like this at the end (disallow anything else than
> regular
> expressions which you define):
> -.*
>
> Nutch will crawl all domains from your collection (and domains internal
> links).
>
> -----
> Regards,
> Jotta
>
> PS. Sorry for my English :)
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Best-strategy-for-boundary-defined-crawling-tp3554906p3557576.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

RE: Best strategy for boundary defined crawling

Posted by jotta <so...@gmail.com>.

Hi contacts,

Why can't you just put all domains, which do you want to crawl into "seed
directory", then place them in regex-urlfilter.txt file like this:
+http://(www\.)?exampledomain\.com.*

and with something like this at the end (disallow anything else than regular
expressions which you define):
-.*

Nutch will crawl all domains from your collection (and domains internal
links).

-----
Regards,
Jotta

PS. Sorry for my English :)
--
View this message in context: http://lucene.472066.n3.nabble.com/Best-strategy-for-boundary-defined-crawling-tp3554906p3557576.html
Sent from the Nutch - User mailing list archive at Nabble.com.