You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Berlin Brown <be...@gmail.com> on 2006/03/18 19:01:04 UTC

Intranet Crawling and Whole-web Crawling

I noticed in the documentation that you can do whole web crawling and
intranet.  My question, can you combine a database that you crawl
through a set provided URLs to the database you created with whole-web
crawling.

For example, here you create a directory crawl.test, this contains a database.

bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log

bin/nutch admin new/db -create

Here, I am creating a database in the directory 'new'.  Can I add the
two databases together.  For example, let me say I run through
whole-web crawling and then I want to crawl a set of URLs, can I add
those to the index.

bin/nutch crawl urls -dir new -depth 3 >& crawl.log

?

Re: Intranet Crawling and Whole-web Crawling

Posted by Berlin Brown <be...@gmail.com>.
Thanks a lot, that is exactly it.

On 3/19/06, TDLN <di...@gmail.com> wrote:
> You can merge the indexes together using the nutch merge command.
>
> Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir
> <workingdir>] outputIndex segments...
>
> Rgrds, Thomas
>
>
> On 3/18/06, Berlin Brown <be...@gmail.com> wrote:
> >
> > I noticed in the documentation that you can do whole web crawling and
> > intranet.  My question, can you combine a database that you crawl
> > through a set provided URLs to the database you created with whole-web
> > crawling.
> >
> > For example, here you create a directory crawl.test, this contains a
> > database.
> >
> > bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
> >
> > bin/nutch admin new/db -create
> >
> > Here, I am creating a database in the directory 'new'.  Can I add the
> > two databases together.  For example, let me say I run through
> > whole-web crawling and then I want to crawl a set of URLs, can I add
> > those to the index.
> >
> > bin/nutch crawl urls -dir new -depth 3 >& crawl.log
> >
> > ?
> >
>
>

Re: Intranet Crawling and Whole-web Crawling

Posted by TDLN <di...@gmail.com>.
You can merge the indexes together using the nutch merge command.

Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir
<workingdir>] outputIndex segments...

Rgrds, Thomas


On 3/18/06, Berlin Brown <be...@gmail.com> wrote:
>
> I noticed in the documentation that you can do whole web crawling and
> intranet.  My question, can you combine a database that you crawl
> through a set provided URLs to the database you created with whole-web
> crawling.
>
> For example, here you create a directory crawl.test, this contains a
> database.
>
> bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
>
> bin/nutch admin new/db -create
>
> Here, I am creating a database in the directory 'new'.  Can I add the
> two databases together.  For example, let me say I run through
> whole-web crawling and then I want to crawl a set of URLs, can I add
> those to the index.
>
> bin/nutch crawl urls -dir new -depth 3 >& crawl.log
>
> ?
>