You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Meryl Silverburgh <si...@gmail.com> on 2007/04/06 21:08:15 UTC

Trying to setup Nutch

Hi,

i am trying to setup Nutch.
I setup 1 site in my urls file:
http://www.yahoo.com

And then I start crawl using this command:
$bin/nutch crawl urls -dir crawl -depth 1 -topN 5

But I get this "No URLs to fecth", can you please tell me what am i missing?
$ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 1
topN = 5
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070406140513
Generator: filtering: false
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

Re: Trying to setup Nutch

Posted by Meryl Silverburgh <si...@gmail.com>.

Thanks.

I follow the rest of the tutorial, and it said :
To fetch, we first generate a fetchlist from the database:

bin/nutch generate crawl/crawldb crawl/segments

This generates a fetchlist for all of the pages due to be fetched. The
fetchlist is placed in a newly created segment directory. The segment
directory is named by the time it's created. We save the name of this
segment in the shell variable s1:

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1



But how can I see the list of the URLs (in human readable format)
before I actually fetch it?
i do a more of that file, it is not readable.
crawl/segments/20070406202200/crawl_generate



On 4/6/07, Xiangyu Zhang <zh...@live.com> wrote:
> Actually that command is for distribute configuration on multiple machines.
> The tutorial you refered to is for entry-level users who typically don't
> need distribute utility.
>
> According to your description, I guess you're using Nutch on a single
> machine which makes that command useless to you.
>
> But when you decide to deploy Nutch to multiple machines to do something
> big, you have much more to do than that tutorial tells you,including that
> command :)
>
> ----- Original Message -----
> From: "Meryl Silverburgh" <si...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Saturday, April 07, 2007 9:12 AM
> Subject: Re: Trying to setup Nutch
>
> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> >> So that's the problem : you have to replace MY.DOMAIN.NAME with domains
> >> you
> >> want to crawl.
> >> For your situation, that line should reads :
> >> +^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
> >> Check it out.
> >>
> >
> > Thanks for your help.
> > but from the documtation
> > http://lucene.apache.org/nutch/tutorial8.html, i don't need to do
> > this:
> > $bin/hadoop dfs -put urls urls
> >
> > but I should do this for crawling:
> >
> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >
> > Why do I need to do this, and what is that for?
> > $bin/hadoop dfs -put urls urls
> >
> >> ----- Original Message -----
> >> From: "Meryl Silverburgh" <si...@gmail.com>
> >> To: <nu...@lucene.apache.org>
> >> Sent: Saturday, April 07, 2007 9:02 AM
> >> Subject: Re: Trying to setup Nutch
> >>
> >> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> >> >> Have yuo checked your crawl-urlfilter.txt file ?
> >> >> Make sure you have replaced your accepted domain.
> >> >>
> >> >
> >> > I have this in my crawl-urlfilter.txt
> >> >
> >> > # accept hosts in MY.DOMAIN.NAME
> >> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >> >
> >> >
> >> > but lets' say I have
> >> > yahoo, cnn, amazon, msn, google
> >> > in my 'urls' files, what should my accepted domain to be?
> >> >
> >> >
> >> >> ----- Original Message -----
> >> >> From: "Meryl Silverburgh" <si...@gmail.com>
> >> >> To: <nu...@lucene.apache.org>
> >> >> Sent: Saturday, April 07, 2007 8:54 AM
> >> >> Subject: Re: Trying to setup Nutch
> >> >>
> >> >> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> >> >> >> After setup, you should put the urls you want to crawl into the
> >> >> >> HDFS
> >> >> >> by
> >> >> >> the
> >> >> >> command :
> >> >> >> $bin/hadoop dfs -put urls urls
> >> >> >>
> >> >> >> Maybe that's something you forgot to do and I hope it helps :)
> >> >> >>
> >> >> >
> >> >> > I try your command, but I get this error:
> >> >> > $ bin/hadoop dfs -put urls urls
> >> >> > put: Target urls already exists
> >> >> >
> >> >> >
> >> >> > I just have 1 line in my file 'urls':
> >> >> > $ more urls
> >> >> > http://www.yahoo.com
> >> >> >
> >> >> > Thanks for any help.
> >> >> >
> >> >> >
> >> >> >> ----- Original Message -----
> >> >> >> From: "Meryl Silverburgh" <si...@gmail.com>
> >> >> >> To: <nu...@lucene.apache.org>
> >> >> >> Sent: Saturday, April 07, 2007 3:08 AM
> >> >> >> Subject: Trying to setup Nutch
> >> >> >>
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > i am trying to setup Nutch.
> >> >> >> > I setup 1 site in my urls file:
> >> >> >> > http://www.yahoo.com
> >> >> >> >
> >> >> >> > And then I start crawl using this command:
> >> >> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> >> >> >
> >> >> >> > But I get this "No URLs to fecth", can you please tell me what am
> >> >> >> > i
> >> >> >> > missing?
> >> >> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> >> >> > crawl started in: crawl
> >> >> >> > rootUrlDir = urls
> >> >> >> > threads = 10
> >> >> >> > depth = 1
> >> >> >> > topN = 5
> >> >> >> > Injector: starting
> >> >> >> > Injector: crawlDb: crawl/crawldb
> >> >> >> > Injector: urlDir: urls
> >> >> >> > Injector: Converting injected urls to crawl db entries.
> >> >> >> > Injector: Merging injected urls into crawl db.
> >> >> >> > Injector: done
> >> >> >> > Generator: Selecting best-scoring urls due for fetch.
> >> >> >> > Generator: starting
> >> >> >> > Generator: segment: crawl/segments/20070406140513
> >> >> >> > Generator: filtering: false
> >> >> >> > Generator: topN: 5
> >> >> >> > Generator: jobtracker is 'local', generating exactly one
> >> >> >> > partition.
> >> >> >> > Generator: 0 records selected for fetching, exiting ...
> >> >> >> > Stopping at depth=0 - no more URLs to fetch.
> >> >> >> > No URLs to fetch - check your seed list and URL filters.
> >> >> >> > crawl finished: crawl
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Trying to setup Nutch

Posted by Xiangyu Zhang <zh...@live.com>.

Actually that command is for distribute configuration on multiple machines.
The tutorial you refered to is for entry-level users who typically don't 
need distribute utility.

According to your description, I guess you're using Nutch on a single 
machine which makes that command useless to you.

But when you decide to deploy Nutch to multiple machines to do something 
big, you have much more to do than that tutorial tells you,including that 
command :)

----- Original Message -----
From: "Meryl Silverburgh" <si...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Saturday, April 07, 2007 9:12 AM
Subject: Re: Trying to setup Nutch

> On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
>> So that's the problem : you have to replace MY.DOMAIN.NAME with domains 
>> you
>> want to crawl.
>> For your situation, that line should reads :
>> +^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
>> Check it out.
>>
>
> Thanks for your help.
> but from the documtation
> http://lucene.apache.org/nutch/tutorial8.html, i don't need to do
> this:
> $bin/hadoop dfs -put urls urls
>
> but I should do this for crawling:
>
> $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>
> Why do I need to do this, and what is that for?
> $bin/hadoop dfs -put urls urls
>
>> ----- Original Message -----
>> From: "Meryl Silverburgh" <si...@gmail.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Saturday, April 07, 2007 9:02 AM
>> Subject: Re: Trying to setup Nutch
>>
>> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
>> >> Have yuo checked your crawl-urlfilter.txt file ?
>> >> Make sure you have replaced your accepted domain.
>> >>
>> >
>> > I have this in my crawl-urlfilter.txt
>> >
>> > # accept hosts in MY.DOMAIN.NAME
>> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>> >
>> >
>> > but lets' say I have
>> > yahoo, cnn, amazon, msn, google
>> > in my 'urls' files, what should my accepted domain to be?
>> >
>> >
>> >> ----- Original Message -----
>> >> From: "Meryl Silverburgh" <si...@gmail.com>
>> >> To: <nu...@lucene.apache.org>
>> >> Sent: Saturday, April 07, 2007 8:54 AM
>> >> Subject: Re: Trying to setup Nutch
>> >>
>> >> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
>> >> >> After setup, you should put the urls you want to crawl into the 
>> >> >> HDFS
>> >> >> by
>> >> >> the
>> >> >> command :
>> >> >> $bin/hadoop dfs -put urls urls
>> >> >>
>> >> >> Maybe that's something you forgot to do and I hope it helps :)
>> >> >>
>> >> >
>> >> > I try your command, but I get this error:
>> >> > $ bin/hadoop dfs -put urls urls
>> >> > put: Target urls already exists
>> >> >
>> >> >
>> >> > I just have 1 line in my file 'urls':
>> >> > $ more urls
>> >> > http://www.yahoo.com
>> >> >
>> >> > Thanks for any help.
>> >> >
>> >> >
>> >> >> ----- Original Message -----
>> >> >> From: "Meryl Silverburgh" <si...@gmail.com>
>> >> >> To: <nu...@lucene.apache.org>
>> >> >> Sent: Saturday, April 07, 2007 3:08 AM
>> >> >> Subject: Trying to setup Nutch
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > i am trying to setup Nutch.
>> >> >> > I setup 1 site in my urls file:
>> >> >> > http://www.yahoo.com
>> >> >> >
>> >> >> > And then I start crawl using this command:
>> >> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> >> >
>> >> >> > But I get this "No URLs to fecth", can you please tell me what am 
>> >> >> > i
>> >> >> > missing?
>> >> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> >> > crawl started in: crawl
>> >> >> > rootUrlDir = urls
>> >> >> > threads = 10
>> >> >> > depth = 1
>> >> >> > topN = 5
>> >> >> > Injector: starting
>> >> >> > Injector: crawlDb: crawl/crawldb
>> >> >> > Injector: urlDir: urls
>> >> >> > Injector: Converting injected urls to crawl db entries.
>> >> >> > Injector: Merging injected urls into crawl db.
>> >> >> > Injector: done
>> >> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> >> > Generator: starting
>> >> >> > Generator: segment: crawl/segments/20070406140513
>> >> >> > Generator: filtering: false
>> >> >> > Generator: topN: 5
>> >> >> > Generator: jobtracker is 'local', generating exactly one 
>> >> >> > partition.
>> >> >> > Generator: 0 records selected for fetching, exiting ...
>> >> >> > Stopping at depth=0 - no more URLs to fetch.
>> >> >> > No URLs to fetch - check your seed list and URL filters.
>> >> >> > crawl finished: crawl
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: Trying to setup Nutch

Posted by Meryl Silverburgh <si...@gmail.com>.

On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> So that's the problem : you have to replace MY.DOMAIN.NAME with domains you
> want to crawl.
> For your situation, that line should reads :
> +^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
> Check it out.
>

Thanks for your help.
but from the documtation
http://lucene.apache.org/nutch/tutorial8.html, i don't need to do
this:
$bin/hadoop dfs -put urls urls

but I should do this for crawling:

$bin/nutch crawl urls -dir crawl -depth 1 -topN 5

Why do I need to do this, and what is that for?
$bin/hadoop dfs -put urls urls

> ----- Original Message -----
> From: "Meryl Silverburgh" <si...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Saturday, April 07, 2007 9:02 AM
> Subject: Re: Trying to setup Nutch
>
> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> >> Have yuo checked your crawl-urlfilter.txt file ?
> >> Make sure you have replaced your accepted domain.
> >>
> >
> > I have this in my crawl-urlfilter.txt
> >
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> >
> >
> > but lets' say I have
> > yahoo, cnn, amazon, msn, google
> > in my 'urls' files, what should my accepted domain to be?
> >
> >
> >> ----- Original Message -----
> >> From: "Meryl Silverburgh" <si...@gmail.com>
> >> To: <nu...@lucene.apache.org>
> >> Sent: Saturday, April 07, 2007 8:54 AM
> >> Subject: Re: Trying to setup Nutch
> >>
> >> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> >> >> After setup, you should put the urls you want to crawl into the HDFS
> >> >> by
> >> >> the
> >> >> command :
> >> >> $bin/hadoop dfs -put urls urls
> >> >>
> >> >> Maybe that's something you forgot to do and I hope it helps :)
> >> >>
> >> >
> >> > I try your command, but I get this error:
> >> > $ bin/hadoop dfs -put urls urls
> >> > put: Target urls already exists
> >> >
> >> >
> >> > I just have 1 line in my file 'urls':
> >> > $ more urls
> >> > http://www.yahoo.com
> >> >
> >> > Thanks for any help.
> >> >
> >> >
> >> >> ----- Original Message -----
> >> >> From: "Meryl Silverburgh" <si...@gmail.com>
> >> >> To: <nu...@lucene.apache.org>
> >> >> Sent: Saturday, April 07, 2007 3:08 AM
> >> >> Subject: Trying to setup Nutch
> >> >>
> >> >> > Hi,
> >> >> >
> >> >> > i am trying to setup Nutch.
> >> >> > I setup 1 site in my urls file:
> >> >> > http://www.yahoo.com
> >> >> >
> >> >> > And then I start crawl using this command:
> >> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> >> >
> >> >> > But I get this "No URLs to fecth", can you please tell me what am i
> >> >> > missing?
> >> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> >> > crawl started in: crawl
> >> >> > rootUrlDir = urls
> >> >> > threads = 10
> >> >> > depth = 1
> >> >> > topN = 5
> >> >> > Injector: starting
> >> >> > Injector: crawlDb: crawl/crawldb
> >> >> > Injector: urlDir: urls
> >> >> > Injector: Converting injected urls to crawl db entries.
> >> >> > Injector: Merging injected urls into crawl db.
> >> >> > Injector: done
> >> >> > Generator: Selecting best-scoring urls due for fetch.
> >> >> > Generator: starting
> >> >> > Generator: segment: crawl/segments/20070406140513
> >> >> > Generator: filtering: false
> >> >> > Generator: topN: 5
> >> >> > Generator: jobtracker is 'local', generating exactly one partition.
> >> >> > Generator: 0 records selected for fetching, exiting ...
> >> >> > Stopping at depth=0 - no more URLs to fetch.
> >> >> > No URLs to fetch - check your seed list and URL filters.
> >> >> > crawl finished: crawl
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: Trying to setup Nutch

Posted by zh...@live.com.

So that's the problem : you have to replace MY.DOMAIN.NAME with domains you 
want to crawl.
For your situation, that line should reads : 
+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/
Check it out.

----- Original Message -----
From: "Meryl Silverburgh" <si...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Saturday, April 07, 2007 9:02 AM
Subject: Re: Trying to setup Nutch

> On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
>> Have yuo checked your crawl-urlfilter.txt file ?
>> Make sure you have replaced your accepted domain.
>>
>
> I have this in my crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
>
>
> but lets' say I have
> yahoo, cnn, amazon, msn, google
> in my 'urls' files, what should my accepted domain to be?
>
>
>> ----- Original Message -----
>> From: "Meryl Silverburgh" <si...@gmail.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Saturday, April 07, 2007 8:54 AM
>> Subject: Re: Trying to setup Nutch
>>
>> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
>> >> After setup, you should put the urls you want to crawl into the HDFS 
>> >> by
>> >> the
>> >> command :
>> >> $bin/hadoop dfs -put urls urls
>> >>
>> >> Maybe that's something you forgot to do and I hope it helps :)
>> >>
>> >
>> > I try your command, but I get this error:
>> > $ bin/hadoop dfs -put urls urls
>> > put: Target urls already exists
>> >
>> >
>> > I just have 1 line in my file 'urls':
>> > $ more urls
>> > http://www.yahoo.com
>> >
>> > Thanks for any help.
>> >
>> >
>> >> ----- Original Message -----
>> >> From: "Meryl Silverburgh" <si...@gmail.com>
>> >> To: <nu...@lucene.apache.org>
>> >> Sent: Saturday, April 07, 2007 3:08 AM
>> >> Subject: Trying to setup Nutch
>> >>
>> >> > Hi,
>> >> >
>> >> > i am trying to setup Nutch.
>> >> > I setup 1 site in my urls file:
>> >> > http://www.yahoo.com
>> >> >
>> >> > And then I start crawl using this command:
>> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> >
>> >> > But I get this "No URLs to fecth", can you please tell me what am i
>> >> > missing?
>> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >> > crawl started in: crawl
>> >> > rootUrlDir = urls
>> >> > threads = 10
>> >> > depth = 1
>> >> > topN = 5
>> >> > Injector: starting
>> >> > Injector: crawlDb: crawl/crawldb
>> >> > Injector: urlDir: urls
>> >> > Injector: Converting injected urls to crawl db entries.
>> >> > Injector: Merging injected urls into crawl db.
>> >> > Injector: done
>> >> > Generator: Selecting best-scoring urls due for fetch.
>> >> > Generator: starting
>> >> > Generator: segment: crawl/segments/20070406140513
>> >> > Generator: filtering: false
>> >> > Generator: topN: 5
>> >> > Generator: jobtracker is 'local', generating exactly one partition.
>> >> > Generator: 0 records selected for fetching, exiting ...
>> >> > Stopping at depth=0 - no more URLs to fetch.
>> >> > No URLs to fetch - check your seed list and URL filters.
>> >> > crawl finished: crawl
>> >> >
>> >>
>> >
>>
>

Re: Trying to setup Nutch

Posted by Meryl Silverburgh <si...@gmail.com>.

On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> Have yuo checked your crawl-urlfilter.txt file ?
> Make sure you have replaced your accepted domain.
>

I have this in my crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/


but lets' say I have
yahoo, cnn, amazon, msn, google
in my 'urls' files, what should my accepted domain to be?


> ----- Original Message -----
> From: "Meryl Silverburgh" <si...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Saturday, April 07, 2007 8:54 AM
> Subject: Re: Trying to setup Nutch
>
> > On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> >> After setup, you should put the urls you want to crawl into the HDFS by
> >> the
> >> command :
> >> $bin/hadoop dfs -put urls urls
> >>
> >> Maybe that's something you forgot to do and I hope it helps :)
> >>
> >
> > I try your command, but I get this error:
> > $ bin/hadoop dfs -put urls urls
> > put: Target urls already exists
> >
> >
> > I just have 1 line in my file 'urls':
> > $ more urls
> > http://www.yahoo.com
> >
> > Thanks for any help.
> >
> >
> >> ----- Original Message -----
> >> From: "Meryl Silverburgh" <si...@gmail.com>
> >> To: <nu...@lucene.apache.org>
> >> Sent: Saturday, April 07, 2007 3:08 AM
> >> Subject: Trying to setup Nutch
> >>
> >> > Hi,
> >> >
> >> > i am trying to setup Nutch.
> >> > I setup 1 site in my urls file:
> >> > http://www.yahoo.com
> >> >
> >> > And then I start crawl using this command:
> >> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> >
> >> > But I get this "No URLs to fecth", can you please tell me what am i
> >> > missing?
> >> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >> > crawl started in: crawl
> >> > rootUrlDir = urls
> >> > threads = 10
> >> > depth = 1
> >> > topN = 5
> >> > Injector: starting
> >> > Injector: crawlDb: crawl/crawldb
> >> > Injector: urlDir: urls
> >> > Injector: Converting injected urls to crawl db entries.
> >> > Injector: Merging injected urls into crawl db.
> >> > Injector: done
> >> > Generator: Selecting best-scoring urls due for fetch.
> >> > Generator: starting
> >> > Generator: segment: crawl/segments/20070406140513
> >> > Generator: filtering: false
> >> > Generator: topN: 5
> >> > Generator: jobtracker is 'local', generating exactly one partition.
> >> > Generator: 0 records selected for fetching, exiting ...
> >> > Stopping at depth=0 - no more URLs to fetch.
> >> > No URLs to fetch - check your seed list and URL filters.
> >> > crawl finished: crawl
> >> >
> >>
> >
>

Re: Trying to setup Nutch

Posted by zh...@live.com.

Have yuo checked your crawl-urlfilter.txt file ?
Make sure you have replaced your accepted domain.

----- Original Message -----
From: "Meryl Silverburgh" <si...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Saturday, April 07, 2007 8:54 AM
Subject: Re: Trying to setup Nutch

> On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
>> After setup, you should put the urls you want to crawl into the HDFS by 
>> the
>> command :
>> $bin/hadoop dfs -put urls urls
>>
>> Maybe that's something you forgot to do and I hope it helps :)
>>
>
> I try your command, but I get this error:
> $ bin/hadoop dfs -put urls urls
> put: Target urls already exists
>
>
> I just have 1 line in my file 'urls':
> $ more urls
> http://www.yahoo.com
>
> Thanks for any help.
>
>
>> ----- Original Message -----
>> From: "Meryl Silverburgh" <si...@gmail.com>
>> To: <nu...@lucene.apache.org>
>> Sent: Saturday, April 07, 2007 3:08 AM
>> Subject: Trying to setup Nutch
>>
>> > Hi,
>> >
>> > i am trying to setup Nutch.
>> > I setup 1 site in my urls file:
>> > http://www.yahoo.com
>> >
>> > And then I start crawl using this command:
>> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> >
>> > But I get this "No URLs to fecth", can you please tell me what am i
>> > missing?
>> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>> > crawl started in: crawl
>> > rootUrlDir = urls
>> > threads = 10
>> > depth = 1
>> > topN = 5
>> > Injector: starting
>> > Injector: crawlDb: crawl/crawldb
>> > Injector: urlDir: urls
>> > Injector: Converting injected urls to crawl db entries.
>> > Injector: Merging injected urls into crawl db.
>> > Injector: done
>> > Generator: Selecting best-scoring urls due for fetch.
>> > Generator: starting
>> > Generator: segment: crawl/segments/20070406140513
>> > Generator: filtering: false
>> > Generator: topN: 5
>> > Generator: jobtracker is 'local', generating exactly one partition.
>> > Generator: 0 records selected for fetching, exiting ...
>> > Stopping at depth=0 - no more URLs to fetch.
>> > No URLs to fetch - check your seed list and URL filters.
>> > crawl finished: crawl
>> >
>>
>

Re: Trying to setup Nutch

Posted by Meryl Silverburgh <si...@gmail.com>.

On 4/6/07, zhangxy@live.com <zh...@live.com> wrote:
> After setup, you should put the urls you want to crawl into the HDFS by the
> command :
> $bin/hadoop dfs -put urls urls
>
> Maybe that's something you forgot to do and I hope it helps :)
>

I try your command, but I get this error:
$ bin/hadoop dfs -put urls urls
put: Target urls already exists


I just have 1 line in my file 'urls':
$ more urls
http://www.yahoo.com

Thanks for any help.


> ----- Original Message -----
> From: "Meryl Silverburgh" <si...@gmail.com>
> To: <nu...@lucene.apache.org>
> Sent: Saturday, April 07, 2007 3:08 AM
> Subject: Trying to setup Nutch
>
> > Hi,
> >
> > i am trying to setup Nutch.
> > I setup 1 site in my urls file:
> > http://www.yahoo.com
> >
> > And then I start crawl using this command:
> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >
> > But I get this "No URLs to fecth", can you please tell me what am i
> > missing?
> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 10
> > depth = 1
> > topN = 5
> > Injector: starting
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: segment: crawl/segments/20070406140513
> > Generator: filtering: false
> > Generator: topN: 5
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=0 - no more URLs to fetch.
> > No URLs to fetch - check your seed list and URL filters.
> > crawl finished: crawl
> >
>

Re: Trying to setup Nutch

Posted by zh...@live.com.

After setup, you should put the urls you want to crawl into the HDFS by the 
command :
$bin/hadoop dfs -put urls urls

Maybe that's something you forgot to do and I hope it helps :)

----- Original Message -----
From: "Meryl Silverburgh" <si...@gmail.com>
To: <nu...@lucene.apache.org>
Sent: Saturday, April 07, 2007 3:08 AM
Subject: Trying to setup Nutch

> Hi,
>
> i am trying to setup Nutch.
> I setup 1 site in my urls file:
> http://www.yahoo.com
>
> And then I start crawl using this command:
> $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>
> But I get this "No URLs to fecth", can you please tell me what am i 
> missing?
> $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 1
> topN = 5
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070406140513
> Generator: filtering: false
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: crawl
>

Re: Trying to setup Nutch

Posted by Meryl Silverburgh <si...@gmail.com>.

Yes, I have add this  to my crawl-urlfilter.txt


+^http://([a-z0-9]*\.)*(yahoo.com|cnn.com|amazon.com|msn.com|google.com)/


but i still have the problem that I mention in my previous mail.

On 4/10/07, Michael Wechner <mi...@wyona.com> wrote:
> Meryl Silverburgh wrote:
>
> > Hi,
> >
> > i am trying to setup Nutch.
> > I setup 1 site in my urls file:
> > http://www.yahoo.com
>
>
> have yiu added it to the URL/Crawl filters?
>
> Cheers
>
> Michael
>
> >
> > And then I start crawl using this command:
> > $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> >
> > But I get this "No URLs to fecth", can you please tell me what am i
> > missing?
> > $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> > crawl started in: crawl
> > rootUrlDir = urls
> > threads = 10
> > depth = 1
> > topN = 5
> > Injector: starting
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injector: Merging injected urls into crawl db.
> > Injector: done
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: starting
> > Generator: segment: crawl/segments/20070406140513
> > Generator: filtering: false
> > Generator: topN: 5
> > Generator: jobtracker is 'local', generating exactly one partition.
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=0 - no more URLs to fetch.
> > No URLs to fetch - check your seed list and URL filters.
> > crawl finished: crawl
> >
>
>
> --
> Michael Wechner
> Wyona      -   Open Source Content Management   -    Apache Lenya
> http://www.wyona.com                      http://lenya.apache.org
> michael.wechner@wyona.com                        michi@apache.org
> +41 44 272 91 61
>
>

Re: Trying to setup Nutch

Posted by Michael Wechner <mi...@wyona.com>.

Meryl Silverburgh wrote:

> Hi,
>
> i am trying to setup Nutch.
> I setup 1 site in my urls file:
> http://www.yahoo.com


have yiu added it to the URL/Crawl filters?

Cheers

Michael

>
> And then I start crawl using this command:
> $bin/nutch crawl urls -dir crawl -depth 1 -topN 5
>
> But I get this "No URLs to fecth", can you please tell me what am i 
> missing?
> $ bin/nutch crawl urls -dir crawl -depth 1 -topN 5
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 1
> topN = 5
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20070406140513
> Generator: filtering: false
> Generator: topN: 5
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: crawl
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61