You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2011/11/10 19:51:04 UTC

Continuous crawling

I've done some searching on this, but haven't found any real solutions.  Is
there an existing way to do a continuous crawl using Nutch?  I know I can
use the bin/nutch crawl command, but that stops after a certain number of
iterations.

Right now I'm working on a java class to do it, but I would assume it's a
problem that's been solved already.  Unfortunately I can't seem to find any
evidence of this.

Thanks.

Re: Continuous crawling

Posted by Markus Jelsma <ma...@openindex.io>.

Nutch already uses it in the Fetcher. It outputs stuff like below in the 
Hadoop GUI and after each job has finished on stdout.


exception 	30,154 	0 	30,154
access_denied 	380 	0 	380
gone 	3,159 	0 	3,159
moved 	18,601 	0 	18,601
robots_denied 	7,889 	0 	7,889
robots_denied_maxcrawldelay 	167 	0 	167
hitByThrougputThreshold 	5 	0 	5
bytes_downloaded 	24,012,066,657 	0 	24,012,066,657
hitByTimeLimit 	3,020 	0 	3,020
notmodified 	30,223 	0 	30,223
temp_moved 	21,653 	0 	21,653
success 	433,955 	0 	433,955
notfound 	23,384 	0 	23,384


On Monday 28 November 2011 15:09:49 Bai Shen wrote:
> We looked at the hadoop reporter and aren't sure how to access it with
> nutch.  Is there a certain way it works?  Can you give me an example?
> Thanks.
> 
> On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > **
> > 
> > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> > > 
> > > <ma...@openindex.io>wrote:
> > > > > Interesting. How do you tell if the segments have been fetched,
> > > > > etc?
> > > > 
> > > > after a job the shell script waits for its completion and return
> > > > code.
> > 
> > If
> > 
> > > > it
> > > > 
> > > > returns 0 all is fine and we move it to another queue. If != 0 then
> > > > 
> > > > there's an
> > > > 
> > > > error and reports via mail.
> > > > 
> > > > 
> > > > 
> > > > Ah, okay. I didn't realize it was returning an error code.
> > > > 
> > > > > How
> > > > > 
> > > > > do you know if there are any urls that had problems?
> > > > 
> > > > Hadoop reporter shows statistics. There are always many errors for
> > > > many
> > > > 
> > > > reasons. This is normal because we crawl everything.
> > > 
> > > How are you running Hadoop reporter?
> > 
> > You'll get it for free when operating a Hadoop cluster.
> > 
> > > > > Or fetch jobs that
> > > > > 
> > > > > errored out, etc.
> > > > 
> > > > The non-zero return code.

-- 
Markus Jelsma - CTO - Openindex

Re: Continuous crawling

Posted by Bai Shen <ba...@gmail.com>.

Fixed it.  Turns out I'd copied the conf files to the wrong directory.

However, I'm having trouble running my java code.  Previously I put my jar
into the runtime/local/lib directory and then called bin/nutch myClass.  I
put my jar in the hadoop/lob directory, but I'm still getting a
ClassNotFoundException.

On Mon, Nov 28, 2011 at 3:38 PM, Bai Shen <ba...@gmail.com> wrote:

> I've changed nutch to use the pseudo-distributed mode, but it keeps
> erroring out that no agent is listed in the http.agent.name property.  I
> copied over my conf directory from local, but that didn't fix it.  What am
> I missing?
>
>
> On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Simply run Nutch in pseudo-distributed mode. If you have no idea of what
>> this means, then it would be a good idea to have a look at
>> http://hadoop.apache.org/common/docs/stable/single_node_setup.html and in
>> particular the section mentioning http://localhost:50030/jobtracker.jsp
>>
>> On 28 November 2011 14:09, Bai Shen <ba...@gmail.com> wrote:
>>
>> > We looked at the hadoop reporter and aren't sure how to access it with
>> > nutch.  Is there a certain way it works?  Can you give me an example?
>> > Thanks.
>> >
>> > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
>> > <ma...@openindex.io>wrote:
>> >
>> > > **
>> > >
>> > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
>> > >
>> > > >
>> > >
>> > > > <ma...@openindex.io>wrote:
>> > >
>> > > > > > Interesting. How do you tell if the segments have been fetched,
>> > etc?
>> > >
>> > > > >
>> > >
>> > > > > after a job the shell script waits for its completion and return
>> > code.
>> > > If
>> > >
>> > > > > it
>> > >
>> > > > > returns 0 all is fine and we move it to another queue. If != 0
>> then
>> > >
>> > > > > there's an
>> > >
>> > > > > error and reports via mail.
>> > >
>> > > > >
>> > >
>> > > > > Ah, okay. I didn't realize it was returning an error code.
>> > >
>> > > > >
>> > >
>> > > > > > How
>> > >
>> > > > > > do you know if there are any urls that had problems?
>> > >
>> > > > >
>> > >
>> > > > > Hadoop reporter shows statistics. There are always many errors for
>> > many
>> > >
>> > > > > reasons. This is normal because we crawl everything.
>> > >
>> > > >
>> > >
>> > > > How are you running Hadoop reporter?
>> > >
>> > > You'll get it for free when operating a Hadoop cluster.
>> > >
>> > > >
>> > >
>> > > > > > Or fetch jobs that
>> > >
>> > > > > > errored out, etc.
>> > >
>> > > > >
>> > >
>> > > > > The non-zero return code.
>> > >
>> >
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>

Re: Continuous crawling

Posted by Markus Jelsma <ma...@openindex.io>.

you can search the web for "Nutch Continuous crawling"

On Tuesday 29 November 2011 11:01:14 庄名洲 wrote:
> I'd like to know details in continuous crawling, too.
> could anyone fwd me the original email, because i'm new here. thanks to all
> of you.
> 
> 2011/11/29 庄名洲 <mi...@gmail.com>
> 
> > no agnet is listed in the http.agent.name property.
> > I met this before.
> > Just rebuild with ant~~
> > And maybe you'll need .patch files to fix the source. Good luck
> > 
> > 
> > 2011/11/29 Bai Shen <ba...@gmail.com>
> > 
> >> I've changed nutch to use the pseudo-distributed mode, but it keeps
> >> erroring out that no agent is listed in the http.agent.name property.  I
> >> copied over my conf directory from local, but that didn't fix it.  What
> >> am I missing?
> >> 
> >> On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche <
> >> 
> >> lists.digitalpebble@gmail.com> wrote:
> >> > Simply run Nutch in pseudo-distributed mode. If you have no idea of
> >> > what this means, then it would be a good idea to have a look at
> >> > http://hadoop.apache.org/common/docs/stable/single_node_setup.html and
> >> 
> >> in
> >> 
> >> > particular the section mentioning
> >> > http://localhost:50030/jobtracker.jsp
> >> > 
> >> > On 28 November 2011 14:09, Bai Shen <ba...@gmail.com> wrote:
> >> > > We looked at the hadoop reporter and aren't sure how to access it
> >> > > with nutch.  Is there a certain way it works?  Can you give me an
> >> > > example? Thanks.
> >> > > 
> >> > > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> >> > > 
> >> > > <ma...@openindex.io>wrote:
> >> > > > **
> >> > > > 
> >> > > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> >> > > > > 
> >> > > > > <ma...@openindex.io>wrote:
> >> > > > > > > Interesting. How do you tell if the segments have been
> >> 
> >> fetched,
> >> 
> >> > > etc?
> >> > > 
> >> > > > > > after a job the shell script waits for its completion and
> >> > > > > > return
> >> > > 
> >> > > code.
> >> > > 
> >> > > > If
> >> > > > 
> >> > > > > > it
> >> > > > > > 
> >> > > > > > returns 0 all is fine and we move it to another queue. If != 0
> >> 
> >> then
> >> 
> >> > > > > > there's an
> >> > > > > > 
> >> > > > > > error and reports via mail.
> >> > > > > > 
> >> > > > > > 
> >> > > > > > 
> >> > > > > > Ah, okay. I didn't realize it was returning an error code.
> >> > > > > > 
> >> > > > > > > How
> >> > > > > > > 
> >> > > > > > > do you know if there are any urls that had problems?
> >> > > > > > 
> >> > > > > > Hadoop reporter shows statistics. There are always many errors
> >> 
> >> for
> >> 
> >> > > many
> >> > > 
> >> > > > > > reasons. This is normal because we crawl everything.
> >> > > > > 
> >> > > > > How are you running Hadoop reporter?
> >> > > > 
> >> > > > You'll get it for free when operating a Hadoop cluster.
> >> > > > 
> >> > > > > > > Or fetch jobs that
> >> > > > > > > 
> >> > > > > > > errored out, etc.
> >> > > > > > 
> >> > > > > > The non-zero return code.
> >> > 
> >> > --
> >> > *
> >> > *Open Source Solutions for Text Engineering
> >> > 
> >> > http://digitalpebble.blogspot.com/
> >> > http://www.digitalpebble.com
> > 
> > --
> > *Best Regards :-)*
> > *mingzhou zhuang
> > Department of Computer Science & Technology,Tsinghua University, Beijing,
> > China*

-- 
Markus Jelsma - CTO - Openindex

Re: Continuous crawling

Posted by 庄名洲 <mi...@gmail.com>.

I'd like to know details in continuous crawling, too.
could anyone fwd me the original email, because i'm new here. thanks to all
of you.

2011/11/29 庄名洲 <mi...@gmail.com>

> no agnet is listed in the http.agent.name property.
> I met this before.
> Just rebuild with ant~~
> And maybe you'll need .patch files to fix the source. Good luck
>
>
> 2011/11/29 Bai Shen <ba...@gmail.com>
>
>> I've changed nutch to use the pseudo-distributed mode, but it keeps
>> erroring out that no agent is listed in the http.agent.name property.  I
>> copied over my conf directory from local, but that didn't fix it.  What am
>> I missing?
>>
>> On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche <
>> lists.digitalpebble@gmail.com> wrote:
>>
>> > Simply run Nutch in pseudo-distributed mode. If you have no idea of what
>> > this means, then it would be a good idea to have a look at
>> > http://hadoop.apache.org/common/docs/stable/single_node_setup.html and
>> in
>> > particular the section mentioning http://localhost:50030/jobtracker.jsp
>> >
>> > On 28 November 2011 14:09, Bai Shen <ba...@gmail.com> wrote:
>> >
>> > > We looked at the hadoop reporter and aren't sure how to access it with
>> > > nutch.  Is there a certain way it works?  Can you give me an example?
>> > > Thanks.
>> > >
>> > > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
>> > > <ma...@openindex.io>wrote:
>> > >
>> > > > **
>> > > >
>> > > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
>> > > >
>> > > > >
>> > > >
>> > > > > <ma...@openindex.io>wrote:
>> > > >
>> > > > > > > Interesting. How do you tell if the segments have been
>> fetched,
>> > > etc?
>> > > >
>> > > > > >
>> > > >
>> > > > > > after a job the shell script waits for its completion and return
>> > > code.
>> > > > If
>> > > >
>> > > > > > it
>> > > >
>> > > > > > returns 0 all is fine and we move it to another queue. If != 0
>> then
>> > > >
>> > > > > > there's an
>> > > >
>> > > > > > error and reports via mail.
>> > > >
>> > > > > >
>> > > >
>> > > > > > Ah, okay. I didn't realize it was returning an error code.
>> > > >
>> > > > > >
>> > > >
>> > > > > > > How
>> > > >
>> > > > > > > do you know if there are any urls that had problems?
>> > > >
>> > > > > >
>> > > >
>> > > > > > Hadoop reporter shows statistics. There are always many errors
>> for
>> > > many
>> > > >
>> > > > > > reasons. This is normal because we crawl everything.
>> > > >
>> > > > >
>> > > >
>> > > > > How are you running Hadoop reporter?
>> > > >
>> > > > You'll get it for free when operating a Hadoop cluster.
>> > > >
>> > > > >
>> > > >
>> > > > > > > Or fetch jobs that
>> > > >
>> > > > > > > errored out, etc.
>> > > >
>> > > > > >
>> > > >
>> > > > > > The non-zero return code.
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> >
>>
>
>
>
> --
> *Best Regards :-)*
> *mingzhou zhuang
> Department of Computer Science & Technology,Tsinghua University, Beijing,
> China*
>



-- 
*Best Regards :-)*
*mingzhou zhuang
Department of Computer Science & Technology,Tsinghua University, Beijing,
China*

Re: Continuous crawling

Posted by 庄名洲 <mi...@gmail.com>.

no agnet is listed in the http.agent.name property.
I met this before.
Just rebuild with ant~~
And maybe you'll need .patch files to fix the source. Good luck

2011/11/29 Bai Shen <ba...@gmail.com>

> I've changed nutch to use the pseudo-distributed mode, but it keeps
> erroring out that no agent is listed in the http.agent.name property.  I
> copied over my conf directory from local, but that didn't fix it.  What am
> I missing?
>
> On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Simply run Nutch in pseudo-distributed mode. If you have no idea of what
> > this means, then it would be a good idea to have a look at
> > http://hadoop.apache.org/common/docs/stable/single_node_setup.html and
> in
> > particular the section mentioning http://localhost:50030/jobtracker.jsp
> >
> > On 28 November 2011 14:09, Bai Shen <ba...@gmail.com> wrote:
> >
> > > We looked at the hadoop reporter and aren't sure how to access it with
> > > nutch.  Is there a certain way it works?  Can you give me an example?
> > > Thanks.
> > >
> > > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> > > <ma...@openindex.io>wrote:
> > >
> > > > **
> > > >
> > > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> > > >
> > > > >
> > > >
> > > > > <ma...@openindex.io>wrote:
> > > >
> > > > > > > Interesting. How do you tell if the segments have been fetched,
> > > etc?
> > > >
> > > > > >
> > > >
> > > > > > after a job the shell script waits for its completion and return
> > > code.
> > > > If
> > > >
> > > > > > it
> > > >
> > > > > > returns 0 all is fine and we move it to another queue. If != 0
> then
> > > >
> > > > > > there's an
> > > >
> > > > > > error and reports via mail.
> > > >
> > > > > >
> > > >
> > > > > > Ah, okay. I didn't realize it was returning an error code.
> > > >
> > > > > >
> > > >
> > > > > > > How
> > > >
> > > > > > > do you know if there are any urls that had problems?
> > > >
> > > > > >
> > > >
> > > > > > Hadoop reporter shows statistics. There are always many errors
> for
> > > many
> > > >
> > > > > > reasons. This is normal because we crawl everything.
> > > >
> > > > >
> > > >
> > > > > How are you running Hadoop reporter?
> > > >
> > > > You'll get it for free when operating a Hadoop cluster.
> > > >
> > > > >
> > > >
> > > > > > > Or fetch jobs that
> > > >
> > > > > > > errored out, etc.
> > > >
> > > > > >
> > > >
> > > > > > The non-zero return code.
> > > >
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*Best Regards :-)*
*mingzhou zhuang
Department of Computer Science & Technology,Tsinghua University, Beijing,
China*

Re: Continuous crawling

Posted by Bai Shen <ba...@gmail.com>.

I've changed nutch to use the pseudo-distributed mode, but it keeps
erroring out that no agent is listed in the http.agent.name property.  I
copied over my conf directory from local, but that didn't fix it.  What am
I missing?

On Mon, Nov 28, 2011 at 9:23 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Simply run Nutch in pseudo-distributed mode. If you have no idea of what
> this means, then it would be a good idea to have a look at
> http://hadoop.apache.org/common/docs/stable/single_node_setup.html and in
> particular the section mentioning http://localhost:50030/jobtracker.jsp
>
> On 28 November 2011 14:09, Bai Shen <ba...@gmail.com> wrote:
>
> > We looked at the hadoop reporter and aren't sure how to access it with
> > nutch.  Is there a certain way it works?  Can you give me an example?
> > Thanks.
> >
> > On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> > <ma...@openindex.io>wrote:
> >
> > > **
> > >
> > > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> > >
> > > >
> > >
> > > > <ma...@openindex.io>wrote:
> > >
> > > > > > Interesting. How do you tell if the segments have been fetched,
> > etc?
> > >
> > > > >
> > >
> > > > > after a job the shell script waits for its completion and return
> > code.
> > > If
> > >
> > > > > it
> > >
> > > > > returns 0 all is fine and we move it to another queue. If != 0 then
> > >
> > > > > there's an
> > >
> > > > > error and reports via mail.
> > >
> > > > >
> > >
> > > > > Ah, okay. I didn't realize it was returning an error code.
> > >
> > > > >
> > >
> > > > > > How
> > >
> > > > > > do you know if there are any urls that had problems?
> > >
> > > > >
> > >
> > > > > Hadoop reporter shows statistics. There are always many errors for
> > many
> > >
> > > > > reasons. This is normal because we crawl everything.
> > >
> > > >
> > >
> > > > How are you running Hadoop reporter?
> > >
> > > You'll get it for free when operating a Hadoop cluster.
> > >
> > > >
> > >
> > > > > > Or fetch jobs that
> > >
> > > > > > errored out, etc.
> > >
> > > > >
> > >
> > > > > The non-zero return code.
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Continuous crawling

Posted by Julien Nioche <li...@gmail.com>.

Simply run Nutch in pseudo-distributed mode. If you have no idea of what
this means, then it would be a good idea to have a look at
http://hadoop.apache.org/common/docs/stable/single_node_setup.html and in
particular the section mentioning http://localhost:50030/jobtracker.jsp

On 28 November 2011 14:09, Bai Shen <ba...@gmail.com> wrote:

> We looked at the hadoop reporter and aren't sure how to access it with
> nutch.  Is there a certain way it works?  Can you give me an example?
> Thanks.
>
> On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
> <ma...@openindex.io>wrote:
>
> > **
> >
> > > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> >
> > >
> >
> > > <ma...@openindex.io>wrote:
> >
> > > > > Interesting. How do you tell if the segments have been fetched,
> etc?
> >
> > > >
> >
> > > > after a job the shell script waits for its completion and return
> code.
> > If
> >
> > > > it
> >
> > > > returns 0 all is fine and we move it to another queue. If != 0 then
> >
> > > > there's an
> >
> > > > error and reports via mail.
> >
> > > >
> >
> > > > Ah, okay. I didn't realize it was returning an error code.
> >
> > > >
> >
> > > > > How
> >
> > > > > do you know if there are any urls that had problems?
> >
> > > >
> >
> > > > Hadoop reporter shows statistics. There are always many errors for
> many
> >
> > > > reasons. This is normal because we crawl everything.
> >
> > >
> >
> > > How are you running Hadoop reporter?
> >
> > You'll get it for free when operating a Hadoop cluster.
> >
> > >
> >
> > > > > Or fetch jobs that
> >
> > > > > errored out, etc.
> >
> > > >
> >
> > > > The non-zero return code.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Continuous crawling

Posted by Bai Shen <ba...@gmail.com>.

We looked at the hadoop reporter and aren't sure how to access it with
nutch.  Is there a certain way it works?  Can you give me an example?
Thanks.

On Mon, Nov 21, 2011 at 3:11 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> **
>
> > On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
>
> >
>
> > <ma...@openindex.io>wrote:
>
> > > > Interesting. How do you tell if the segments have been fetched, etc?
>
> > >
>
> > > after a job the shell script waits for its completion and return code.
> If
>
> > > it
>
> > > returns 0 all is fine and we move it to another queue. If != 0 then
>
> > > there's an
>
> > > error and reports via mail.
>
> > >
>
> > > Ah, okay. I didn't realize it was returning an error code.
>
> > >
>
> > > > How
>
> > > > do you know if there are any urls that had problems?
>
> > >
>
> > > Hadoop reporter shows statistics. There are always many errors for many
>
> > > reasons. This is normal because we crawl everything.
>
> >
>
> > How are you running Hadoop reporter?
>
> You'll get it for free when operating a Hadoop cluster.
>
> >
>
> > > > Or fetch jobs that
>
> > > > errored out, etc.
>
> > >
>
> > > The non-zero return code.
>

Re: Continuous crawling

Posted by Markus Jelsma <ma...@openindex.io>.

> On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > > Interesting.  How do you tell if the segments have been fetched, etc?
> > 
> > after a job the shell script waits for its completion and return code. If
> > it
> > returns 0 all is fine and we move it to another queue. If != 0 then
> > there's an
> > error and reports via mail.
> > 
> > Ah, okay.  I didn't realize it was returning an error code.
> > 
> > > How
> > > do you know if there are any urls that had problems?
> > 
> > Hadoop reporter shows statistics. There are always many errors for many
> > reasons. This is normal because we crawl everything.
> 
> How are you running Hadoop reporter?

You'll get it for free when operating a Hadoop cluster.

> 
> > > Or fetch jobs that
> > > errored out, etc.
> > 
> > The non-zero return code.

Re: Continuous crawling

Posted by Markus Jelsma <ma...@openindex.io>.

The most basic shell script would be putting the commands of the tutorial in 
order and have a cron job execute that script once every interval. In case of 
large crawls you need to add some locking mechanism to avoid overlap.

http://wiki.apache.org/nutch/NutchTutorial

On Monday 14 November 2011 01:55:15 xander wrote:
> Hi,
> I wan to write a shell scirpt which will crawl data and update the database
> for me every 2 hours . Can you help me write a shell script for it. I am
> new to this and would appreciate any sort of help. You can divert me to a
> useful link too.
> 
> thanks
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Continuous-crawling-tp3497615p3505600.h
> tml Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Continuous crawling

Posted by xander <fr...@gmail.com>.

Hi,
I wan to write a shell scirpt which will crawl data and update the database
for me every 2 hours . Can you help me write a shell script for it. I am new
to this and would appreciate any sort of help. You can divert me to a useful
link too.

thanks 

--
View this message in context: http://lucene.472066.n3.nabble.com/Continuous-crawling-tp3497615p3505600.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Continuous crawling

Posted by Bai Shen <ba...@gmail.com>.

On Thu, Nov 10, 2011 at 3:32 PM, Markus Jelsma
<ma...@openindex.io>wrote:

>
> > Interesting.  How do you tell if the segments have been fetched, etc?
>
> after a job the shell script waits for its completion and return code. If
> it
> returns 0 all is fine and we move it to another queue. If != 0 then
> there's an
> error and reports via mail.
>
> Ah, okay.  I didn't realize it was returning an error code.


> > How
> > do you know if there are any urls that had problems?
>
> Hadoop reporter shows statistics. There are always many errors for many
> reasons. This is normal because we crawl everything.
>
>
How are you running Hadoop reporter?


> > Or fetch jobs that
> > errored out, etc.
>
> The non-zero return code.
>
>

Re: Continuous crawling

Posted by Markus Jelsma <ma...@openindex.io>.

> Interesting.  How do you tell if the segments have been fetched, etc?

after a job the shell script waits for its completion and return code. If it 
returns 0 all is fine and we move it to another queue. If != 0 then there's an 
error and reports via mail.

> How
> do you know if there are any urls that had problems?

Hadoop reporter shows statistics. There are always many errors for many 
reasons. This is normal because we crawl everything.

> Or fetch jobs that
> errored out, etc.

The non-zero return code.

> 
> On Thu, Nov 10, 2011 at 2:01 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > I prefer a suite of shell scripts and cron jobs. We simply generate many
> > segments at once, have a cron job checking for available segments we can
> > fetch
> > and fetch them. If all are fetched, the segemnts are moved to a queue
> > directory for updating the DB. Once the DB has been updated the
> > generators are
> > triggered and the whole circus repeats.
> > 
> > > I've done some searching on this, but haven't found any real solutions.
> >  
> >  Is
> >  
> > > there an existing way to do a continuous crawl using Nutch?  I know I
> > > can use the bin/nutch crawl command, but that stops after a certain
> > > number of iterations.
> > > 
> > > Right now I'm working on a java class to do it, but I would assume it's
> > > a problem that's been solved already.  Unfortunately I can't seem to
> > > find
> > 
> > any
> > 
> > > evidence of this.
> > > 
> > > Thanks.

Re: Continuous crawling

Posted by Bai Shen <ba...@gmail.com>.

Interesting.  How do you tell if the segments have been fetched, etc?  How
do you know if there are any urls that had problems?  Or fetch jobs that
errored out, etc.

On Thu, Nov 10, 2011 at 2:01 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> I prefer a suite of shell scripts and cron jobs. We simply generate many
> segments at once, have a cron job checking for available segments we can
> fetch
> and fetch them. If all are fetched, the segemnts are moved to a queue
> directory for updating the DB. Once the DB has been updated the generators
> are
> triggered and the whole circus repeats.
>
>
> > I've done some searching on this, but haven't found any real solutions.
>  Is
> > there an existing way to do a continuous crawl using Nutch?  I know I can
> > use the bin/nutch crawl command, but that stops after a certain number of
> > iterations.
> >
> > Right now I'm working on a java class to do it, but I would assume it's a
> > problem that's been solved already.  Unfortunately I can't seem to find
> any
> > evidence of this.
> >
> > Thanks.
>

Re: Continuous crawling

Posted by Markus Jelsma <ma...@openindex.io>.

I prefer a suite of shell scripts and cron jobs. We simply generate many 
segments at once, have a cron job checking for available segments we can fetch 
and fetch them. If all are fetched, the segemnts are moved to a queue 
directory for updating the DB. Once the DB has been updated the generators are 
triggered and the whole circus repeats.


> I've done some searching on this, but haven't found any real solutions.  Is
> there an existing way to do a continuous crawl using Nutch?  I know I can
> use the bin/nutch crawl command, but that stops after a certain number of
> iterations.
> 
> Right now I'm working on a java class to do it, but I would assume it's a
> problem that's been solved already.  Unfortunately I can't seem to find any
> evidence of this.
> 
> Thanks.