You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ".: Abhishek :." <ab...@gmail.com> on 2011/02/09 02:17:01 UTC

Running crawls between a specified time interval

Hi all,

 I am just trying to figure out if there is some way I can set Nutch crawls
between a time interval say like crawl from 12:00 AM to 12:00 PM and then
start the further processing(start process of indexing and so on that
follows the crawl) after that.

 I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how this
could be done. Rather, if I am to use an external shell script for doing
this, how do I chain the crawl process and trigger further processing after
crawl?

Thanks,
Abi

Re: Running crawls between a specified time interval

Posted by ".: Abishek :." <ab...@gmail.com>.

Hi folks,

 I am planning to,

   1. Use quartz schedular to do crawl and fetch(in a single job) for a day
   or two, then pause it.
   2. Copy the crawldb and segments folder to a separate temp folder.
   3. Do link inversion, indexing on this temp folder.
   4. Then resume the step 1.

 Does this work fine? Has anyone done this before?

Cheers,
Abi


On Thu, Feb 10, 2011 at 10:29 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> In Nutch 1.x you cannot abort and resume the fetch process.
>
> On Thursday 10 February 2011 15:27:05 .: Abishek :. wrote:
> > Thanks folks. Will try to do one of these...
> >
> > Could I also pause crawling for a while, then index the whole crawl till
> > the time it was paused(move the indexes out of to different locations)
> and
> > then continue crawling from where it was paused?
> >
> >  Just a simple pause - resume kind of thing
> >
> > On Thu, Feb 10, 2011 at 10:11 PM, Alexander Aristov <
> >
> > alexander.aristov@gmail.com> wrote:
> > > Hi
> > >
> > > You may put separate crawling phases to separate scripts something like
> > >
> > > inject.sh
> > > crawl.sh
> > > indexing.sh
> > >
> > > And configure these scripts to start at certain time using any
> scheduling
> > > tool
> > >
> > > for example I find it very easy to use linux cron scheduler.
> > >
> > > But you can configure that crawl can work between 12.00- 13.00. Crawl
> is
> > > working until it has unfetched resources in queue or max fetch limit is
> > > reached. And it takes as much time as needed.
> > >
> > > Best Regards
> > > Alexander Aristov
> > >
> > > On 9 February 2011 04:17, .: Abhishek :. <ab...@gmail.com> wrote:
> > > > Hi all,
> > > >
> > > >  I am just trying to figure out if there is some way I can set Nutch
> > >
> > > crawls
> > >
> > > > between a time interval say like crawl from 12:00 AM to 12:00 PM and
> > > > then start the further processing(start process of indexing and so on
> > > > that follows the crawl) after that.
> > > >
> > > >  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how
> > >
> > > this
> > >
> > > > could be done. Rather, if I am to use an external shell script for
> > > > doing this, how do I chain the crawl process and trigger further
> > > > processing
> > >
> > > after
> > >
> > > > crawl?
> > > >
> > > > Thanks,
> > > > Abi
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Running crawls between a specified time interval

Posted by Markus Jelsma <ma...@openindex.io>.

In Nutch 1.x you cannot abort and resume the fetch process.

On Thursday 10 February 2011 15:27:05 .: Abishek :. wrote:
> Thanks folks. Will try to do one of these...
> 
> Could I also pause crawling for a while, then index the whole crawl till
> the time it was paused(move the indexes out of to different locations) and
> then continue crawling from where it was paused?
> 
>  Just a simple pause - resume kind of thing
> 
> On Thu, Feb 10, 2011 at 10:11 PM, Alexander Aristov <
> 
> alexander.aristov@gmail.com> wrote:
> > Hi
> > 
> > You may put separate crawling phases to separate scripts something like
> > 
> > inject.sh
> > crawl.sh
> > indexing.sh
> > 
> > And configure these scripts to start at certain time using any scheduling
> > tool
> > 
> > for example I find it very easy to use linux cron scheduler.
> > 
> > But you can configure that crawl can work between 12.00- 13.00. Crawl is
> > working until it has unfetched resources in queue or max fetch limit is
> > reached. And it takes as much time as needed.
> > 
> > Best Regards
> > Alexander Aristov
> > 
> > On 9 February 2011 04:17, .: Abhishek :. <ab...@gmail.com> wrote:
> > > Hi all,
> > > 
> > >  I am just trying to figure out if there is some way I can set Nutch
> > 
> > crawls
> > 
> > > between a time interval say like crawl from 12:00 AM to 12:00 PM and
> > > then start the further processing(start process of indexing and so on
> > > that follows the crawl) after that.
> > > 
> > >  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how
> > 
> > this
> > 
> > > could be done. Rather, if I am to use an external shell script for
> > > doing this, how do I chain the crawl process and trigger further
> > > processing
> > 
> > after
> > 
> > > crawl?
> > > 
> > > Thanks,
> > > Abi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Running crawls between a specified time interval

Posted by ".: Abishek :." <ab...@gmail.com>.

Thanks folks. Will try to do one of these...

Could I also pause crawling for a while, then index the whole crawl till the
time it was paused(move the indexes out of to different locations) and then
continue crawling from where it was paused?

 Just a simple pause - resume kind of thing

On Thu, Feb 10, 2011 at 10:11 PM, Alexander Aristov <
alexander.aristov@gmail.com> wrote:

> Hi
>
> You may put separate crawling phases to separate scripts something like
>
> inject.sh
> crawl.sh
> indexing.sh
>
> And configure these scripts to start at certain time using any scheduling
> tool
>
> for example I find it very easy to use linux cron scheduler.
>
> But you can configure that crawl can work between 12.00- 13.00. Crawl is
> working until it has unfetched resources in queue or max fetch limit is
> reached. And it takes as much time as needed.
>
> Best Regards
> Alexander Aristov
>
>
> On 9 February 2011 04:17, .: Abhishek :. <ab...@gmail.com> wrote:
>
> > Hi all,
> >
> >  I am just trying to figure out if there is some way I can set Nutch
> crawls
> > between a time interval say like crawl from 12:00 AM to 12:00 PM and then
> > start the further processing(start process of indexing and so on that
> > follows the crawl) after that.
> >
> >  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how
> this
> > could be done. Rather, if I am to use an external shell script for doing
> > this, how do I chain the crawl process and trigger further processing
> after
> > crawl?
> >
> > Thanks,
> > Abi
> >
>

Re: Running crawls between a specified time interval

Posted by Alexander Aristov <al...@gmail.com>.

Hi

You may put separate crawling phases to separate scripts something like

inject.sh
crawl.sh
indexing.sh

And configure these scripts to start at certain time using any scheduling
tool

for example I find it very easy to use linux cron scheduler.

But you can configure that crawl can work between 12.00- 13.00. Crawl is
working until it has unfetched resources in queue or max fetch limit is
reached. And it takes as much time as needed.

Best Regards
Alexander Aristov


On 9 February 2011 04:17, .: Abhishek :. <ab...@gmail.com> wrote:

> Hi all,
>
>  I am just trying to figure out if there is some way I can set Nutch crawls
> between a time interval say like crawl from 12:00 AM to 12:00 PM and then
> start the further processing(start process of indexing and so on that
> follows the crawl) after that.
>
>  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how this
> could be done. Rather, if I am to use an external shell script for doing
> this, how do I chain the crawl process and trigger further processing after
> crawl?
>
> Thanks,
> Abi
>

Re: Running crawls between a specified time interval

Posted by Sonal Goyal <so...@gmail.com>.

Abhishek,

You can probably take a look at Oozie or Azkaban. I am not sure they support
running process between xand y time, but definitely support scheduling a job
Thanks and Regards,
Sonal
<https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
Salesforce, FTP servers and others <https://github.com/sonalgoyal/hiho>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Thu, Feb 10, 2011 at 4:31 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> I'm unsure about what Hadoop can do here but with Nutch you can't. What you
> can do is create a run script that checks the current time before starting.
> Nutch job's cannot always be aborted and resumed, beware of the fetch
> process.
>
> On Wednesday 09 February 2011 02:17:01 .: Abhishek :. wrote:
> > Hi all,
> >
> >  I am just trying to figure out if there is some way I can set Nutch
> crawls
> > between a time interval say like crawl from 12:00 AM to 12:00 PM and then
> > start the further processing(start process of indexing and so on that
> > follows the crawl) after that.
> >
> >  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how
> this
> > could be done. Rather, if I am to use an external shell script for doing
> > this, how do I chain the crawl process and trigger further processing
> after
> > crawl?
> >
> > Thanks,
> > Abi
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Running crawls between a specified time interval

Posted by Markus Jelsma <ma...@openindex.io>.

I'm unsure about what Hadoop can do here but with Nutch you can't. What you 
can do is create a run script that checks the current time before starting. 
Nutch job's cannot always be aborted and resumed, beware of the fetch process.

On Wednesday 09 February 2011 02:17:01 .: Abhishek :. wrote:
> Hi all,
> 
>  I am just trying to figure out if there is some way I can set Nutch crawls
> between a time interval say like crawl from 12:00 AM to 12:00 PM and then
> start the further processing(start process of indexing and so on that
> follows the crawl) after that.
> 
>  I think Nutch job is tied to Hadoop's JobConf. I am not sure on  how this
> could be done. Rather, if I am to use an external shell script for doing
> this, how do I chain the crawl process and trigger further processing after
> crawl?
> 
> Thanks,
> Abi

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350