You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jim Lamb <jl...@mail.com> on 2016/11/16 12:24:59 UTC

Automating Nutch 2.3.1 on Amazon EMR

Hello,
 
I am looking for a way to automate Nutch 2.3.1 crawls on Amazon EMR. I have seen lots of documentation and examples of SSHing to the master node in the cluster and running bin/crawl from there, but it would be much cleaner to be able to add a set of  "steps" to the EMR create-cluster script where the job file is called with the appropriate jarfile name and parameters. That way, the cluster could be started from a script and would terminate once it had completed, having pushed its data into our external Solr index.
 
I could reverse engineer the bin/nutch script to get such a list of jar calls, but the one point that I cannot quite grasp is how to emulate the loop of rounds that the bin/crawl script performs. Since the removal of org.apache.nutch.crawl.Crawl I can't see how to do more than one round, short of repeating the same steps over and over (except inject) in the create-cluster command.
 
At the moment, everything works on EMR (3.11.0) with Nutch 2.3.1 and HBase 0.98.0 both installed as bootstrap actions, but I have to get the master node IP address, then ssh in and run bin/crawl manually, I then have to keep checking whether it has finished to go terminate the cluster to not incur extra cost (though a workaround here is to simply add "&& init 0" to the command so that the master node dies and takes the cluster with it, albeit always showing as a failed cluster to EMR).
 
It would be very desirable to use automate this, as we have the need to run many separate ad-hoc Nutch runs and all the config can be brought down from S3, leaving just one manual step.
 
Any help/pointers, particularly from anyone who has done this, would be appreciated.
 
Regards,
 
Jim

Re: Automating Nutch 2.3.1 on Amazon EMR

Posted by Jim Lamb <jl...@mail.com>.
Further to this, I have found that I can only submit a maximum of 256 steps to EMR. Some of our crawls take over 100 rounds, so defining an arbitrary number of (generate,fetch,parse,updatedb,index,solrdedup) rounds each with 6 steps isn't going to work either :-(

Has nobody automated this?

Thanks,

Jim
 

Sent: Thursday, November 17, 2016 at 11:30 AM
From: "Jim Lamb" <jl...@mail.com>
To: user@nutch.apache.org
Subject: Re: Automating Nutch 2.3.1 on Amazon EMR
Hi Sebastian,

Thanks for coming back to me.

> Adding
> set -x
> to bin/nutch and then running bin/crawl with a sample crawl which includes all steps
> should log all commands with a full list of arguments.

Yes, that's a great idea. Thanks.

> But on EMR it should be possible to directly reference the Nutch job file
> by a s3:// URL. (but haven't tried it this way)

Yes, that is possible. You add an S3 URL to the Jar= argument in your step definition of the create-cluster command.

> aws emr terminate-cluster ...

Ah, yes. I did wonder if the master instance had appropriate instance role privilege to do this. I'll try.

Unfortunately, it still doesn't solve the iteration issue. Short of defining many many repeated sets of steps, I don't see how I would get multiple rounds. What am I missing?

Thanks,

Jim

Re: Automating Nutch 2.3.1 on Amazon EMR

Posted by Jim Lamb <jl...@mail.com>.
Hi Sebastian,
 
Thanks for coming back to me.
 
> Adding
> set -x
> to bin/nutch and then running bin/crawl with a sample crawl which includes all steps
> should log all commands with a full list of arguments.

Yes, that's a great idea. Thanks.

> But on EMR it should be possible to directly reference the Nutch job file
> by a s3:// URL. (but haven't tried it this way)

Yes, that is possible. You add an S3 URL to the Jar= argument in your step definition of the create-cluster command.

> aws emr terminate-cluster ...

Ah, yes. I did wonder if the master instance had appropriate instance role privilege to do this. I'll try.

Unfortunately, it still doesn't solve the iteration issue. Short of defining many many repeated sets of steps, I don't see how I would get multiple rounds. What am I missing?

Thanks,

Jim

Re: Automating Nutch 2.3.1 on Amazon EMR

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Jim,

> I could reverse engineer the bin/nutch script to get such a list of jar calls,

Adding
  set -x
to bin/nutch and then running bin/crawl with a sample crawl which includes all steps
should log all commands with a full list of arguments.

> all the config can be brought down from S3

could copy it via
  aws s3 cp ...
from S3 to the local filesystem of the master.
But on EMR it should be possible to directly reference the Nutch job file
by a s3:// URL. (but haven't tried it this way)

> simply add "&& init 0"

  aws emr terminate-cluster ...
should do the job cleanly. Also have a look on other subcommands of
  aws emr

Sebastian

P.S.: I haven't done this and still shutting down the cluster on AWS manually
      - but that doesn't really matter since the crawling takes over a week.

On 11/16/2016 01:24 PM, Jim Lamb wrote:
> Hello,
>  
> I am looking for a way to automate Nutch 2.3.1 crawls on Amazon EMR. I have seen lots of documentation and examples of SSHing to the master node in the cluster and running bin/crawl from there, but it would be much cleaner to be able to add a set of  "steps" to the EMR create-cluster script where the job file is called with the appropriate jarfile name and parameters. That way, the cluster could be started from a script and would terminate once it had completed, having pushed its data into our external Solr index.
>  
> I could reverse engineer the bin/nutch script to get such a list of jar calls, but the one point that I cannot quite grasp is how to emulate the loop of rounds that the bin/crawl script performs. Since the removal of org.apache.nutch.crawl.Crawl I can't see how to do more than one round, short of repeating the same steps over and over (except inject) in the create-cluster command.
>  
> At the moment, everything works on EMR (3.11.0) with Nutch 2.3.1 and HBase 0.98.0 both installed as bootstrap actions, but I have to get the master node IP address, then ssh in and run bin/crawl manually, I then have to keep checking whether it has finished to go terminate the cluster to not incur extra cost (though a workaround here is to simply add "&& init 0" to the command so that the master node dies and takes the cluster with it, albeit always showing as a failed cluster to EMR).
>  
> It would be very desirable to use automate this, as we have the need to run many separate ad-hoc Nutch runs and all the config can be brought down from S3, leaving just one manual step.
>  
> Any help/pointers, particularly from anyone who has done this, would be appreciated.
>  
> Regards,
>  
> Jim
>