You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sebastian Nagel <wa...@googlemail.com> on 2017/02/01 10:04:56 UTC

Re: create and run a nutch crawler using aws emr on a schedule

Hi Srini,

> I will check it out.
Thanks, would like to see whether it works.

> I am curious how you approached shutting
> down the emr cluster for nutch
I'm running Nutch on Cloudera CDH. When the crawl is done
(which is manually checked), a script terminates all EC2
instances of the cluster (they are identified by a tag).

Best,
Sebastian

On 01/26/2017 07:16 PM, Srinivasan Ramaswamy wrote:
> Thanks Sebastin, Its interesting to note that you have a patch to directly
> write to S3. I will check it out. I am curious how you approached shutting
> down the emr cluster for nutch ? did you do that using the shell script by
> listening to the exit status of the crawl command ?
> 
> will cloudformation make my job easier or it will not have the flexibility
> of using a shell script ? anyone tried that approach ?
> 
> Thanks
> Srini
> 
> 
> 
> 
> On Thu, Jan 26, 2017 at 1:58 AM, Sebastian Nagel <wastl.nagel@googlemail.com
>> wrote:
> 
>> Hi,
>>
>>> I would like to export the crawled output to s3
>>> (already have the seed file stored in s3)
>>
>> Please, also have a look at
>>   https://issues.apache.org/jira/browse/NUTCH-2281
>> (would be great to have a second test for the patch / pull request)
>>
>> At a first glance, all 3 approaches seem feasible.
>> Personally, I only have experience with shell scripting
>> and AWS CLI commands to launch the cluster. It's quite
>> flexible, but sometimes cumbersome.
>>
>> Best,
>> Sebastian
>>
>> On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote:
>>> Hi Nutch users,
>>>
>>> I am trying to run a nutch crawler periodically on a schedule (like a
>> cron
>>> job). I am running my nutch setup in  AWS EMR to avoid setting up and
>>> maintaining infrastructure. I would like to export the crawled output to
>> s3
>>> (already have the seed file stored in s3) and then terminate the EMR
>>> cluster as my nutch job would not run for more than half a day (atleast
>> for
>>> now).
>>>
>>> Here is my question:
>>>
>>> How can i automate the AWS EMR cluster creation with nutch installed and
>> my
>>> configurations  (both emr and nutch) updated and also terminate the
>> cluster
>>> once nutch finishes  ?
>>>
>>>  Here are some ideas i can think of, purely based on my reading not tried
>>> any of them yet.
>>>
>>> - write a script using AWS CLI commands to create the emr cluster and run
>>> the nutch job and terminate once its done
>>> - use cloudformation to create the emr cluster with necessary application
>>> (nutch in this case)
>>> - use AWS data pipeline and create a schedule and pipeline for this flow
>> (i
>>> dont know whether data pipeline can achieve what i want)
>>>
>>> I would be curious to hear how others approached similar requirement.
>>>
>>> Thanks
>>> Srini
>>>
>>
>>
>