You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@apex.apache.org by Jim <ji...@facility.supplies> on 2016/02/25 21:02:31 UTC

EMR Configuration Settings

Good afternoon,

We are working on bringing up our first Apex application on AWS EMR to test and roll out to production in the near future.

Being brand new to Hadoop, yarn, and Apex, I really don't have a good feel for what mapreduce, Hadoop, and apex config settings should be.

Obviously, we want to optimize our memory so we aren't getting boxes that are much larger than we have to.

When I go through all the apex documentation, I don't see a lot about the correct way to set up all the config files to optimize memory.

I have three different applications, each one, and in unit testing they take up:


1.)    3.5 GB

2.)    3.5 GB

3.)    2 GB

For each application.  These applications are taking data read in from AWS kinesis, or AWS SQS and transforming the data in memory and then updating a database as new transactions come in.

Can someone provide some sample configs, and help in understanding what to tweak to optimize my EMR system for datatorrent / apex only applications, with no batch Hadoop jobs being run on this setup?

Any help would be greatly appreciated!

Thanks,

Jim

Re: EMR Configuration Settings

Posted by Sasha Parfenov <sa...@datatorrent.com>.

Jim,

The fact that your applications are only using 2-3GB given your DAG already
shows you've already spent some time tuning the application.  In case you
missed one or two of these, key parameters to adjust include:

* Operator memory - can be tuned for a single operator or all operators at
once
* Buffer Server memory - used by operator output port to buffer tuples
being sent to another container, and can be tuned on a per outgoing port
basis.
* Application Master memory - defaults to 1024MB, but may need to be tuned
based on number of containers used in the app.
* YARN Container memory - automatically allocated based on operators +
buffer servers, but may need adjustments in yarn-site.xml settings
depending on values of yarn.scheduler.minimum-allocation-mb (1024) and
yarn.scheduler.maximum-allocation-mb (8192)

For examples on how to tune these see configuration docs
<http://docs.datatorrent.com/troubleshooting/#configuration> or look at
demo applications and their dt-site*.xml configuration files in apex-malhar
<https://github.com/apache/incubator-apex-malhar/find/master> repo.

In terms of your Hadoop EMR deployment, you can skip or shut down most
services related to typical Hadoop installation like mapred historyserver,
httpfs service, and many others.  The only ones you need to keep are:

* yarn resourcemanager (master)
* hdfs namenode (master)
* yarn namenode
* hdfs datanode
* dtgateway

You can also tune all of these based on your application usage patters,
such as number of files written to HDFS, etc.

Hopefully this helps, and let us know if you ran into any other interesting
parameters requiring adjustment.

Thanks,
Sasha

On Thu, Feb 25, 2016 at 12:51 PM, Munagala Ramanath <ra...@datatorrent.com>
wrote:

> Jim,
>
> There is some discussion of configuring memory here:
>
> http://docs.datatorrent.com/troubleshooting/#configuration
>
> http://docs.datatorrent.com/tutorials/topnwords-c2/#step-iv-customize-the-application-and-operators
> http://docs.datatorrent.com/tutorials/topnwords-c7/
>
> Ram
>
>
> On Thu, Feb 25, 2016 at 12:02 PM, Jim <ji...@facility.supplies> wrote:
>
>> Good afternoon,
>>
>>
>>
>> We are working on bringing up our first Apex application on AWS EMR to
>> test and roll out to production in the near future.
>>
>>
>>
>> Being brand new to Hadoop, yarn, and Apex, I really don’t have a good
>> feel for what mapreduce, Hadoop, and apex config settings should be.
>>
>>
>>
>> Obviously, we want to optimize our memory so we aren’t getting boxes that
>> are much larger than we have to.
>>
>>
>>
>> When I go through all the apex documentation, I don’t see a lot about the
>> correct way to set up all the config files to optimize memory.
>>
>>
>>
>> I have three different applications, each one, and in unit testing they
>> take up:
>>
>>
>>
>> 1.)    3.5 GB
>>
>> 2.)    3.5 GB
>>
>> 3.)    2 GB
>>
>>
>>
>> For each application.  These applications are taking data read in from
>> AWS kinesis, or AWS SQS and transforming the data in memory and then
>> updating a database as new transactions come in.
>>
>>
>>
>> Can someone provide some sample configs, and help in understanding what
>> to tweak to optimize my EMR system for datatorrent / apex only
>> applications, with no batch Hadoop jobs being run on this setup?
>>
>>
>>
>> Any help would be greatly appreciated!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jim
>>
>
>

-- 
*Sasha Parfenov*
Software Engineer, DataTorrent Inc.
3200 Patrick Henry Drive, 2nd floor
Santa Clara, CA  95054
sasha@datatorrent.com
https://www.datatorrent.com/

Re: EMR Configuration Settings

Posted by Munagala Ramanath <ra...@datatorrent.com>.

Jim,

There is some discussion of configuring memory here:

http://docs.datatorrent.com/troubleshooting/#configuration
http://docs.datatorrent.com/tutorials/topnwords-c2/#step-iv-customize-the-application-and-operators
http://docs.datatorrent.com/tutorials/topnwords-c7/

Ram


On Thu, Feb 25, 2016 at 12:02 PM, Jim <ji...@facility.supplies> wrote:

> Good afternoon,
>
>
>
> We are working on bringing up our first Apex application on AWS EMR to
> test and roll out to production in the near future.
>
>
>
> Being brand new to Hadoop, yarn, and Apex, I really don’t have a good feel
> for what mapreduce, Hadoop, and apex config settings should be.
>
>
>
> Obviously, we want to optimize our memory so we aren’t getting boxes that
> are much larger than we have to.
>
>
>
> When I go through all the apex documentation, I don’t see a lot about the
> correct way to set up all the config files to optimize memory.
>
>
>
> I have three different applications, each one, and in unit testing they
> take up:
>
>
>
> 1.)    3.5 GB
>
> 2.)    3.5 GB
>
> 3.)    2 GB
>
>
>
> For each application.  These applications are taking data read in from AWS
> kinesis, or AWS SQS and transforming the data in memory and then updating a
> database as new transactions come in.
>
>
>
> Can someone provide some sample configs, and help in understanding what to
> tweak to optimize my EMR system for datatorrent / apex only applications,
> with no batch Hadoop jobs being run on this setup?
>
>
>
> Any help would be greatly appreciated!
>
>
>
> Thanks,
>
>
>
> Jim
>