You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Luke Hansen <lu...@wealthfront.com> on 2015/10/13 21:52:22 UTC

Hadoop Configuration from DoFn

Does anyone know if this is the right way to configure Hadoop from a Crunch
DoFn?  This didn't seem to affect anything.

Thanks!

@Override
public void configure(Configuration conf) {
  conf.set("mapreduce.map.java.opts", "-Xmx3900m");
  conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");

  conf.set("mapreduce.map.memory.mb", "4096");
  conf.set("mapreduce.reduce.memory.mb", "4096");
}

Re: Hadoop Configuration from DoFn

Posted by Luke Hansen <lu...@wealthfront.com>.
Thanks for the quick replies everyone!  Setting the configuration at the
pipeline level (as opposed to the DoFn level) worked.

On Tue, Oct 13, 2015 at 1:08 PM, Micah Whitacre <mk...@gmail.com>
wrote:

> Yeah was misconstruing it with the setContext(...) method which provides
> the configuration when the job is actually running.[1]  Luke, you might
> look at generating a plan of your pipeline to see what other DoFns might be
> inside the same job and causing a conflict with your settings.
>
> We typically do the global settings vs trying to tweak at each DoFn simply
> because it allows us to avoid worrying about which DoFn's get grouped into
> a single task and override each other.
>
> [1] -
> http://crunch.apache.org/apidocs/0.12.0/org/apache/crunch/DoFn.html#configure(org.apache.hadoop.conf.Configuration)
>
> On Tue, Oct 13, 2015 at 3:02 PM, Robinson, Landon - Landon <
> landon.t.robinson@lowes.com> wrote:
>
>> You can do it both ways: at the DoFn level or at the pipeline level.
>>
>> For global settings, go with the pipeline level. For individual
>> jobs/tasks, go DoFn Level.
>>
>> *Pipeline Level:*
>>
>> Configuration crunchConf = getConf();
>> crunchConf.set("mapred.job.queue.name", "batch");
>> Pipeline pipeline = new MRPipeline(TransformKronosMR.class, *“*My Pipeline" ,crunchConf);
>>
>>
>> *DoFn Level (as mentioned):*
>>
>> @Override
>> public void configure(Configuration conf) {
>>   conf.set("mapreduce.map.java.opts", "-Xmx3900m");
>>   conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");
>>
>>   conf.set("mapreduce.map.memory.mb", "4096");
>>   conf.set("mapreduce.reduce.memory.mb", "4096");
>> }
>>
>>
>>
>>
>> ---------------------------------------------------------------------------
>> Landon Robinson
>> Big Data/Hadoop Engineer
>> Lowe’s Companies Inc. | IT Business Intelligence
>>
>> ---------------------------------------------------------------------------
>>
>> From: Micah Whitacre <mk...@gmail.com>
>> Reply-To: "user@crunch.apache.org" <us...@crunch.apache.org>
>> Date: Tuesday, October 13, 2015 at 3:55 PM
>> To: "user@crunch.apache.org" <us...@crunch.apache.org>
>> Subject: Re: Hadoop Configuration from DoFn
>>
>> Luke,
>>   Generally that configuration should be set on the Configuration object
>> passed to Pipeline vs on the individual DoFns.  The configure(...) method
>> is called when re-instantiating the DoFn on the Map/Reduce task and at that
>> point those memory settings wouldn't be honored.
>>
>> On Tue, Oct 13, 2015 at 2:52 PM, Luke Hansen <lu...@wealthfront.com>
>> wrote:
>>
>>> Does anyone know if this is the right way to configure Hadoop from a
>>> Crunch DoFn?  This didn't seem to affect anything.
>>>
>>> Thanks!
>>>
>>> @Override
>>> public void configure(Configuration conf) {
>>>   conf.set("mapreduce.map.java.opts", "-Xmx3900m");
>>>   conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");
>>>
>>>   conf.set("mapreduce.map.memory.mb", "4096");
>>>   conf.set("mapreduce.reduce.memory.mb", "4096");
>>> }
>>>
>>>
>> NOTICE: All information in and attached to the e-mails below may be
>> proprietary, confidential, privileged and otherwise protected from improper
>> or erroneous disclosure. If you are not the sender's intended recipient,
>> you are not authorized to intercept, read, print, retain, copy, forward, or
>> disseminate this message. If you have erroneously received this
>> communication, please notify the sender immediately by phone
>> (704-758-1000) or by e-mail and destroy all copies of this message
>> electronic, paper, or otherwise.
>>
>> *By transmitting documents via this email: Users, Customers, Suppliers
>> and Vendors collectively acknowledge and agree the transmittal of
>> information via email is voluntary, is offered as a convenience, and is not
>> a secured method of communication; Not to transmit any payment information
>> E.G. credit card, debit card, checking account, wire transfer information,
>> passwords, or sensitive and personal information E.G. Driver's license,
>> DOB, social security, or any other information the user wishes to remain
>> confidential; To transmit only non-confidential information such as plans,
>> pictures and drawings and to assume all risk and liability for and
>> indemnify Lowe's from any claims, losses or damages that may arise from the
>> transmittal of documents or including non-confidential information in the
>> body of an email transmittal. Thank you. *
>>
>
>

Re: Hadoop Configuration from DoFn

Posted by Micah Whitacre <mk...@gmail.com>.
Yeah was misconstruing it with the setContext(...) method which provides
the configuration when the job is actually running.[1]  Luke, you might
look at generating a plan of your pipeline to see what other DoFns might be
inside the same job and causing a conflict with your settings.

We typically do the global settings vs trying to tweak at each DoFn simply
because it allows us to avoid worrying about which DoFn's get grouped into
a single task and override each other.

[1] -
http://crunch.apache.org/apidocs/0.12.0/org/apache/crunch/DoFn.html#configure(org.apache.hadoop.conf.Configuration)

On Tue, Oct 13, 2015 at 3:02 PM, Robinson, Landon - Landon <
landon.t.robinson@lowes.com> wrote:

> You can do it both ways: at the DoFn level or at the pipeline level.
>
> For global settings, go with the pipeline level. For individual
> jobs/tasks, go DoFn Level.
>
> *Pipeline Level:*
>
> Configuration crunchConf = getConf();
> crunchConf.set("mapred.job.queue.name", "batch");
> Pipeline pipeline = new MRPipeline(TransformKronosMR.class, *“*My Pipeline" ,crunchConf);
>
>
> *DoFn Level (as mentioned):*
>
> @Override
> public void configure(Configuration conf) {
>   conf.set("mapreduce.map.java.opts", "-Xmx3900m");
>   conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");
>
>   conf.set("mapreduce.map.memory.mb", "4096");
>   conf.set("mapreduce.reduce.memory.mb", "4096");
> }
>
>
>
> ---------------------------------------------------------------------------
> Landon Robinson
> Big Data/Hadoop Engineer
> Lowe’s Companies Inc. | IT Business Intelligence
> ---------------------------------------------------------------------------
>
> From: Micah Whitacre <mk...@gmail.com>
> Reply-To: "user@crunch.apache.org" <us...@crunch.apache.org>
> Date: Tuesday, October 13, 2015 at 3:55 PM
> To: "user@crunch.apache.org" <us...@crunch.apache.org>
> Subject: Re: Hadoop Configuration from DoFn
>
> Luke,
>   Generally that configuration should be set on the Configuration object
> passed to Pipeline vs on the individual DoFns.  The configure(...) method
> is called when re-instantiating the DoFn on the Map/Reduce task and at that
> point those memory settings wouldn't be honored.
>
> On Tue, Oct 13, 2015 at 2:52 PM, Luke Hansen <lu...@wealthfront.com> wrote:
>
>> Does anyone know if this is the right way to configure Hadoop from a
>> Crunch DoFn?  This didn't seem to affect anything.
>>
>> Thanks!
>>
>> @Override
>> public void configure(Configuration conf) {
>>   conf.set("mapreduce.map.java.opts", "-Xmx3900m");
>>   conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");
>>
>>   conf.set("mapreduce.map.memory.mb", "4096");
>>   conf.set("mapreduce.reduce.memory.mb", "4096");
>> }
>>
>>
> NOTICE: All information in and attached to the e-mails below may be
> proprietary, confidential, privileged and otherwise protected from improper
> or erroneous disclosure. If you are not the sender's intended recipient,
> you are not authorized to intercept, read, print, retain, copy, forward, or
> disseminate this message. If you have erroneously received this
> communication, please notify the sender immediately by phone (704-758-1000)
> or by e-mail and destroy all copies of this message electronic, paper, or
> otherwise.
>
> *By transmitting documents via this email: Users, Customers, Suppliers and
> Vendors collectively acknowledge and agree the transmittal of information
> via email is voluntary, is offered as a convenience, and is not a secured
> method of communication; Not to transmit any payment information E.G.
> credit card, debit card, checking account, wire transfer information,
> passwords, or sensitive and personal information E.G. Driver's license,
> DOB, social security, or any other information the user wishes to remain
> confidential; To transmit only non-confidential information such as plans,
> pictures and drawings and to assume all risk and liability for and
> indemnify Lowe's from any claims, losses or damages that may arise from the
> transmittal of documents or including non-confidential information in the
> body of an email transmittal. Thank you. *
>

Re: Hadoop Configuration from DoFn

Posted by "Robinson, Landon - Landon" <la...@lowes.com>.
You can do it both ways: at the DoFn level or at the pipeline level.

For global settings, go with the pipeline level. For individual jobs/tasks, go DoFn Level.

Pipeline Level:


Configuration crunchConf = getConf();
crunchConf.set("mapred.job.queue.name", "batch");
Pipeline pipeline = new MRPipeline(TransformKronosMR.class, “My Pipeline" ,crunchConf);


DoFn Level (as mentioned):

@Override
public void configure(Configuration conf) {
  conf.set("mapreduce.map.java.opts", "-Xmx3900m");
  conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");

  conf.set("mapreduce.map.memory.mb", "4096");
  conf.set("mapreduce.reduce.memory.mb", "4096");
}


---------------------------------------------------------------------------
[cid:3940A2A6-14FD-49BC-AC41-B9E206378684]
Landon Robinson
Big Data/Hadoop Engineer
Lowe’s Companies Inc. | IT Business Intelligence
---------------------------------------------------------------------------

From: Micah Whitacre <mk...@gmail.com>>
Reply-To: "user@crunch.apache.org<ma...@crunch.apache.org>" <us...@crunch.apache.org>>
Date: Tuesday, October 13, 2015 at 3:55 PM
To: "user@crunch.apache.org<ma...@crunch.apache.org>" <us...@crunch.apache.org>>
Subject: Re: Hadoop Configuration from DoFn

Luke,
  Generally that configuration should be set on the Configuration object passed to Pipeline vs on the individual DoFns.  The configure(...) method is called when re-instantiating the DoFn on the Map/Reduce task and at that point those memory settings wouldn't be honored.

On Tue, Oct 13, 2015 at 2:52 PM, Luke Hansen <lu...@wealthfront.com>> wrote:
Does anyone know if this is the right way to configure Hadoop from a Crunch DoFn?  This didn't seem to affect anything.

Thanks!

@Override
public void configure(Configuration conf) {
  conf.set("mapreduce.map.java.opts", "-Xmx3900m");
  conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");

  conf.set("mapreduce.map.memory.mb", "4096");
  conf.set("mapreduce.reduce.memory.mb", "4096");
}


NOTICE: All information in and attached to the e-mails below may be proprietary, confidential, privileged and otherwise protected from improper or erroneous disclosure. If you are not the sender's intended recipient, you are not authorized to intercept, read, print, retain, copy, forward, or disseminate this message. If you have erroneously received this communication, please notify the sender immediately by phone (704-758-1000) or by e-mail and destroy all copies of this message electronic, paper, or otherwise. 

By transmitting documents via this email: Users, Customers, Suppliers and Vendors collectively acknowledge and agree the transmittal of information via email is voluntary, is offered as a convenience, and is not a secured method of communication; Not to transmit any payment information E.G. credit card, debit card, checking account, wire transfer information, passwords, or sensitive and personal information E.G. Driver's license, DOB, social security, or any other information the user wishes to remain confidential; To transmit only non-confidential information such as plans, pictures and drawings and to assume all risk and liability for and indemnify Lowe's from any claims, losses or damages that may arise from the transmittal of documents or including non-confidential information in the body of an email transmittal. Thank you.

Re: Hadoop Configuration from DoFn

Posted by David Ortiz <dp...@gmail.com>.
Micah,

      I have definitely had that approach work for passing configuration
memory settings in that were necessary for a specific DoFn, but would be
inappropriate for the rest of the Pipeline on CDH 5.4.4.  Not sure if there
is something that changes after that version which would prevent this from
working.

Thanks,
     Dave

On Tue, Oct 13, 2015 at 3:55 PM Micah Whitacre <mk...@gmail.com> wrote:

> Luke,
>   Generally that configuration should be set on the Configuration object
> passed to Pipeline vs on the individual DoFns.  The configure(...) method
> is called when re-instantiating the DoFn on the Map/Reduce task and at that
> point those memory settings wouldn't be honored.
>
> On Tue, Oct 13, 2015 at 2:52 PM, Luke Hansen <lu...@wealthfront.com> wrote:
>
>> Does anyone know if this is the right way to configure Hadoop from a
>> Crunch DoFn?  This didn't seem to affect anything.
>>
>> Thanks!
>>
>> @Override
>> public void configure(Configuration conf) {
>>   conf.set("mapreduce.map.java.opts", "-Xmx3900m");
>>   conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");
>>
>>   conf.set("mapreduce.map.memory.mb", "4096");
>>   conf.set("mapreduce.reduce.memory.mb", "4096");
>> }
>>
>>
>

Re: Hadoop Configuration from DoFn

Posted by Micah Whitacre <mk...@gmail.com>.
Luke,
  Generally that configuration should be set on the Configuration object
passed to Pipeline vs on the individual DoFns.  The configure(...) method
is called when re-instantiating the DoFn on the Map/Reduce task and at that
point those memory settings wouldn't be honored.

On Tue, Oct 13, 2015 at 2:52 PM, Luke Hansen <lu...@wealthfront.com> wrote:

> Does anyone know if this is the right way to configure Hadoop from a
> Crunch DoFn?  This didn't seem to affect anything.
>
> Thanks!
>
> @Override
> public void configure(Configuration conf) {
>   conf.set("mapreduce.map.java.opts", "-Xmx3900m");
>   conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");
>
>   conf.set("mapreduce.map.memory.mb", "4096");
>   conf.set("mapreduce.reduce.memory.mb", "4096");
> }
>
>

RE: Hadoop Configuration from DoFn

Posted by David Ortiz <do...@videologygroup.com>.
That is a correct way to do it, but it can also be affected by settings in other DoFns that may be declared if they run in the same mapper or reducer.

From: Luke Hansen [mailto:luke@wealthfront.com]
Sent: Tuesday, October 13, 2015 3:52 PM
To: user@crunch.apache.org
Subject: Hadoop Configuration from DoFn

Does anyone know if this is the right way to configure Hadoop from a Crunch DoFn?  This didn't seem to affect anything.

Thanks!

@Override
public void configure(Configuration conf) {
  conf.set("mapreduce.map.java.opts", "-Xmx3900m");
  conf.set("mapreduce.reduce.java.opts", "-Xmx3900m");

  conf.set("mapreduce.map.memory.mb", "4096");
  conf.set("mapreduce.reduce.memory.mb", "4096");
}

This email is intended only for the use of the individual(s) to whom it is addressed. If you have received this communication in error, please immediately notify the sender and delete the original email.