You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by ha...@hsbc.com on 2018/12/07 14:08:59 UTC

mapred.child.java.opts

Hello,

While checking the Nutch (1.15) crawl bash file, I noticed at line 211 that 1000MB is statically set for java - > mapred.child.java.opts=-Xmx1000m

Any idea why?, Can I change it?, What will be the impact?
Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__________________________________________________________________

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
__________________________________________________________________
Protect our environment - please only print this if you have to!



-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not copy,
forward, disclose or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the sender immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or virus-free.
The sender does not accept liability for any errors or omissions.

Re: mapred.child.java.opts

Posted by Lewis John McGibbney <le...@apache.org>.

Hi Hany,
Yes the paramater is set to 1GB by default but it should also be noted that this configuration key is actually deprecated as of some time ago. Seeing as we are using the 'new' MapReduce API, I suspect we should use 'mapreduce.map.java.opts` and `mapreduce.reduce.java.opts` instead so that is something we need to update.
Can you please provide a patch for this and submit it against the 1.x branch?

Now to answer your question, essentially these configuration parameters enable you to tune the heap-size for child jvms of maps and reduces respectively. In the content of Nutch this might be useful if certain crawl phases consume more heap memory e.g. parsing. This will ultimately be crawl-specific.
HTH
Lewis

On 2018/12/07 14:08:59, hany.nasr@hsbc.com wrote: 
> Hello,
> 
> While checking the Nutch (1.15) crawl bash file, I noticed at line 211 that 1000MB is statically set for java - > mapred.child.java.opts=-Xmx1000m
> 
> Any idea why?, Can I change it?, What will be the impact?
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not copy,
> forward, disclose or use any part of it. If you have received this message in error,
> please delete it and all copies from your system and notify the sender immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or virus-free.
> The sender does not accept liability for any errors or omissions.
>

RE: mapred.child.java.opts

Posted by ha...@hsbc.com.

Thank you. It is really very helpful.

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__________________________________________________________________ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.nasr@hsbc.com 
__________________________________________________________________ 
Protect our environment - please only print this if you have to!


-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com.INVALID] 
Sent: 10 December 2018 10:37
To: user@nutch.apache.org
Subject: Re: mapred.child.java.opts

Hi,

> The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?

Yes, in local mode it will respect the value of the environment variable NUTCH_HEAPSIZE.
Respectively, the script $NUTCH_HOME/bin/nutch called by bin/crawl will respect it.

> How can I set NUTCH_HEAPSIZE?

It's an environment variable. How to set it might depend on the shell you're using.
E.g., for the bash shell:
  % export NUTCH_HEAPSIZE=2048
  % bin/crawl ...

Best,
Sebastian


On 12/7/18 4:05 PM, hany.nasr@hsbc.com wrote:
> Thank you Sebastian.
> 
> I am using standalone Nutch and using crawl command. Didn't install 
> separate Hadoop cluster
> 
> The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?
> How can I set NUTCH_HEAPSIZE?
> 
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __________________________________________________________________
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.nasr@hsbc.com
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com.INVALID]
> Sent: 07 December 2018 15:44
> To: user@nutch.apache.org
> Subject: Re: mapred.child.java.opts
> 
> Hi,
> 
> yes, of course, the comments just one line above even encourages you to do so:
> 
> # note that some of the options listed here could be set in the # 
> corresponding hadoop site xml param file
> 
> For most use cases this value is ok. Only if you're using a parsing fetcher with many threads you may need more Java heap memory. Note that this setting only applies to a (pseudo-)distributed mode (running on Hadoop). In locale mode you can set the Java heap size via the environment variable NUTCH_HEAPSIZE.
> 
> 
>> What will be the impact?
> 
> That depends mostly on your Hadoop cluster setup. Afaik, the properties mapreduce.map.java.opts resp. mapreduce.reduce.java.opts will override mapred.child.java.opts on Hadoop 2.x, so on a recent configured Hadoop cluster there is usually zero impact.
> 
> There is also a Jira issue open to make the heap memory configurable 
> in distributed mode, see
> https://issues.apache.org/jira/browse/NUTCH-2501
> 
> 
> Best,
> Sebastian
> 
> On 12/7/18 3:08 PM, hany.nasr@hsbc.com wrote:
>> Hello,
>>
>> While checking the Nutch (1.15) crawl bash file, I noticed at line 
>> 211 that 1000MB is statically set for java - > 
>> mapred.child.java.opts=-Xmx1000m
>>
>> Any idea why?, Can I change it?, What will be the impact?
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul.
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __________________________________________________________________
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
>> __________________________________________________________________
>> Protect our environment - please only print this if you have to!
>>
>>


***************************************************
This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate.
****************************************************

 


-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not copy,
forward, disclose or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the sender immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or virus-free.
The sender does not accept liability for any errors or omissions.

Re: mapred.child.java.opts

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.

Hi,

> The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?

Yes, in local mode it will respect the value of the environment variable NUTCH_HEAPSIZE.
Respectively, the script $NUTCH_HOME/bin/nutch called by bin/crawl will respect it.

> How can I set NUTCH_HEAPSIZE?

It's an environment variable. How to set it might depend on the shell you're using.
E.g., for the bash shell:
  % export NUTCH_HEAPSIZE=2048
  % bin/crawl ...

Best,
Sebastian


On 12/7/18 4:05 PM, hany.nasr@hsbc.com wrote:
> Thank you Sebastian.
> 
> I am using standalone Nutch and using crawl command. Didn't install separate Hadoop cluster
> 
> The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?
> How can I set NUTCH_HEAPSIZE?
> 
> Kind regards, 
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT 
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________ 
> 
> Tie line: 7148 7689 4698 
> External: +48 123 42 0698 
> Mobile: +48 723 680 278 
> E-mail: hany.nasr@hsbc.com 
> __________________________________________________________________ 
> Protect our environment - please only print this if you have to!
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com.INVALID] 
> Sent: 07 December 2018 15:44
> To: user@nutch.apache.org
> Subject: Re: mapred.child.java.opts
> 
> Hi,
> 
> yes, of course, the comments just one line above even encourages you to do so:
> 
> # note that some of the options listed here could be set in the # corresponding hadoop site xml param file
> 
> For most use cases this value is ok. Only if you're using a parsing fetcher with many threads you may need more Java heap memory. Note that this setting only applies to a (pseudo-)distributed mode (running on Hadoop). In locale mode you can set the Java heap size via the environment variable NUTCH_HEAPSIZE.
> 
> 
>> What will be the impact?
> 
> That depends mostly on your Hadoop cluster setup. Afaik, the properties mapreduce.map.java.opts resp. mapreduce.reduce.java.opts will override mapred.child.java.opts on Hadoop 2.x, so on a recent configured Hadoop cluster there is usually zero impact.
> 
> There is also a Jira issue open to make the heap memory configurable in distributed mode, see
> https://issues.apache.org/jira/browse/NUTCH-2501
> 
> 
> Best,
> Sebastian
> 
> On 12/7/18 3:08 PM, hany.nasr@hsbc.com wrote:
>> Hello,
>>
>> While checking the Nutch (1.15) crawl bash file, I noticed at line 211 
>> that 1000MB is statically set for java - > 
>> mapred.child.java.opts=-Xmx1000m
>>
>> Any idea why?, Can I change it?, What will be the impact?
>> Kind regards,
>> Hany Shehata
>> Enterprise Engineer
>> Green Six Sigma Certified
>> Solutions Architect, Marketing and Communications IT Corporate 
>> Functions | HSBC Operations, Services and Technology (HOST) ul. 
>> Kapelanka 42A, 30-347 Kraków, Poland 
>> __________________________________________________________________
>>
>> Tie line: 7148 7689 4698
>> External: +48 123 42 0698
>> Mobile: +48 723 680 278
>> E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
>> __________________________________________________________________
>> Protect our environment - please only print this if you have to!
>>
>>

RE: mapred.child.java.opts

Posted by ha...@hsbc.com.

Thank you Sebastian.

I am using standalone Nutch and using crawl command. Didn't install separate Hadoop cluster

The 1000MB is static value, will the crawl bash script respect NUTCH_HEAPSIZE?
How can I set NUTCH_HEAPSIZE?

Kind regards, 
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT 
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__________________________________________________________________ 

Tie line: 7148 7689 4698 
External: +48 123 42 0698 
Mobile: +48 723 680 278 
E-mail: hany.nasr@hsbc.com 
__________________________________________________________________ 
Protect our environment - please only print this if you have to!

-----Original Message-----
From: Sebastian Nagel [mailto:wastl.nagel@googlemail.com.INVALID] 
Sent: 07 December 2018 15:44
To: user@nutch.apache.org
Subject: Re: mapred.child.java.opts

Hi,

yes, of course, the comments just one line above even encourages you to do so:

# note that some of the options listed here could be set in the # corresponding hadoop site xml param file

For most use cases this value is ok. Only if you're using a parsing fetcher with many threads you may need more Java heap memory. Note that this setting only applies to a (pseudo-)distributed mode (running on Hadoop). In locale mode you can set the Java heap size via the environment variable NUTCH_HEAPSIZE.

> What will be the impact?

That depends mostly on your Hadoop cluster setup. Afaik, the properties mapreduce.map.java.opts resp. mapreduce.reduce.java.opts will override mapred.child.java.opts on Hadoop 2.x, so on a recent configured Hadoop cluster there is usually zero impact.

There is also a Jira issue open to make the heap memory configurable in distributed mode, see
https://issues.apache.org/jira/browse/NUTCH-2501

Best,
Sebastian

On 12/7/18 3:08 PM, hany.nasr@hsbc.com wrote:
> Hello,
> 
> While checking the Nutch (1.15) crawl bash file, I noticed at line 211 
> that 1000MB is statically set for java - > 
> mapred.child.java.opts=-Xmx1000m
> 
> Any idea why?, Can I change it?, What will be the impact?
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT Corporate 
> Functions | HSBC Operations, Services and Technology (HOST) ul. 
> Kapelanka 42A, 30-347 Kraków, Poland 
> __________________________________________________________________
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you 
> may not copy, forward, disclose or use any part of it. If you have 
> received this message in error, please delete it and all copies from 
> your system and notify the sender immediately by return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or virus-free.
> The sender does not accept liability for any errors or omissions.
> 

***************************************************
This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate.
****************************************************

-----------------------------------------
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not copy,
forward, disclose or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the sender immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or virus-free.
The sender does not accept liability for any errors or omissions.

Re: mapred.child.java.opts

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.

Hi,

yes, of course, the comments just one line above even encourages you to do so:

# note that some of the options listed here could be set in the
# corresponding hadoop site xml param file

For most use cases this value is ok. Only if you're using a parsing fetcher with many threads you
may need more Java heap memory. Note that this setting only applies to
a (pseudo-)distributed mode (running on Hadoop). In locale mode you can set the Java heap size via
the environment variable NUTCH_HEAPSIZE.


> What will be the impact?

That depends mostly on your Hadoop cluster setup. Afaik, the properties mapreduce.map.java.opts
resp. mapreduce.reduce.java.opts will override mapred.child.java.opts on Hadoop 2.x, so on a recent
configured Hadoop cluster
there is usually zero impact.

There is also a Jira issue open to make the heap memory configurable in distributed mode, see
https://issues.apache.org/jira/browse/NUTCH-2501


Best,
Sebastian

On 12/7/18 3:08 PM, hany.nasr@hsbc.com wrote:
> Hello,
> 
> While checking the Nutch (1.15) crawl bash file, I noticed at line 211 that 1000MB is statically set for java - > mapred.child.java.opts=-Xmx1000m
> 
> Any idea why?, Can I change it?, What will be the impact?
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __________________________________________________________________
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.nasr@hsbc.com<ma...@hsbc.com>
> __________________________________________________________________
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -----------------------------------------
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not copy,
> forward, disclose or use any part of it. If you have received this message in error,
> please delete it and all copies from your system and notify the sender immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or virus-free.
> The sender does not accept liability for any errors or omissions.
>