You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com> on 2011/07/15 00:14:29 UTC

Issue with MR code not scaling correctly with data sizes

All,

I have a MR program that I feed in a list of IDs and it generates the unique comparison set as a result. Example: if I have a list {1,2,3,4,5} then the resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or (n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets (I can verify less than 1000 fairly easily) but fails when I try to push the set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
	1) Partition the IDs evenly, based on amount of output per value, into a set of keys equal to the number of reduce slots we currently have
	2) Use the distributed cache to push the ID file out to the various reducers
	3) In the setup of the reducer, populate an int array with the values from the ID file in distributed cache
	4) Output a comparison only if the current ID from the values iterator is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an oozie workflow so it made sense to just do it in MR for now. My issue is that when I try the larger sized ID files it only outputs part of resulting data set and there are no errors to be found. Part of me thinks that I need to tweak some site configuration properties, due to the size of data that is spilling to disk, but after scanning through all 3 sites I am having issues pin pointing anything I think could be causing this. I moved from reading the file from HDFS to using the distributed cache for the join read thinking that might solve my problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.


RE: Issue with MR code not scaling correctly with data sizes

Posted by "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com>.
Bobby,

I am sorry for the cross post as I didn't realize that common was BCC'ed. Won't do it again :)

This morning I was able to resolve the issue after having a talk with our admin. Turns out changing configuration parms around data dirs without letting devs know is a bad thing! Thank you again for the questions as the counters confirmed for me that it actually was outputting all of my data.

Matt

From: Robert Evans [mailto:evans@yahoo-inc.com]
Sent: Friday, July 15, 2011 9:56 AM
To: mapreduce-user@hadoop.apache.org
Cc: GOEKE, MATTHEW [AG/1000]
Subject: Re: Issue with MR code not scaling correctly with data sizes

Please don't cross post.  I put common-user in BCC.

I really don't know for sure what is happening especially without the code or more to go on and debugging something remotely over e-mail is extremely difficult.  You are essentially doing a cross which is going to be very expensive no matter what you do. But I do have a few questions for you.

  1.  How large is the IDs file(s) you are using?  Have you updated the amount of heap the JVM has and the number of slots to accommodate it?
  2.  How are you storing the IDs in RAM to do the join?
  3.  Have you tried logging in your map/reduce code to verify the number of entries you expect are being loaded at each stage?
  4.  Along with that have you looked at the counters for your map./reduce program to verify that the number of records are showing flowing through the system as expected?

--Bobby

On 7/14/11 5:14 PM, "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com> wrote:
All,

I have a MR program that I feed in a list of IDs and it generates the unique comparison set as a result. Example: if I have a list {1,2,3,4,5} then the resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or (n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets (I can verify less than 1000 fairly easily) but fails when I try to push the set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
        1) Partition the IDs evenly, based on amount of output per value, into a set of keys equal to the number of reduce slots we currently have
        2) Use the distributed cache to push the ID file out to the various reducers
        3) In the setup of the reducer, populate an int array with the values from the ID file in distributed cache
        4) Output a comparison only if the current ID from the values iterator is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an oozie workflow so it made sense to just do it in MR for now. My issue is that when I try the larger sized ID files it only outputs part of resulting data set and there are no errors to be found. Part of me thinks that I need to tweak some site configuration properties, due to the size of data that is spilling to disk, but after scanning through all 3 sites I am having issues pin pointing anything I think could be causing this. I moved from reading the file from HDFS to using the distributed cache for the join read thinking that might solve my problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.


Re: Issue with MR code not scaling correctly with data sizes

Posted by Robert Evans <ev...@yahoo-inc.com>.
Please don't cross post.  I put common-user in BCC.

I really don't know for sure what is happening especially without the code or more to go on and debugging something remotely over e-mail is extremely difficult.  You are essentially doing a cross which is going to be very expensive no matter what you do. But I do have a few questions for you.


 1.  How large is the IDs file(s) you are using?  Have you updated the amount of heap the JVM has and the number of slots to accommodate it?
 2.  How are you storing the IDs in RAM to do the join?
 3.  Have you tried logging in your map/reduce code to verify the number of entries you expect are being loaded at each stage?
 4.  Along with that have you looked at the counters for your map./reduce program to verify that the number of records are showing flowing through the system as expected?

--Bobby

On 7/14/11 5:14 PM, "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com> wrote:

All,

I have a MR program that I feed in a list of IDs and it generates the unique comparison set as a result. Example: if I have a list {1,2,3,4,5} then the resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or (n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets (I can verify less than 1000 fairly easily) but fails when I try to push the set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
        1) Partition the IDs evenly, based on amount of output per value, into a set of keys equal to the number of reduce slots we currently have
        2) Use the distributed cache to push the ID file out to the various reducers
        3) In the setup of the reducer, populate an int array with the values from the ID file in distributed cache
        4) Output a comparison only if the current ID from the values iterator is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an oozie workflow so it made sense to just do it in MR for now. My issue is that when I try the larger sized ID files it only outputs part of resulting data set and there are no errors to be found. Part of me thinks that I need to tweak some site configuration properties, due to the size of data that is spilling to disk, but after scanning through all 3 sites I am having issues pin pointing anything I think could be causing this. I moved from reading the file from HDFS to using the distributed cache for the join read thinking that might solve my problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.



Re: Issue with MR code not scaling correctly with data sizes

Posted by Robert Evans <ev...@yahoo-inc.com>.
Please don't cross post.  I put common-user in BCC.

I really don't know for sure what is happening especially without the code or more to go on and debugging something remotely over e-mail is extremely difficult.  You are essentially doing a cross which is going to be very expensive no matter what you do. But I do have a few questions for you.


 1.  How large is the IDs file(s) you are using?  Have you updated the amount of heap the JVM has and the number of slots to accommodate it?
 2.  How are you storing the IDs in RAM to do the join?
 3.  Have you tried logging in your map/reduce code to verify the number of entries you expect are being loaded at each stage?
 4.  Along with that have you looked at the counters for your map./reduce program to verify that the number of records are showing flowing through the system as expected?

--Bobby

On 7/14/11 5:14 PM, "GOEKE, MATTHEW (AG/1000)" <ma...@monsanto.com> wrote:

All,

I have a MR program that I feed in a list of IDs and it generates the unique comparison set as a result. Example: if I have a list {1,2,3,4,5} then the resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or (n^2-n)/2 number of comparisons. My code works just fine on smaller scaled sets (I can verify less than 1000 fairly easily) but fails when I try to push the set to 10-20k IDs which is annoying when the end goal is 1-10 million.

The flow of the program is:
        1) Partition the IDs evenly, based on amount of output per value, into a set of keys equal to the number of reduce slots we currently have
        2) Use the distributed cache to push the ID file out to the various reducers
        3) In the setup of the reducer, populate an int array with the values from the ID file in distributed cache
        4) Output a comparison only if the current ID from the values iterator is greater than the current iterator through the int array

I realize that this could be done many other ways but this will be part of an oozie workflow so it made sense to just do it in MR for now. My issue is that when I try the larger sized ID files it only outputs part of resulting data set and there are no errors to be found. Part of me thinks that I need to tweak some site configuration properties, due to the size of data that is spilling to disk, but after scanning through all 3 sites I am having issues pin pointing anything I think could be causing this. I moved from reading the file from HDFS to using the distributed cache for the join read thinking that might solve my problem but there seems to be something else I am overlooking.

Any advice is greatly appreciated!

Matt
This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you are obligated to comply with all
applicable U.S. export laws and regulations.