You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by "Cox, Jonathan A" <ja...@sandia.gov> on 2015/12/18 01:46:41 UTC

Java Out of Memory Errors with CsvBulkLoadTool

I am trying to ingest a 575MB CSV file with 192,444 lines using the CsvBulkLoadTool MapReduce job. When running this job, I find that I have to boost the max Java heap space to 48GB (24GB fails with Java out of memory errors).

I'm concerned about scaling issues. It seems like it shouldn't require between 24-48GB of memory to ingest a 575MB file. However, I am pretty new to Hadoop/HBase/Phoenix, so maybe I am off base here.

Can anybody comment on this observation?

Thanks,
Jonathan

Re: Java Out of Memory Errors with CsvBulkLoadTool

Posted by Gabriel Reid <ga...@gmail.com>.
On Fri, Dec 18, 2015 at 4:31 PM, Riesland, Zack
<Za...@sensus.com> wrote:
> We are able to ingest MUCH larger sets of data (hundreds of GB) using the CSVBulkLoadTool.
>
> However, we have found it to be a huge memory hog.
>
> We dug into the source a bit and found that HFileOutputFormat.configureIncrementalLoad(), in using TotalOrderPartitioner and KeyValueReducer, ultimately keeps a TreeSet of all the key/value pairs before finally writing the HFiles.
>
> So if the size of your data exceeds the memory allocated on the client calling the MapReduce job, it will eventually fail.


I think (or at least hope!) that the situation isn't quite as bad as that.

The HFileOutputFormat.configureIncrementalLoad call will load the
start keys of all regions, and configure those for use by the
TotalOrderPartitioner. This will grow as the number of regions for the
output table grows.

The KeyValueSortReducer does indeed use a TreeSet to store KeyValues,
but this is a TreeSet per distinct row key. The size of this TreeSet
will grow as the number of columns per row grows. The memory usage is
typically higher than expected because each single column value is
stored as a row key, which contains the full row key of the row.

- Gabriel

RE: Java Out of Memory Errors with CsvBulkLoadTool

Posted by "Riesland, Zack" <Za...@sensus.com>.
We are able to ingest MUCH larger sets of data (hundreds of GB) using the CSVBulkLoadTool. 

However, we have found it to be a huge memory hog.

We dug into the source a bit and found that HFileOutputFormat.configureIncrementalLoad(), in using TotalOrderPartitioner and KeyValueReducer, ultimately keeps a TreeSet of all the key/value pairs before finally writing the HFiles.

So if the size of your data exceeds the memory allocated on the client calling the MapReduce job, it will eventually fail.

Again, that data set doesn't seem anywhere near large enough to be an issue though.

-----Original Message-----
From: Gabriel Reid [mailto:gabriel.reid@gmail.com] 
Sent: Friday, December 18, 2015 10:17 AM
To: user@phoenix.apache.org
Subject: Re: Java Out of Memory Errors with CsvBulkLoadTool

Hi Jonathan,

Sounds like something is very wrong here.

Are you running the job on an actual cluster, or are you using the local job tracker (i.e. running the import job on a single computer).

Normally an import job, regardless of the size of the input, should run with map and reduce tasks that have a standard (e.g. 2GB) heap size per task (although there will typically be multiple tasks started on the cluster). There shouldn't be any need to have anything like a 48GB heap.

If you are running this on an actual cluster, could you elaborate on where/how you're setting the 48GB heap size setting?

- Gabriel


On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:
> I am trying to ingest a 575MB CSV file with 192,444 lines using the 
> CsvBulkLoadTool MapReduce job. When running this job, I find that I 
> have to boost the max Java heap space to 48GB (24GB fails with Java 
> out of memory errors).
>
>
>
> I’m concerned about scaling issues. It seems like it shouldn’t require 
> between 24-48GB of memory to ingest a 575MB file. However, I am pretty 
> new to Hadoop/HBase/Phoenix, so maybe I am off base here.
>
>
>
> Can anybody comment on this observation?
>
>
>
> Thanks,
>
> Jonathan

Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Posted by Gabriel Reid <ga...@gmail.com>.
On Fri, Dec 18, 2015 at 9:35 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:

>
> The Hadoop version is 2.6.2.
>

I'm assuming the reduce phase is failing with the OOME, is that correct?

Could you run "jps -v" to see what the full set of JVM parameters are
for the JVM that is running the task that is failing? I can't imagine
a situation where anywhere near 48GB would be needed, so I'm assuming
that this is a question of your memory settings not being correctly
propagated to the task JVMs somehow.

- Gabriel

Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Posted by "김영우 (Youngwoo Kim)" <wa...@gmail.com>.
Got a same problem here. my bulk load job failed due to lack of memory at
reduce phase. 15M rows each day on a Phoenix table with a additional index.

After all, I recreated my tables with salting. It helps a lot because the
bulk load job launched same number of reducer with salt buckets.

But, I believe, if you're bulk loading large dataset, it would fail. AFAIK,
current MR based bulk loading requires a lot of memory for writing target
files.

- Youngwoo

On Sat, Dec 19, 2015 at 5:35 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:

> Hi Gabriel,
>
> The Hadoop version is 2.6.2.
>
> -Jonathan
>
> -----Original Message-----
> From: Gabriel Reid [mailto:gabriel.reid@gmail.com]
> Sent: Friday, December 18, 2015 11:58 AM
> To: user@phoenix.apache.org
> Subject: Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool
>
> Hi Jonathan,
>
> Which Hadoop version are you using? I'm actually wondering if
> mapred.child.java.opts is still supported in Hadoop 2.x (I think it has
> been replaced by mapreduce.map.java.opts and mapreduce.reduce.java.opts).
>
> The HADOOP_CLIENT_OPTS won't make a difference if you're running in
> (pseudo) distributed mode, as separate JVMs will be started up for the
> tasks.
>
> - Gabriel
>
>
> On Fri, Dec 18, 2015 at 7:33 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:
> > Gabriel,
> >
> > I am running the job on a single machine in pseudo distributed mode.
> I've set the max Java heap size in two different ways (just to be sure):
> >
> > export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xmx48g"
> >
> > and also in mapred-site.xml:
> >   <property>
> >     <name>mapred.child.java.opts</name>
> >     <value>-Xmx48g</value>
> >   </property>
> >
> > -----Original Message-----
> > From: Gabriel Reid [mailto:gabriel.reid@gmail.com]
> > Sent: Friday, December 18, 2015 8:17 AM
> > To: user@phoenix.apache.org
> > Subject: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool
> >
> > Hi Jonathan,
> >
> > Sounds like something is very wrong here.
> >
> > Are you running the job on an actual cluster, or are you using the local
> job tracker (i.e. running the import job on a single computer).
> >
> > Normally an import job, regardless of the size of the input, should run
> with map and reduce tasks that have a standard (e.g. 2GB) heap size per
> task (although there will typically be multiple tasks started on the
> cluster). There shouldn't be any need to have anything like a 48GB heap.
> >
> > If you are running this on an actual cluster, could you elaborate on
> where/how you're setting the 48GB heap size setting?
> >
> > - Gabriel
> >
> >
> > On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <ja...@sandia.gov>
> wrote:
> >> I am trying to ingest a 575MB CSV file with 192,444 lines using the
> >> CsvBulkLoadTool MapReduce job. When running this job, I find that I
> >> have to boost the max Java heap space to 48GB (24GB fails with Java
> >> out of memory errors).
> >>
> >>
> >>
> >> I’m concerned about scaling issues. It seems like it shouldn’t
> >> require between 24-48GB of memory to ingest a 575MB file. However, I
> >> am pretty new to Hadoop/HBase/Phoenix, so maybe I am off base here.
> >>
> >>
> >>
> >> Can anybody comment on this observation?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Jonathan
>

RE: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Posted by "Cox, Jonathan A" <ja...@sandia.gov>.
Hi Gabriel,

The Hadoop version is 2.6.2.

-Jonathan

-----Original Message-----
From: Gabriel Reid [mailto:gabriel.reid@gmail.com] 
Sent: Friday, December 18, 2015 11:58 AM
To: user@phoenix.apache.org
Subject: Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Hi Jonathan,

Which Hadoop version are you using? I'm actually wondering if mapred.child.java.opts is still supported in Hadoop 2.x (I think it has been replaced by mapreduce.map.java.opts and mapreduce.reduce.java.opts).

The HADOOP_CLIENT_OPTS won't make a difference if you're running in
(pseudo) distributed mode, as separate JVMs will be started up for the tasks.

- Gabriel


On Fri, Dec 18, 2015 at 7:33 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:
> Gabriel,
>
> I am running the job on a single machine in pseudo distributed mode. I've set the max Java heap size in two different ways (just to be sure):
>
> export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xmx48g"
>
> and also in mapred-site.xml:
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value>-Xmx48g</value>
>   </property>
>
> -----Original Message-----
> From: Gabriel Reid [mailto:gabriel.reid@gmail.com]
> Sent: Friday, December 18, 2015 8:17 AM
> To: user@phoenix.apache.org
> Subject: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool
>
> Hi Jonathan,
>
> Sounds like something is very wrong here.
>
> Are you running the job on an actual cluster, or are you using the local job tracker (i.e. running the import job on a single computer).
>
> Normally an import job, regardless of the size of the input, should run with map and reduce tasks that have a standard (e.g. 2GB) heap size per task (although there will typically be multiple tasks started on the cluster). There shouldn't be any need to have anything like a 48GB heap.
>
> If you are running this on an actual cluster, could you elaborate on where/how you're setting the 48GB heap size setting?
>
> - Gabriel
>
>
> On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:
>> I am trying to ingest a 575MB CSV file with 192,444 lines using the 
>> CsvBulkLoadTool MapReduce job. When running this job, I find that I 
>> have to boost the max Java heap space to 48GB (24GB fails with Java 
>> out of memory errors).
>>
>>
>>
>> I’m concerned about scaling issues. It seems like it shouldn’t 
>> require between 24-48GB of memory to ingest a 575MB file. However, I 
>> am pretty new to Hadoop/HBase/Phoenix, so maybe I am off base here.
>>
>>
>>
>> Can anybody comment on this observation?
>>
>>
>>
>> Thanks,
>>
>> Jonathan

Re: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Jonathan,

Which Hadoop version are you using? I'm actually wondering if
mapred.child.java.opts is still supported in Hadoop 2.x (I think it
has been replaced by mapreduce.map.java.opts and
mapreduce.reduce.java.opts).

The HADOOP_CLIENT_OPTS won't make a difference if you're running in
(pseudo) distributed mode, as separate JVMs will be started up for the
tasks.

- Gabriel


On Fri, Dec 18, 2015 at 7:33 PM, Cox, Jonathan A <ja...@sandia.gov> wrote:
> Gabriel,
>
> I am running the job on a single machine in pseudo distributed mode. I've set the max Java heap size in two different ways (just to be sure):
>
> export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xmx48g"
>
> and also in mapred-site.xml:
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value>-Xmx48g</value>
>   </property>
>
> -----Original Message-----
> From: Gabriel Reid [mailto:gabriel.reid@gmail.com]
> Sent: Friday, December 18, 2015 8:17 AM
> To: user@phoenix.apache.org
> Subject: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool
>
> Hi Jonathan,
>
> Sounds like something is very wrong here.
>
> Are you running the job on an actual cluster, or are you using the local job tracker (i.e. running the import job on a single computer).
>
> Normally an import job, regardless of the size of the input, should run with map and reduce tasks that have a standard (e.g. 2GB) heap size per task (although there will typically be multiple tasks started on the cluster). There shouldn't be any need to have anything like a 48GB heap.
>
> If you are running this on an actual cluster, could you elaborate on where/how you're setting the 48GB heap size setting?
>
> - Gabriel
>
>
> On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:
>> I am trying to ingest a 575MB CSV file with 192,444 lines using the
>> CsvBulkLoadTool MapReduce job. When running this job, I find that I
>> have to boost the max Java heap space to 48GB (24GB fails with Java
>> out of memory errors).
>>
>>
>>
>> I’m concerned about scaling issues. It seems like it shouldn’t require
>> between 24-48GB of memory to ingest a 575MB file. However, I am pretty
>> new to Hadoop/HBase/Phoenix, so maybe I am off base here.
>>
>>
>>
>> Can anybody comment on this observation?
>>
>>
>>
>> Thanks,
>>
>> Jonathan

RE: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Posted by "Cox, Jonathan A" <ja...@sandia.gov>.
Gabriel,

I am running the job on a single machine in pseudo distributed mode. I've set the max Java heap size in two different ways (just to be sure):

export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xmx48g"

and also in mapred-site.xml:
  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx48g</value>
  </property>

-----Original Message-----
From: Gabriel Reid [mailto:gabriel.reid@gmail.com] 
Sent: Friday, December 18, 2015 8:17 AM
To: user@phoenix.apache.org
Subject: [EXTERNAL] Re: Java Out of Memory Errors with CsvBulkLoadTool

Hi Jonathan,

Sounds like something is very wrong here.

Are you running the job on an actual cluster, or are you using the local job tracker (i.e. running the import job on a single computer).

Normally an import job, regardless of the size of the input, should run with map and reduce tasks that have a standard (e.g. 2GB) heap size per task (although there will typically be multiple tasks started on the cluster). There shouldn't be any need to have anything like a 48GB heap.

If you are running this on an actual cluster, could you elaborate on where/how you're setting the 48GB heap size setting?

- Gabriel


On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:
> I am trying to ingest a 575MB CSV file with 192,444 lines using the 
> CsvBulkLoadTool MapReduce job. When running this job, I find that I 
> have to boost the max Java heap space to 48GB (24GB fails with Java 
> out of memory errors).
>
>
>
> I’m concerned about scaling issues. It seems like it shouldn’t require 
> between 24-48GB of memory to ingest a 575MB file. However, I am pretty 
> new to Hadoop/HBase/Phoenix, so maybe I am off base here.
>
>
>
> Can anybody comment on this observation?
>
>
>
> Thanks,
>
> Jonathan

Re: Java Out of Memory Errors with CsvBulkLoadTool

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Jonathan,

Sounds like something is very wrong here.

Are you running the job on an actual cluster, or are you using the
local job tracker (i.e. running the import job on a single computer).

Normally an import job, regardless of the size of the input, should
run with map and reduce tasks that have a standard (e.g. 2GB) heap
size per task (although there will typically be multiple tasks started
on the cluster). There shouldn't be any need to have anything like a
48GB heap.

If you are running this on an actual cluster, could you elaborate on
where/how you're setting the 48GB heap size setting?

- Gabriel


On Fri, Dec 18, 2015 at 1:46 AM, Cox, Jonathan A <ja...@sandia.gov> wrote:
> I am trying to ingest a 575MB CSV file with 192,444 lines using the
> CsvBulkLoadTool MapReduce job. When running this job, I find that I have to
> boost the max Java heap space to 48GB (24GB fails with Java out of memory
> errors).
>
>
>
> I’m concerned about scaling issues. It seems like it shouldn’t require
> between 24-48GB of memory to ingest a 575MB file. However, I am pretty new
> to Hadoop/HBase/Phoenix, so maybe I am off base here.
>
>
>
> Can anybody comment on this observation?
>
>
>
> Thanks,
>
> Jonathan