You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2014/04/14 13:50:45 UTC

How to generate a large dataset quickly.

I want to create a large dateset for HBase with different versions and
number of rows. It's about 10M rows and 100 versions to do some benchmarks.

What's the fastest way to create it?? I'm generating the dataset with a
Mapreduce of 100.000rows and 10verions. It takes 17minutes and size around
7Gb. I don't know if I could do it quickly. The bottleneck is when
MapReduces write the output and when transfer the output to the Reduces.

Re: How to generate a large dataset quickly.

Posted by lars hofhansl <la...@apache.org>.

That is correct.
What Vladimir (I think) and I are suggesting is go through the HBase front door (the normal client API). Only if that does not work/is too slow, use M/R.

-- Lars


________________________________
 From: Guillermo Ortiz <ko...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Monday, April 14, 2014 2:06 PM
Subject: Re: How to generate a large dataset quickly.
 

But, if I'm using bulkLoad, I think this method bypasses the WAL, right?
I have no idea about the autoFlush, is it still necessary to set to false
or the bulkload does some kind of magic with that as well??

I could try to do the loads without bulkLoad, but, I don't think that's the
problem, maybe, it's just the time the cluster needs, although, it seems
like too much time.




2014-04-14 22:51 GMT+02:00 lars hofhansl <la...@apache.org>:

> +1 to what Vladimir said.
> For the Puts in question you can also disable the write ahead log (WAL)
> and issue a flush on the table after your ingest.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Vladimir Rodionov <vr...@carrieriq.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Monday, April 14, 2014 11:15 AM
> Subject: RE: How to generate a large dataset quickly.
>
> There is no need to run M/R unless your cluster is large (very large)
> Single multithreaded client can easily ingest 10s of thousands rows per
> sec.
> Check YCSB benchmark tool, for example.
>
> Make sure you disable both region splitting and major compaction during
> data ingestion
> and pre-split regions accordingly to improve overall performance.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
>
> From: Ted Yu [yuzhihong@gmail.com]
> Sent: Monday, April 14, 2014 9:16 AM
> To: user@hbase.apache.org
> Subject: Re: How to generate a large dataset quickly.
>
> I looked at revision history for HFileOutputFormat.java
> There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
> affect throughput much.
>
> If you can use ganglia (or some similar tool) to pinpoint what caused the
> low ingest rate, that would give us more clue.
>
> BTW Is upgrading to newer release, such as 0.98.1 (which contains
> HBASE-8755), an option for you ?
>
> Cheers
>
>
> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <konstt2000@gmail.com
> >wrote:
>
> > I'm using. 0.94.6-cdh4.4.0,
> >
> > I use the bulkload:
> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> > FileOutputFormat.setOutputPath(job, hbasePath);
> > HTable table = new HTable(jConf, HBASE_TABLE);
> > HFileOutputFormat.configureIncrementalLoad(job, table);
> >
> > It seems that it takes really long time when it starts to execute the
> Puts
> > to HBase in the reduce phase.
> >
> >
> >
> > 2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:
> >
> > > Which hbase release did you run mapreduce job ?
> > >
> > > Cheers
> > >
> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com>
> > wrote:
> > >
> > > > I want to create a large dateset for HBase with different versions
> and
> > > > number of rows. It's about 10M rows and 100 versions to do some
> > > benchmarks.
> > > >
> > > > What's the fastest way to create it?? I'm generating the dataset
> with a
> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > > around
> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > > MapReduces write the output and when transfer the output to the
> > Reduces.
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Re: How to generate a large dataset quickly.

Posted by Guillermo Ortiz <ko...@gmail.com>.

But, if I'm using bulkLoad, I think this method bypasses the WAL, right?
I have no idea about the autoFlush, is it still necessary to set to false
or the bulkload does some kind of magic with that as well??

I could try to do the loads without bulkLoad, but, I don't think that's the
problem, maybe, it's just the time the cluster needs, although, it seems
like too much time.



2014-04-14 22:51 GMT+02:00 lars hofhansl <la...@apache.org>:

> +1 to what Vladimir said.
> For the Puts in question you can also disable the write ahead log (WAL)
> and issue a flush on the table after your ingest.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Vladimir Rodionov <vr...@carrieriq.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Monday, April 14, 2014 11:15 AM
> Subject: RE: How to generate a large dataset quickly.
>
> There is no need to run M/R unless your cluster is large (very large)
> Single multithreaded client can easily ingest 10s of thousands rows per
> sec.
> Check YCSB benchmark tool, for example.
>
> Make sure you disable both region splitting and major compaction during
> data ingestion
> and pre-split regions accordingly to improve overall performance.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodionov@carrieriq.com
>
> ________________________________________
>
> From: Ted Yu [yuzhihong@gmail.com]
> Sent: Monday, April 14, 2014 9:16 AM
> To: user@hbase.apache.org
> Subject: Re: How to generate a large dataset quickly.
>
> I looked at revision history for HFileOutputFormat.java
> There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
> affect throughput much.
>
> If you can use ganglia (or some similar tool) to pinpoint what caused the
> low ingest rate, that would give us more clue.
>
> BTW Is upgrading to newer release, such as 0.98.1 (which contains
> HBASE-8755), an option for you ?
>
> Cheers
>
>
> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <konstt2000@gmail.com
> >wrote:
>
> > I'm using. 0.94.6-cdh4.4.0,
> >
> > I use the bulkload:
> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> > FileOutputFormat.setOutputPath(job, hbasePath);
> > HTable table = new HTable(jConf, HBASE_TABLE);
> > HFileOutputFormat.configureIncrementalLoad(job, table);
> >
> > It seems that it takes really long time when it starts to execute the
> Puts
> > to HBase in the reduce phase.
> >
> >
> >
> > 2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:
> >
> > > Which hbase release did you run mapreduce job ?
> > >
> > > Cheers
> > >
> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com>
> > wrote:
> > >
> > > > I want to create a large dateset for HBase with different versions
> and
> > > > number of rows. It's about 10M rows and 100 versions to do some
> > > benchmarks.
> > > >
> > > > What's the fastest way to create it?? I'm generating the dataset
> with a
> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > > around
> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > > MapReduces write the output and when transfer the output to the
> > Reduces.
> > >
> >
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or Notifications@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Re: How to generate a large dataset quickly.

Posted by lars hofhansl <la...@apache.org>.

+1 to what Vladimir said.
For the Puts in question you can also disable the write ahead log (WAL) and issue a flush on the table after your ingest.

-- Lars

----- Original Message -----
From: Vladimir Rodionov <vr...@carrieriq.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Monday, April 14, 2014 11:15 AM
Subject: RE: How to generate a large dataset quickly.

There is no need to run M/R unless your cluster is large (very large)
Single multithreaded client can easily ingest 10s of thousands rows per sec.
Check YCSB benchmark tool, for example.

Make sure you disable both region splitting and major compaction during data ingestion
and pre-split regions accordingly to improve overall performance.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________

From: Ted Yu [yuzhihong@gmail.com]
Sent: Monday, April 14, 2014 9:16 AM
To: user@hbase.apache.org
Subject: Re: How to generate a large dataset quickly.

I looked at revision history for HFileOutputFormat.java
There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
affect throughput much.

If you can use ganglia (or some similar tool) to pinpoint what caused the
low ingest rate, that would give us more clue.

BTW Is upgrading to newer release, such as 0.98.1 (which contains
HBASE-8755), an option for you ?

Cheers

On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <ko...@gmail.com>wrote:

> I'm using. 0.94.6-cdh4.4.0,
>
> I use the bulkload:
> FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> FileOutputFormat.setOutputPath(job, hbasePath);
> HTable table = new HTable(jConf, HBASE_TABLE);
> HFileOutputFormat.configureIncrementalLoad(job, table);
>
> It seems that it takes really long time when it starts to execute the Puts
> to HBase in the reduce phase.
>
>
>
> 2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:
>
> > Which hbase release did you run mapreduce job ?
> >
> > Cheers
> >
> > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
> >
> > > I want to create a large dateset for HBase with different versions and
> > > number of rows. It's about 10M rows and 100 versions to do some
> > benchmarks.
> > >
> > > What's the fastest way to create it?? I'm generating the dataset with a
> > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > around
> > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > MapReduces write the output and when transfer the output to the
> Reduces.
> >
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

RE: How to generate a large dataset quickly.

Posted by Vladimir Rodionov <vr...@carrieriq.com>.

There is no need to run M/R unless your cluster is large (very large)
Single multithreaded client can easily ingest 10s of thousands rows per sec.
Check YCSB benchmark tool, for example.

Make sure you disable both region splitting and major compaction during data ingestion
and pre-split regions accordingly to improve overall performance.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com
e-mail: vrodionov@carrieriq.com

________________________________________
From: Ted Yu [yuzhihong@gmail.com]
Sent: Monday, April 14, 2014 9:16 AM
To: user@hbase.apache.org
Subject: Re: How to generate a large dataset quickly.

I looked at revision history for HFileOutputFormat.java
There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
affect throughput much.

If you can use ganglia (or some similar tool) to pinpoint what caused the
low ingest rate, that would give us more clue.

BTW Is upgrading to newer release, such as 0.98.1 (which contains
HBASE-8755), an option for you ?

Cheers

On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <ko...@gmail.com>wrote:

> I'm using. 0.94.6-cdh4.4.0,
>
> I use the bulkload:
> FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> FileOutputFormat.setOutputPath(job, hbasePath);
> HTable table = new HTable(jConf, HBASE_TABLE);
> HFileOutputFormat.configureIncrementalLoad(job, table);
>
> It seems that it takes really long time when it starts to execute the Puts
> to HBase in the reduce phase.
>
>
>
> 2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:
>
> > Which hbase release did you run mapreduce job ?
> >
> > Cheers
> >
> > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
> >
> > > I want to create a large dateset for HBase with different versions and
> > > number of rows. It's about 10M rows and 100 versions to do some
> > benchmarks.
> > >
> > > What's the fastest way to create it?? I'm generating the dataset with a
> > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > around
> > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > MapReduces write the output and when transfer the output to the
> Reduces.
> >
>

Confidentiality Notice:  The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited.  If you have received this message in error, please immediately notify the sender and/or Notifications@carrieriq.com and delete or destroy any copy of this message and its attachments.

Re: How to generate a large dataset quickly.

Posted by Doug Meil <do...@explorysmedical.com>.

 
re:  "So, I execute 3.2Mill of Put¹s in HBase."

There will be 3.2 million Puts, but they won¹t be sent over 1 at a time if
autoFlush on Htable is false.  By default, htable should be using a 2mb
write buffer, and then it groups the Puts by RegionServer.






On 4/14/14, 2:21 PM, "Guillermo Ortiz" <ko...@gmail.com> wrote:

>Are there some benchmark about how long could it takes to insert data in
>HBase to have a reference?
>The output of my Mapper has 3.2mill. output. So, I execute 3.2Mill of
>Put's
>in HBase.
>
>Well, data has to be copied and sent to the reducers, but with a network
>of
>1Gb it shouldn't take too much time. I'll check Ganglia.
>
>
>2014-04-14 18:16 GMT+02:00 Ted Yu <yu...@gmail.com>:
>
>> I looked at revision history for HFileOutputFormat.java
>> There was one patch, HBASE-8949, which went into 0.94.11 but it
>>shouldn't
>> affect throughput much.
>>
>> If you can use ganglia (or some similar tool) to pinpoint what caused
>>the
>> low ingest rate, that would give us more clue.
>>
>> BTW Is upgrading to newer release, such as 0.98.1 (which contains
>> HBASE-8755), an option for you ?
>>
>> Cheers
>>
>>
>> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <konstt2000@gmail.com
>> >wrote:
>>
>> > I'm using. 0.94.6-cdh4.4.0,
>> >
>> > I use the bulkload:
>> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
>> > FileOutputFormat.setOutputPath(job, hbasePath);
>> > HTable table = new HTable(jConf, HBASE_TABLE);
>> > HFileOutputFormat.configureIncrementalLoad(job, table);
>> >
>> > It seems that it takes really long time when it starts to execute the
>> Puts
>> > to HBase in the reduce phase.
>> >
>> >
>> >
>> > 2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:
>> >
>> > > Which hbase release did you run mapreduce job ?
>> > >
>> > > Cheers
>> > >
>> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com>
>> > wrote:
>> > >
>> > > > I want to create a large dateset for HBase with different versions
>> and
>> > > > number of rows. It's about 10M rows and 100 versions to do some
>> > > benchmarks.
>> > > >
>> > > > What's the fastest way to create it?? I'm generating the dataset
>> with a
>> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and
>>size
>> > > around
>> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
>> > > > MapReduces write the output and when transfer the output to the
>> > Reduces.
>> > >
>> >
>>

Re: How to generate a large dataset quickly.

Posted by Guillermo Ortiz <ko...@gmail.com>.

Are there some benchmark about how long could it takes to insert data in
HBase to have a reference?
The output of my Mapper has 3.2mill. output. So, I execute 3.2Mill of Put's
in HBase.

Well, data has to be copied and sent to the reducers, but with a network of
1Gb it shouldn't take too much time. I'll check Ganglia.


2014-04-14 18:16 GMT+02:00 Ted Yu <yu...@gmail.com>:

> I looked at revision history for HFileOutputFormat.java
> There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
> affect throughput much.
>
> If you can use ganglia (or some similar tool) to pinpoint what caused the
> low ingest rate, that would give us more clue.
>
> BTW Is upgrading to newer release, such as 0.98.1 (which contains
> HBASE-8755), an option for you ?
>
> Cheers
>
>
> On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <konstt2000@gmail.com
> >wrote:
>
> > I'm using. 0.94.6-cdh4.4.0,
> >
> > I use the bulkload:
> > FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> > FileOutputFormat.setOutputPath(job, hbasePath);
> > HTable table = new HTable(jConf, HBASE_TABLE);
> > HFileOutputFormat.configureIncrementalLoad(job, table);
> >
> > It seems that it takes really long time when it starts to execute the
> Puts
> > to HBase in the reduce phase.
> >
> >
> >
> > 2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:
> >
> > > Which hbase release did you run mapreduce job ?
> > >
> > > Cheers
> > >
> > > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com>
> > wrote:
> > >
> > > > I want to create a large dateset for HBase with different versions
> and
> > > > number of rows. It's about 10M rows and 100 versions to do some
> > > benchmarks.
> > > >
> > > > What's the fastest way to create it?? I'm generating the dataset
> with a
> > > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > > around
> > > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > > MapReduces write the output and when transfer the output to the
> > Reduces.
> > >
> >
>

Re: How to generate a large dataset quickly.

Posted by Ted Yu <yu...@gmail.com>.

I looked at revision history for HFileOutputFormat.java
There was one patch, HBASE-8949, which went into 0.94.11 but it shouldn't
affect throughput much.

If you can use ganglia (or some similar tool) to pinpoint what caused the
low ingest rate, that would give us more clue.

BTW Is upgrading to newer release, such as 0.98.1 (which contains
HBASE-8755), an option for you ?

Cheers


On Mon, Apr 14, 2014 at 5:41 AM, Guillermo Ortiz <ko...@gmail.com>wrote:

> I'm using. 0.94.6-cdh4.4.0,
>
> I use the bulkload:
> FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
> FileOutputFormat.setOutputPath(job, hbasePath);
> HTable table = new HTable(jConf, HBASE_TABLE);
> HFileOutputFormat.configureIncrementalLoad(job, table);
>
> It seems that it takes really long time when it starts to execute the Puts
> to HBase in the reduce phase.
>
>
>
> 2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:
>
> > Which hbase release did you run mapreduce job ?
> >
> > Cheers
> >
> > On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com>
> wrote:
> >
> > > I want to create a large dateset for HBase with different versions and
> > > number of rows. It's about 10M rows and 100 versions to do some
> > benchmarks.
> > >
> > > What's the fastest way to create it?? I'm generating the dataset with a
> > > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> > around
> > > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > > MapReduces write the output and when transfer the output to the
> Reduces.
> >
>

Re: How to generate a large dataset quickly.

Posted by Guillermo Ortiz <ko...@gmail.com>.

I'm using. 0.94.6-cdh4.4.0,

I use the bulkload:
FileInputFormat.addInputPath(job, new Path(INPUT_FOLDER));
FileOutputFormat.setOutputPath(job, hbasePath);
HTable table = new HTable(jConf, HBASE_TABLE);
HFileOutputFormat.configureIncrementalLoad(job, table);

It seems that it takes really long time when it starts to execute the Puts
to HBase in the reduce phase.



2014-04-14 14:35 GMT+02:00 Ted Yu <yu...@gmail.com>:

> Which hbase release did you run mapreduce job ?
>
> Cheers
>
> On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com> wrote:
>
> > I want to create a large dateset for HBase with different versions and
> > number of rows. It's about 10M rows and 100 versions to do some
> benchmarks.
> >
> > What's the fastest way to create it?? I'm generating the dataset with a
> > Mapreduce of 100.000rows and 10verions. It takes 17minutes and size
> around
> > 7Gb. I don't know if I could do it quickly. The bottleneck is when
> > MapReduces write the output and when transfer the output to the Reduces.
>

Re: How to generate a large dataset quickly.

Posted by Ted Yu <yu...@gmail.com>.

Which hbase release did you run mapreduce job ?

Cheers

On Apr 14, 2014, at 4:50 AM, Guillermo Ortiz <ko...@gmail.com> wrote:

> I want to create a large dateset for HBase with different versions and
> number of rows. It's about 10M rows and 100 versions to do some benchmarks.
> 
> What's the fastest way to create it?? I'm generating the dataset with a
> Mapreduce of 100.000rows and 10verions. It takes 17minutes and size around
> 7Gb. I don't know if I could do it quickly. The bottleneck is when
> MapReduces write the output and when transfer the output to the Reduces.