You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Vivek Krishna <vi...@gmail.com> on 2011/03/25 01:22:21 UTC

Yet another bulk import question

Data Size - 20 GB.  It took about an hour with default hbase setting and
after varying several parameters, we were able to get this done in ~20
minutes.  This is slow and we are trying to improve.

We wrote a java client which would essentially `put` to hbase tables in
batches.  Our fine-tuning parameters include,
1.  Disabling compaction
2.  Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000, 40000
)
3.  Setting AutoFlush to on/off.
4.  Varying write buffer(in client)  with 2mb, 128mb,256mb
5.  Changing regionserver.handler.count to 100
6.  Varying regionserver size from 128 to 256/512/1024.
7.  Increasing number of regions.
8.  Creating regions with keys pre-specified (so that clients hit the
regions directly)
9.  Varying number of clients (from 30 clients to 100 clients)

The above was tested on a 38 node cluster with 2 regions each.

We did not try disabling WAL fearing loss of data.

Are there any other parameters that we missed during the process?


Viv

Re: Yet another bulk import question

Posted by Vivek Krishna <vi...@gmail.com>.
Is there a limiting factor/setting that limits/controls the bandwidth on
HBase nodes? I know there is a number to be set on zoo.cfg to increase the
number of incoming connections.

Though I am using a 15 Gigabit ethernet card, I can see only 50-100MB/s of
transfer per node (from clients) via ganglia.
Viv



On Thu, Mar 24, 2011 at 8:42 PM, Ted Dunning <td...@maprtech.com> wrote:

>
> Something is just wrong.  You should be able to do 17,000 records from a
> few nodes with multiple threads against a fairly small cluster.  You should
> be able to come close to that from a single node into a dozen region
> servers.
>
>
> On Thu, Mar 24, 2011 at 5:32 PM, Vivek Krishna <vi...@gmail.com>wrote:
>
>> I have a total of 10 clients-nodes with 3-10 threads running on each node.
>> Record size ~1K
>>
>> Viv
>>
>>
>>
>>
>> On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> Are you putting this data from a single host?  Is your sender
>>> multi-threaded?
>>>
>>> I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
>>> stressing the network.  You would likely be stressing a single threaded
>>> client pretty severely.
>>>
>>> What is your record size?  It may be that you are bound up by the number
>>> of records being inserted rather than the total data size.
>>>
>>> On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna <vi...@gmail.com>wrote:
>>>
>>>> Data Size - 20 GB.  It took about an hour with default hbase setting and
>>>> after varying several parameters, we were able to get this done in ~20
>>>> minutes.  This is slow and we are trying to improve.
>>>>
>>>> We wrote a java client which would essentially `put` to hbase tables in
>>>> batches.  Our fine-tuning parameters include,
>>>> 1.  Disabling compaction
>>>> 2.  Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000,
>>>> 40000
>>>> )
>>>> 3.  Setting AutoFlush to on/off.
>>>> 4.  Varying write buffer(in client)  with 2mb, 128mb,256mb
>>>> 5.  Changing regionserver.handler.count to 100
>>>> 6.  Varying regionserver size from 128 to 256/512/1024.
>>>> 7.  Increasing number of regions.
>>>> 8.  Creating regions with keys pre-specified (so that clients hit the
>>>> regions directly)
>>>> 9.  Varying number of clients (from 30 clients to 100 clients)
>>>>
>>>> The above was tested on a 38 node cluster with 2 regions each.
>>>>
>>>> We did not try disabling WAL fearing loss of data.
>>>>
>>>> Are there any other parameters that we missed during the process?
>>>>
>>>>
>>>> Viv
>>>>
>>>
>>>
>>
>

Re: Yet another bulk import question

Posted by Ted Dunning <td...@maprtech.com>.
Something is just wrong.  You should be able to do 17,000 records from a few
nodes with multiple threads against a fairly small cluster.  You should be
able to come close to that from a single node into a dozen region servers.

On Thu, Mar 24, 2011 at 5:32 PM, Vivek Krishna <vi...@gmail.com>wrote:

> I have a total of 10 clients-nodes with 3-10 threads running on each node.
> Record size ~1K
>
> Viv
>
>
>
>
> On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning <td...@maprtech.com>wrote:
>
>> Are you putting this data from a single host?  Is your sender
>> multi-threaded?
>>
>> I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
>> stressing the network.  You would likely be stressing a single threaded
>> client pretty severely.
>>
>> What is your record size?  It may be that you are bound up by the number
>> of records being inserted rather than the total data size.
>>
>> On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna <vi...@gmail.com>wrote:
>>
>>> Data Size - 20 GB.  It took about an hour with default hbase setting and
>>> after varying several parameters, we were able to get this done in ~20
>>> minutes.  This is slow and we are trying to improve.
>>>
>>> We wrote a java client which would essentially `put` to hbase tables in
>>> batches.  Our fine-tuning parameters include,
>>> 1.  Disabling compaction
>>> 2.  Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000,
>>> 40000
>>> )
>>> 3.  Setting AutoFlush to on/off.
>>> 4.  Varying write buffer(in client)  with 2mb, 128mb,256mb
>>> 5.  Changing regionserver.handler.count to 100
>>> 6.  Varying regionserver size from 128 to 256/512/1024.
>>> 7.  Increasing number of regions.
>>> 8.  Creating regions with keys pre-specified (so that clients hit the
>>> regions directly)
>>> 9.  Varying number of clients (from 30 clients to 100 clients)
>>>
>>> The above was tested on a 38 node cluster with 2 regions each.
>>>
>>> We did not try disabling WAL fearing loss of data.
>>>
>>> Are there any other parameters that we missed during the process?
>>>
>>>
>>> Viv
>>>
>>
>>
>

Re: Yet another bulk import question

Posted by Vivek Krishna <vi...@gmail.com>.
I have a total of 10 clients-nodes with 3-10 threads running on each node.
Record size ~1K

Viv



On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning <td...@maprtech.com> wrote:

> Are you putting this data from a single host?  Is your sender
> multi-threaded?
>
> I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
> stressing the network.  You would likely be stressing a single threaded
> client pretty severely.
>
> What is your record size?  It may be that you are bound up by the number of
> records being inserted rather than the total data size.
>
> On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna <vi...@gmail.com>wrote:
>
>> Data Size - 20 GB.  It took about an hour with default hbase setting and
>> after varying several parameters, we were able to get this done in ~20
>> minutes.  This is slow and we are trying to improve.
>>
>> We wrote a java client which would essentially `put` to hbase tables in
>> batches.  Our fine-tuning parameters include,
>> 1.  Disabling compaction
>> 2.  Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000,
>> 40000
>> )
>> 3.  Setting AutoFlush to on/off.
>> 4.  Varying write buffer(in client)  with 2mb, 128mb,256mb
>> 5.  Changing regionserver.handler.count to 100
>> 6.  Varying regionserver size from 128 to 256/512/1024.
>> 7.  Increasing number of regions.
>> 8.  Creating regions with keys pre-specified (so that clients hit the
>> regions directly)
>> 9.  Varying number of clients (from 30 clients to 100 clients)
>>
>> The above was tested on a 38 node cluster with 2 regions each.
>>
>> We did not try disabling WAL fearing loss of data.
>>
>> Are there any other parameters that we missed during the process?
>>
>>
>> Viv
>>
>
>

Re: Yet another bulk import question

Posted by Ted Dunning <td...@maprtech.com>.
Are you putting this data from a single host?  Is your sender
multi-threaded?

I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
stressing the network.  You would likely be stressing a single threaded
client pretty severely.

What is your record size?  It may be that you are bound up by the number of
records being inserted rather than the total data size.

On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna <vi...@gmail.com>wrote:

> Data Size - 20 GB.  It took about an hour with default hbase setting and
> after varying several parameters, we were able to get this done in ~20
> minutes.  This is slow and we are trying to improve.
>
> We wrote a java client which would essentially `put` to hbase tables in
> batches.  Our fine-tuning parameters include,
> 1.  Disabling compaction
> 2.  Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000, 40000
> )
> 3.  Setting AutoFlush to on/off.
> 4.  Varying write buffer(in client)  with 2mb, 128mb,256mb
> 5.  Changing regionserver.handler.count to 100
> 6.  Varying regionserver size from 128 to 256/512/1024.
> 7.  Increasing number of regions.
> 8.  Creating regions with keys pre-specified (so that clients hit the
> regions directly)
> 9.  Varying number of clients (from 30 clients to 100 clients)
>
> The above was tested on a 38 node cluster with 2 regions each.
>
> We did not try disabling WAL fearing loss of data.
>
> Are there any other parameters that we missed during the process?
>
>
> Viv
>