You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Vivek Krishna <vi...@gmail.com> on 2011/04/11 20:20:18 UTC

Re: Yet another bulk import question

Is there a limiting factor/setting that limits/controls the bandwidth on
HBase nodes? I know there is a number to be set on zoo.cfg to increase the
number of incoming connections.

Though I am using a 15 Gigabit ethernet card, I can see only 50-100MB/s of
transfer per node (from clients) via ganglia.
Viv



On Thu, Mar 24, 2011 at 8:42 PM, Ted Dunning <td...@maprtech.com> wrote:

>
> Something is just wrong.  You should be able to do 17,000 records from a
> few nodes with multiple threads against a fairly small cluster.  You should
> be able to come close to that from a single node into a dozen region
> servers.
>
>
> On Thu, Mar 24, 2011 at 5:32 PM, Vivek Krishna <vi...@gmail.com>wrote:
>
>> I have a total of 10 clients-nodes with 3-10 threads running on each node.
>> Record size ~1K
>>
>> Viv
>>
>>
>>
>>
>> On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning <td...@maprtech.com>wrote:
>>
>>> Are you putting this data from a single host?  Is your sender
>>> multi-threaded?
>>>
>>> I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
>>> stressing the network.  You would likely be stressing a single threaded
>>> client pretty severely.
>>>
>>> What is your record size?  It may be that you are bound up by the number
>>> of records being inserted rather than the total data size.
>>>
>>> On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna <vi...@gmail.com>wrote:
>>>
>>>> Data Size - 20 GB.  It took about an hour with default hbase setting and
>>>> after varying several parameters, we were able to get this done in ~20
>>>> minutes.  This is slow and we are trying to improve.
>>>>
>>>> We wrote a java client which would essentially `put` to hbase tables in
>>>> batches.  Our fine-tuning parameters include,
>>>> 1.  Disabling compaction
>>>> 2.  Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000,
>>>> 40000
>>>> )
>>>> 3.  Setting AutoFlush to on/off.
>>>> 4.  Varying write buffer(in client)  with 2mb, 128mb,256mb
>>>> 5.  Changing regionserver.handler.count to 100
>>>> 6.  Varying regionserver size from 128 to 256/512/1024.
>>>> 7.  Increasing number of regions.
>>>> 8.  Creating regions with keys pre-specified (so that clients hit the
>>>> regions directly)
>>>> 9.  Varying number of clients (from 30 clients to 100 clients)
>>>>
>>>> The above was tested on a 38 node cluster with 2 regions each.
>>>>
>>>> We did not try disabling WAL fearing loss of data.
>>>>
>>>> Are there any other parameters that we missed during the process?
>>>>
>>>>
>>>> Viv
>>>>
>>>
>>>
>>
>