You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by llpind <so...@hotmail.com> on 2009/06/07 01:55:22 UTC

Frequent changing rowkey - HBase insert

I'm doing an insert operation using the java API.

When inserting data where the rowkey changes often, it seems the inserts go
really slow.

Is there another method for doing inserts of this type?  (instead of
BatchUpdate).

Thanks
-- 
View this message in context: http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906724.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Frequent changing rowkey - HBase insert

Posted by llpind <so...@hotmail.com>.

Hi Erik,

Yes that sounds good.  The type of calls I was looking for in the API.  


Erik Holstad wrote:
> 
> Hi Ilpind!
> 
> On Mon, Jun 8, 2009 at 8:45 AM, llpind <so...@hotmail.com> wrote:
> 
>>
>> The insert works well for when I have a row key which is constant for a
>> long
>> period of time, and I can split it up into blocks.  But when the row key
>> changes often, then insert performance over time starts to suffer.  The
>> suggestion made by Ryan does help, and I was eventually able to get the
>> entire data set into HBase. ( ~120 Million records)
>>
>> Currently working on some analysis, and had a question about the java
>> api.
>> Is there a way to get record count given a row key?  something like: long
>> getColumnCount (rowkey).  So it doesn't bring down any data to client,
>> but
>> simply returns the size..?
>>
>>
> We have been talking about something similar to this for scanners. A call
> that
> just counts the number of rows between a start and a stop row and doesn't
> return
> any data.
> So that would make it 4 calls if I'm not mistaken :
> countRows(Scan scan)
> countFamilies(List<byte[]> families)
> countQualifiers(byte [] family)
> countVersions(byte[] family, byte[] qualifier, long minTime, long maxTime)
> 
> or maybe just keep it simple and use.:
> countRows(Scan scan)
> countFamilies(Get get)
> countQualifiers(Get get)
> countVersions(Get get)
> 
> We talked about just having a special serializer that doesn't return any
> data just the count.
> 
> How does that sound to you?
> 
> Erik
> 
> 

-- 
View this message in context: http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23928539.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Frequent changing rowkey - HBase insert

Posted by Erik Holstad <er...@gmail.com>.

Hi Ilpind!

On Mon, Jun 8, 2009 at 8:45 AM, llpind <so...@hotmail.com> wrote:

>
> The insert works well for when I have a row key which is constant for a
> long
> period of time, and I can split it up into blocks.  But when the row key
> changes often, then insert performance over time starts to suffer.  The
> suggestion made by Ryan does help, and I was eventually able to get the
> entire data set into HBase. ( ~120 Million records)
>
> Currently working on some analysis, and had a question about the java api.
> Is there a way to get record count given a row key?  something like: long
> getColumnCount (rowkey).  So it doesn't bring down any data to client, but
> simply returns the size..?
>
>
We have been talking about something similar to this for scanners. A call
that
just counts the number of rows between a start and a stop row and doesn't
return
any data.
So that would make it 4 calls if I'm not mistaken :
countRows(Scan scan)
countFamilies(List<byte[]> families)
countQualifiers(byte [] family)
countVersions(byte[] family, byte[] qualifier, long minTime, long maxTime)

or maybe just keep it simple and use.:
countRows(Scan scan)
countFamilies(Get get)
countQualifiers(Get get)
countVersions(Get get)

We talked about just having a special serializer that doesn't return any
data just the count.

How does that sound to you?

Erik

Re: Frequent changing rowkey - HBase insert

Posted by stack <st...@duboce.net>.

This is TRUNK or 0.19 (sorry if you've said already).  Can you figure what
the slowdown is?  Is it lookups into .META. to find region hosting key?
St.Ack

On Mon, Jun 8, 2009 at 9:08 AM, llpind <so...@hotmail.com> wrote:

>
>
>
> stack-3 wrote:
> >
> >
> > Row key changes frequently?
> >
> > You mean you are filling rows with lots of columns and while you hold to
> a
> > single row, insert is fast?
> >
> > Your uploader encounters keys randomly or are they sorted?
> >
> >
>
> Yes.  given a large dataset with frequently changing row keys, the overall
> process slows down with a few million records inserted (again is is much
> better now with auth flush off etc.).  row keys are sorted on insertion.
>
> --
> View this message in context:
> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23927051.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: Frequent changing rowkey - HBase insert

Posted by llpind <so...@hotmail.com>.

stack-3 wrote:
> 
> 
> Row key changes frequently?
> 
> You mean you are filling rows with lots of columns and while you hold to a
> single row, insert is fast?
> 
> Your uploader encounters keys randomly or are they sorted?
> 
> 

Yes.  given a large dataset with frequently changing row keys, the overall
process slows down with a few million records inserted (again is is much
better now with auth flush off etc.).  row keys are sorted on insertion.

-- 
View this message in context: http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23927051.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Frequent changing rowkey - HBase insert

Posted by stack <st...@duboce.net>.

On Mon, Jun 8, 2009 at 8:45 AM, llpind <so...@hotmail.com> wrote:

>
> The insert works well for when I have a row key which is constant for a
> long
> period of time, and I can split it up into blocks.  But when the row key
> changes often, then insert performance over time starts to suffer.  The
> suggestion made by Ryan does help, and I was eventually able to get the
> entire data set into HBase. ( ~120 Million records)

Row key changes frequently?

You mean you are filling rows with lots of columns and while you hold to a
single row, insert is fast?

Your uploader encounters keys randomly or are they sorted?

>
>
> Currently working on some analysis, and had a question about the java api.
> Is there a way to get record count given a row key?  something like: long
> getColumnCount (rowkey).  So it doesn't bring down any data to client, but
> simply returns the size..?

No.  You have to scan currently.

St.Ack

Re: Frequent changing rowkey - HBase insert

Posted by llpind <so...@hotmail.com>.

The insert works well for when I have a row key which is constant for a long
period of time, and I can split it up into blocks.  But when the row key
changes often, then insert performance over time starts to suffer.  The
suggestion made by Ryan does help, and I was eventually able to get the
entire data set into HBase. ( ~120 Million records)

Currently working on some analysis, and had a question about the java api. 
Is there a way to get record count given a row key?  something like: long
getColumnCount (rowkey).  So it doesn't bring down any data to client, but
simply returns the size..?

Thanks.

stack-3 wrote:
> 
> On Sat, Jun 6, 2009 at 6:13 PM, llpind <so...@hotmail.com> wrote:
> 
>>
>>
>> And it's inserting 1M in about 1 minute+ .   Not the best still.
> 
> 
> What you looking for performance-wise?
> 
> Is your cluster working for you now?
> 
> Thanks,
> St.Ack
> 
> 

-- 
View this message in context: http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23926640.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Frequent changing rowkey - HBase insert

Posted by stack <st...@duboce.net>.

On Sat, Jun 6, 2009 at 6:13 PM, llpind <so...@hotmail.com> wrote:

>
>
> And it's inserting 1M in about 1 minute+ .   Not the best still.


What you looking for performance-wise?

Is your cluster working for you now?

Thanks,
St.Ack

Re: Frequent changing rowkey - HBase insert

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

I agree seams in 0.20 that I am testing my clients are the bottleneck now 
not the db
I went from seeing 30K on 8 nodes to around 115k when I could keep all the 
clients writing at the same time for a few secs.

Billy



"Ryan Rawson" <ry...@gmail.com> wrote in 
message news:78568af10906061819g5949eae8ye0f30653540fb535@mail.gmail.com...
> In 0.20 things should get faster.
>
> Generally speaking I find HBase's insert performance really good.  One of
> the best even.  Plus Just Add Servers (tm).
>
> -ryan
>
> On Sat, Jun 6, 2009 at 6:13 PM, llpind 
> <so...@hotmail.com> wrote:
>
>>
>> Thanks Ryan,
>>
>> Yeah that sped it up a bit.
>>
>> I set :
>>                table.setAutoFlush(false);
>>                table.setWriteBufferSize(1024*1024*12);
>>
>> And it's inserting 1M in about 1 minute+ .   Not the best still.
>>
>> 2009-06-06 18:06:54.894 ======PROCESSING RECORD: ====== @1000000
>> 2009-06-06 18:08:07.725 ======PROCESSING RECORD: ====== @2000000
>> 2009-06-06 18:09:24.992 ======PROCESSING RECORD: ====== @3000000
>> 2009-06-06 18:11:13.279 ======PROCESSING RECORD: ====== @4000000
>>
>>
>> Ryan Rawson wrote:
>> >
>> > Don't use the thrift gateway for bulk import.
>> >
>> > Use the Java API, and be sure to turn off auto flushing and use a
>> > reasonably
>> > sizable commit buffer. 1-12MB is probably ideal.
>> >
>> > i can push a 20 node cluster past 180k inserts/sec using this.
>> >
>> > On Sat, Jun 6, 2009 at 5:51 PM, llpind 
>> > <so...@hotmail.com> wrote:
>> >
>> >>
>> >> Thanks Ryan, well done.
>> >>
>> >> I have no experience using Thrift gateway, could you please provide 
>> >> some
>> >> actual code here or in your blog post?  I'd love to see how your 
>> >> method
>> >> compares with mine.
>> >>
>> >> Last night I was able to do ~58 million records in ~1.6 hours using 
>> >> the
>> >> HBase Java API directly.  But with this new data, I'm seeing much 
>> >> slower
>> >> times.  After reading around, it appears it's because my row key now
>> >> changes
>> >> often, whearas before it was constant for some time (more columns).
>> >> Thanks
>> >> again. :)
>> >>
>> >>
>> >> Ryan Rawson wrote:
>> >> >
>> >> > Have a look at:
>> >> >
>> >> >
>> >>
>> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
>> >> >
>> >> > -ryan
>> >> >
>> >> >
>> >> > On Sat, Jun 6, 2009 at 4:55 PM, llpind 
>> >> > <so...@hotmail.com>
>> wrote:
>> >> >
>> >> >>
>> >> >> I'm doing an insert operation using the java API.
>> >> >>
>> >> >> When inserting data where the rowkey changes often, it seems the
>> >> inserts
>> >> >> go
>> >> >> really slow.
>> >> >>
>> >> >> Is there another method for doing inserts of this type?  (instead 
>> >> >> of
>> >> >> BatchUpdate).
>> >> >>
>> >> >> Thanks
>> >> >> --
>> >> >> View this message in context:
>> >> >>
>> >>
>> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906724.html
>> >> >> Sent from the HBase User mailing list archive at Nabble.com.
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906943.html
>> >> Sent from the HBase User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23907040.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
>

Re: Frequent changing rowkey - HBase insert

Posted by Ryan Rawson <ry...@gmail.com>.

In 0.20 things should get faster.

Generally speaking I find HBase's insert performance really good.  One of
the best even.  Plus Just Add Servers (tm).

-ryan

On Sat, Jun 6, 2009 at 6:13 PM, llpind <so...@hotmail.com> wrote:

>
> Thanks Ryan,
>
> Yeah that sped it up a bit.
>
> I set :
>                table.setAutoFlush(false);
>                table.setWriteBufferSize(1024*1024*12);
>
> And it's inserting 1M in about 1 minute+ .   Not the best still.
>
> 2009-06-06 18:06:54.894 ======PROCESSING RECORD: ====== @1000000
> 2009-06-06 18:08:07.725 ======PROCESSING RECORD: ====== @2000000
> 2009-06-06 18:09:24.992 ======PROCESSING RECORD: ====== @3000000
> 2009-06-06 18:11:13.279 ======PROCESSING RECORD: ====== @4000000
>
>
> Ryan Rawson wrote:
> >
> > Don't use the thrift gateway for bulk import.
> >
> > Use the Java API, and be sure to turn off auto flushing and use a
> > reasonably
> > sizable commit buffer. 1-12MB is probably ideal.
> >
> > i can push a 20 node cluster past 180k inserts/sec using this.
> >
> > On Sat, Jun 6, 2009 at 5:51 PM, llpind <so...@hotmail.com> wrote:
> >
> >>
> >> Thanks Ryan, well done.
> >>
> >> I have no experience using Thrift gateway, could you please provide some
> >> actual code here or in your blog post?  I'd love to see how your method
> >> compares with mine.
> >>
> >> Last night I was able to do ~58 million records in ~1.6 hours using the
> >> HBase Java API directly.  But with this new data, I'm seeing much slower
> >> times.  After reading around, it appears it's because my row key now
> >> changes
> >> often, whearas before it was constant for some time (more columns).
> >> Thanks
> >> again. :)
> >>
> >>
> >> Ryan Rawson wrote:
> >> >
> >> > Have a look at:
> >> >
> >> >
> >>
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> >> >
> >> > -ryan
> >> >
> >> >
> >> > On Sat, Jun 6, 2009 at 4:55 PM, llpind <so...@hotmail.com>
> wrote:
> >> >
> >> >>
> >> >> I'm doing an insert operation using the java API.
> >> >>
> >> >> When inserting data where the rowkey changes often, it seems the
> >> inserts
> >> >> go
> >> >> really slow.
> >> >>
> >> >> Is there another method for doing inserts of this type?  (instead of
> >> >> BatchUpdate).
> >> >>
> >> >> Thanks
> >> >> --
> >> >> View this message in context:
> >> >>
> >>
> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906724.html
> >> >> Sent from the HBase User mailing list archive at Nabble.com.
> >> >>
> >> >>
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906943.html
> >> Sent from the HBase User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23907040.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: Frequent changing rowkey - HBase insert

Posted by llpind <so...@hotmail.com>.

Thanks Ryan,

Yeah that sped it up a bit.  

I set :
		table.setAutoFlush(false);
		table.setWriteBufferSize(1024*1024*12);

And it's inserting 1M in about 1 minute+ .   Not the best still.

2009-06-06 18:06:54.894 ======PROCESSING RECORD: ====== @1000000
2009-06-06 18:08:07.725 ======PROCESSING RECORD: ====== @2000000
2009-06-06 18:09:24.992 ======PROCESSING RECORD: ====== @3000000
2009-06-06 18:11:13.279 ======PROCESSING RECORD: ====== @4000000


Ryan Rawson wrote:
> 
> Don't use the thrift gateway for bulk import.
> 
> Use the Java API, and be sure to turn off auto flushing and use a
> reasonably
> sizable commit buffer. 1-12MB is probably ideal.
> 
> i can push a 20 node cluster past 180k inserts/sec using this.
> 
> On Sat, Jun 6, 2009 at 5:51 PM, llpind <so...@hotmail.com> wrote:
> 
>>
>> Thanks Ryan, well done.
>>
>> I have no experience using Thrift gateway, could you please provide some
>> actual code here or in your blog post?  I'd love to see how your method
>> compares with mine.
>>
>> Last night I was able to do ~58 million records in ~1.6 hours using the
>> HBase Java API directly.  But with this new data, I'm seeing much slower
>> times.  After reading around, it appears it's because my row key now
>> changes
>> often, whearas before it was constant for some time (more columns). 
>> Thanks
>> again. :)
>>
>>
>> Ryan Rawson wrote:
>> >
>> > Have a look at:
>> >
>> >
>> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
>> >
>> > -ryan
>> >
>> >
>> > On Sat, Jun 6, 2009 at 4:55 PM, llpind <so...@hotmail.com> wrote:
>> >
>> >>
>> >> I'm doing an insert operation using the java API.
>> >>
>> >> When inserting data where the rowkey changes often, it seems the
>> inserts
>> >> go
>> >> really slow.
>> >>
>> >> Is there another method for doing inserts of this type?  (instead of
>> >> BatchUpdate).
>> >>
>> >> Thanks
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906724.html
>> >> Sent from the HBase User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906943.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23907040.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Frequent changing rowkey - HBase insert

Posted by Ryan Rawson <ry...@gmail.com>.

Don't use the thrift gateway for bulk import.

Use the Java API, and be sure to turn off auto flushing and use a reasonably
sizable commit buffer. 1-12MB is probably ideal.

i can push a 20 node cluster past 180k inserts/sec using this.

On Sat, Jun 6, 2009 at 5:51 PM, llpind <so...@hotmail.com> wrote:

>
> Thanks Ryan, well done.
>
> I have no experience using Thrift gateway, could you please provide some
> actual code here or in your blog post?  I'd love to see how your method
> compares with mine.
>
> Last night I was able to do ~58 million records in ~1.6 hours using the
> HBase Java API directly.  But with this new data, I'm seeing much slower
> times.  After reading around, it appears it's because my row key now
> changes
> often, whearas before it was constant for some time (more columns).  Thanks
> again. :)
>
>
> Ryan Rawson wrote:
> >
> > Have a look at:
> >
> >
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> >
> > -ryan
> >
> >
> > On Sat, Jun 6, 2009 at 4:55 PM, llpind <so...@hotmail.com> wrote:
> >
> >>
> >> I'm doing an insert operation using the java API.
> >>
> >> When inserting data where the rowkey changes often, it seems the inserts
> >> go
> >> really slow.
> >>
> >> Is there another method for doing inserts of this type?  (instead of
> >> BatchUpdate).
> >>
> >> Thanks
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906724.html
> >> Sent from the HBase User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906943.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Re: Frequent changing rowkey - HBase insert

Posted by llpind <so...@hotmail.com>.

Thanks Ryan, well done.  

I have no experience using Thrift gateway, could you please provide some
actual code here or in your blog post?  I'd love to see how your method
compares with mine. 

Last night I was able to do ~58 million records in ~1.6 hours using the
HBase Java API directly.  But with this new data, I'm seeing much slower
times.  After reading around, it appears it's because my row key now changes
often, whearas before it was constant for some time (more columns).  Thanks
again. :)

Ryan Rawson wrote:
> 
> Have a look at:
> 
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> 
> -ryan
> 
> 
> On Sat, Jun 6, 2009 at 4:55 PM, llpind <so...@hotmail.com> wrote:
> 
>>
>> I'm doing an insert operation using the java API.
>>
>> When inserting data where the rowkey changes often, it seems the inserts
>> go
>> really slow.
>>
>> Is there another method for doing inserts of this type?  (instead of
>> BatchUpdate).
>>
>> Thanks
>> --
>> View this message in context:
>> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906724.html
>> Sent from the HBase User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906943.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Frequent changing rowkey - HBase insert

Posted by Ryan Rawson <ry...@gmail.com>.

Have a look at:

http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html

-ryan


On Sat, Jun 6, 2009 at 4:55 PM, llpind <so...@hotmail.com> wrote:

>
> I'm doing an insert operation using the java API.
>
> When inserting data where the rowkey changes often, it seems the inserts go
> really slow.
>
> Is there another method for doing inserts of this type?  (instead of
> BatchUpdate).
>
> Thanks
> --
> View this message in context:
> http://www.nabble.com/Frequent-changing-rowkey---HBase-insert-tp23906724p23906724.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>