You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Li Li <fa...@gmail.com> on 2014/04/28 12:27:02 UTC

question about threads count

hi all,
   with the same read/write data, will threads count affect performance?
   e.g. I have 10,000 write request/second. I don't care the order very much.
   how many writer threads should I use to obtain maximum throughput?

Re: question about threads count

Posted by Li Li <fa...@gmail.com>.
I have also considered this method. But what about other columns
without default value(status's default value is 0, so I can think
absence as 0)
e.g. depth, insertTime, ...
anyway, if using put instead of checkAndPut will make it much faster,
I will consider this method.


On Tue, Apr 29, 2014 at 9:44 AM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Simply don't set your status to 0 when you write it first.
>
> Absence mean not read.
> 1 mean read.
> So there is no risk that someone try to set 0 and someone else try to set 1.
>
> Will that be an option?
>
>
> 2014-04-28 21:23 GMT-04:00 Li Li <fa...@gmail.com>:
>
>> I am using hbase to store information for a web spider.
>> I have a table to save information of a webpage, the rowkey is url,
>> and there are other columns such as status(int) and depth(int)
>> in the beginning, the status is 0.  A worker thread will select urls
>> whose status is 0 and do something with it and modify it to 1,...
>> there are more than 1 urls link to a given url.
>> e.g.  url1->url url2->url
>> there are two times insertion of url. If I do not use checkAndPut,
>> when thread 1 insert url and the worker thread do something with url
>> and modify its status to 1. Then thread 2 again insert url and reset
>> the status to 0, then the worker thread will do somthing again. That's
>> not I want.
>>
>> On Tue, Apr 29, 2014 at 8:56 AM, Jean-Marc Spaggiari
>> <je...@spaggiari.org> wrote:
>> > Why do you want to make sure the row is only inserted once? If you insert
>> > the same raw twice the 2nd one will simple overwrite the first one and
>> > HBase will take care of the versions.
>> >
>> > regarding the codes fragments, I don't think the autoflush is going to
>> do a
>> > big difference compared to the cost of the check & put...
>> >
>> >
>> > 2014-04-28 20:50 GMT-04:00 Li Li <fa...@gmail.com>:
>> >
>> >> I must use checkAndPut to ensure a row is only inserted once.
>> >> if I have 1000 checkAndPut,will setAutoFlush(false) useful?
>> >> is there any performance difference of the following two code fragments?
>> >> 1.
>> >>     table.setAutoFlush(false);
>> >>     for(int i=0;i<1000;i++){
>> >>          Put put=...
>> >>          table.checkAndPut(,....put);
>> >>     }
>> >> 2.
>> >>     table.setAutoFlush(true);
>> >>     for(int i=0;i<1000;i++){
>> >>          Put put=...
>> >>          table.checkAndPut(,....put);
>> >>     }
>> >>
>> >> On Tue, Apr 29, 2014 at 8:36 AM, Jean-Marc Spaggiari
>> >> <je...@spaggiari.org> wrote:
>> >> > It depends. Batch a list of puts/gets wll be way faster than
>> checkAndPut,
>> >> > but the result will not be the same... a batch of puts will not do any
>> >> > check...
>> >> >
>> >> >
>> >> > 2014-04-28 20:17 GMT-04:00 Li Li <fa...@gmail.com>:
>> >> >
>> >> >> but I have many checkAndPut operations.
>> >> >> will use batch a better solution?
>> >> >>
>> >> >> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
>> >> >> <je...@spaggiari.org> wrote:
>> >> >> > Hi Li Li,
>> >> >> >
>> >> >> > Yes, threads will impact the performances. If you send all you
>> writes
>> >> >> with
>> >> >> > a single thread, a single HBase handler will take care of them,
>> etc.
>> >> >> HBase
>> >> >> > does not provide a single handler for a single client connexion.
>> It's
>> >> >> able
>> >> >> > to handle multiple threads and clients.
>> >> >> >
>> >> >> > However, it also all depends on the way you send your writes. If
>> you
>> >> >> send a
>> >> >> > single puts(<10000>) per seconds, if will not be better to send 10
>> 000
>> >> >> > threads with a single put.
>> >> >> >
>> >> >> > I will recommend you to run some perf tests on your installation to
>> >> find
>> >> >> a
>> >> >> > good number for your configuration.
>> >> >> >
>> >> >> > JM
>> >> >> >
>> >> >> >
>> >> >> > 2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:
>> >> >> >
>> >> >> >> hi all,
>> >> >> >>    with the same read/write data, will threads count affect
>> >> performance?
>> >> >> >>    e.g. I have 10,000 write request/second. I don't care the order
>> >> very
>> >> >> >> much.
>> >> >> >>    how many writer threads should I use to obtain maximum
>> throughput?
>> >> >> >>
>> >> >>
>> >>
>>

Re: question about threads count

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Simply don't set your status to 0 when you write it first.

Absence mean not read.
1 mean read.
So there is no risk that someone try to set 0 and someone else try to set 1.

Will that be an option?


2014-04-28 21:23 GMT-04:00 Li Li <fa...@gmail.com>:

> I am using hbase to store information for a web spider.
> I have a table to save information of a webpage, the rowkey is url,
> and there are other columns such as status(int) and depth(int)
> in the beginning, the status is 0.  A worker thread will select urls
> whose status is 0 and do something with it and modify it to 1,...
> there are more than 1 urls link to a given url.
> e.g.  url1->url url2->url
> there are two times insertion of url. If I do not use checkAndPut,
> when thread 1 insert url and the worker thread do something with url
> and modify its status to 1. Then thread 2 again insert url and reset
> the status to 0, then the worker thread will do somthing again. That's
> not I want.
>
> On Tue, Apr 29, 2014 at 8:56 AM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
> > Why do you want to make sure the row is only inserted once? If you insert
> > the same raw twice the 2nd one will simple overwrite the first one and
> > HBase will take care of the versions.
> >
> > regarding the codes fragments, I don't think the autoflush is going to
> do a
> > big difference compared to the cost of the check & put...
> >
> >
> > 2014-04-28 20:50 GMT-04:00 Li Li <fa...@gmail.com>:
> >
> >> I must use checkAndPut to ensure a row is only inserted once.
> >> if I have 1000 checkAndPut,will setAutoFlush(false) useful?
> >> is there any performance difference of the following two code fragments?
> >> 1.
> >>     table.setAutoFlush(false);
> >>     for(int i=0;i<1000;i++){
> >>          Put put=...
> >>          table.checkAndPut(,....put);
> >>     }
> >> 2.
> >>     table.setAutoFlush(true);
> >>     for(int i=0;i<1000;i++){
> >>          Put put=...
> >>          table.checkAndPut(,....put);
> >>     }
> >>
> >> On Tue, Apr 29, 2014 at 8:36 AM, Jean-Marc Spaggiari
> >> <je...@spaggiari.org> wrote:
> >> > It depends. Batch a list of puts/gets wll be way faster than
> checkAndPut,
> >> > but the result will not be the same... a batch of puts will not do any
> >> > check...
> >> >
> >> >
> >> > 2014-04-28 20:17 GMT-04:00 Li Li <fa...@gmail.com>:
> >> >
> >> >> but I have many checkAndPut operations.
> >> >> will use batch a better solution?
> >> >>
> >> >> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
> >> >> <je...@spaggiari.org> wrote:
> >> >> > Hi Li Li,
> >> >> >
> >> >> > Yes, threads will impact the performances. If you send all you
> writes
> >> >> with
> >> >> > a single thread, a single HBase handler will take care of them,
> etc.
> >> >> HBase
> >> >> > does not provide a single handler for a single client connexion.
> It's
> >> >> able
> >> >> > to handle multiple threads and clients.
> >> >> >
> >> >> > However, it also all depends on the way you send your writes. If
> you
> >> >> send a
> >> >> > single puts(<10000>) per seconds, if will not be better to send 10
> 000
> >> >> > threads with a single put.
> >> >> >
> >> >> > I will recommend you to run some perf tests on your installation to
> >> find
> >> >> a
> >> >> > good number for your configuration.
> >> >> >
> >> >> > JM
> >> >> >
> >> >> >
> >> >> > 2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:
> >> >> >
> >> >> >> hi all,
> >> >> >>    with the same read/write data, will threads count affect
> >> performance?
> >> >> >>    e.g. I have 10,000 write request/second. I don't care the order
> >> very
> >> >> >> much.
> >> >> >>    how many writer threads should I use to obtain maximum
> throughput?
> >> >> >>
> >> >>
> >>
>

Re: question about threads count

Posted by Li Li <fa...@gmail.com>.
I am using hbase to store information for a web spider.
I have a table to save information of a webpage, the rowkey is url,
and there are other columns such as status(int) and depth(int)
in the beginning, the status is 0.  A worker thread will select urls
whose status is 0 and do something with it and modify it to 1,...
there are more than 1 urls link to a given url.
e.g.  url1->url url2->url
there are two times insertion of url. If I do not use checkAndPut,
when thread 1 insert url and the worker thread do something with url
and modify its status to 1. Then thread 2 again insert url and reset
the status to 0, then the worker thread will do somthing again. That's
not I want.

On Tue, Apr 29, 2014 at 8:56 AM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Why do you want to make sure the row is only inserted once? If you insert
> the same raw twice the 2nd one will simple overwrite the first one and
> HBase will take care of the versions.
>
> regarding the codes fragments, I don't think the autoflush is going to do a
> big difference compared to the cost of the check & put...
>
>
> 2014-04-28 20:50 GMT-04:00 Li Li <fa...@gmail.com>:
>
>> I must use checkAndPut to ensure a row is only inserted once.
>> if I have 1000 checkAndPut,will setAutoFlush(false) useful?
>> is there any performance difference of the following two code fragments?
>> 1.
>>     table.setAutoFlush(false);
>>     for(int i=0;i<1000;i++){
>>          Put put=...
>>          table.checkAndPut(,....put);
>>     }
>> 2.
>>     table.setAutoFlush(true);
>>     for(int i=0;i<1000;i++){
>>          Put put=...
>>          table.checkAndPut(,....put);
>>     }
>>
>> On Tue, Apr 29, 2014 at 8:36 AM, Jean-Marc Spaggiari
>> <je...@spaggiari.org> wrote:
>> > It depends. Batch a list of puts/gets wll be way faster than checkAndPut,
>> > but the result will not be the same... a batch of puts will not do any
>> > check...
>> >
>> >
>> > 2014-04-28 20:17 GMT-04:00 Li Li <fa...@gmail.com>:
>> >
>> >> but I have many checkAndPut operations.
>> >> will use batch a better solution?
>> >>
>> >> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
>> >> <je...@spaggiari.org> wrote:
>> >> > Hi Li Li,
>> >> >
>> >> > Yes, threads will impact the performances. If you send all you writes
>> >> with
>> >> > a single thread, a single HBase handler will take care of them, etc.
>> >> HBase
>> >> > does not provide a single handler for a single client connexion. It's
>> >> able
>> >> > to handle multiple threads and clients.
>> >> >
>> >> > However, it also all depends on the way you send your writes. If you
>> >> send a
>> >> > single puts(<10000>) per seconds, if will not be better to send 10 000
>> >> > threads with a single put.
>> >> >
>> >> > I will recommend you to run some perf tests on your installation to
>> find
>> >> a
>> >> > good number for your configuration.
>> >> >
>> >> > JM
>> >> >
>> >> >
>> >> > 2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:
>> >> >
>> >> >> hi all,
>> >> >>    with the same read/write data, will threads count affect
>> performance?
>> >> >>    e.g. I have 10,000 write request/second. I don't care the order
>> very
>> >> >> much.
>> >> >>    how many writer threads should I use to obtain maximum throughput?
>> >> >>
>> >>
>>

Re: question about threads count

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Why do you want to make sure the row is only inserted once? If you insert
the same raw twice the 2nd one will simple overwrite the first one and
HBase will take care of the versions.

regarding the codes fragments, I don't think the autoflush is going to do a
big difference compared to the cost of the check & put...


2014-04-28 20:50 GMT-04:00 Li Li <fa...@gmail.com>:

> I must use checkAndPut to ensure a row is only inserted once.
> if I have 1000 checkAndPut,will setAutoFlush(false) useful?
> is there any performance difference of the following two code fragments?
> 1.
>     table.setAutoFlush(false);
>     for(int i=0;i<1000;i++){
>          Put put=...
>          table.checkAndPut(,....put);
>     }
> 2.
>     table.setAutoFlush(true);
>     for(int i=0;i<1000;i++){
>          Put put=...
>          table.checkAndPut(,....put);
>     }
>
> On Tue, Apr 29, 2014 at 8:36 AM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
> > It depends. Batch a list of puts/gets wll be way faster than checkAndPut,
> > but the result will not be the same... a batch of puts will not do any
> > check...
> >
> >
> > 2014-04-28 20:17 GMT-04:00 Li Li <fa...@gmail.com>:
> >
> >> but I have many checkAndPut operations.
> >> will use batch a better solution?
> >>
> >> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
> >> <je...@spaggiari.org> wrote:
> >> > Hi Li Li,
> >> >
> >> > Yes, threads will impact the performances. If you send all you writes
> >> with
> >> > a single thread, a single HBase handler will take care of them, etc.
> >> HBase
> >> > does not provide a single handler for a single client connexion. It's
> >> able
> >> > to handle multiple threads and clients.
> >> >
> >> > However, it also all depends on the way you send your writes. If you
> >> send a
> >> > single puts(<10000>) per seconds, if will not be better to send 10 000
> >> > threads with a single put.
> >> >
> >> > I will recommend you to run some perf tests on your installation to
> find
> >> a
> >> > good number for your configuration.
> >> >
> >> > JM
> >> >
> >> >
> >> > 2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:
> >> >
> >> >> hi all,
> >> >>    with the same read/write data, will threads count affect
> performance?
> >> >>    e.g. I have 10,000 write request/second. I don't care the order
> very
> >> >> much.
> >> >>    how many writer threads should I use to obtain maximum throughput?
> >> >>
> >>
>

Re: question about threads count

Posted by Li Li <fa...@gmail.com>.
I must use checkAndPut to ensure a row is only inserted once.
if I have 1000 checkAndPut,will setAutoFlush(false) useful?
is there any performance difference of the following two code fragments?
1.
    table.setAutoFlush(false);
    for(int i=0;i<1000;i++){
         Put put=...
         table.checkAndPut(,....put);
    }
2.
    table.setAutoFlush(true);
    for(int i=0;i<1000;i++){
         Put put=...
         table.checkAndPut(,....put);
    }

On Tue, Apr 29, 2014 at 8:36 AM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> It depends. Batch a list of puts/gets wll be way faster than checkAndPut,
> but the result will not be the same... a batch of puts will not do any
> check...
>
>
> 2014-04-28 20:17 GMT-04:00 Li Li <fa...@gmail.com>:
>
>> but I have many checkAndPut operations.
>> will use batch a better solution?
>>
>> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
>> <je...@spaggiari.org> wrote:
>> > Hi Li Li,
>> >
>> > Yes, threads will impact the performances. If you send all you writes
>> with
>> > a single thread, a single HBase handler will take care of them, etc.
>> HBase
>> > does not provide a single handler for a single client connexion. It's
>> able
>> > to handle multiple threads and clients.
>> >
>> > However, it also all depends on the way you send your writes. If you
>> send a
>> > single puts(<10000>) per seconds, if will not be better to send 10 000
>> > threads with a single put.
>> >
>> > I will recommend you to run some perf tests on your installation to find
>> a
>> > good number for your configuration.
>> >
>> > JM
>> >
>> >
>> > 2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:
>> >
>> >> hi all,
>> >>    with the same read/write data, will threads count affect performance?
>> >>    e.g. I have 10,000 write request/second. I don't care the order very
>> >> much.
>> >>    how many writer threads should I use to obtain maximum throughput?
>> >>
>>

Re: question about threads count

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
It depends. Batch a list of puts/gets wll be way faster than checkAndPut,
but the result will not be the same... a batch of puts will not do any
check...


2014-04-28 20:17 GMT-04:00 Li Li <fa...@gmail.com>:

> but I have many checkAndPut operations.
> will use batch a better solution?
>
> On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
> <je...@spaggiari.org> wrote:
> > Hi Li Li,
> >
> > Yes, threads will impact the performances. If you send all you writes
> with
> > a single thread, a single HBase handler will take care of them, etc.
> HBase
> > does not provide a single handler for a single client connexion. It's
> able
> > to handle multiple threads and clients.
> >
> > However, it also all depends on the way you send your writes. If you
> send a
> > single puts(<10000>) per seconds, if will not be better to send 10 000
> > threads with a single put.
> >
> > I will recommend you to run some perf tests on your installation to find
> a
> > good number for your configuration.
> >
> > JM
> >
> >
> > 2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:
> >
> >> hi all,
> >>    with the same read/write data, will threads count affect performance?
> >>    e.g. I have 10,000 write request/second. I don't care the order very
> >> much.
> >>    how many writer threads should I use to obtain maximum throughput?
> >>
>

Re: question about threads count

Posted by Li Li <fa...@gmail.com>.
but I have many checkAndPut operations.
will use batch a better solution?

On Mon, Apr 28, 2014 at 8:01 PM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Hi Li Li,
>
> Yes, threads will impact the performances. If you send all you writes with
> a single thread, a single HBase handler will take care of them, etc. HBase
> does not provide a single handler for a single client connexion. It's able
> to handle multiple threads and clients.
>
> However, it also all depends on the way you send your writes. If you send a
> single puts(<10000>) per seconds, if will not be better to send 10 000
> threads with a single put.
>
> I will recommend you to run some perf tests on your installation to find a
> good number for your configuration.
>
> JM
>
>
> 2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:
>
>> hi all,
>>    with the same read/write data, will threads count affect performance?
>>    e.g. I have 10,000 write request/second. I don't care the order very
>> much.
>>    how many writer threads should I use to obtain maximum throughput?
>>

Re: question about threads count

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.
Hi Li Li,

Yes, threads will impact the performances. If you send all you writes with
a single thread, a single HBase handler will take care of them, etc. HBase
does not provide a single handler for a single client connexion. It's able
to handle multiple threads and clients.

However, it also all depends on the way you send your writes. If you send a
single puts(<10000>) per seconds, if will not be better to send 10 000
threads with a single put.

I will recommend you to run some perf tests on your installation to find a
good number for your configuration.

JM


2014-04-28 6:27 GMT-04:00 Li Li <fa...@gmail.com>:

> hi all,
>    with the same read/write data, will threads count affect performance?
>    e.g. I have 10,000 write request/second. I don't care the order very
> much.
>    how many writer threads should I use to obtain maximum throughput?
>