You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Prakash Kadel <pr...@gmail.com> on 2013/02/18 01:48:26 UTC

coprocessor enabled put very slow, help please~~~

hi,
   i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
my settings:
    3 region servers
   60 maps
each map inserts to doc table.(checkAndPut)
regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.


Sincerely,
Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
I don't agree with Lars on the second half of his statement. 

Yes, there will be a performance hit when you go across regions because you're now going across the network to a second machine. 
However, I disagree that it defeats the performance purpose. 

In a Hadoop cluster, we tend to launch our jobs from an edge server. However w HBase, you can connect to the cluster from a remote client and still run queries against the data outside of the traditional M/R. 

So doing something inner cluster would be less expensive than doing something round trip back to the client. 

In addition, there is no concept of a transaction. All put()s are atomic. So you write to your base table, one atomic write. You write to your index(s) table(s) each index update is atomic. (Assuming you may have multiple indexes on your base table.

Its important to remember that coprocessors are really, really new. As Andrew points out... its not recommended for the novice. 


On Feb 17, 2013, at 8:31 PM, lars hofhansl <la...@apache.org> wrote:

> The main advantage of coprocessors is that they keep the logic local to the region server. Putting data into other region servers is supported, but defeats the performance purpose.
> 
> 
> 
> ________________________________
> From: Prakash Kadel <pr...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Sunday, February 17, 2013 5:26 PM
> Subject: Re: coprocessor enabled put very slow, help please~~~
> 
> thanks again,
>   i did try making indexes with the MR. dont have exact evaluation data, but inserting indexes directly with mapreduce does seem to be much much faster than making the indexes with the coprocessors. guess i am missing the point about the coprosessors. 
> my reason for trying out the coprocessor was to make the insertion code cleaner and efficient index creation.
> 
> Sincerely,
> Prakash Kadel
> 
> On Feb 18, 2013, at 10:17 AM, lars hofhansl <la...@apache.org> wrote:
> 
>> Index maintenance will always be slower. An interesting comparison would be to also update your indexes from the M/R and see whether that performs better.
>> 
>> 
>> 
>> ________________________________
>> From: Prakash Kadel <pr...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>> Sent: Sunday, February 17, 2013 5:13 PM
>> Subject: Re: coprocessor enabled put very slow, help please~~~
>> 
>> thank you lars,
>> That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?
>> 
>> Sincerely,
>> Prakash Kadel
>> 
>> On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:
>> 
>>> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Prakash Kadel <pr...@gmail.com>
>>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>>> Sent: Sunday, February 17, 2013 4:52 PM
>>> Subject: Re: coprocessor enabled put very slow, help please~~~
>>> 
>>> Forgot to mention. I am using 0.92.
>>> 
>>> Sincerely,
>>> Prakash
>>> 
>>> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
>>> 
>>>> hi,
>>>>      i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>>>> my settings:
>>>>       3 region servers
>>>>      60 maps
>>>> each map inserts to doc table.(checkAndPut)
>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>>>> 
>>>> 
>>>> Sincerely,
>>>> Prakash

Michael Segel  | (m) 312.755.9623

Segel and Associates



Re: coprocessor enabled put very slow, help please~~~

Posted by lars hofhansl <la...@apache.org>.
The main advantage of coprocessors is that they keep the logic local to the region server. Putting data into other region servers is supported, but defeats the performance purpose.



________________________________
 From: Prakash Kadel <pr...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Sunday, February 17, 2013 5:26 PM
Subject: Re: coprocessor enabled put very slow, help please~~~
 
thanks again,
  i did try making indexes with the MR. dont have exact evaluation data, but inserting indexes directly with mapreduce does seem to be much much faster than making the indexes with the coprocessors. guess i am missing the point about the coprosessors. 
my reason for trying out the coprocessor was to make the insertion code cleaner and efficient index creation.

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 10:17 AM, lars hofhansl <la...@apache.org> wrote:

> Index maintenance will always be slower. An interesting comparison would be to also update your indexes from the M/R and see whether that performs better.
> 
> 
> 
> ________________________________
> From: Prakash Kadel <pr...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Sunday, February 17, 2013 5:13 PM
> Subject: Re: coprocessor enabled put very slow, help please~~~
> 
> thank you lars,
> That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?
> 
> Sincerely,
> Prakash Kadel
> 
> On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:
> 
>> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
>> 
>> 
>> 
>> ________________________________
>> From: Prakash Kadel <pr...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>> Sent: Sunday, February 17, 2013 4:52 PM
>> Subject: Re: coprocessor enabled put very slow, help please~~~
>> 
>> Forgot to mention. I am using 0.92.
>> 
>> Sincerely,
>> Prakash
>> 
>> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
>> 
>>> hi,
>>>     i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>>> my settings:
>>>      3 region servers
>>>     60 maps
>>> each map inserts to doc table.(checkAndPut)
>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>>> 
>>> 
>>> Sincerely,
>>> Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
thanks again,
  i did try making indexes with the MR. dont have exact evaluation data, but inserting indexes directly with mapreduce does seem to be much much faster than making the indexes with the coprocessors. guess i am missing the point about the coprosessors. 
my reason for trying out the coprocessor was to make the insertion code cleaner and efficient index creation.

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 10:17 AM, lars hofhansl <la...@apache.org> wrote:

> Index maintenance will always be slower. An interesting comparison would be to also update your indexes from the M/R and see whether that performs better.
> 
> 
> 
> ________________________________
> From: Prakash Kadel <pr...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Sunday, February 17, 2013 5:13 PM
> Subject: Re: coprocessor enabled put very slow, help please~~~
> 
> thank you lars,
> That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?
> 
> Sincerely,
> Prakash Kadel
> 
> On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:
> 
>> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
>> 
>> 
>> 
>> ________________________________
>> From: Prakash Kadel <pr...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>> Sent: Sunday, February 17, 2013 4:52 PM
>> Subject: Re: coprocessor enabled put very slow, help please~~~
>> 
>> Forgot to mention. I am using 0.92.
>> 
>> Sincerely,
>> Prakash
>> 
>> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
>> 
>>> hi,
>>>     i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>>> my settings:
>>>      3 region servers
>>>     60 maps
>>> each map inserts to doc table.(checkAndPut)
>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>>> 
>>> 
>>> Sincerely,
>>> Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
is there a way to do async writes with coprocessors triggered by Put operations?
thanks

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 10:31 AM, Michael Segel <mi...@hotmail.com> wrote:

> Hmmm. Can you have async writes using a coprocessor? 
> 
> 
> On Feb 17, 2013, at 7:17 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Index maintenance will always be slower. An interesting comparison would be to also update your indexes from the M/R and see whether that performs better.
>> 
>> 
>> 
>> ________________________________
>> From: Prakash Kadel <pr...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>> Sent: Sunday, February 17, 2013 5:13 PM
>> Subject: Re: coprocessor enabled put very slow, help please~~~
>> 
>> thank you lars,
>> That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?
>> 
>> Sincerely,
>> Prakash Kadel
>> 
>> On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:
>> 
>>> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Prakash Kadel <pr...@gmail.com>
>>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>>> Sent: Sunday, February 17, 2013 4:52 PM
>>> Subject: Re: coprocessor enabled put very slow, help please~~~
>>> 
>>> Forgot to mention. I am using 0.92.
>>> 
>>> Sincerely,
>>> Prakash
>>> 
>>> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
>>> 
>>>> hi,
>>>>    i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>>>> my settings:
>>>>     3 region servers
>>>>    60 maps
>>>> each map inserts to doc table.(checkAndPut)
>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>>>> 
>>>> 
>>>> Sincerely,
>>>> Prakash
> 
> Michael Segel  | (m) 312.755.9623
> 
> Segel and Associates
> 
> 

Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
Hmmm. Can you have async writes using a coprocessor? 


On Feb 17, 2013, at 7:17 PM, lars hofhansl <la...@apache.org> wrote:

> Index maintenance will always be slower. An interesting comparison would be to also update your indexes from the M/R and see whether that performs better.
> 
> 
> 
> ________________________________
> From: Prakash Kadel <pr...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Sunday, February 17, 2013 5:13 PM
> Subject: Re: coprocessor enabled put very slow, help please~~~
> 
> thank you lars,
> That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?
> 
> Sincerely,
> Prakash Kadel
> 
> On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:
> 
>> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
>> 
>> 
>> 
>> ________________________________
>> From: Prakash Kadel <pr...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>> Sent: Sunday, February 17, 2013 4:52 PM
>> Subject: Re: coprocessor enabled put very slow, help please~~~
>> 
>> Forgot to mention. I am using 0.92.
>> 
>> Sincerely,
>> Prakash
>> 
>> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
>> 
>>> hi,
>>>     i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>>> my settings:
>>>      3 region servers
>>>     60 maps
>>> each map inserts to doc table.(checkAndPut)
>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>>> 
>>> 
>>> Sincerely,
>>> Prakash

Michael Segel  | (m) 312.755.9623

Segel and Associates



Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
one more question.

even if the coprocessors are making insertions to different region, since i use  "postCheckAndPut" shouldnt there be not much prefomance slow down?

thanks

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 10:17 AM, lars hofhansl <la...@apache.org> wrote:

> Index maintenance will always be slower. An interesting comparison would be to also update your indexes from the M/R and see whether that performs better.
> 
> 
> 
> ________________________________
> From: Prakash Kadel <pr...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Sunday, February 17, 2013 5:13 PM
> Subject: Re: coprocessor enabled put very slow, help please~~~
> 
> thank you lars,
> That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?
> 
> Sincerely,
> Prakash Kadel
> 
> On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:
> 
>> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
>> 
>> 
>> 
>> ________________________________
>> From: Prakash Kadel <pr...@gmail.com>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
>> Sent: Sunday, February 17, 2013 4:52 PM
>> Subject: Re: coprocessor enabled put very slow, help please~~~
>> 
>> Forgot to mention. I am using 0.92.
>> 
>> Sincerely,
>> Prakash
>> 
>> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
>> 
>>> hi,
>>>     i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>>> my settings:
>>>      3 region servers
>>>     60 maps
>>> each map inserts to doc table.(checkAndPut)
>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>>> 
>>> 
>>> Sincerely,
>>> Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by lars hofhansl <la...@apache.org>.
Index maintenance will always be slower. An interesting comparison would be to also update your indexes from the M/R and see whether that performs better.



________________________________
 From: Prakash Kadel <pr...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Sunday, February 17, 2013 5:13 PM
Subject: Re: coprocessor enabled put very slow, help please~~~
 
thank you lars,
That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:

> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
> 
> 
> 
> ________________________________
> From: Prakash Kadel <pr...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Sunday, February 17, 2013 4:52 PM
> Subject: Re: coprocessor enabled put very slow, help please~~~
> 
> Forgot to mention. I am using 0.92.
> 
> Sincerely,
> Prakash
> 
> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
> 
>> hi,
>>    i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>> my settings:
>>     3 region servers
>>    60 maps
>> each map inserts to doc table.(checkAndPut)
>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>> 
>> 
>> Sincerely,
>> Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
thank you lars,
That is my guess too. I am confused, isnt that something that cannot be controlled. Is this approach of creating some kind of index wrong?

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 10:07 AM, lars hofhansl <la...@apache.org> wrote:

> Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.
> 
> 
> 
> ________________________________
> From: Prakash Kadel <pr...@gmail.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Sunday, February 17, 2013 4:52 PM
> Subject: Re: coprocessor enabled put very slow, help please~~~
> 
> Forgot to mention. I am using 0.92.
> 
> Sincerely,
> Prakash
> 
> On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:
> 
>> hi,
>>    i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
>> my settings:
>>     3 region servers
>>    60 maps
>> each map inserts to doc table.(checkAndPut)
>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
>> 
>> 
>> Sincerely,
>> Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by lars hofhansl <la...@apache.org>.
Presumably the coprocessor issues Puts to another region server in most cases, that could explain it being (much) slower.



________________________________
 From: Prakash Kadel <pr...@gmail.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Sent: Sunday, February 17, 2013 4:52 PM
Subject: Re: coprocessor enabled put very slow, help please~~~
 
Forgot to mention. I am using 0.92.

Sincerely,
Prakash

On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:

> hi,
>   i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
> my settings:
>    3 region servers
>   60 maps
> each map inserts to doc table.(checkAndPut)
> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
> 
> 
> Sincerely,
> Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
Forgot to mention. I am using 0.92.

Sincerely,
Prakash

On Feb 18, 2013, at 9:48 AM, Prakash Kadel <pr...@gmail.com> wrote:

> hi,
>   i am trying to insert few million documents to hbase with mapreduce. To enable quick search of docs i want to have some indexes, so i tried to use the coprocessors, but they are slowing down my inserts. Arent the coprocessors not supposed to increase the latency? 
> my settings:
>    3 region servers
>   60 maps
> each map inserts to doc table.(checkAndPut)
> regionobserver coprocessor does a postCheckAndPut and inserts some rows to a index table.
> 
> 
> Sincerely,
> Prakash

Re: coprocessor enabled put very slow, help please~~~

Posted by yonghu <yo...@gmail.com>.
Forget to say. I also tested MapReduce. It's faster than coprocessor.

On Mon, Feb 18, 2013 at 10:01 AM, yonghu <yo...@gmail.com> wrote:
> Parkash,
>
> I have a six nodes cluster and met the same problem as you had. In my
> test, inserting one tuple using coprocessor is nearly 10 times slower
> than normal put operation. I think the main reason is what Lars
> pointed out, the main overhead is executing RPC.
>
> regards!
>
> Yong
>
> On Mon, Feb 18, 2013 at 6:52 AM, Wei Tan <wt...@us.ibm.com> wrote:
>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>> LSM, read is much slower compared to a write...
>>
>>
>> Best Regards,
>> Wei
>>
>>
>>
>>
>> From:   Prakash Kadel <pr...@gmail.com>
>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>> Date:   02/17/2013 07:49 PM
>> Subject:        coprocessor enabled put very slow, help please~~~
>>
>>
>>
>> hi,
>>    i am trying to insert few million documents to hbase with mapreduce. To
>> enable quick search of docs i want to have some indexes, so i tried to use
>> the coprocessors, but they are slowing down my inserts. Arent the
>> coprocessors not supposed to increase the latency?
>> my settings:
>>     3 region servers
>>    60 maps
>> each map inserts to doc table.(checkAndPut)
>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>> a index table.
>>
>>
>> Sincerely,
>> Prakash
>>

Re: coprocessor enabled put very slow, help please~~~

Posted by yonghu <yo...@gmail.com>.
Parkash,

I have a six nodes cluster and met the same problem as you had. In my
test, inserting one tuple using coprocessor is nearly 10 times slower
than normal put operation. I think the main reason is what Lars
pointed out, the main overhead is executing RPC.

regards!

Yong

On Mon, Feb 18, 2013 at 6:52 AM, Wei Tan <wt...@us.ibm.com> wrote:
> Is your CheckAndPut involving a local or remote READ? Due to the nature of
> LSM, read is much slower compared to a write...
>
>
> Best Regards,
> Wei
>
>
>
>
> From:   Prakash Kadel <pr...@gmail.com>
> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
> Date:   02/17/2013 07:49 PM
> Subject:        coprocessor enabled put very slow, help please~~~
>
>
>
> hi,
>    i am trying to insert few million documents to hbase with mapreduce. To
> enable quick search of docs i want to have some indexes, so i tried to use
> the coprocessors, but they are slowing down my inserts. Arent the
> coprocessors not supposed to increase the latency?
> my settings:
>     3 region servers
>    60 maps
> each map inserts to doc table.(checkAndPut)
> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
> a index table.
>
>
> Sincerely,
> Prakash
>

Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
Well... 

if you look at the OP's code, there's a couple things I think could be causing a bit of the overhead on the writes. 

Lets see if he fixes his code and see if there are any changes in performance. 

On Feb 18, 2013, at 11:56 AM, Wei Tan <wt...@us.ibm.com> wrote:

> Well, my experience shows:
> 
> 1. A local read can be >10ms and a remote put can be 1-2ms. Due to the 
> nature of LSM a read is always a scan to one or multiple files. A very 
> quick experiment can be, if you temporarily disable the check and only do 
> the put, will you see performance go up?
> 
> 2. In a lot of cases, RPC may NOT be the bottle neck. Remember a "local" 
> put also involves RPC -- during WAL to HDFS.
> 
> 
> Best Regards,
> Wei
> 
> Wei Tan 
> Research Staff Member 
> IBM T. J. Watson Research Center
> Yorktown Heights, NY 10598
> wtan@us.ibm.com; 914-945-4386
> 
> 
> 
> From:   Prakash Kadel <pr...@gmail.com>
> To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
> Date:   02/18/2013 04:04 AM
> Subject:        Re: coprocessor enabled put very slow, help please~~~
> 
> 
> 
> its a local read. i just check the last param of PostCheckAndPut 
> indicating if the Put succeeded. Incase if the put success, i insert a row 
> in another table
> 
> Sincerely,
> Prakash Kadel
> 
> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
> 
>> Is your CheckAndPut involving a local or remote READ? Due to the nature 
> of 
>> LSM, read is much slower compared to a write...
>> 
>> 
>> Best Regards,
>> Wei
>> 
>> 
>> 
>> 
>> From:   Prakash Kadel <pr...@gmail.com>
>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
>> Date:   02/17/2013 07:49 PM
>> Subject:        coprocessor enabled put very slow, help please~~~
>> 
>> 
>> 
>> hi,
>>  i am trying to insert few million documents to hbase with mapreduce. 
> To 
>> enable quick search of docs i want to have some indexes, so i tried to 
> use 
>> the coprocessors, but they are slowing down my inserts. Arent the 
>> coprocessors not supposed to increase the latency? 
>> my settings:
>>   3 region servers
>>  60 maps
>> each map inserts to doc table.(checkAndPut)
>> regionobserver coprocessor does a postCheckAndPut and inserts some rows 
> to 
>> a index table.
>> 
>> 
>> Sincerely,
>> Prakash
>> 
> 
> 


Re: coprocessor enabled put very slow, help please~~~

Posted by Wei Tan <wt...@us.ibm.com>.
Well, my experience shows:

1. A local read can be >10ms and a remote put can be 1-2ms. Due to the 
nature of LSM a read is always a scan to one or multiple files. A very 
quick experiment can be, if you temporarily disable the check and only do 
the put, will you see performance go up?
 
2. In a lot of cases, RPC may NOT be the bottle neck. Remember a "local" 
put also involves RPC -- during WAL to HDFS.


Best Regards,
Wei

Wei Tan 
Research Staff Member 
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598
wtan@us.ibm.com; 914-945-4386



From:   Prakash Kadel <pr...@gmail.com>
To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
Date:   02/18/2013 04:04 AM
Subject:        Re: coprocessor enabled put very slow, help please~~~



its a local read. i just check the last param of PostCheckAndPut 
indicating if the Put succeeded. Incase if the put success, i insert a row 
in another table

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:

> Is your CheckAndPut involving a local or remote READ? Due to the nature 
of 
> LSM, read is much slower compared to a write...
> 
> 
> Best Regards,
> Wei
> 
> 
> 
> 
> From:   Prakash Kadel <pr...@gmail.com>
> To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
> Date:   02/17/2013 07:49 PM
> Subject:        coprocessor enabled put very slow, help please~~~
> 
> 
> 
> hi,
>   i am trying to insert few million documents to hbase with mapreduce. 
To 
> enable quick search of docs i want to have some indexes, so i tried to 
use 
> the coprocessors, but they are slowing down my inserts. Arent the 
> coprocessors not supposed to increase the latency? 
> my settings:
>    3 region servers
>   60 maps
> each map inserts to doc table.(checkAndPut)
> regionobserver coprocessor does a postCheckAndPut and inserts some rows 
to 
> a index table.
> 
> 
> Sincerely,
> Prakash
> 



Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
I should follow up with that I was asking why he was using an HTable Pool, not saying that it was wrong. 

Still. I think in the pool the writes shouldn't have to go to the WAL. 


On Feb 19, 2013, at 10:01 AM, Michael Segel <mi...@hotmail.com> wrote:

> Good question.. 
> 
> You create a class MyRO. 
> 
> How many instances of  MyRO exist per RS?
> 
> How many queries can access the instance MyRO at the same time? 
> 
> 
> 
> 
> On Feb 19, 2013, at 9:15 AM, Wei Tan <wt...@us.ibm.com> wrote:
> 
>> A side question: if HTablePool is not encouraged to be used... how we 
>> handle the thread safeness in using HTable? Any replacement for 
>> HTablePool, in plan?
>> Thanks,
>> 
>> 
>> Best Regards,
>> Wei
>> 
>> 
>> 
>> 
>> From:   Michel Segel <mi...@hotmail.com>
>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
>> Date:   02/18/2013 09:23 AM
>> Subject:        Re: coprocessor enabled put very slow, help please~~~
>> 
>> 
>> 
>> Why are you using an HTable Pool?
>> Why are you closing the table after each iteration through?
>> 
>> Try using 1 HTable object. Turn off WAL
>> Initiate in start()
>> Close in Stop()
>> Surround the use in a try / catch
>> If exception caught, re instantiate new HTable connection.
>> 
>> Maybe want to flush the connection after puts. 
>> 
>> 
>> Again not sure why you are using check and put on the base table. Your 
>> count could be off.
>> 
>> As an example look at poem/rhyme 'Marry had a little lamb'.
>> Then check your word count.
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On Feb 18, 2013, at 7:21 AM, prakash kadel <pr...@gmail.com> 
>> wrote:
>> 
>>> Thank you guys for your replies,
>>> Michael,
>>> I think i didnt make it clear. Here is my use case,
>>> 
>>> I have text documents to insert in the hbase. (With possible duplicates)
>>> Suppose i have a document as : " I am working. He is not working"
>>> 
>>> I want to insert this document to a table in hbase, say table "doc"
>>> 
>>> =doc table=
>>> -----
>>> rowKey : doc_id
>>> cf: doc_content
>>> value: "I am working. He is not working"
>>> 
>>> Now, i to create another table that stores the word count, say "doc_idx"
>>> 
>>> doc_idx table
>>> ---
>>> rowKey : I, cf: count, value: 1
>>> rowKey : am, cf: count, value: 1
>>> rowKey : working, cf: count, value: 2
>>> rowKey : He, cf: count, value: 1
>>> rowKey : is, cf: count, value: 1
>>> rowKey : not, cf: count, value: 1
>>> 
>>> My MR job code:
>>> ==============
>>> 
>>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>>  for(String word : doc_content.split("\\s+")) {
>>>     Increment inc = new Increment(Bytes.toBytes(word));
>>>     inc.addColumn("count", "", 1);
>>>  }
>>> }
>>> 
>>> Now, i wanted to do some experiments with coprocessors. So, i modified
>>> the code as follows.
>>> 
>>> My MR job code:
>>> ===============
>>> 
>>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>>> 
>>> Coprocessor code:
>>> ===============
>>> 
>>>  public void start(CoprocessorEnvironment env)  {
>>>      pool = new HTablePool(conf, 100);
>>>  }
>>> 
>>>  public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>>> compareOp,     comparator,  put, result) {
>>> 
>>>              if(!result) return true; // check if the put succeeded
>>> 
>>>      HTableInterface table_idx = pool.getTable("doc_idx");
>>> 
>>>      try {
>>> 
>>>          for(KeyValue contentKV = put.get("doc_content", "")) {
>>>                          for(String word :
>>> contentKV.getValue().split("\\s+")) {
>>>                              Increment inc = new
>>> Increment(Bytes.toBytes(word));
>>>                              inc.addColumn("count", "", 1);
>>>                              table_idx.increment(inc);
>>>                          }
>>>                     }
>>>      } finally {
>>>          table_idx.close();
>>>      }
>>>      return true;
>>>  }
>>> 
>>>  public void stop(env) {
>>>      pool.close();
>>>  }
>>> 
>>> I am a newbee to HBASE. I am not sure this is the way to do.
>>> Given that, why is the cooprocessor enabled version much slower than
>>> the one without?
>>> 
>>> 
>>> Sincerely,
>>> Prakash Kadel
>>> 
>>> 
>>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>>> <mi...@hotmail.com> wrote:
>>>> 
>>>> The  issue I was talking about was the use of a check and put.
>>>> The OP wrote:
>>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
>> rows to
>>>>>>>> a index table.
>>>> 
>>>> My question is why does the OP use a checkAndPut, and the 
>> RegionObserver's postChecAndPut?
>>>> 
>>>> 
>>>> Here's a good example... 
>> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>> 
>>>> 
>>>> The OP doesn't really get in to the use case, so we don't know why the 
>> Check and Put in the M/R job.
>>>> He should just be using put() and then a postPut().
>>>> 
>>>> Another issue... since he's writing to  a different HTable... how? Does 
>> he create an HTable instance in the start() method of his RO object and 
>> then reference it later? Or does he create the instance of the HTable on 
>> the fly in each postCheckAndPut() ?
>>>> Without seeing his code, we don't know.
>>>> 
>>>> Note that this is synchronous set of writes. Your overall return from 
>> the M/R call to put will wait until the second row is inserted.
>>>> 
>>>> Interestingly enough, you may want to consider disabling the WAL on the 
>> write to the index.  You can always run a M/R job that rebuilds the index 
>> should something occur to the system where you might lose the data. 
>> Indexes *ARE* expendable. ;-)
>>>> 
>>>> Does that explain it?
>>>> 
>>>> -Mike
>>>> 
>>>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>>>> 
>>>>> Hi, Michael
>>>>> 
>>>>> I don't quite understand what do you mean by "round trip back to the
>>>>> client". In my understanding, as the RegionServer and TaskTracker can
>>>>> be the same node, MR don't have to pull data into client and then
>>>>> process.  And you also mention the "unnecessary overhead", can you
>>>>> explain a little bit what operations or data processing can be seen as
>>>>> "unnecessary overhead".
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> yong
>>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>>>> <mi...@hotmail.com> wrote:
>>>>>> Why?
>>>>>> 
>>>>>> This seems like an unnecessary overhead.
>>>>>> 
>>>>>> You are writing code within the coprocessor on the server. 
>> Pessimistic code really isn't recommended if you are worried about 
>> performance.
>>>>>> 
>>>>>> I have to ask... by the time you have executed the code in your 
>> co-processor, what would cause the initial write to fail?
>>>>>> 
>>>>>> 
>>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> 
>> wrote:
>>>>>> 
>>>>>>> its a local read. i just check the last param of PostCheckAndPut 
>> indicating if the Put succeeded. Incase if the put success, i insert a row 
>> in another table
>>>>>>> 
>>>>>>> Sincerely,
>>>>>>> Prakash Kadel
>>>>>>> 
>>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>>>> 
>>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the 
>> nature of
>>>>>>>> LSM, read is much slower compared to a write...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> Wei
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>>>> Date:   02/17/2013 07:49 PM
>>>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> hi,
>>>>>>>> i am trying to insert few million documents to hbase with 
>> mapreduce. To
>>>>>>>> enable quick search of docs i want to have some indexes, so i tried 
>> to use
>>>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>>>> coprocessors not supposed to increase the latency?
>>>>>>>> my settings:
>>>>>>>> 3 region servers
>>>>>>>> 60 maps
>>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
>> rows to
>>>>>>>> a index table.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Sincerely,
>>>>>>>> Prakash
>>>>>> 
>>>>>> Michael Segel  | (m) 312.755.9623
>>>>>> 
>>>>>> Segel and Associates
>>> 
>> 
>> 
> 
> 


Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
Good question.. 

You create a class MyRO. 

How many instances of  MyRO exist per RS?

How many queries can access the instance MyRO at the same time? 




On Feb 19, 2013, at 9:15 AM, Wei Tan <wt...@us.ibm.com> wrote:

> A side question: if HTablePool is not encouraged to be used... how we 
> handle the thread safeness in using HTable? Any replacement for 
> HTablePool, in plan?
> Thanks,
> 
> 
> Best Regards,
> Wei
> 
> 
> 
> 
> From:   Michel Segel <mi...@hotmail.com>
> To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
> Date:   02/18/2013 09:23 AM
> Subject:        Re: coprocessor enabled put very slow, help please~~~
> 
> 
> 
> Why are you using an HTable Pool?
> Why are you closing the table after each iteration through?
> 
> Try using 1 HTable object. Turn off WAL
> Initiate in start()
> Close in Stop()
> Surround the use in a try / catch
> If exception caught, re instantiate new HTable connection.
> 
> Maybe want to flush the connection after puts. 
> 
> 
> Again not sure why you are using check and put on the base table. Your 
> count could be off.
> 
> As an example look at poem/rhyme 'Marry had a little lamb'.
> Then check your word count.
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Feb 18, 2013, at 7:21 AM, prakash kadel <pr...@gmail.com> 
> wrote:
> 
>> Thank you guys for your replies,
>> Michael,
>>  I think i didnt make it clear. Here is my use case,
>> 
>> I have text documents to insert in the hbase. (With possible duplicates)
>> Suppose i have a document as : " I am working. He is not working"
>> 
>> I want to insert this document to a table in hbase, say table "doc"
>> 
>> =doc table=
>> -----
>> rowKey : doc_id
>> cf: doc_content
>> value: "I am working. He is not working"
>> 
>> Now, i to create another table that stores the word count, say "doc_idx"
>> 
>> doc_idx table
>> ---
>> rowKey : I, cf: count, value: 1
>> rowKey : am, cf: count, value: 1
>> rowKey : working, cf: count, value: 2
>> rowKey : He, cf: count, value: 1
>> rowKey : is, cf: count, value: 1
>> rowKey : not, cf: count, value: 1
>> 
>> My MR job code:
>> ==============
>> 
>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>   for(String word : doc_content.split("\\s+")) {
>>      Increment inc = new Increment(Bytes.toBytes(word));
>>      inc.addColumn("count", "", 1);
>>   }
>> }
>> 
>> Now, i wanted to do some experiments with coprocessors. So, i modified
>> the code as follows.
>> 
>> My MR job code:
>> ===============
>> 
>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>> 
>> Coprocessor code:
>> ===============
>> 
>>   public void start(CoprocessorEnvironment env)  {
>>       pool = new HTablePool(conf, 100);
>>   }
>> 
>>   public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>> compareOp,     comparator,  put, result) {
>> 
>>               if(!result) return true; // check if the put succeeded
>> 
>>       HTableInterface table_idx = pool.getTable("doc_idx");
>> 
>>       try {
>> 
>>           for(KeyValue contentKV = put.get("doc_content", "")) {
>>                           for(String word :
>> contentKV.getValue().split("\\s+")) {
>>                               Increment inc = new
>> Increment(Bytes.toBytes(word));
>>                               inc.addColumn("count", "", 1);
>>                               table_idx.increment(inc);
>>                           }
>>                      }
>>       } finally {
>>           table_idx.close();
>>       }
>>       return true;
>>   }
>> 
>>   public void stop(env) {
>>       pool.close();
>>   }
>> 
>> I am a newbee to HBASE. I am not sure this is the way to do.
>> Given that, why is the cooprocessor enabled version much slower than
>> the one without?
>> 
>> 
>> Sincerely,
>> Prakash Kadel
>> 
>> 
>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>> <mi...@hotmail.com> wrote:
>>> 
>>> The  issue I was talking about was the use of a check and put.
>>> The OP wrote:
>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
> rows to
>>>>>>> a index table.
>>> 
>>> My question is why does the OP use a checkAndPut, and the 
> RegionObserver's postChecAndPut?
>>> 
>>> 
>>> Here's a good example... 
> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
> 
>>> 
>>> The OP doesn't really get in to the use case, so we don't know why the 
> Check and Put in the M/R job.
>>> He should just be using put() and then a postPut().
>>> 
>>> Another issue... since he's writing to  a different HTable... how? Does 
> he create an HTable instance in the start() method of his RO object and 
> then reference it later? Or does he create the instance of the HTable on 
> the fly in each postCheckAndPut() ?
>>> Without seeing his code, we don't know.
>>> 
>>> Note that this is synchronous set of writes. Your overall return from 
> the M/R call to put will wait until the second row is inserted.
>>> 
>>> Interestingly enough, you may want to consider disabling the WAL on the 
> write to the index.  You can always run a M/R job that rebuilds the index 
> should something occur to the system where you might lose the data. 
> Indexes *ARE* expendable. ;-)
>>> 
>>> Does that explain it?
>>> 
>>> -Mike
>>> 
>>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>>> 
>>>> Hi, Michael
>>>> 
>>>> I don't quite understand what do you mean by "round trip back to the
>>>> client". In my understanding, as the RegionServer and TaskTracker can
>>>> be the same node, MR don't have to pull data into client and then
>>>> process.  And you also mention the "unnecessary overhead", can you
>>>> explain a little bit what operations or data processing can be seen as
>>>> "unnecessary overhead".
>>>> 
>>>> Thanks
>>>> 
>>>> yong
>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>>> <mi...@hotmail.com> wrote:
>>>>> Why?
>>>>> 
>>>>> This seems like an unnecessary overhead.
>>>>> 
>>>>> You are writing code within the coprocessor on the server. 
> Pessimistic code really isn't recommended if you are worried about 
> performance.
>>>>> 
>>>>> I have to ask... by the time you have executed the code in your 
> co-processor, what would cause the initial write to fail?
>>>>> 
>>>>> 
>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> 
> wrote:
>>>>> 
>>>>>> its a local read. i just check the last param of PostCheckAndPut 
> indicating if the Put succeeded. Incase if the put success, i insert a row 
> in another table
>>>>>> 
>>>>>> Sincerely,
>>>>>> Prakash Kadel
>>>>>> 
>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>>> 
>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the 
> nature of
>>>>>>> LSM, read is much slower compared to a write...
>>>>>>> 
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Wei
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>>> Date:   02/17/2013 07:49 PM
>>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> hi,
>>>>>>> i am trying to insert few million documents to hbase with 
> mapreduce. To
>>>>>>> enable quick search of docs i want to have some indexes, so i tried 
> to use
>>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>>> coprocessors not supposed to increase the latency?
>>>>>>> my settings:
>>>>>>> 3 region servers
>>>>>>> 60 maps
>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
> rows to
>>>>>>> a index table.
>>>>>>> 
>>>>>>> 
>>>>>>> Sincerely,
>>>>>>> Prakash
>>>>> 
>>>>> Michael Segel  | (m) 312.755.9623
>>>>> 
>>>>> Segel and Associates
>> 
> 
> 


Re: coprocessor enabled put very slow, help please~~~

Posted by Andrew Purtell <ap...@apache.org>.
A coprocessor is some code running in a server process. The resources
available and rules of the road are different from client side programming.
HTablePool (and HTable in general) is problematic for server side
programming in my opinion: http://search-hadoop.com/m/XtAi5Fogw32 Since
this comes up now and again seems like a lightweight alternative for server
side IPC could be useful.


On Tue, Feb 19, 2013 at 7:15 AM, Wei Tan <wt...@us.ibm.com> wrote:

> A side question: if HTablePool is not encouraged to be used... how we
> handle the thread safeness in using HTable? Any replacement for
> HTablePool, in plan?
> Thanks,
>
>
> Best Regards,
> Wei
>
>
>
>
> From:   Michel Segel <mi...@hotmail.com>
> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
> Date:   02/18/2013 09:23 AM
> Subject:        Re: coprocessor enabled put very slow, help please~~~
>
>
>
> Why are you using an HTable Pool?
> Why are you closing the table after each iteration through?
>
> Try using 1 HTable object. Turn off WAL
> Initiate in start()
> Close in Stop()
> Surround the use in a try / catch
> If exception caught, re instantiate new HTable connection.
>
> Maybe want to flush the connection after puts.
>
>
> Again not sure why you are using check and put on the base table. Your
> count could be off.
>
> As an example look at poem/rhyme 'Marry had a little lamb'.
> Then check your word count.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Feb 18, 2013, at 7:21 AM, prakash kadel <pr...@gmail.com>
> wrote:
>
> > Thank you guys for your replies,
> > Michael,
> >   I think i didnt make it clear. Here is my use case,
> >
> > I have text documents to insert in the hbase. (With possible duplicates)
> > Suppose i have a document as : " I am working. He is not working"
> >
> > I want to insert this document to a table in hbase, say table "doc"
> >
> > =doc table=
> > -----
> > rowKey : doc_id
> > cf: doc_content
> > value: "I am working. He is not working"
> >
> > Now, i to create another table that stores the word count, say "doc_idx"
> >
> > doc_idx table
> > ---
> > rowKey : I, cf: count, value: 1
> > rowKey : am, cf: count, value: 1
> > rowKey : working, cf: count, value: 2
> > rowKey : He, cf: count, value: 1
> > rowKey : is, cf: count, value: 1
> > rowKey : not, cf: count, value: 1
> >
> > My MR job code:
> > ==============
> >
> > if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
> >    for(String word : doc_content.split("\\s+")) {
> >       Increment inc = new Increment(Bytes.toBytes(word));
> >       inc.addColumn("count", "", 1);
> >    }
> > }
> >
> > Now, i wanted to do some experiments with coprocessors. So, i modified
> > the code as follows.
> >
> > My MR job code:
> > ===============
> >
> > doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
> >
> > Coprocessor code:
> > ===============
> >
> >    public void start(CoprocessorEnvironment env)  {
> >        pool = new HTablePool(conf, 100);
> >    }
> >
> >    public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
> > compareOp,     comparator,  put, result) {
> >
> >                if(!result) return true; // check if the put succeeded
> >
> >        HTableInterface table_idx = pool.getTable("doc_idx");
> >
> >        try {
> >
> >            for(KeyValue contentKV = put.get("doc_content", "")) {
> >                            for(String word :
> > contentKV.getValue().split("\\s+")) {
> >                                Increment inc = new
> > Increment(Bytes.toBytes(word));
> >                                inc.addColumn("count", "", 1);
> >                                table_idx.increment(inc);
> >                            }
> >                       }
> >        } finally {
> >            table_idx.close();
> >        }
> >        return true;
> >    }
> >
> >    public void stop(env) {
> >        pool.close();
> >    }
> >
> > I am a newbee to HBASE. I am not sure this is the way to do.
> > Given that, why is the cooprocessor enabled version much slower than
> > the one without?
> >
> >
> > Sincerely,
> > Prakash Kadel
> >
> >
> > On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
> > <mi...@hotmail.com> wrote:
> >>
> >> The  issue I was talking about was the use of a check and put.
> >> The OP wrote:
> >>>>>> each map inserts to doc table.(checkAndPut)
> >>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some
> rows to
> >>>>>> a index table.
> >>
> >> My question is why does the OP use a checkAndPut, and the
> RegionObserver's postChecAndPut?
> >>
> >>
> >> Here's a good example...
>
> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>
> >>
> >> The OP doesn't really get in to the use case, so we don't know why the
> Check and Put in the M/R job.
> >> He should just be using put() and then a postPut().
> >>
> >> Another issue... since he's writing to  a different HTable... how? Does
> he create an HTable instance in the start() method of his RO object and
> then reference it later? Or does he create the instance of the HTable on
> the fly in each postCheckAndPut() ?
> >> Without seeing his code, we don't know.
> >>
> >> Note that this is synchronous set of writes. Your overall return from
> the M/R call to put will wait until the second row is inserted.
> >>
> >> Interestingly enough, you may want to consider disabling the WAL on the
> write to the index.  You can always run a M/R job that rebuilds the index
> should something occur to the system where you might lose the data.
> Indexes *ARE* expendable. ;-)
> >>
> >> Does that explain it?
> >>
> >> -Mike
> >>
> >> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
> >>
> >>> Hi, Michael
> >>>
> >>> I don't quite understand what do you mean by "round trip back to the
> >>> client". In my understanding, as the RegionServer and TaskTracker can
> >>> be the same node, MR don't have to pull data into client and then
> >>> process.  And you also mention the "unnecessary overhead", can you
> >>> explain a little bit what operations or data processing can be seen as
> >>> "unnecessary overhead".
> >>>
> >>> Thanks
> >>>
> >>> yong
> >>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
> >>> <mi...@hotmail.com> wrote:
> >>>> Why?
> >>>>
> >>>> This seems like an unnecessary overhead.
> >>>>
> >>>> You are writing code within the coprocessor on the server.
> Pessimistic code really isn't recommended if you are worried about
> performance.
> >>>>
> >>>> I have to ask... by the time you have executed the code in your
> co-processor, what would cause the initial write to fail?
> >>>>
> >>>>
> >>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com>
> wrote:
> >>>>
> >>>>> its a local read. i just check the last param of PostCheckAndPut
> indicating if the Put succeeded. Incase if the put success, i insert a row
> in another table
> >>>>>
> >>>>> Sincerely,
> >>>>> Prakash Kadel
> >>>>>
> >>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
> >>>>>
> >>>>>> Is your CheckAndPut involving a local or remote READ? Due to the
> nature of
> >>>>>> LSM, read is much slower compared to a write...
> >>>>>>
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Wei
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> From:   Prakash Kadel <pr...@gmail.com>
> >>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
> >>>>>> Date:   02/17/2013 07:49 PM
> >>>>>> Subject:        coprocessor enabled put very slow, help please~~~
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> hi,
> >>>>>> i am trying to insert few million documents to hbase with
> mapreduce. To
> >>>>>> enable quick search of docs i want to have some indexes, so i tried
> to use
> >>>>>> the coprocessors, but they are slowing down my inserts. Arent the
> >>>>>> coprocessors not supposed to increase the latency?
> >>>>>> my settings:
> >>>>>> 3 region servers
> >>>>>> 60 maps
> >>>>>> each map inserts to doc table.(checkAndPut)
> >>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some
> rows to
> >>>>>> a index table.
> >>>>>>
> >>>>>>
> >>>>>> Sincerely,
> >>>>>> Prakash
> >>>>
> >>>> Michael Segel  | (m) 312.755.9623
> >>>>
> >>>> Segel and Associates
> >
>
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: coprocessor enabled put very slow, help please~~~

Posted by Wei Tan <wt...@us.ibm.com>.
A side question: if HTablePool is not encouraged to be used... how we 
handle the thread safeness in using HTable? Any replacement for 
HTablePool, in plan?
Thanks,


Best Regards,
Wei




From:   Michel Segel <mi...@hotmail.com>
To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
Date:   02/18/2013 09:23 AM
Subject:        Re: coprocessor enabled put very slow, help please~~~



Why are you using an HTable Pool?
Why are you closing the table after each iteration through?

Try using 1 HTable object. Turn off WAL
Initiate in start()
Close in Stop()
Surround the use in a try / catch
If exception caught, re instantiate new HTable connection.

Maybe want to flush the connection after puts. 


Again not sure why you are using check and put on the base table. Your 
count could be off.

As an example look at poem/rhyme 'Marry had a little lamb'.
Then check your word count.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 18, 2013, at 7:21 AM, prakash kadel <pr...@gmail.com> 
wrote:

> Thank you guys for your replies,
> Michael,
>   I think i didnt make it clear. Here is my use case,
> 
> I have text documents to insert in the hbase. (With possible duplicates)
> Suppose i have a document as : " I am working. He is not working"
> 
> I want to insert this document to a table in hbase, say table "doc"
> 
> =doc table=
> -----
> rowKey : doc_id
> cf: doc_content
> value: "I am working. He is not working"
> 
> Now, i to create another table that stores the word count, say "doc_idx"
> 
> doc_idx table
> ---
> rowKey : I, cf: count, value: 1
> rowKey : am, cf: count, value: 1
> rowKey : working, cf: count, value: 2
> rowKey : He, cf: count, value: 1
> rowKey : is, cf: count, value: 1
> rowKey : not, cf: count, value: 1
> 
> My MR job code:
> ==============
> 
> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>    for(String word : doc_content.split("\\s+")) {
>       Increment inc = new Increment(Bytes.toBytes(word));
>       inc.addColumn("count", "", 1);
>    }
> }
> 
> Now, i wanted to do some experiments with coprocessors. So, i modified
> the code as follows.
> 
> My MR job code:
> ===============
> 
> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
> 
> Coprocessor code:
> ===============
> 
>    public void start(CoprocessorEnvironment env)  {
>        pool = new HTablePool(conf, 100);
>    }
> 
>    public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
> compareOp,     comparator,  put, result) {
> 
>                if(!result) return true; // check if the put succeeded
> 
>        HTableInterface table_idx = pool.getTable("doc_idx");
> 
>        try {
> 
>            for(KeyValue contentKV = put.get("doc_content", "")) {
>                            for(String word :
> contentKV.getValue().split("\\s+")) {
>                                Increment inc = new
> Increment(Bytes.toBytes(word));
>                                inc.addColumn("count", "", 1);
>                                table_idx.increment(inc);
>                            }
>                       }
>        } finally {
>            table_idx.close();
>        }
>        return true;
>    }
> 
>    public void stop(env) {
>        pool.close();
>    }
> 
> I am a newbee to HBASE. I am not sure this is the way to do.
> Given that, why is the cooprocessor enabled version much slower than
> the one without?
> 
> 
> Sincerely,
> Prakash Kadel
> 
> 
> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> 
>> The  issue I was talking about was the use of a check and put.
>> The OP wrote:
>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
rows to
>>>>>> a index table.
>> 
>> My question is why does the OP use a checkAndPut, and the 
RegionObserver's postChecAndPut?
>> 
>> 
>> Here's a good example... 
http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put

>> 
>> The OP doesn't really get in to the use case, so we don't know why the 
Check and Put in the M/R job.
>> He should just be using put() and then a postPut().
>> 
>> Another issue... since he's writing to  a different HTable... how? Does 
he create an HTable instance in the start() method of his RO object and 
then reference it later? Or does he create the instance of the HTable on 
the fly in each postCheckAndPut() ?
>> Without seeing his code, we don't know.
>> 
>> Note that this is synchronous set of writes. Your overall return from 
the M/R call to put will wait until the second row is inserted.
>> 
>> Interestingly enough, you may want to consider disabling the WAL on the 
write to the index.  You can always run a M/R job that rebuilds the index 
should something occur to the system where you might lose the data. 
Indexes *ARE* expendable. ;-)
>> 
>> Does that explain it?
>> 
>> -Mike
>> 
>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>> 
>>> Hi, Michael
>>> 
>>> I don't quite understand what do you mean by "round trip back to the
>>> client". In my understanding, as the RegionServer and TaskTracker can
>>> be the same node, MR don't have to pull data into client and then
>>> process.  And you also mention the "unnecessary overhead", can you
>>> explain a little bit what operations or data processing can be seen as
>>> "unnecessary overhead".
>>> 
>>> Thanks
>>> 
>>> yong
>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>> <mi...@hotmail.com> wrote:
>>>> Why?
>>>> 
>>>> This seems like an unnecessary overhead.
>>>> 
>>>> You are writing code within the coprocessor on the server. 
Pessimistic code really isn't recommended if you are worried about 
performance.
>>>> 
>>>> I have to ask... by the time you have executed the code in your 
co-processor, what would cause the initial write to fail?
>>>> 
>>>> 
>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> 
wrote:
>>>> 
>>>>> its a local read. i just check the last param of PostCheckAndPut 
indicating if the Put succeeded. Incase if the put success, i insert a row 
in another table
>>>>> 
>>>>> Sincerely,
>>>>> Prakash Kadel
>>>>> 
>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>> 
>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the 
nature of
>>>>>> LSM, read is much slower compared to a write...
>>>>>> 
>>>>>> 
>>>>>> Best Regards,
>>>>>> Wei
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>> Date:   02/17/2013 07:49 PM
>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> hi,
>>>>>> i am trying to insert few million documents to hbase with 
mapreduce. To
>>>>>> enable quick search of docs i want to have some indexes, so i tried 
to use
>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>> coprocessors not supposed to increase the latency?
>>>>>> my settings:
>>>>>> 3 region servers
>>>>>> 60 maps
>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some 
rows to
>>>>>> a index table.
>>>>>> 
>>>>>> 
>>>>>> Sincerely,
>>>>>> Prakash
>>>> 
>>>> Michael Segel  | (m) 312.755.9623
>>>> 
>>>> Segel and Associates
> 



Re: coprocessor enabled put very slow, help please~~~

Posted by prakash kadel <pr...@gmail.com>.
thanks,
   i am going to do some test and let you know



On Mon, Feb 18, 2013 at 11:13 PM, Michel Segel
<mi...@hotmail.com> wrote:
> Why are you using an HTable Pool?
> Why are you closing the table after each iteration through?
>
> Try using 1 HTable object. Turn off WAL
> Initiate in start()
> Close in Stop()
> Surround the use in a try / catch
> If exception caught, re instantiate new HTable connection.
>
> Maybe want to flush the connection after puts.
>
>
> Again not sure why you are using check and put on the base table. Your count could be off.
>
> As an example look at poem/rhyme 'Marry had a little lamb'.
> Then check your word count.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Feb 18, 2013, at 7:21 AM, prakash kadel <pr...@gmail.com> wrote:
>
>> Thank you guys for your replies,
>> Michael,
>>   I think i didnt make it clear. Here is my use case,
>>
>> I have text documents to insert in the hbase. (With possible duplicates)
>> Suppose i have a document as : " I am working. He is not working"
>>
>> I want to insert this document to a table in hbase, say table "doc"
>>
>> =doc table=
>> -----
>> rowKey : doc_id
>> cf: doc_content
>> value: "I am working. He is not working"
>>
>> Now, i to create another table that stores the word count, say "doc_idx"
>>
>> doc_idx table
>> ---
>> rowKey : I, cf: count, value: 1
>> rowKey : am, cf: count, value: 1
>> rowKey : working, cf: count, value: 2
>> rowKey : He, cf: count, value: 1
>> rowKey : is, cf: count, value: 1
>> rowKey : not, cf: count, value: 1
>>
>> My MR job code:
>> ==============
>>
>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>    for(String word : doc_content.split("\\s+")) {
>>       Increment inc = new Increment(Bytes.toBytes(word));
>>       inc.addColumn("count", "", 1);
>>    }
>> }
>>
>> Now, i wanted to do some experiments with coprocessors. So, i modified
>> the code as follows.
>>
>> My MR job code:
>> ===============
>>
>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>>
>> Coprocessor code:
>> ===============
>>
>>    public void start(CoprocessorEnvironment env)  {
>>        pool = new HTablePool(conf, 100);
>>    }
>>
>>    public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>> compareOp,     comparator,  put, result) {
>>
>>                if(!result) return true; // check if the put succeeded
>>
>>        HTableInterface table_idx = pool.getTable("doc_idx");
>>
>>        try {
>>
>>            for(KeyValue contentKV = put.get("doc_content", "")) {
>>                            for(String word :
>> contentKV.getValue().split("\\s+")) {
>>                                Increment inc = new
>> Increment(Bytes.toBytes(word));
>>                                inc.addColumn("count", "", 1);
>>                                table_idx.increment(inc);
>>                            }
>>                       }
>>        } finally {
>>            table_idx.close();
>>        }
>>        return true;
>>    }
>>
>>    public void stop(env) {
>>        pool.close();
>>    }
>>
>> I am a newbee to HBASE. I am not sure this is the way to do.
>> Given that, why is the cooprocessor enabled version much slower than
>> the one without?
>>
>>
>> Sincerely,
>> Prakash Kadel
>>
>>
>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>> <mi...@hotmail.com> wrote:
>>>
>>> The  issue I was talking about was the use of a check and put.
>>> The OP wrote:
>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>>>> a index table.
>>>
>>> My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut?
>>>
>>>
>>> Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>>>
>>> The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job.
>>> He should just be using put() and then a postPut().
>>>
>>> Another issue... since he's writing to  a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ?
>>> Without seeing his code, we don't know.
>>>
>>> Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted.
>>>
>>> Interestingly enough, you may want to consider disabling the WAL on the write to the index.  You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data.  Indexes *ARE* expendable. ;-)
>>>
>>> Does that explain it?
>>>
>>> -Mike
>>>
>>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>>>
>>>> Hi, Michael
>>>>
>>>> I don't quite understand what do you mean by "round trip back to the
>>>> client". In my understanding, as the RegionServer and TaskTracker can
>>>> be the same node, MR don't have to pull data into client and then
>>>> process.  And you also mention the "unnecessary overhead", can you
>>>> explain a little bit what operations or data processing can be seen as
>>>> "unnecessary overhead".
>>>>
>>>> Thanks
>>>>
>>>> yong
>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>>> <mi...@hotmail.com> wrote:
>>>>> Why?
>>>>>
>>>>> This seems like an unnecessary overhead.
>>>>>
>>>>> You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.
>>>>>
>>>>> I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail?
>>>>>
>>>>>
>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:
>>>>>
>>>>>> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
>>>>>>
>>>>>> Sincerely,
>>>>>> Prakash Kadel
>>>>>>
>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>>>
>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>>>>>> LSM, read is much slower compared to a write...
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Wei
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>>> Date:   02/17/2013 07:49 PM
>>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> hi,
>>>>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>>>>> enable quick search of docs i want to have some indexes, so i tried to use
>>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>>> coprocessors not supposed to increase the latency?
>>>>>>> my settings:
>>>>>>> 3 region servers
>>>>>>> 60 maps
>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>>>> a index table.
>>>>>>>
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> Prakash
>>>>>
>>>>> Michael Segel  | (m) 312.755.9623
>>>>>
>>>>> Segel and Associates
>>

Re: coprocessor enabled put very slow, help please~~~

Posted by Michel Segel <mi...@hotmail.com>.
Why are you using an HTable Pool?
Why are you closing the table after each iteration through?

Try using 1 HTable object. Turn off WAL
Initiate in start()
Close in Stop()
Surround the use in a try / catch
If exception caught, re instantiate new HTable connection.

Maybe want to flush the connection after puts. 


Again not sure why you are using check and put on the base table. Your count could be off.

As an example look at poem/rhyme 'Marry had a little lamb'.
Then check your word count.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 18, 2013, at 7:21 AM, prakash kadel <pr...@gmail.com> wrote:

> Thank you guys for your replies,
> Michael,
>   I think i didnt make it clear. Here is my use case,
> 
> I have text documents to insert in the hbase. (With possible duplicates)
> Suppose i have a document as : " I am working. He is not working"
> 
> I want to insert this document to a table in hbase, say table "doc"
> 
> =doc table=
> -----
> rowKey : doc_id
> cf: doc_content
> value: "I am working. He is not working"
> 
> Now, i to create another table that stores the word count, say "doc_idx"
> 
> doc_idx table
> ---
> rowKey : I, cf: count, value: 1
> rowKey : am, cf: count, value: 1
> rowKey : working, cf: count, value: 2
> rowKey : He, cf: count, value: 1
> rowKey : is, cf: count, value: 1
> rowKey : not, cf: count, value: 1
> 
> My MR job code:
> ==============
> 
> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>    for(String word : doc_content.split("\\s+")) {
>       Increment inc = new Increment(Bytes.toBytes(word));
>       inc.addColumn("count", "", 1);
>    }
> }
> 
> Now, i wanted to do some experiments with coprocessors. So, i modified
> the code as follows.
> 
> My MR job code:
> ===============
> 
> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
> 
> Coprocessor code:
> ===============
> 
>    public void start(CoprocessorEnvironment env)  {
>        pool = new HTablePool(conf, 100);
>    }
> 
>    public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
> compareOp,     comparator,  put, result) {
> 
>                if(!result) return true; // check if the put succeeded
> 
>        HTableInterface table_idx = pool.getTable("doc_idx");
> 
>        try {
> 
>            for(KeyValue contentKV = put.get("doc_content", "")) {
>                            for(String word :
> contentKV.getValue().split("\\s+")) {
>                                Increment inc = new
> Increment(Bytes.toBytes(word));
>                                inc.addColumn("count", "", 1);
>                                table_idx.increment(inc);
>                            }
>                       }
>        } finally {
>            table_idx.close();
>        }
>        return true;
>    }
> 
>    public void stop(env) {
>        pool.close();
>    }
> 
> I am a newbee to HBASE. I am not sure this is the way to do.
> Given that, why is the cooprocessor enabled version much slower than
> the one without?
> 
> 
> Sincerely,
> Prakash Kadel
> 
> 
> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> 
>> The  issue I was talking about was the use of a check and put.
>> The OP wrote:
>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>>> a index table.
>> 
>> My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut?
>> 
>> 
>> Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>> 
>> The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job.
>> He should just be using put() and then a postPut().
>> 
>> Another issue... since he's writing to  a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ?
>> Without seeing his code, we don't know.
>> 
>> Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted.
>> 
>> Interestingly enough, you may want to consider disabling the WAL on the write to the index.  You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data.  Indexes *ARE* expendable. ;-)
>> 
>> Does that explain it?
>> 
>> -Mike
>> 
>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>> 
>>> Hi, Michael
>>> 
>>> I don't quite understand what do you mean by "round trip back to the
>>> client". In my understanding, as the RegionServer and TaskTracker can
>>> be the same node, MR don't have to pull data into client and then
>>> process.  And you also mention the "unnecessary overhead", can you
>>> explain a little bit what operations or data processing can be seen as
>>> "unnecessary overhead".
>>> 
>>> Thanks
>>> 
>>> yong
>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>> <mi...@hotmail.com> wrote:
>>>> Why?
>>>> 
>>>> This seems like an unnecessary overhead.
>>>> 
>>>> You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.
>>>> 
>>>> I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail?
>>>> 
>>>> 
>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:
>>>> 
>>>>> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
>>>>> 
>>>>> Sincerely,
>>>>> Prakash Kadel
>>>>> 
>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>> 
>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>>>>> LSM, read is much slower compared to a write...
>>>>>> 
>>>>>> 
>>>>>> Best Regards,
>>>>>> Wei
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>> Date:   02/17/2013 07:49 PM
>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> hi,
>>>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>>>> enable quick search of docs i want to have some indexes, so i tried to use
>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>> coprocessors not supposed to increase the latency?
>>>>>> my settings:
>>>>>> 3 region servers
>>>>>> 60 maps
>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>>> a index table.
>>>>>> 
>>>>>> 
>>>>>> Sincerely,
>>>>>> Prakash
>>>> 
>>>> Michael Segel  | (m) 312.755.9623
>>>> 
>>>> Segel and Associates
> 

Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
sorry for all these unclear queries.

i turned of WAL on both the doc and index table.

in my system all documents have a UUID (assigned before it comes into the system) i just use this UUID as the rowkey. so duplicates basically means documents with the same id, even if the contents are the same.
for a poem like Mary had a little lamb, the whole poem would probably be counted as a single document. if such a   document comes, the word counts of the words in the poem would increment by their count in the poem.
if multiple docs have the same content but different id, i just treat them as different docs and do the increments.


Sincerely,
Prakash Kadel

On Feb 20, 2013, at 11:14 PM, Michel Segel <mi...@hotmail.com> wrote:

> 
> What happens when you have a poem like Mary had a little lamb?
> 
> Did you turn off the WAL on both table inserts, or just the index?
> 
> If you want to avoid processing duplicate docs... You could do this a couple of ways. The simplest way is to record the doc ID and a check sum for the doc. If the doc you are processing matches... You can simply do NOOP for the lines in the doc. (This isn't the fastest, but its easy.)
> The other is to run a preprocess which removes duplicate doc from your directory and you then process the docs...
> 
> Third thing... Do a code review. Sloppy code will kill performance...
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Feb 20, 2013, at 5:26 AM, Prakash Kadel <pr...@gmail.com> wrote:
> 
>> michael, 
>>  infact i dont care about latency bw doc write and index write.
>> today i did some tests.
>> turns out turning off WAL does speed up the writes by about a factor of 2.
>> interestingly, enabling bloom filter did little to improve the checkandput.
>> 
>> earlier you mentioned
>>>>>> The OP doesn't really get in to the use case, so we don't know why the
>>>>> Check and Put in the M/R job.
>>>>>> He should just be using put() and then a postPut().
>> 
>> 
>> the main reason i use checkandput is to make sure the word count index doesnt get duplicate increments when duplicate documents come in. additionally i also need to dump dup free docs to hdfs for legacy system that we have in place.
>> is there some way to avoid chechandput?
>> 
>> 
>> Sincerely,
>> Prakash 
>> 
>> On Feb 20, 2013, at 10:00 PM, Michel Segel <mi...@hotmail.com> wrote:
>> 
>>> I was suggesting removing the write to WAL on your write to the index table only.
>>> 
>>> The thing you have to realize that true low latency systems use databases as a sink. It's the end of the line so to speak.
>>> 
>>> So if you're worried about a small latency between the writing to your doc table, and then the write of your index.. You are designing the wrong system.
>>> 
>>> Consider that it takes some time t to write the base record and then to write the indexes.
>>> For that period, you have a Schrödinger's cat problem as to if the row exists or not. Since HBase lacks transactions and ACID, trying to write a solution where you require the low latency... You are using the wrong tool.
>> 

Re: coprocessor enabled put very slow, help please~~~

Posted by Michel Segel <mi...@hotmail.com>.
What happens when you have a poem like Mary had a little lamb?

Did you turn off the WAL on both table inserts, or just the index?

If you want to avoid processing duplicate docs... You could do this a couple of ways. The simplest way is to record the doc ID and a check sum for the doc. If the doc you are processing matches... You can simply do NOOP for the lines in the doc. (This isn't the fastest, but its easy.)
The other is to run a preprocess which removes duplicate doc from your directory and you then process the docs...

Third thing... Do a code review. Sloppy code will kill performance...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 20, 2013, at 5:26 AM, Prakash Kadel <pr...@gmail.com> wrote:

> michael, 
>   infact i dont care about latency bw doc write and index write.
> today i did some tests.
> turns out turning off WAL does speed up the writes by about a factor of 2.
> interestingly, enabling bloom filter did little to improve the checkandput.
> 
> earlier you mentioned
>>>>> The OP doesn't really get in to the use case, so we don't know why the
>>>> Check and Put in the M/R job.
>>>>> He should just be using put() and then a postPut().
> 
> 
> the main reason i use checkandput is to make sure the word count index doesnt get duplicate increments when duplicate documents come in. additionally i also need to dump dup free docs to hdfs for legacy system that we have in place.
> is there some way to avoid chechandput?
> 
> 
> Sincerely,
> Prakash 
> 
> On Feb 20, 2013, at 10:00 PM, Michel Segel <mi...@hotmail.com> wrote:
> 
>> I was suggesting removing the write to WAL on your write to the index table only.
>> 
>> The thing you have to realize that true low latency systems use databases as a sink. It's the end of the line so to speak.
>> 
>> So if you're worried about a small latency between the writing to your doc table, and then the write of your index.. You are designing the wrong system.
>> 
>> Consider that it takes some time t to write the base record and then to write the indexes.
>> For that period, you have a Schrödinger's cat problem as to if the row exists or not. Since HBase lacks transactions and ACID, trying to write a solution where you require the low latency... You are using the wrong tool.
> 

Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
michael, 
   infact i dont care about latency bw doc write and index write.
today i did some tests.
turns out turning off WAL does speed up the writes by about a factor of 2.
interestingly, enabling bloom filter did little to improve the checkandput.

earlier you mentioned
>>>> The OP doesn't really get in to the use case, so we don't know why the
>>> Check and Put in the M/R job.
>>>> He should just be using put() and then a postPut().


the main reason i use checkandput is to make sure the word count index doesnt get duplicate increments when duplicate documents come in. additionally i also need to dump dup free docs to hdfs for legacy system that we have in place.
is there some way to avoid chechandput?


Sincerely,
Prakash 

On Feb 20, 2013, at 10:00 PM, Michel Segel <mi...@hotmail.com> wrote:

> I was suggesting removing the write to WAL on your write to the index table only.
> 
> The thing you have to realize that true low latency systems use databases as a sink. It's the end of the line so to speak.
> 
> So if you're worried about a small latency between the writing to your doc table, and then the write of your index.. You are designing the wrong system.
> 
> Consider that it takes some time t to write the base record and then to write the indexes.
> For that period, you have a Schrödinger's cat problem as to if the row exists or not. Since HBase lacks transactions and ACID, trying to write a solution where you require the low latency... You are using the wrong tool.
> 
> Remember that HBase was designed as a distributed system for managing very large data sets. Your speed from using secondary indexes like an inverted table is in the read and not the write.
> 
> If you had append working, you could create an index if you could create a fixed sized key buffer. Or something down that path... Sorry, just thinking something out loud...
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Feb 19, 2013, at 1:53 PM, Asaf Mesika <as...@gmail.com> wrote:
> 
>> 1. Try batching your increment calls to a List<Row> and use batch() to
>> execute it. Should reduce RPC calls by 2 magnitudes.
>> 2. Combine batching with scanning more words, thus aggregating your count
>> for a certain word thus less Increment commands.
>> 3. Enable Bloom Filters. Should speed up Increment by a factor of 2 at
>> least.
>> 4. Don't use keyValue.getValue(). It does a System.arraycopy behind the
>> scenes. Use getBuffer() and getValueOffset() and getValueLength() and
>> iterate on the existing array. Write your own Split without going into
>> using String functions which goes through encoding (expensive). Just find
>> your delimiter by byte comparison.
>> 5. Enable BloomFilters on doc table. It should speed up the checkAndPut.
>> 6. I wouldn't give up WAL. It ain't your bottleneck IMO.
>> 
>> On Monday, February 18, 2013, prakash kadel wrote:
>> 
>>> Thank you guys for your replies,
>>> Michael,
>>>  I think i didnt make it clear. Here is my use case,
>>> 
>>> I have text documents to insert in the hbase. (With possible duplicates)
>>> Suppose i have a document as : " I am working. He is not working"
>>> 
>>> I want to insert this document to a table in hbase, say table "doc"
>>> 
>>> =doc table=
>>> -----
>>> rowKey : doc_id
>>> cf: doc_content
>>> value: "I am working. He is not working"
>>> 
>>> Now, i to create another table that stores the word count, say "doc_idx"
>>> 
>>> doc_idx table
>>> ---
>>> rowKey : I, cf: count, value: 1
>>> rowKey : am, cf: count, value: 1
>>> rowKey : working, cf: count, value: 2
>>> rowKey : He, cf: count, value: 1
>>> rowKey : is, cf: count, value: 1
>>> rowKey : not, cf: count, value: 1
>>> 
>>> My MR job code:
>>> ==============
>>> 
>>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>>   for(String word : doc_content.split("\\s+")) {
>>>      Increment inc = new Increment(Bytes.toBytes(word));
>>>      inc.addColumn("count", "", 1);
>>>   }
>>> }
>>> 
>>> Now, i wanted to do some experiments with coprocessors. So, i modified
>>> the code as follows.
>>> 
>>> My MR job code:
>>> ===============
>>> 
>>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>>> 
>>> Coprocessor code:
>>> ===============
>>> 
>>>       public void start(CoprocessorEnvironment env)  {
>>>               pool = new HTablePool(conf, 100);
>>>       }
>>> 
>>>       public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>>> compareOp,       comparator,  put, result) {
>>> 
>>>               if(!result) return true; // check if the put succeeded
>>> 
>>>               HTableInterface table_idx = pool.getTable("doc_idx");
>>> 
>>>               try {
>>> 
>>>                       for(KeyValue contentKV = put.get("doc_content",
>>> "")) {
>>>                           for(String word :
>>> contentKV.getValue().split("\\s+")) {
>>>                               Increment inc = new
>>> Increment(Bytes.toBytes(word));
>>>                               inc.addColumn("count", "", 1);
>>>                               table_idx.increment(inc);
>>>                           }
>>>                      }
>>>               } finally {
>>>                       table_idx.close();
>>>               }
>>>               return true;
>>>       }
>>> 
>>>       public void stop(env) {
>>>               pool.close();
>>>       }
>>> 
>>> I am a newbee to HBASE. I am not sure this is the way to do.
>>> Given that, why is the cooprocessor enabled version much slower than
>>> the one without?
>>> 
>>> 
>>> Sincerely,
>>> Prakash Kadel
>>> 
>>> 
>>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>>> <michael_segel@hotmail.com <javascript:;>> wrote:
>>>> 
>>>> The  issue I was talking about was the use of a check and put.
>>>> The OP wrote:
>>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some
>>> rows to
>>>>>>>> a index table.
>>>> 
>>>> My question is why does the OP use a checkAndPut, and the
>>> RegionObserver's postChecAndPut?
>>>> 
>>>> 
>>>> Here's a good example...
>>> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>>>> 
>>>> The OP doesn't really get in to the use case, so we don't know why the
>>> Check and Put in the M/R job.
>>>> He should just be using put() and then a postPut().
>>>> 
>>>> Another issue... since he's writing to  a different HTable... how? Does
>>> he create an HTable instance in the start() method of his RO object and
>>> then reference it later? Or does he create the instance of the HTable on
>>> the fly in each postCheckAndPut() ?
>>>> Without seeing his code, we don't know.
>>>> 
>>>> Note that this is synchronous set of writes. Your overall return from
>>> the M/R call to put will wait until the second row is inserted.
>>>> 
>>>> Interestingly enough, you may want to consider disabling the WAL on the
>>> write to the index.  You can always run a M/R job that rebuilds the index
>>> should something occur to the system where you might lose the data.
>>> Indexes *ARE* expendable. ;-)
>>>> 
>>>> Does that explain it?
>>>> 
>>>> -Mike
>>>> 
>>>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>>>> 
>>>>> Hi, Michael
>>>>> 
>>>>> I don't quite understand what do you mean by "round trip back to the
>>>>> client". In my understanding, as the RegionServer and TaskTracker can
>>>>> be the same node, MR don't have to pull data into client and then
>>>>> process.  And you also mention the "unnecessary overhead", can you
>>>>> explain a little bit what operations or data processing can be seen as
>>>>> "unnecessary overhead".
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> yong
>>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>>>> <mi...@hotmail.com> wrote:
>>>>>> Why?
>>>>>> 
>>>>>> This seems like an unnecessary overhead.
>>>>>> 
>>>>>> You are writing code within the coprocessor on the server.
>>> Pessimistic code really isn't recommended if you are worried about
>>> performance.
>>>>>> 
>>>>>> I have to ask... by the time you have executed the code in your
>>> co-processor, what would cause the initial write to fail?
>>>>>> 
>>>>>> 
>>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> its a local read. i just check the last param of PostCheckAndPut
>>> indicating if the Put succeeded. Incase if the put success, i insert a row
>>> in another table
>>>>>>> 
>>>>>>> Sincerely,
>>>>>>> Prakash Kadel
>>>>>>> 
>>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>>>> 
>>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the
>>> nature of
>>>>>>>> LSM, read is much slower compared to a write...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> Wei
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>>>> Date:   02/17/2013 07:49 PM
>>>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> hi,
>>>>>>>> i am trying to insert few million documents

Re: coprocessor enabled put very slow, help please~~~

Posted by Michel Segel <mi...@hotmail.com>.
I was suggesting removing the write to WAL on your write to the index table only.

The thing you have to realize that true low latency systems use databases as a sink. It's the end of the line so to speak.

So if you're worried about a small latency between the writing to your doc table, and then the write of your index.. You are designing the wrong system.

Consider that it takes some time t to write the base record and then to write the indexes.
For that period, you have a Schrödinger's cat problem as to if the row exists or not. Since HBase lacks transactions and ACID, trying to write a solution where you require the low latency... You are using the wrong tool.

Remember that HBase was designed as a distributed system for managing very large data sets. Your speed from using secondary indexes like an inverted table is in the read and not the write.

If you had append working, you could create an index if you could create a fixed sized key buffer. Or something down that path... Sorry, just thinking something out loud...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 19, 2013, at 1:53 PM, Asaf Mesika <as...@gmail.com> wrote:

> 1. Try batching your increment calls to a List<Row> and use batch() to
> execute it. Should reduce RPC calls by 2 magnitudes.
> 2. Combine batching with scanning more words, thus aggregating your count
> for a certain word thus less Increment commands.
> 3. Enable Bloom Filters. Should speed up Increment by a factor of 2 at
> least.
> 4. Don't use keyValue.getValue(). It does a System.arraycopy behind the
> scenes. Use getBuffer() and getValueOffset() and getValueLength() and
> iterate on the existing array. Write your own Split without going into
> using String functions which goes through encoding (expensive). Just find
> your delimiter by byte comparison.
> 5. Enable BloomFilters on doc table. It should speed up the checkAndPut.
> 6. I wouldn't give up WAL. It ain't your bottleneck IMO.
> 
> On Monday, February 18, 2013, prakash kadel wrote:
> 
>> Thank you guys for your replies,
>> Michael,
>>   I think i didnt make it clear. Here is my use case,
>> 
>> I have text documents to insert in the hbase. (With possible duplicates)
>> Suppose i have a document as : " I am working. He is not working"
>> 
>> I want to insert this document to a table in hbase, say table "doc"
>> 
>> =doc table=
>> -----
>> rowKey : doc_id
>> cf: doc_content
>> value: "I am working. He is not working"
>> 
>> Now, i to create another table that stores the word count, say "doc_idx"
>> 
>> doc_idx table
>> ---
>> rowKey : I, cf: count, value: 1
>> rowKey : am, cf: count, value: 1
>> rowKey : working, cf: count, value: 2
>> rowKey : He, cf: count, value: 1
>> rowKey : is, cf: count, value: 1
>> rowKey : not, cf: count, value: 1
>> 
>> My MR job code:
>> ==============
>> 
>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>>    for(String word : doc_content.split("\\s+")) {
>>       Increment inc = new Increment(Bytes.toBytes(word));
>>       inc.addColumn("count", "", 1);
>>    }
>> }
>> 
>> Now, i wanted to do some experiments with coprocessors. So, i modified
>> the code as follows.
>> 
>> My MR job code:
>> ===============
>> 
>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>> 
>> Coprocessor code:
>> ===============
>> 
>>        public void start(CoprocessorEnvironment env)  {
>>                pool = new HTablePool(conf, 100);
>>        }
>> 
>>        public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
>> compareOp,       comparator,  put, result) {
>> 
>>                if(!result) return true; // check if the put succeeded
>> 
>>                HTableInterface table_idx = pool.getTable("doc_idx");
>> 
>>                try {
>> 
>>                        for(KeyValue contentKV = put.get("doc_content",
>> "")) {
>>                            for(String word :
>> contentKV.getValue().split("\\s+")) {
>>                                Increment inc = new
>> Increment(Bytes.toBytes(word));
>>                                inc.addColumn("count", "", 1);
>>                                table_idx.increment(inc);
>>                            }
>>                       }
>>                } finally {
>>                        table_idx.close();
>>                }
>>                return true;
>>        }
>> 
>>        public void stop(env) {
>>                pool.close();
>>        }
>> 
>> I am a newbee to HBASE. I am not sure this is the way to do.
>> Given that, why is the cooprocessor enabled version much slower than
>> the one without?
>> 
>> 
>> Sincerely,
>> Prakash Kadel
>> 
>> 
>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
>> <michael_segel@hotmail.com <javascript:;>> wrote:
>>> 
>>> The  issue I was talking about was the use of a check and put.
>>> The OP wrote:
>>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some
>> rows to
>>>>>>> a index table.
>>> 
>>> My question is why does the OP use a checkAndPut, and the
>> RegionObserver's postChecAndPut?
>>> 
>>> 
>>> Here's a good example...
>> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>>> 
>>> The OP doesn't really get in to the use case, so we don't know why the
>> Check and Put in the M/R job.
>>> He should just be using put() and then a postPut().
>>> 
>>> Another issue... since he's writing to  a different HTable... how? Does
>> he create an HTable instance in the start() method of his RO object and
>> then reference it later? Or does he create the instance of the HTable on
>> the fly in each postCheckAndPut() ?
>>> Without seeing his code, we don't know.
>>> 
>>> Note that this is synchronous set of writes. Your overall return from
>> the M/R call to put will wait until the second row is inserted.
>>> 
>>> Interestingly enough, you may want to consider disabling the WAL on the
>> write to the index.  You can always run a M/R job that rebuilds the index
>> should something occur to the system where you might lose the data.
>> Indexes *ARE* expendable. ;-)
>>> 
>>> Does that explain it?
>>> 
>>> -Mike
>>> 
>>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>>> 
>>>> Hi, Michael
>>>> 
>>>> I don't quite understand what do you mean by "round trip back to the
>>>> client". In my understanding, as the RegionServer and TaskTracker can
>>>> be the same node, MR don't have to pull data into client and then
>>>> process.  And you also mention the "unnecessary overhead", can you
>>>> explain a little bit what operations or data processing can be seen as
>>>> "unnecessary overhead".
>>>> 
>>>> Thanks
>>>> 
>>>> yong
>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>>> <mi...@hotmail.com> wrote:
>>>>> Why?
>>>>> 
>>>>> This seems like an unnecessary overhead.
>>>>> 
>>>>> You are writing code within the coprocessor on the server.
>> Pessimistic code really isn't recommended if you are worried about
>> performance.
>>>>> 
>>>>> I have to ask... by the time you have executed the code in your
>> co-processor, what would cause the initial write to fail?
>>>>> 
>>>>> 
>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com>
>> wrote:
>>>>> 
>>>>>> its a local read. i just check the last param of PostCheckAndPut
>> indicating if the Put succeeded. Incase if the put success, i insert a row
>> in another table
>>>>>> 
>>>>>> Sincerely,
>>>>>> Prakash Kadel
>>>>>> 
>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>>> 
>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the
>> nature of
>>>>>>> LSM, read is much slower compared to a write...
>>>>>>> 
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> Wei
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>>> Date:   02/17/2013 07:49 PM
>>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> hi,
>>>>>>> i am trying to insert few million documents

Re: coprocessor enabled put very slow, help please~~~

Posted by Asaf Mesika <as...@gmail.com>.
1. Try batching your increment calls to a List<Row> and use batch() to
execute it. Should reduce RPC calls by 2 magnitudes.
2. Combine batching with scanning more words, thus aggregating your count
for a certain word thus less Increment commands.
3. Enable Bloom Filters. Should speed up Increment by a factor of 2 at
least.
4. Don't use keyValue.getValue(). It does a System.arraycopy behind the
scenes. Use getBuffer() and getValueOffset() and getValueLength() and
iterate on the existing array. Write your own Split without going into
using String functions which goes through encoding (expensive). Just find
your delimiter by byte comparison.
5. Enable BloomFilters on doc table. It should speed up the checkAndPut.
6. I wouldn't give up WAL. It ain't your bottleneck IMO.

On Monday, February 18, 2013, prakash kadel wrote:

> Thank you guys for your replies,
> Michael,
>    I think i didnt make it clear. Here is my use case,
>
> I have text documents to insert in the hbase. (With possible duplicates)
> Suppose i have a document as : " I am working. He is not working"
>
> I want to insert this document to a table in hbase, say table "doc"
>
> =doc table=
> -----
> rowKey : doc_id
> cf: doc_content
> value: "I am working. He is not working"
>
> Now, i to create another table that stores the word count, say "doc_idx"
>
> doc_idx table
> ---
> rowKey : I, cf: count, value: 1
> rowKey : am, cf: count, value: 1
> rowKey : working, cf: count, value: 2
> rowKey : He, cf: count, value: 1
> rowKey : is, cf: count, value: 1
> rowKey : not, cf: count, value: 1
>
> My MR job code:
> ==============
>
> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
>     for(String word : doc_content.split("\\s+")) {
>        Increment inc = new Increment(Bytes.toBytes(word));
>        inc.addColumn("count", "", 1);
>     }
> }
>
> Now, i wanted to do some experiments with coprocessors. So, i modified
> the code as follows.
>
> My MR job code:
> ===============
>
> doc.checkAndPut(rowKey, doc_content, "", null, putDoc);
>
> Coprocessor code:
> ===============
>
>         public void start(CoprocessorEnvironment env)  {
>                 pool = new HTablePool(conf, 100);
>         }
>
>         public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
> compareOp,       comparator,  put, result) {
>
>                 if(!result) return true; // check if the put succeeded
>
>                 HTableInterface table_idx = pool.getTable("doc_idx");
>
>                 try {
>
>                         for(KeyValue contentKV = put.get("doc_content",
> "")) {
>                             for(String word :
> contentKV.getValue().split("\\s+")) {
>                                 Increment inc = new
> Increment(Bytes.toBytes(word));
>                                 inc.addColumn("count", "", 1);
>                                 table_idx.increment(inc);
>                             }
>                        }
>                 } finally {
>                         table_idx.close();
>                 }
>                 return true;
>         }
>
>         public void stop(env) {
>                 pool.close();
>         }
>
> I am a newbee to HBASE. I am not sure this is the way to do.
> Given that, why is the cooprocessor enabled version much slower than
> the one without?
>
>
> Sincerely,
> Prakash Kadel
>
>
> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
> <michael_segel@hotmail.com <javascript:;>> wrote:
> >
> > The  issue I was talking about was the use of a check and put.
> > The OP wrote:
> >>>>> each map inserts to doc table.(checkAndPut)
> >>>>> regionobserver coprocessor does a postCheckAndPut and inserts some
> rows to
> >>>>> a index table.
> >
> > My question is why does the OP use a checkAndPut, and the
> RegionObserver's postChecAndPut?
> >
> >
> > Here's a good example...
> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
> >
> > The OP doesn't really get in to the use case, so we don't know why the
> Check and Put in the M/R job.
> > He should just be using put() and then a postPut().
> >
> > Another issue... since he's writing to  a different HTable... how? Does
> he create an HTable instance in the start() method of his RO object and
> then reference it later? Or does he create the instance of the HTable on
> the fly in each postCheckAndPut() ?
> > Without seeing his code, we don't know.
> >
> > Note that this is synchronous set of writes. Your overall return from
> the M/R call to put will wait until the second row is inserted.
> >
> > Interestingly enough, you may want to consider disabling the WAL on the
> write to the index.  You can always run a M/R job that rebuilds the index
> should something occur to the system where you might lose the data.
>  Indexes *ARE* expendable. ;-)
> >
> > Does that explain it?
> >
> > -Mike
> >
> > On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
> >
> >> Hi, Michael
> >>
> >> I don't quite understand what do you mean by "round trip back to the
> >> client". In my understanding, as the RegionServer and TaskTracker can
> >> be the same node, MR don't have to pull data into client and then
> >> process.  And you also mention the "unnecessary overhead", can you
> >> explain a little bit what operations or data processing can be seen as
> >> "unnecessary overhead".
> >>
> >> Thanks
> >>
> >> yong
> >> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
> >> <mi...@hotmail.com> wrote:
> >>> Why?
> >>>
> >>> This seems like an unnecessary overhead.
> >>>
> >>> You are writing code within the coprocessor on the server.
>  Pessimistic code really isn't recommended if you are worried about
> performance.
> >>>
> >>> I have to ask... by the time you have executed the code in your
> co-processor, what would cause the initial write to fail?
> >>>
> >>>
> >>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com>
> wrote:
> >>>
> >>>> its a local read. i just check the last param of PostCheckAndPut
> indicating if the Put succeeded. Incase if the put success, i insert a row
> in another table
> >>>>
> >>>> Sincerely,
> >>>> Prakash Kadel
> >>>>
> >>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
> >>>>
> >>>>> Is your CheckAndPut involving a local or remote READ? Due to the
> nature of
> >>>>> LSM, read is much slower compared to a write...
> >>>>>
> >>>>>
> >>>>> Best Regards,
> >>>>> Wei
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> From:   Prakash Kadel <pr...@gmail.com>
> >>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
> >>>>> Date:   02/17/2013 07:49 PM
> >>>>> Subject:        coprocessor enabled put very slow, help please~~~
> >>>>>
> >>>>>
> >>>>>
> >>>>> hi,
> >>>>> i am trying to insert few million documents

Re: coprocessor enabled put very slow, help please~~~

Posted by prakash kadel <pr...@gmail.com>.
Thank you guys for your replies,
Michael,
   I think i didnt make it clear. Here is my use case,

I have text documents to insert in the hbase. (With possible duplicates)
Suppose i have a document as : " I am working. He is not working"

I want to insert this document to a table in hbase, say table "doc"

=doc table=
-----
rowKey : doc_id
cf: doc_content
value: "I am working. He is not working"

Now, i to create another table that stores the word count, say "doc_idx"

doc_idx table
---
rowKey : I, cf: count, value: 1
rowKey : am, cf: count, value: 1
rowKey : working, cf: count, value: 2
rowKey : He, cf: count, value: 1
rowKey : is, cf: count, value: 1
rowKey : not, cf: count, value: 1

My MR job code:
==============

if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) {
    for(String word : doc_content.split("\\s+")) {
       Increment inc = new Increment(Bytes.toBytes(word));
       inc.addColumn("count", "", 1);
    }
}

Now, i wanted to do some experiments with coprocessors. So, i modified
the code as follows.

My MR job code:
===============

doc.checkAndPut(rowKey, doc_content, "", null, putDoc);

Coprocessor code:
===============

	public void start(CoprocessorEnvironment env)  {
		pool = new HTablePool(conf, 100);
	}

	public boolean postCheckAndPut(c,  row,  family, byte[] qualifier,
compareOp,	 comparator,  put, result) {

                if(!result) return true; // check if the put succeeded

		HTableInterface table_idx = pool.getTable("doc_idx");

		try {

			for(KeyValue contentKV = put.get("doc_content", "")) {
                            for(String word :
contentKV.getValue().split("\\s+")) {
                                Increment inc = new
Increment(Bytes.toBytes(word));
                                inc.addColumn("count", "", 1);
                                table_idx.increment(inc);
                            }
                       }
		} finally {
			table_idx.close();
		}
		return true;
	}

	public void stop(env) {
		pool.close();
	}

I am a newbee to HBASE. I am not sure this is the way to do.
Given that, why is the cooprocessor enabled version much slower than
the one without?


Sincerely,
Prakash Kadel


On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel
<mi...@hotmail.com> wrote:
>
> The  issue I was talking about was the use of a check and put.
> The OP wrote:
>>>>> each map inserts to doc table.(checkAndPut)
>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>> a index table.
>
> My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut?
>
>
> Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>
> The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job.
> He should just be using put() and then a postPut().
>
> Another issue... since he's writing to  a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ?
> Without seeing his code, we don't know.
>
> Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted.
>
> Interestingly enough, you may want to consider disabling the WAL on the write to the index.  You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data.  Indexes *ARE* expendable. ;-)
>
> Does that explain it?
>
> -Mike
>
> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>
>> Hi, Michael
>>
>> I don't quite understand what do you mean by "round trip back to the
>> client". In my understanding, as the RegionServer and TaskTracker can
>> be the same node, MR don't have to pull data into client and then
>> process.  And you also mention the "unnecessary overhead", can you
>> explain a little bit what operations or data processing can be seen as
>> "unnecessary overhead".
>>
>> Thanks
>>
>> yong
>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>> <mi...@hotmail.com> wrote:
>>> Why?
>>>
>>> This seems like an unnecessary overhead.
>>>
>>> You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.
>>>
>>> I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail?
>>>
>>>
>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:
>>>
>>>> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
>>>>
>>>> Sincerely,
>>>> Prakash Kadel
>>>>
>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>
>>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>>>> LSM, read is much slower compared to a write...
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Wei
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>> Date:   02/17/2013 07:49 PM
>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>
>>>>>
>>>>>
>>>>> hi,
>>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>>> enable quick search of docs i want to have some indexes, so i tried to use
>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>> coprocessors not supposed to increase the latency?
>>>>> my settings:
>>>>>  3 region servers
>>>>> 60 maps
>>>>> each map inserts to doc table.(checkAndPut)
>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>> a index table.
>>>>>
>>>>>
>>>>> Sincerely,
>>>>> Prakash
>>>>>
>>>>
>>>
>>> Michael Segel  | (m) 312.755.9623
>>>
>>> Segel and Associates
>>>
>>>
>>
>

Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
Well it also goes back to the question of how the RO is writing to the second table. 

I would imagine that if the M/R uses Mapper.setup() to instantiate the HTable for the index write  and then in Mapper.map() writes to the index table, why would the co-processor take much more time?

I think a code review would be in order.


On Feb 18, 2013, at 6:22 AM, yonghu <yo...@gmail.com> wrote:

> Ok. Now, I got your point. I didn't notice the "checkAndPut".
> 
> regards!
> 
> Yong
> 
> On Mon, Feb 18, 2013 at 1:11 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> 
>> The  issue I was talking about was the use of a check and put.
>> The OP wrote:
>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>>> a index table.
>> 
>> My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut?
>> 
>> 
>> Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>> 
>> The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job.
>> He should just be using put() and then a postPut().
>> 
>> Another issue... since he's writing to  a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ?
>> Without seeing his code, we don't know.
>> 
>> Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted.
>> 
>> Interestingly enough, you may want to consider disabling the WAL on the write to the index.  You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data.  Indexes *ARE* expendable. ;-)
>> 
>> Does that explain it?
>> 
>> -Mike
>> 
>> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>> 
>>> Hi, Michael
>>> 
>>> I don't quite understand what do you mean by "round trip back to the
>>> client". In my understanding, as the RegionServer and TaskTracker can
>>> be the same node, MR don't have to pull data into client and then
>>> process.  And you also mention the "unnecessary overhead", can you
>>> explain a little bit what operations or data processing can be seen as
>>> "unnecessary overhead".
>>> 
>>> Thanks
>>> 
>>> yong
>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>>> <mi...@hotmail.com> wrote:
>>>> Why?
>>>> 
>>>> This seems like an unnecessary overhead.
>>>> 
>>>> You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.
>>>> 
>>>> I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail?
>>>> 
>>>> 
>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:
>>>> 
>>>>> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
>>>>> 
>>>>> Sincerely,
>>>>> Prakash Kadel
>>>>> 
>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>> 
>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>>>>> LSM, read is much slower compared to a write...
>>>>>> 
>>>>>> 
>>>>>> Best Regards,
>>>>>> Wei
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>>> Date:   02/17/2013 07:49 PM
>>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> hi,
>>>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>>>> enable quick search of docs i want to have some indexes, so i tried to use
>>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>>> coprocessors not supposed to increase the latency?
>>>>>> my settings:
>>>>>> 3 region servers
>>>>>> 60 maps
>>>>>> each map inserts to doc table.(checkAndPut)
>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>>> a index table.
>>>>>> 
>>>>>> 
>>>>>> Sincerely,
>>>>>> Prakash
>>>>>> 
>>>>> 
>>>> 
>>>> Michael Segel  | (m) 312.755.9623
>>>> 
>>>> Segel and Associates
>>>> 
>>>> 
>>> 
>> 
> 


Re: coprocessor enabled put very slow, help please~~~

Posted by yonghu <yo...@gmail.com>.
Ok. Now, I got your point. I didn't notice the "checkAndPut".

regards!

Yong

On Mon, Feb 18, 2013 at 1:11 PM, Michael Segel
<mi...@hotmail.com> wrote:
>
> The  issue I was talking about was the use of a check and put.
> The OP wrote:
>>>>> each map inserts to doc table.(checkAndPut)
>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>> a index table.
>
> My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut?
>
>
> Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put
>
> The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job.
> He should just be using put() and then a postPut().
>
> Another issue... since he's writing to  a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ?
> Without seeing his code, we don't know.
>
> Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted.
>
> Interestingly enough, you may want to consider disabling the WAL on the write to the index.  You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data.  Indexes *ARE* expendable. ;-)
>
> Does that explain it?
>
> -Mike
>
> On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:
>
>> Hi, Michael
>>
>> I don't quite understand what do you mean by "round trip back to the
>> client". In my understanding, as the RegionServer and TaskTracker can
>> be the same node, MR don't have to pull data into client and then
>> process.  And you also mention the "unnecessary overhead", can you
>> explain a little bit what operations or data processing can be seen as
>> "unnecessary overhead".
>>
>> Thanks
>>
>> yong
>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
>> <mi...@hotmail.com> wrote:
>>> Why?
>>>
>>> This seems like an unnecessary overhead.
>>>
>>> You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.
>>>
>>> I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail?
>>>
>>>
>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:
>>>
>>>> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
>>>>
>>>> Sincerely,
>>>> Prakash Kadel
>>>>
>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>>>
>>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>>>> LSM, read is much slower compared to a write...
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Wei
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>>> Date:   02/17/2013 07:49 PM
>>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>>>
>>>>>
>>>>>
>>>>> hi,
>>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>>> enable quick search of docs i want to have some indexes, so i tried to use
>>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>>> coprocessors not supposed to increase the latency?
>>>>> my settings:
>>>>>  3 region servers
>>>>> 60 maps
>>>>> each map inserts to doc table.(checkAndPut)
>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>>> a index table.
>>>>>
>>>>>
>>>>> Sincerely,
>>>>> Prakash
>>>>>
>>>>
>>>
>>> Michael Segel  | (m) 312.755.9623
>>>
>>> Segel and Associates
>>>
>>>
>>
>

Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
The  issue I was talking about was the use of a check and put. 
The OP wrote:
>>>> each map inserts to doc table.(checkAndPut)
>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>> a index table.

My question is why does the OP use a checkAndPut, and the RegionObserver's postChecAndPut?


Here's a good example... http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put

The OP doesn't really get in to the use case, so we don't know why the Check and Put in the M/R job. 
He should just be using put() and then a postPut(). 

Another issue... since he's writing to  a different HTable... how? Does he create an HTable instance in the start() method of his RO object and then reference it later? Or does he create the instance of the HTable on the fly in each postCheckAndPut() ? 
Without seeing his code, we don't know. 

Note that this is synchronous set of writes. Your overall return from the M/R call to put will wait until the second row is inserted. 

Interestingly enough, you may want to consider disabling the WAL on the write to the index.  You can always run a M/R job that rebuilds the index should something occur to the system where you might lose the data.  Indexes *ARE* expendable. ;-) 

Does that explain it? 

-Mike

On Feb 18, 2013, at 4:57 AM, yonghu <yo...@gmail.com> wrote:

> Hi, Michael
> 
> I don't quite understand what do you mean by "round trip back to the
> client". In my understanding, as the RegionServer and TaskTracker can
> be the same node, MR don't have to pull data into client and then
> process.  And you also mention the "unnecessary overhead", can you
> explain a little bit what operations or data processing can be seen as
> "unnecessary overhead".
> 
> Thanks
> 
> yong
> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
> <mi...@hotmail.com> wrote:
>> Why?
>> 
>> This seems like an unnecessary overhead.
>> 
>> You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.
>> 
>> I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail?
>> 
>> 
>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:
>> 
>>> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
>>> 
>>> Sincerely,
>>> Prakash Kadel
>>> 
>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>> 
>>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>>> LSM, read is much slower compared to a write...
>>>> 
>>>> 
>>>> Best Regards,
>>>> Wei
>>>> 
>>>> 
>>>> 
>>>> 
>>>> From:   Prakash Kadel <pr...@gmail.com>
>>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>>> Date:   02/17/2013 07:49 PM
>>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>> 
>>>> 
>>>> 
>>>> hi,
>>>> i am trying to insert few million documents to hbase with mapreduce. To
>>>> enable quick search of docs i want to have some indexes, so i tried to use
>>>> the coprocessors, but they are slowing down my inserts. Arent the
>>>> coprocessors not supposed to increase the latency?
>>>> my settings:
>>>>  3 region servers
>>>> 60 maps
>>>> each map inserts to doc table.(checkAndPut)
>>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>>> a index table.
>>>> 
>>>> 
>>>> Sincerely,
>>>> Prakash
>>>> 
>>> 
>> 
>> Michael Segel  | (m) 312.755.9623
>> 
>> Segel and Associates
>> 
>> 
> 


Re: coprocessor enabled put very slow, help please~~~

Posted by yonghu <yo...@gmail.com>.
Hi, Michael

I don't quite understand what do you mean by "round trip back to the
client". In my understanding, as the RegionServer and TaskTracker can
be the same node, MR don't have to pull data into client and then
process.  And you also mention the "unnecessary overhead", can you
explain a little bit what operations or data processing can be seen as
"unnecessary overhead".

Thanks

yong
On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel
<mi...@hotmail.com> wrote:
> Why?
>
> This seems like an unnecessary overhead.
>
> You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.
>
> I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail?
>
>
> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:
>
>> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
>>
>> Sincerely,
>> Prakash Kadel
>>
>> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
>>
>>> Is your CheckAndPut involving a local or remote READ? Due to the nature of
>>> LSM, read is much slower compared to a write...
>>>
>>>
>>> Best Regards,
>>> Wei
>>>
>>>
>>>
>>>
>>> From:   Prakash Kadel <pr...@gmail.com>
>>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>,
>>> Date:   02/17/2013 07:49 PM
>>> Subject:        coprocessor enabled put very slow, help please~~~
>>>
>>>
>>>
>>> hi,
>>>  i am trying to insert few million documents to hbase with mapreduce. To
>>> enable quick search of docs i want to have some indexes, so i tried to use
>>> the coprocessors, but they are slowing down my inserts. Arent the
>>> coprocessors not supposed to increase the latency?
>>> my settings:
>>>   3 region servers
>>>  60 maps
>>> each map inserts to doc table.(checkAndPut)
>>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to
>>> a index table.
>>>
>>>
>>> Sincerely,
>>> Prakash
>>>
>>
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>

Re: coprocessor enabled put very slow, help please~~~

Posted by Michael Segel <mi...@hotmail.com>.
Why? 

This seems like an unnecessary overhead. 

You are writing code within the coprocessor on the server.  Pessimistic code really isn't recommended if you are worried about performance.

I have to ask... by the time you have executed the code in your co-processor, what would cause the initial write to fail? 


On Feb 18, 2013, at 3:01 AM, Prakash Kadel <pr...@gmail.com> wrote:

> its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table
> 
> Sincerely,
> Prakash Kadel
> 
> On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:
> 
>> Is your CheckAndPut involving a local or remote READ? Due to the nature of 
>> LSM, read is much slower compared to a write...
>> 
>> 
>> Best Regards,
>> Wei
>> 
>> 
>> 
>> 
>> From:   Prakash Kadel <pr...@gmail.com>
>> To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
>> Date:   02/17/2013 07:49 PM
>> Subject:        coprocessor enabled put very slow, help please~~~
>> 
>> 
>> 
>> hi,
>>  i am trying to insert few million documents to hbase with mapreduce. To 
>> enable quick search of docs i want to have some indexes, so i tried to use 
>> the coprocessors, but they are slowing down my inserts. Arent the 
>> coprocessors not supposed to increase the latency? 
>> my settings:
>>   3 region servers
>>  60 maps
>> each map inserts to doc table.(checkAndPut)
>> regionobserver coprocessor does a postCheckAndPut and inserts some rows to 
>> a index table.
>> 
>> 
>> Sincerely,
>> Prakash
>> 
> 

Michael Segel  | (m) 312.755.9623

Segel and Associates



Re: coprocessor enabled put very slow, help please~~~

Posted by Prakash Kadel <pr...@gmail.com>.
its a local read. i just check the last param of PostCheckAndPut indicating if the Put succeeded. Incase if the put success, i insert a row in another table

Sincerely,
Prakash Kadel

On Feb 18, 2013, at 2:52 PM, Wei Tan <wt...@us.ibm.com> wrote:

> Is your CheckAndPut involving a local or remote READ? Due to the nature of 
> LSM, read is much slower compared to a write...
> 
> 
> Best Regards,
> Wei
> 
> 
> 
> 
> From:   Prakash Kadel <pr...@gmail.com>
> To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
> Date:   02/17/2013 07:49 PM
> Subject:        coprocessor enabled put very slow, help please~~~
> 
> 
> 
> hi,
>   i am trying to insert few million documents to hbase with mapreduce. To 
> enable quick search of docs i want to have some indexes, so i tried to use 
> the coprocessors, but they are slowing down my inserts. Arent the 
> coprocessors not supposed to increase the latency? 
> my settings:
>    3 region servers
>   60 maps
> each map inserts to doc table.(checkAndPut)
> regionobserver coprocessor does a postCheckAndPut and inserts some rows to 
> a index table.
> 
> 
> Sincerely,
> Prakash
> 

Re: coprocessor enabled put very slow, help please~~~

Posted by Wei Tan <wt...@us.ibm.com>.
Is your CheckAndPut involving a local or remote READ? Due to the nature of 
LSM, read is much slower compared to a write...


Best Regards,
Wei




From:   Prakash Kadel <pr...@gmail.com>
To:     "user@hbase.apache.org" <us...@hbase.apache.org>, 
Date:   02/17/2013 07:49 PM
Subject:        coprocessor enabled put very slow, help please~~~



hi,
   i am trying to insert few million documents to hbase with mapreduce. To 
enable quick search of docs i want to have some indexes, so i tried to use 
the coprocessors, but they are slowing down my inserts. Arent the 
coprocessors not supposed to increase the latency? 
my settings:
    3 region servers
   60 maps
each map inserts to doc table.(checkAndPut)
regionobserver coprocessor does a postCheckAndPut and inserts some rows to 
a index table.


Sincerely,
Prakash