You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tephra.apache.org by Micael Capitão <mi...@xpand-it.com> on 2017/05/31 08:49:09 UTC

TransactionCodec poor performance

Hi all,

I've been testing Tephra 0.11.0 for a project that may need transactions 
on top of HBase and I find it's performance, for instance, for a bulk 
load, very poor. Let's not discuss why am I doing a bulk load with 
transactions.

In my use case I am generating batches of ~10000 elements and inserting 
them with the *put(List<Put> puts)* method. There is no concurrent 
writers or readers.
If I do the put without transactions it takes ~0.5s. If I use the 
*TransactionAwareHTable* it takes ~12s.
I've tracked down the performance killer to be the 
*addToOperation(OperationWithAttributes op, Transaction tx)*, more 
specifically the *txCodec.encode(tx)*.

I've created a TransactionAwareHTableFix with the *addToOperation(txPut, 
tx)* commented, and used it in my code, and each batch started to take 
~0.5s.

I've noticed that inside the *TransactionCodec* you were instantiating a 
new TSerializer and TDeserializer on each call to encode/decode. I tried 
instantiating the ser/deser on the constructor but even that way each of 
my batches would take the same ~12s.

Further investigation has shown me that the Transaction instance, after 
being encoded by the TransactionCodec, has 104171 bytes of length. So in 
my 10000 elements batch, ~970MB is metadata. Is that supposed to happen?


Regards,

Micael Capitão

Re: TransactionCodec poor performance

Posted by Andreas Neumann <an...@apache.org>.
Hi Micael,

the transaction state is kept in memory by the transaction manager, and its
edits are written to a write-ahead log to be able to reconstruct the state
after a failure.

You are right that the transaction object does not need to be serialized
for each put: I opened two improvement Jiras (TEPHRA-233 ands -234) to
address this.

Were you able to clean up the transaction state and rerun your benchmark?

Cheers -Andreas.

On Thu, Jun 8, 2017 at 2:02 AM, Micael Capitão <mi...@xpand-it.com>
wrote:

> Hi,
>
> (I have inadvertently deleted the previous reply email so this email is a
> response from my previous email)
>
> Probably I have lots of invalidated transactions because of the first
> tests I was performing that were taking more than 30s per transaction. It
> is possible that the invalidated transactions have pilled up.
>
> Bellow is the stats on the Transaction object. And yes, I have lots of
> invalid transactions and that explains the absurd size I am getting on the
> serialized representation. Where does Tephra store that? Zookeeper?
>
> 2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:109 - startTx Encoded
> transaction size: 104203 bytes
> 2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:110 - inprogress Tx: 0
> 2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:111 - invalid Tx: 13015
> 2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:112 - checkpoint write
> pointers: 0
>
> Another question: does the Transaction object may change outside the
> startTx and updateTx calls? I was wondering if it is really needed to
> serialize it on each single operation.
>
>
> Regards.
>
>
> On 31/05/17 09:49, Micael Capitão wrote:
>
>> Hi all,
>>
>> I've been testing Tephra 0.11.0 for a project that may need transactions
>> on top of HBase and I find it's performance, for instance, for a bulk load,
>> very poor. Let's not discuss why am I doing a bulk load with transactions.
>>
>> In my use case I am generating batches of ~10000 elements and inserting
>> them with the *put(List<Put> puts)* method. There is no concurrent writers
>> or readers.
>> If I do the put without transactions it takes ~0.5s. If I use the
>> *TransactionAwareHTable* it takes ~12s.
>> I've tracked down the performance killer to be the
>> *addToOperation(OperationWithAttributes op, Transaction tx)*, more
>> specifically the *txCodec.encode(tx)*.
>>
>> I've created a TransactionAwareHTableFix with the *addToOperation(txPut,
>> tx)* commented, and used it in my code, and each batch started to take
>> ~0.5s.
>>
>> I've noticed that inside the *TransactionCodec* you were instantiating a
>> new TSerializer and TDeserializer on each call to encode/decode. I tried
>> instantiating the ser/deser on the constructor but even that way each of my
>> batches would take the same ~12s.
>>
>> Further investigation has shown me that the Transaction instance, after
>> being encoded by the TransactionCodec, has 104171 bytes of length. So in my
>> 10000 elements batch, ~970MB is metadata. Is that supposed to happen?
>>
>>
>> Regards,
>>
>> Micael Capitão
>>
>
> --
>
> Micael Capitão
> *BIG DATA ENGINEER*
>
> *E-mail: *micael.capitao@xpand-it.com
> *Mobile: *(+351) 91 260 94 27 | *Skype*: micaelcapitao
>
> x
>
> Xpand IT | Delivering Innovation and Technology
> Phone: (+351) 21 896 71 50
> Fax:(+351) 21 896 71 51
> Site:www.xpand-it.com <http://www.xpand-it.com>
>
> Facebook <http://www.xpand-it.com/facebook> Linkedin <
> http://www.xpand-it.com/linkedin> Twitter <http://www.xpand-it.com/twitter>
> Youtube <http://www.xpand-it.com/youtube>
>
>

Re: TransactionCodec poor performance

Posted by Micael Capitão <mi...@xpand-it.com>.
Hi,

(I have inadvertently deleted the previous reply email so this email is 
a response from my previous email)

Probably I have lots of invalidated transactions because of the first 
tests I was performing that were taking more than 30s per transaction. 
It is possible that the invalidated transactions have pilled up.

Bellow is the stats on the Transaction object. And yes, I have lots of 
invalid transactions and that explains the absurd size I am getting on 
the serialized representation. Where does Tephra store that? Zookeeper?

2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:109 - startTx 
Encoded transaction size: 104203 bytes
2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:110 - inprogress Tx: 0
2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:111 - invalid Tx: 13015
2017-06-07 09:50:08 INFO  TransactionAwareHTableFix:112 - checkpoint 
write pointers: 0

Another question: does the Transaction object may change outside the 
startTx and updateTx calls? I was wondering if it is really needed to 
serialize it on each single operation.


Regards.

On 31/05/17 09:49, Micael Capitão wrote:
> Hi all,
>
> I've been testing Tephra 0.11.0 for a project that may need 
> transactions on top of HBase and I find it's performance, for 
> instance, for a bulk load, very poor. Let's not discuss why am I doing 
> a bulk load with transactions.
>
> In my use case I am generating batches of ~10000 elements and 
> inserting them with the *put(List<Put> puts)* method. There is no 
> concurrent writers or readers.
> If I do the put without transactions it takes ~0.5s. If I use the 
> *TransactionAwareHTable* it takes ~12s.
> I've tracked down the performance killer to be the 
> *addToOperation(OperationWithAttributes op, Transaction tx)*, more 
> specifically the *txCodec.encode(tx)*.
>
> I've created a TransactionAwareHTableFix with the 
> *addToOperation(txPut, tx)* commented, and used it in my code, and 
> each batch started to take ~0.5s.
>
> I've noticed that inside the *TransactionCodec* you were instantiating 
> a new TSerializer and TDeserializer on each call to encode/decode. I 
> tried instantiating the ser/deser on the constructor but even that way 
> each of my batches would take the same ~12s.
>
> Further investigation has shown me that the Transaction instance, 
> after being encoded by the TransactionCodec, has 104171 bytes of 
> length. So in my 10000 elements batch, ~970MB is metadata. Is that 
> supposed to happen?
>
>
> Regards,
>
> Micael Capitão

-- 

Micael Capitão
*BIG DATA ENGINEER*

*E-mail: *micael.capitao@xpand-it.com
*Mobile: *(+351) 91 260 94 27 | *Skype*: micaelcapitao

x 	

Xpand IT | Delivering Innovation and Technology
Phone: (+351) 21 896 71 50
Fax:(+351) 21 896 71 51
Site:www.xpand-it.com <http://www.xpand-it.com>

Facebook <http://www.xpand-it.com/facebook> Linkedin 
<http://www.xpand-it.com/linkedin> Twitter 
<http://www.xpand-it.com/twitter> Youtube <http://www.xpand-it.com/youtube>


Re: TransactionCodec poor performance

Posted by Andreas Neumann <an...@apache.org>.
Hi Micael,

if your transaction objects are that large, that indicates that you have a
lot of either invalid or in-progress transactions. I am wondering how that
happens. Can you share your code?

Also, are you subscribed to this list? It would be good to do so in order
to receive the responses, they normally only go to the list.

Cheers -Andreas.

On Wed, May 31, 2017 at 2:29 AM, Terence Yim <ch...@gmail.com> wrote:

> Hi Micael,
>
> Do you know if the invalid tx list inside the Transaction object is large?
>
> Terence
>
> > On May 31, 2017, at 1:49 AM, Micael Capitão <mi...@xpand-it.com>
> wrote:
> >
> > Hi all,
> >
> > I've been testing Tephra 0.11.0 for a project that may need transactions
> on top of HBase and I find it's performance, for instance, for a bulk load,
> very poor. Let's not discuss why am I doing a bulk load with transactions.
> >
> > In my use case I am generating batches of ~10000 elements and inserting
> them with the *put(List<Put> puts)* method. There is no concurrent writers
> or readers.
> > If I do the put without transactions it takes ~0.5s. If I use the
> *TransactionAwareHTable* it takes ~12s.
> > I've tracked down the performance killer to be the *addToOperation(OperationWithAttributes
> op, Transaction tx)*, more specifically the *txCodec.encode(tx)*.
> >
> > I've created a TransactionAwareHTableFix with the *addToOperation(txPut,
> tx)* commented, and used it in my code, and each batch started to take
> ~0.5s.
> >
> > I've noticed that inside the *TransactionCodec* you were instantiating a
> new TSerializer and TDeserializer on each call to encode/decode. I tried
> instantiating the ser/deser on the constructor but even that way each of my
> batches would take the same ~12s.
> >
> > Further investigation has shown me that the Transaction instance, after
> being encoded by the TransactionCodec, has 104171 bytes of length. So in my
> 10000 elements batch, ~970MB is metadata. Is that supposed to happen?
> >
> >
> > Regards,
> >
> > Micael Capitão
>
>

Re: TransactionCodec poor performance

Posted by Terence Yim <ch...@gmail.com>.
Hi Micael,

Do you know if the invalid tx list inside the Transaction object is large?

Terence

> On May 31, 2017, at 1:49 AM, Micael Capitão <mi...@xpand-it.com> wrote:
> 
> Hi all,
> 
> I've been testing Tephra 0.11.0 for a project that may need transactions on top of HBase and I find it's performance, for instance, for a bulk load, very poor. Let's not discuss why am I doing a bulk load with transactions.
> 
> In my use case I am generating batches of ~10000 elements and inserting them with the *put(List<Put> puts)* method. There is no concurrent writers or readers.
> If I do the put without transactions it takes ~0.5s. If I use the *TransactionAwareHTable* it takes ~12s.
> I've tracked down the performance killer to be the *addToOperation(OperationWithAttributes op, Transaction tx)*, more specifically the *txCodec.encode(tx)*.
> 
> I've created a TransactionAwareHTableFix with the *addToOperation(txPut, tx)* commented, and used it in my code, and each batch started to take ~0.5s.
> 
> I've noticed that inside the *TransactionCodec* you were instantiating a new TSerializer and TDeserializer on each call to encode/decode. I tried instantiating the ser/deser on the constructor but even that way each of my batches would take the same ~12s.
> 
> Further investigation has shown me that the Transaction instance, after being encoded by the TransactionCodec, has 104171 bytes of length. So in my 10000 elements batch, ~970MB is metadata. Is that supposed to happen?
> 
> 
> Regards,
> 
> Micael Capitão