You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Robert Hamilton <rh...@whalesharkmedia.com> on 2013/04/01 23:31:09 UTC

Flume EventSerializer vs hbase coprocessor

I have a calculation that I'm doing in a custom AsyncHbaseEventSerializer. I want to do the calculation in real time, but it looks like it could be done either here or in a coprocessor. I'm just doing it in the serializer for now because the code is simple that way, and data only ever will come in through flume anyway.

But is this good practice?  I would welcome any advice or guidance.

A simplified version of the calculation: 

Every row has a groupID and a data timestamp field; each groupID represents a distinct group of rows and the timestamp distinguishes between individual rows in the group. We can assume the combination is always unique. So I construct the rowkey as concatenated groupID, '.' , and reverse timestamp.

The task I have, for each such row to be inserted into HBase, find the latest row already inserted having the same groupID (based on timestamp part of the key),  and insert another column having the difference between its time and that of the previous record.  

Each row the serializer sees, it looks up the previous row using a scan and gets the first row from the scan (thats why I'm using the reverse timestamp).  Finds the difference and adds that to the list of PutRequests.

Example:  the data having 2 rows looks like this:

gggg,123456, 'hello'
gggg,123400, 'there'

Result in hbase would look like this.

Row: gggg.123456 , 
	cf:v = 'hello'
        cf:dt = null             <--- no previous row so dt is null

Row: gggg.123400, 
	cf:v='there'
        cf:dt=56                 <-- dt is 56 ms from 123456 - 123400


As shown, I've calculated the dt field from the previous record.  The dt=56 means this record came from an event that was logged 56 ms later than the first one.

Is this a common practice, or am I crazy to be doing this in the serializer? Are there performance or reliability issues that I should be considering?




-- 
This e-mail, including attachments, contains confidential and/or 
proprietary information, and may be used only by the person or entity to 
which it is addressed. The reader is hereby notified that any 
dissemination, distribution or copying of this e-mail is prohibited. If you 
have received this e-mail in error, please notify the sender by replying to 
this message and delete this e-mail immediately.