You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Lars George <la...@gmail.com> on 2016/09/18 16:37:18 UTC

DiffKeyDeltaEncoder not fully efficient?

Hi,

Reading the key encoder code and noticed this in the DiffKeyDeltaEncoder:

if ((flag & FLAG_TIMESTAMP_IS_DIFF) == 0) {
  ByteBufferUtils.putLong(out, timestamp, timestampFitsInBytes);
} else {
  ByteBufferUtils.putLong(out, diffTimestamp, diffTimestampFitsInBytes);
}

So if the timestamp is the same as the one from the previous cell, we
store it again in its full fidelity? This makes no sense as we omit
otherwise all identical fields. I would assume this is a mistake?

In the decoding we do this:

// handle timestamp
int timestampFitsInBytes =
    ((flag & MASK_TIMESTAMP_LENGTH) >>> SHIFT_TIMESTAMP_LENGTH) + 1;
long timestamp = ByteBufferUtils.readLong(source, timestampFitsInBytes);
if ((flag & FLAG_TIMESTAMP_SIGN) != 0) {
  timestamp = -timestamp;
}
if ((flag & FLAG_TIMESTAMP_IS_DIFF) != 0) {
  timestamp = state.timestamp - timestamp;
}
buffer.putLong(timestamp);

This could be changed to simply use the state.timestamp if the value
is the same, no? Right now we would add six bytes for the cell
timestamp when they are identical.

If they are not, we store delta time, so I guess in practice this does
not happen too often, that we have the same time, unless they are
client supplied, say for a row to emulate transactions.

Should I open a ticket?

Cheers,
Lars

Re: DiffKeyDeltaEncoder not fully efficient?

Posted by Lars George <la...@gmail.com>.
You are right Ted, I misread that. Thanks for the second set of eyes!
Disregard this email :)

On Sun, Sep 18, 2016 at 7:09 PM, Ted Yu <yu...@gmail.com> wrote:
> In compressSingleKeyValue(), we compute the number of bytes to store
> timestamp:
>
>       timestampFitsInBytes = ByteBufferUtils.longFitsIn(timestamp);
>
> Then the number of bytes to store the diff between the timestamp of
> previous cell and current timestamp is computed:
>
>       diffTimestampFitsInBytes = ByteBufferUtils.longFitsIn(diffTimestamp);
>
> After that FLAG_TIMESTAMP_IS_DIFF would be set if it is more efficient to
> store the diff:
>
>       if (diffTimestampFitsInBytes < timestampFitsInBytes) {
>
>         flag |= (diffTimestampFitsInBytes - 1) << SHIFT_TIMESTAMP_LENGTH;
>
>         flag |= FLAG_TIMESTAMP_IS_DIFF;
>
> This should achieve good efficiency.
>
> Cheers
>
> On Sun, Sep 18, 2016 at 9:37 AM, Lars George <la...@gmail.com> wrote:
>
>> Hi,
>>
>> Reading the key encoder code and noticed this in the DiffKeyDeltaEncoder:
>>
>> if ((flag & FLAG_TIMESTAMP_IS_DIFF) == 0) {
>>   ByteBufferUtils.putLong(out, timestamp, timestampFitsInBytes);
>> } else {
>>   ByteBufferUtils.putLong(out, diffTimestamp, diffTimestampFitsInBytes);
>> }
>>
>> So if the timestamp is the same as the one from the previous cell, we
>> store it again in its full fidelity? This makes no sense as we omit
>> otherwise all identical fields. I would assume this is a mistake?
>>
>> In the decoding we do this:
>>
>> // handle timestamp
>> int timestampFitsInBytes =
>>     ((flag & MASK_TIMESTAMP_LENGTH) >>> SHIFT_TIMESTAMP_LENGTH) + 1;
>> long timestamp = ByteBufferUtils.readLong(source, timestampFitsInBytes);
>> if ((flag & FLAG_TIMESTAMP_SIGN) != 0) {
>>   timestamp = -timestamp;
>> }
>> if ((flag & FLAG_TIMESTAMP_IS_DIFF) != 0) {
>>   timestamp = state.timestamp - timestamp;
>> }
>> buffer.putLong(timestamp);
>>
>> This could be changed to simply use the state.timestamp if the value
>> is the same, no? Right now we would add six bytes for the cell
>> timestamp when they are identical.
>>
>> If they are not, we store delta time, so I guess in practice this does
>> not happen too often, that we have the same time, unless they are
>> client supplied, say for a row to emulate transactions.
>>
>> Should I open a ticket?
>>
>> Cheers,
>> Lars
>>

Re: DiffKeyDeltaEncoder not fully efficient?

Posted by Ted Yu <yu...@gmail.com>.
In compressSingleKeyValue(), we compute the number of bytes to store
timestamp:

      timestampFitsInBytes = ByteBufferUtils.longFitsIn(timestamp);

Then the number of bytes to store the diff between the timestamp of
previous cell and current timestamp is computed:

      diffTimestampFitsInBytes = ByteBufferUtils.longFitsIn(diffTimestamp);

After that FLAG_TIMESTAMP_IS_DIFF would be set if it is more efficient to
store the diff:

      if (diffTimestampFitsInBytes < timestampFitsInBytes) {

        flag |= (diffTimestampFitsInBytes - 1) << SHIFT_TIMESTAMP_LENGTH;

        flag |= FLAG_TIMESTAMP_IS_DIFF;

This should achieve good efficiency.

Cheers

On Sun, Sep 18, 2016 at 9:37 AM, Lars George <la...@gmail.com> wrote:

> Hi,
>
> Reading the key encoder code and noticed this in the DiffKeyDeltaEncoder:
>
> if ((flag & FLAG_TIMESTAMP_IS_DIFF) == 0) {
>   ByteBufferUtils.putLong(out, timestamp, timestampFitsInBytes);
> } else {
>   ByteBufferUtils.putLong(out, diffTimestamp, diffTimestampFitsInBytes);
> }
>
> So if the timestamp is the same as the one from the previous cell, we
> store it again in its full fidelity? This makes no sense as we omit
> otherwise all identical fields. I would assume this is a mistake?
>
> In the decoding we do this:
>
> // handle timestamp
> int timestampFitsInBytes =
>     ((flag & MASK_TIMESTAMP_LENGTH) >>> SHIFT_TIMESTAMP_LENGTH) + 1;
> long timestamp = ByteBufferUtils.readLong(source, timestampFitsInBytes);
> if ((flag & FLAG_TIMESTAMP_SIGN) != 0) {
>   timestamp = -timestamp;
> }
> if ((flag & FLAG_TIMESTAMP_IS_DIFF) != 0) {
>   timestamp = state.timestamp - timestamp;
> }
> buffer.putLong(timestamp);
>
> This could be changed to simply use the state.timestamp if the value
> is the same, no? Right now we would add six bytes for the cell
> timestamp when they are identical.
>
> If they are not, we store delta time, so I guess in practice this does
> not happen too often, that we have the same time, unless they are
> client supplied, say for a row to emulate transactions.
>
> Should I open a ticket?
>
> Cheers,
> Lars
>