You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2013/10/17 20:04:44 UTC

[jira] [Comment Edited] (HBASE-9794) KeyValues / cells backed by buffer fragments

    [ https://issues.apache.org/jira/browse/HBASE-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798190#comment-13798190 ] 

Andrew Purtell edited comment on HBASE-9794 at 10/17/13 6:04 PM:
-----------------------------------------------------------------

Considering alternatives to the single contiguous buffer as we have now, but only if needed. The current way should remain the default way.

The kind of KeyValue manipulations desired here are analogous to those performed by network stacks in operating systems. 

The BSD mbuf structure is a good example, although it contains a lot of particulars to network stacks.

Back when the dinosaurs roamed the earth I worked with a research OS called Scout which had (I though) a particularly nice tree based structure for composing and decomposing packet buffers. Could be inspiration. From foggy memory it had an API like:

{code}
// Return the current length of the message
extern size_t msgLength (Msg m);

// Replace the contents of message 'm' with that of 'other', but without an ownership transfer to 'm', any changes will have COW semantics
extern void msgAssign (Msg m, Msg other);

// Replace the contents of message 'm' with the union of 'left' and 'right', but without an ownership transfer to 'm', any changes will have COW semantics
extern void msgJoin (Msg m, Msg left, Msg right);

// Remove 'len' bytes from the head of message 'm' into message 'other'
extern void msgBreak (Msg m, Msg other, size_t len);

// Get a contiguous view over 'len' bytes of new storage at the tail of message 'm', may cause internal tree manipulations and allocations in order to provide it
extern void * msgPush (Msg m, size_t len);

// Get a contiguous view over 'len' bytes at the head of message 'm', may cause internal tree manipulations and allocations in order to provide it, and remove those bytes from the message
extern void * msgPop (Msg m, size_t len);

// Get a contiguous view over 'len' bytes at the head of message 'm', may cause internal tree manipulations and allocations in order to provide it, without removing those bytes from the message
extern void * msgPeek (Msg m, size_t len);

// Discard 'len' bytes from the head of the message
extern void msgDiscard (Msg m, size_t len);

// Discard 'len' bytes from the tail of the message
extern void msgTruncate (Msg m, size_t len);

// Initialize state for a walk over the tree of buffers for message 'm'
extern void msgWalkInit (MsgWalk w, Msg m);

// Return a view over the contents of the next buffer for message 'm', or the first buffer upon first invocation. Does not trigger any tree manipulations or allocations
extern void * msgWalkNext (MsgWalk w, size_t *lenp);

// Clean up walk state
extern void msgWalkDone (MsgWalk w);
{code}


was (Author: apurtell):
Considering alternatives to the single contiguous buffer as we have now, but only if needed. The current way should remain the default way.

The kind of KeyValue manipulations desired here are analogous to those performed by network stacks in operating systems. 

The BSD mbuf structure is a good example, although it contains a lot of particulars to network stacks.

Back when the dinosaurs roamed the earth I worked with a research OS called Scout which had (I though) a particularly nice tree based structure for composing and decomposing packet buffers. Could be inspiration. From foggy memory it had an API like:

{code}
// Return the current length of the message
extern size_t msgLength (Msg m);

// Replace the contents of message 'm' with that of 'other', but without an ownership transfer to 'm', any changes will have COW semantics
extern void msgAssign (Msg m, Msg other);

// Replace the contents of message 'm' with the union of 'left' and 'write', but without an ownership transfer to 'm', any changes will have COW semantics
extern void msgJoin (Msg m, Msg left, Msg right);

// Remove 'len' bytes from the head of message 'm' into message 'other'
extern void msgBreak (Msg m, Msg other, size_t len);

// Get a contiguous view over 'len' bytes of new storage at the tail of message 'm', may cause internal tree manipulations and allocations in order to provide it
extern void * msgPush (Msg m, size_t len);

// Get a contiguous view over 'len' bytes at the head of message 'm', may cause internal tree manipulations and allocations in order to provide it, and remove those bytes from the message
extern void * msgPop (Msg m, size_t len);

// Get a contiguous view over 'len' bytes at the head of message 'm', may cause internal tree manipulations and allocations in order to provide it, without removing those bytes from the message
extern void * msgPeek (Msg m, size_t len);

// Discard 'len' bytes from the head of the message
extern void msgDiscard (Msg m, size_t len);

// Discard 'len' bytes from the tail of the message
extern void msgTruncate (Msg m, size_t len);

// Initialize state for a walk over the tree of buffers for message 'm'
extern void msgWalkInit (MsgWalk w, Msg m);

// Return a view over the contents of the next buffer for message 'm', or the first buffer upon first invocation. Does not trigger any tree manipulations or allocations
extern void * msgWalkNext (MsgWalk w, size_t *lenp);

// Clean up walk state
extern void msgWalkDone (MsgWalk w);
{code}

> KeyValues / cells backed by buffer fragments
> --------------------------------------------
>
>                 Key: HBASE-9794
>                 URL: https://issues.apache.org/jira/browse/HBASE-9794
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Andrew Purtell
>
> There are various places in the code where we see comments to the effect "would be great if we had a scatter gather API for KV", appearing at places where we rewrite KVs on the server, for example in HRegion where we process appends and increments.
> KeyValues are stored in buffers of fixed length. This approach has performance advantages for the common case where KVs are not manipulated on their way from disk to RPC. The disadvantage of this approach is any manipulation of the KV internals then requires the creation of a new buffer to hold the result, and a copy of the KV data into the new buffer. Appends and increments are typically a small percentage of overall workload so this has been fine up to now.
>  
> KeyValues can now carry metadata known as tags. Tags are stored contiguously with the rest of the KeyValue. Applications wishing to use tags (like per cell security) change the equation by wanting to rewrite KVs significantly more often. 
> We should consider backing KeyValue with an alternative structure that can better support rewriting portions of its data, appends to existing buffers, scatter-gather copies, possibly even copy-on-write.



--
This message was sent by Atlassian JIRA
(v6.1#6144)