You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jacques Nadeau <ja...@apache.org> on 2013/08/07 03:53:46 UTC

Quick overview of HyperBatch concept.

Someone was asking me about the HyperBatch concept that a recent
commit introduced. The idea is pretty simple. We currently have a
two byte selection vector that we can use to mask a portion of a
columnar record batch before we rewrite it. This is to help in
situations where the rewrite would be unwarranted given the subsequent
operator. This works great for non-blocking operators.

In the case of blocking operators such as sort, this becomes a bit
harder. (Especially in the case of schema changes, which I won't
discuss here.) One solution is generating a this new thing called a
hyperbatch. It looks kind of like a batch but it carries a
SelectionVector4 with it. The SV4 describes not only the valid
records, but also their location within a set of multiple support
record batches. This is encoded as two unsigned bytes for the record
batch index followed by two unsigned bytes for the individual record
(4B records max). In these cases, a (hyper)batch doesn't hold a
ValueVector for each field but rather an indexed array of
ValueVectors. This allows a pointer sort to completed without
rewriting the columnar oriented data until required (typically when
writing to disk or socket). In the meantime, some additional
operators can be pipelined with only small modifications. If we get
to the point that a particular operator no longer supports a SV4 input
batch, we insert a SelectionVectorRemover to rewrite the data to the
more standard record batch format.

You can see an example of the interaction at line 68 of this file:
https://github.com/apache/incubator-drill/blob/db3afaa854fc8475592907dba97162ecf869f9df/sandbox/prototype/exec/java-exec/src/main/java/org/apache/drill/exec/expr/CodeGenerator.java

thanks,
Jacques

Re: Quick overview of HyperBatch concept.

Posted by Jacques Nadeau <ja...@apache.org>.

Selection vector is the same.   Not sure whether either of the others
embrace hyperbatch or new for Drill.

J
On Aug 6, 2013 7:27 PM, "Timothy Chen" <tn...@gmail.com> wrote:

> Ah gotcha, it's the same concept in MonetDB and what Hive batch query
> engine is using too. Didn't know they call it HyperBatch (unless you
> invented it?)
>
> Tim
>
>
> On Tue, Aug 6, 2013 at 6:53 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > Someone was asking me about the HyperBatch concept that a recent
> > commit introduced.  The idea is pretty simple.  We currently have a
> > two byte selection vector that we can use to mask a portion of a
> > columnar record batch before we rewrite it.  This is to help in
> > situations where the rewrite would be unwarranted given the subsequent
> > operator.  This works great for non-blocking operators.
> >
> > In the case of blocking operators such as sort, this becomes a bit
> > harder.  (Especially in the case of schema changes, which I won't
> > discuss here.)  One solution is generating a this new thing called a
> > hyperbatch.  It looks kind of like a batch but it carries a
> > SelectionVector4 with it.  The SV4 describes not only the valid
> > records, but also their location within a set of multiple support
> > record batches.  This is encoded as two unsigned bytes for the record
> > batch index followed by two unsigned bytes for the individual record
> > (4B records max).  In these cases, a (hyper)batch doesn't hold a
> > ValueVector for each field but rather an indexed array of
> > ValueVectors.  This allows a pointer sort to completed without
> > rewriting the columnar oriented data until required (typically when
> > writing to disk or socket).  In the meantime, some additional
> > operators can be pipelined with only small modifications.  If we get
> > to the point that a particular operator no longer supports a SV4 input
> > batch, we insert a SelectionVectorRemover to rewrite the data to the
> > more standard record batch format.
> >
> > You can see an example of the interaction at line 68 of this file:
> >
> >
> https://github.com/apache/incubator-drill/blob/db3afaa854fc8475592907dba97162ecf869f9df/sandbox/prototype/exec/java-exec/src/main/java/org/apache/drill/exec/expr/CodeGenerator.java
> >
> >
> > thanks,
> > Jacques
> >
>

Re: Quick overview of HyperBatch concept.

Posted by Timothy Chen <tn...@gmail.com>.

Ah gotcha, it's the same concept in MonetDB and what Hive batch query
engine is using too. Didn't know they call it HyperBatch (unless you
invented it?)

Tim


On Tue, Aug 6, 2013 at 6:53 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Someone was asking me about the HyperBatch concept that a recent
> commit introduced.  The idea is pretty simple.  We currently have a
> two byte selection vector that we can use to mask a portion of a
> columnar record batch before we rewrite it.  This is to help in
> situations where the rewrite would be unwarranted given the subsequent
> operator.  This works great for non-blocking operators.
>
> In the case of blocking operators such as sort, this becomes a bit
> harder.  (Especially in the case of schema changes, which I won't
> discuss here.)  One solution is generating a this new thing called a
> hyperbatch.  It looks kind of like a batch but it carries a
> SelectionVector4 with it.  The SV4 describes not only the valid
> records, but also their location within a set of multiple support
> record batches.  This is encoded as two unsigned bytes for the record
> batch index followed by two unsigned bytes for the individual record
> (4B records max).  In these cases, a (hyper)batch doesn't hold a
> ValueVector for each field but rather an indexed array of
> ValueVectors.  This allows a pointer sort to completed without
> rewriting the columnar oriented data until required (typically when
> writing to disk or socket).  In the meantime, some additional
> operators can be pipelined with only small modifications.  If we get
> to the point that a particular operator no longer supports a SV4 input
> batch, we insert a SelectionVectorRemover to rewrite the data to the
> more standard record batch format.
>
> You can see an example of the interaction at line 68 of this file:
>
> https://github.com/apache/incubator-drill/blob/db3afaa854fc8475592907dba97162ecf869f9df/sandbox/prototype/exec/java-exec/src/main/java/org/apache/drill/exec/expr/CodeGenerator.java
>
>
> thanks,
> Jacques
>