You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jinfeng Ni <ji...@gmail.com> on 2013/12/04 00:35:25 UTC

buffer allocation of cast into var length type

Hi all,

I' working on the explicit cast support in drill. So far, I have prototyped
the implementation for the first 3 categories, and would like to seek input
from you regarding how to deal with the buffer allocation for cast from
fixed-length type into var-length type.

1. cast from fixed-length type to fixed-length type
eg:   float4 --> int,
        int -> float4,

2. cast from var-length type to fixed-length type
eg: varchar --> int
      varbinary --> int
(Still need to figure out how to handle overflow issue when cast)

3. cast from fixed-length type to var-length type
eg:  int  -> varchar
       bigint -> varbinary

4. cast from var-length type to var-length type
eg:   varchar --> varchar
        varbinary --> varchar

For the 3rd one, ie. from fixed-length to var-length type, it causes some
problem to the current implementation, in terms of buffer allocation.

For the fixed-length type, drill uses java primitive type in ValueHolder.
For instance, IntHolder.value is a int.  But for var-length type, drill
will use a buffer to keep its value. When doing cast from int into varchar,
the buffer for the VarCharHolder is not allocated, and we have to figure
out a way to do the allocation, before cast.

There seems 2 options:
Option 1:  allocate buffer in the function template setup() method.  The
buffer will be used in eval() method.
Problem with this option :
1) need copy twice.  first copy from fixed-type input into the buffer
allocated in setup(), second copy from the buffer into the buffer in the
target vector.
2) need add a cleanup() method to function template, to clean the buffer
allocated, which currently is not there in the code base.

Option 2:  the consumer of output of the cast function will be responsible
to pre-allocate buffer in the target ValueVector for all the
VarCharHolder().  The cast function will simply do the conversion and copy
into the pre-allocated buffer in the target ValueVector.
Good thing of this option is it requires 1 copy.

I have prototyped the 1st option, and have not figured out how to implement
the 2nd approach yet. But I would like to seek suggestion regarding those 2
options, before I proceed next.

Thanks!

Re: buffer allocation of cast into var length type

Posted by Jinfeng Ni <ji...@gmail.com>.

Hi Jason,




On Tue, Dec 3, 2013 at 7:27 PM, Jason Altekruse <al...@gmail.com>wrote:

>
> In regards to the more involved case where you need convert an integer to
> its ascii implementation, how would the consumer know how big of a buffer
> to allocate? Would there be a pre-processing step where you determine the
> number of digits needed to represent the integers/doubles in base 10? For
> integers I guess we could zero fill them all to the same length, but that
> seems like it wouldn't be worth it for the little time we would save
> scanning through the dataset.
>
>
For cast function, it seems simple : user would specify the max length of
the target VARCHAR type, i. e VARCHAR(10).
If the length is not big enough, truncation would happen during the
conversion, and a warning might be raised.

However, as far as I know so far, the current drill code allocate a
pre-determined length for VarCharVector / VarBinaryVector (correct me if
I'm wrong). This makes sense in reading a schemaless parquet file, since
parquet reader does not know the actually length for each column. But for
the cast case, since we know the max length of the target type. In that
sense, I feel that VarCharVector / VarBinaryVector need a way to specify
the max length, if we know the target type.

Another issue with pre-determined length is that the buffer may not be big
enough to hold all the incoming data. cast from fixed-length input does not
have a serious problem here, since we know the max length. But for other
function, like string concat, etc, this pre-determine length may have
issue.


> Another option is that we could always over-allocate the buffers and then
> slice off the excess, but there is no really good way to avoid waste.
>
> Not sure if we want to open this can of worms, but there is another
> possible solution that is related to some thoughts I have around making the
> parquet reader faster. It is possible that we might have to break our
> design of a single column always being represented by a single buffer.
>
> In cases like this where it is hard to know the final buffer length, it
> might be easier to allocate a reasonable guess and then just tack on
> another buffer if we guessed wrong. I know that one of the main goals of
> value vectors is that they are random access, with minimal overhead for
> value extraction, but I think this might be a case where it would be worth
> breaking it.
>
> The simple implementation might look like the variable length vectors, with
> a metadata buffer sitting in front of the data to describe ranges of values
> held in each of the buffers. i.e values 1-400 are in buffer 1 : 401-1000
> are in buffer 2. (I would assume we we never exceed 5 or so buffers, but it
> could provide extra flexibility).
>
> I'm trying to look at how to copy into the buffer in the outgoing
recordbatch directly, in stead of copy into a temp buffer.  This seems
require change in the code generator for the function. I'll look into it,
and will keep you updated.

Thanks!


To prevent the need for an extra step of indirection with each value
> extraction, we could change the interfaces on value vectors a bit to make
> them expose an interator, rather than get(index) method. This would allow
> for fetching the first buffer, reading all of its values with the same
> overhead as we have now, until we hit the end of the buffer, and then we
> could rely on an exception to indicate we ran out of values and at that
> time swap to the second buffer.
>
> -Jason
>
>
> On Tue, Dec 3, 2013 at 8:59 PM, Jinfeng Ni <ji...@gmail.com> wrote:
>
> > Hi Jason,
> >
> > Good question.
> >
> > Actually, for some type cast, it is *binary coercible, *means there is no
> > need internally to do any conversion. for instance, char --> varchar,
> > varchar --> varbinary, etc.
> >
> > For other cases, some transformation is required, since the binary
> > representation of source type is different from the binary representation
> > of target type.
> > For instance, int -> varchar.  The target type need keep each digit of
> the
> > integer, while the source type is a 4-byte representation.
> >
> > I will look into whether it's possible to use the buffer in the output
> > value vector directly, without copying into new buffer.
> >
> >
> >
> >
> >
> > On Tue, Dec 3, 2013 at 6:29 PM, Jason Altekruse <
> altekrusejason@gmail.com
> > >wrote:
> >
> > > Hi Jinfeng,
> > >
> > > This might be a dumb question, but is there any transformation being
> > > performed when going from a fixed length type to a variable length
> type?
> > > That is, are the bytes in the buffer coming in going to be the same as
> > the
> > > bytes coming out of the cast?
> > >
> > > I understand that for casts like int-> long we need to add extra space
> > > between each value, but is it possible that we could just hand the
> buffer
> > > from one value vector type to the other without copying it into a new
> > > buffer?
> > >
> > > We would still have to create a new buffer with the offsets of the
> > > "variable length" values, but it would save us some time if we could do
> > > this.
> > >
> > > -Jason Altekruse
> > >
> > >
> > > On Tue, Dec 3, 2013 at 5:35 PM, Jinfeng Ni <ji...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I' working on the explicit cast support in drill. So far, I have
> > > prototyped
> > > > the implementation for the first 3 categories, and would like to seek
> > > input
> > > > from you regarding how to deal with the buffer allocation for cast
> from
> > > > fixed-length type into var-length type.
> > > >
> > > > 1. cast from fixed-length type to fixed-length type
> > > > eg:   float4 --> int,
> > > >         int -> float4,
> > > >
> > > > 2. cast from var-length type to fixed-length type
> > > > eg: varchar --> int
> > > >       varbinary --> int
> > > > (Still need to figure out how to handle overflow issue when cast)
> > > >
> > > > 3. cast from fixed-length type to var-length type
> > > > eg:  int  -> varchar
> > > >        bigint -> varbinary
> > > >
> > > > 4. cast from var-length type to var-length type
> > > > eg:   varchar --> varchar
> > > >         varbinary --> varchar
> > > >
> > > > For the 3rd one, ie. from fixed-length to var-length type, it causes
> > some
> > > > problem to the current implementation, in terms of buffer allocation.
> > > >
> > > > For the fixed-length type, drill uses java primitive type in
> > ValueHolder.
> > > > For instance, IntHolder.value is a int.  But for var-length type,
> drill
> > > > will use a buffer to keep its value. When doing cast from int into
> > > varchar,
> > > > the buffer for the VarCharHolder is not allocated, and we have to
> > figure
> > > > out a way to do the allocation, before cast.
> > > >
> > > > There seems 2 options:
> > > > Option 1:  allocate buffer in the function template setup() method.
> >  The
> > > > buffer will be used in eval() method.
> > > > Problem with this option :
> > > > 1) need copy twice.  first copy from fixed-type input into the buffer
> > > > allocated in setup(), second copy from the buffer into the buffer in
> > the
> > > > target vector.
> > > > 2) need add a cleanup() method to function template, to clean the
> > buffer
> > > > allocated, which currently is not there in the code base.
> > > >
> > > > Option 2:  the consumer of output of the cast function will be
> > > responsible
> > > > to pre-allocate buffer in the target ValueVector for all the
> > > > VarCharHolder().  The cast function will simply do the conversion and
> > > copy
> > > > into the pre-allocated buffer in the target ValueVector.
> > > > Good thing of this option is it requires 1 copy.
> > > >
> > > > I have prototyped the 1st option, and have not figured out how to
> > > implement
> > > > the 2nd approach yet. But I would like to seek suggestion regarding
> > > those 2
> > > > options, before I proceed next.
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

Re: buffer allocation of cast into var length type

Posted by Jason Altekruse <al...@gmail.com>.

Jinfeng,

I did not even think of actually turning integers in ascii, while I know it
is part of SQL it seems like such a crazy thing to do on a short lived
query on large dataset.

I would take a look at the code we are using for the project operator, that
is the last time I remember discussing passing buffers between different
value vectors. There we used it for simply changing the metadata for a
column where all that the project involved was a column name change, not a
mathematical operation.

In regards to the more involved case where you need convert an integer to
its ascii implementation, how would the consumer know how big of a buffer
to allocate? Would there be a pre-processing step where you determine the
number of digits needed to represent the integers/doubles in base 10? For
integers I guess we could zero fill them all to the same length, but that
seems like it wouldn't be worth it for the little time we would save
scanning through the dataset.

Another option is that we could always over-allocate the buffers and then
slice off the excess, but there is no really good way to avoid waste.

Not sure if we want to open this can of worms, but there is another
possible solution that is related to some thoughts I have around making the
parquet reader faster. It is possible that we might have to break our
design of a single column always being represented by a single buffer.

In cases like this where it is hard to know the final buffer length, it
might be easier to allocate a reasonable guess and then just tack on
another buffer if we guessed wrong. I know that one of the main goals of
value vectors is that they are random access, with minimal overhead for
value extraction, but I think this might be a case where it would be worth
breaking it.

The simple implementation might look like the variable length vectors, with
a metadata buffer sitting in front of the data to describe ranges of values
held in each of the buffers. i.e values 1-400 are in buffer 1 : 401-1000
are in buffer 2. (I would assume we we never exceed 5 or so buffers, but it
could provide extra flexibility).

To prevent the need for an extra step of indirection with each value
extraction, we could change the interfaces on value vectors a bit to make
them expose an interator, rather than get(index) method. This would allow
for fetching the first buffer, reading all of its values with the same
overhead as we have now, until we hit the end of the buffer, and then we
could rely on an exception to indicate we ran out of values and at that
time swap to the second buffer.

-Jason

On Tue, Dec 3, 2013 at 8:59 PM, Jinfeng Ni <ji...@gmail.com> wrote:

> Hi Jason,
>
> Good question.
>
> Actually, for some type cast, it is *binary coercible, *means there is no
> need internally to do any conversion. for instance, char --> varchar,
> varchar --> varbinary, etc.
>
> For other cases, some transformation is required, since the binary
> representation of source type is different from the binary representation
> of target type.
> For instance, int -> varchar.  The target type need keep each digit of the
> integer, while the source type is a 4-byte representation.
>
> I will look into whether it's possible to use the buffer in the output
> value vector directly, without copying into new buffer.
>
>
>
>
>
> On Tue, Dec 3, 2013 at 6:29 PM, Jason Altekruse <altekrusejason@gmail.com
> >wrote:
>
> > Hi Jinfeng,
> >
> > This might be a dumb question, but is there any transformation being
> > performed when going from a fixed length type to a variable length type?
> > That is, are the bytes in the buffer coming in going to be the same as
> the
> > bytes coming out of the cast?
> >
> > I understand that for casts like int-> long we need to add extra space
> > between each value, but is it possible that we could just hand the buffer
> > from one value vector type to the other without copying it into a new
> > buffer?
> >
> > We would still have to create a new buffer with the offsets of the
> > "variable length" values, but it would save us some time if we could do
> > this.
> >
> > -Jason Altekruse
> >
> >
> > On Tue, Dec 3, 2013 at 5:35 PM, Jinfeng Ni <ji...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > I' working on the explicit cast support in drill. So far, I have
> > prototyped
> > > the implementation for the first 3 categories, and would like to seek
> > input
> > > from you regarding how to deal with the buffer allocation for cast from
> > > fixed-length type into var-length type.
> > >
> > > 1. cast from fixed-length type to fixed-length type
> > > eg:   float4 --> int,
> > >         int -> float4,
> > >
> > > 2. cast from var-length type to fixed-length type
> > > eg: varchar --> int
> > >       varbinary --> int
> > > (Still need to figure out how to handle overflow issue when cast)
> > >
> > > 3. cast from fixed-length type to var-length type
> > > eg:  int  -> varchar
> > >        bigint -> varbinary
> > >
> > > 4. cast from var-length type to var-length type
> > > eg:   varchar --> varchar
> > >         varbinary --> varchar
> > >
> > > For the 3rd one, ie. from fixed-length to var-length type, it causes
> some
> > > problem to the current implementation, in terms of buffer allocation.
> > >
> > > For the fixed-length type, drill uses java primitive type in
> ValueHolder.
> > > For instance, IntHolder.value is a int.  But for var-length type, drill
> > > will use a buffer to keep its value. When doing cast from int into
> > varchar,
> > > the buffer for the VarCharHolder is not allocated, and we have to
> figure
> > > out a way to do the allocation, before cast.
> > >
> > > There seems 2 options:
> > > Option 1:  allocate buffer in the function template setup() method.
>  The
> > > buffer will be used in eval() method.
> > > Problem with this option :
> > > 1) need copy twice.  first copy from fixed-type input into the buffer
> > > allocated in setup(), second copy from the buffer into the buffer in
> the
> > > target vector.
> > > 2) need add a cleanup() method to function template, to clean the
> buffer
> > > allocated, which currently is not there in the code base.
> > >
> > > Option 2:  the consumer of output of the cast function will be
> > responsible
> > > to pre-allocate buffer in the target ValueVector for all the
> > > VarCharHolder().  The cast function will simply do the conversion and
> > copy
> > > into the pre-allocated buffer in the target ValueVector.
> > > Good thing of this option is it requires 1 copy.
> > >
> > > I have prototyped the 1st option, and have not figured out how to
> > implement
> > > the 2nd approach yet. But I would like to seek suggestion regarding
> > those 2
> > > options, before I proceed next.
> > >
> > > Thanks!
> > >
> >
>

Re: buffer allocation of cast into var length type

Posted by Jinfeng Ni <ji...@gmail.com>.

Hi Jason,

Good question.

Actually, for some type cast, it is *binary coercible, *means there is no
need internally to do any conversion. for instance, char --> varchar,
varchar --> varbinary, etc.

For other cases, some transformation is required, since the binary
representation of source type is different from the binary representation
of target type.
For instance, int -> varchar.  The target type need keep each digit of the
integer, while the source type is a 4-byte representation.

I will look into whether it's possible to use the buffer in the output
value vector directly, without copying into new buffer.





On Tue, Dec 3, 2013 at 6:29 PM, Jason Altekruse <al...@gmail.com>wrote:

> Hi Jinfeng,
>
> This might be a dumb question, but is there any transformation being
> performed when going from a fixed length type to a variable length type?
> That is, are the bytes in the buffer coming in going to be the same as the
> bytes coming out of the cast?
>
> I understand that for casts like int-> long we need to add extra space
> between each value, but is it possible that we could just hand the buffer
> from one value vector type to the other without copying it into a new
> buffer?
>
> We would still have to create a new buffer with the offsets of the
> "variable length" values, but it would save us some time if we could do
> this.
>
> -Jason Altekruse
>
>
> On Tue, Dec 3, 2013 at 5:35 PM, Jinfeng Ni <ji...@gmail.com> wrote:
>
> > Hi all,
> >
> > I' working on the explicit cast support in drill. So far, I have
> prototyped
> > the implementation for the first 3 categories, and would like to seek
> input
> > from you regarding how to deal with the buffer allocation for cast from
> > fixed-length type into var-length type.
> >
> > 1. cast from fixed-length type to fixed-length type
> > eg:   float4 --> int,
> >         int -> float4,
> >
> > 2. cast from var-length type to fixed-length type
> > eg: varchar --> int
> >       varbinary --> int
> > (Still need to figure out how to handle overflow issue when cast)
> >
> > 3. cast from fixed-length type to var-length type
> > eg:  int  -> varchar
> >        bigint -> varbinary
> >
> > 4. cast from var-length type to var-length type
> > eg:   varchar --> varchar
> >         varbinary --> varchar
> >
> > For the 3rd one, ie. from fixed-length to var-length type, it causes some
> > problem to the current implementation, in terms of buffer allocation.
> >
> > For the fixed-length type, drill uses java primitive type in ValueHolder.
> > For instance, IntHolder.value is a int.  But for var-length type, drill
> > will use a buffer to keep its value. When doing cast from int into
> varchar,
> > the buffer for the VarCharHolder is not allocated, and we have to figure
> > out a way to do the allocation, before cast.
> >
> > There seems 2 options:
> > Option 1:  allocate buffer in the function template setup() method.  The
> > buffer will be used in eval() method.
> > Problem with this option :
> > 1) need copy twice.  first copy from fixed-type input into the buffer
> > allocated in setup(), second copy from the buffer into the buffer in the
> > target vector.
> > 2) need add a cleanup() method to function template, to clean the buffer
> > allocated, which currently is not there in the code base.
> >
> > Option 2:  the consumer of output of the cast function will be
> responsible
> > to pre-allocate buffer in the target ValueVector for all the
> > VarCharHolder().  The cast function will simply do the conversion and
> copy
> > into the pre-allocated buffer in the target ValueVector.
> > Good thing of this option is it requires 1 copy.
> >
> > I have prototyped the 1st option, and have not figured out how to
> implement
> > the 2nd approach yet. But I would like to seek suggestion regarding
> those 2
> > options, before I proceed next.
> >
> > Thanks!
> >
>

Re: buffer allocation of cast into var length type

Posted by Jason Altekruse <al...@gmail.com>.

Hi Jinfeng,

This might be a dumb question, but is there any transformation being
performed when going from a fixed length type to a variable length type?
That is, are the bytes in the buffer coming in going to be the same as the
bytes coming out of the cast?

I understand that for casts like int-> long we need to add extra space
between each value, but is it possible that we could just hand the buffer
from one value vector type to the other without copying it into a new
buffer?

We would still have to create a new buffer with the offsets of the
"variable length" values, but it would save us some time if we could do
this.

-Jason Altekruse


On Tue, Dec 3, 2013 at 5:35 PM, Jinfeng Ni <ji...@gmail.com> wrote:

> Hi all,
>
> I' working on the explicit cast support in drill. So far, I have prototyped
> the implementation for the first 3 categories, and would like to seek input
> from you regarding how to deal with the buffer allocation for cast from
> fixed-length type into var-length type.
>
> 1. cast from fixed-length type to fixed-length type
> eg:   float4 --> int,
>         int -> float4,
>
> 2. cast from var-length type to fixed-length type
> eg: varchar --> int
>       varbinary --> int
> (Still need to figure out how to handle overflow issue when cast)
>
> 3. cast from fixed-length type to var-length type
> eg:  int  -> varchar
>        bigint -> varbinary
>
> 4. cast from var-length type to var-length type
> eg:   varchar --> varchar
>         varbinary --> varchar
>
> For the 3rd one, ie. from fixed-length to var-length type, it causes some
> problem to the current implementation, in terms of buffer allocation.
>
> For the fixed-length type, drill uses java primitive type in ValueHolder.
> For instance, IntHolder.value is a int.  But for var-length type, drill
> will use a buffer to keep its value. When doing cast from int into varchar,
> the buffer for the VarCharHolder is not allocated, and we have to figure
> out a way to do the allocation, before cast.
>
> There seems 2 options:
> Option 1:  allocate buffer in the function template setup() method.  The
> buffer will be used in eval() method.
> Problem with this option :
> 1) need copy twice.  first copy from fixed-type input into the buffer
> allocated in setup(), second copy from the buffer into the buffer in the
> target vector.
> 2) need add a cleanup() method to function template, to clean the buffer
> allocated, which currently is not there in the code base.
>
> Option 2:  the consumer of output of the cast function will be responsible
> to pre-allocate buffer in the target ValueVector for all the
> VarCharHolder().  The cast function will simply do the conversion and copy
> into the pre-allocated buffer in the target ValueVector.
> Good thing of this option is it requires 1 copy.
>
> I have prototyped the 1st option, and have not figured out how to implement
> the 2nd approach yet. But I would like to seek suggestion regarding those 2
> options, before I proceed next.
>
> Thanks!
>