You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Jacques Nadeau <ja...@apache.org> on 2013/04/26 06:06:42 UTC

B[yi]teSize execwork tasks someone could potentially help out with...

I'm working on the execwork stuff and if someone would like to help out,
here are a couple of things that need doing.  I figured I'd drop them here
and see if anyone wants to work on them in the next couple of days.  If so,
let me know otherwise I'll be picking them up soon.

*RPC*
- RPC Layer Handshakes: Currently, I haven't implemented the handshake that
should happen in either the User <> Bit or the Bit <> Bit layer.  The plan
was to use an additional inserted event handler that removed itself from
the event pipeline after a successful handshake or disconnected the channel
on a failed handshake (with appropriate logging).  The main validation at
this point will be simply confirming that both endpoints are running on the
same protocol version.   The only other information that is currently
needed is that that in the Bit <> Bit communication, the client should
inform the server of its DrillEndpoint so that the server can then map that
for future communication in the other direction.

*DataTypes*
- General Expansion: Currently, we have a hodgepodge of datatypes within
the org.apache.drill.common.expression.types.DataType.  We need to clean
this up.  There should be types that map to standard sql types.  My
thinking is that we should actually have separate types for each for
nullable, non-nullable and repeated (required, optional and repeated in
protobuf vernaciular) since we'll generally operate with those values
completely differently (and that each type should reveal which it is).  We
should also have a relationship mapping from each to the other (e.g. how to
convert a signed 32 bit int into a nullable signed 32 bit int.

- Map Types: We don't need nullable but we will need different map types:
inline and fieldwise.  I think these will useful for the execution engine
and will be leverage depending on the particular needs-- for example
fieldwise will be a natural fit where we're operating on columnar data and
doing an explode or other fieldwise nested operation and inline will be
useful when we're doing things like sorting a complex field.  Inline will
also be appropriate where we have extremely sparse record sets.  We'll just
need transformation methods between the two variations.  In the case of a
fieldwise map type field, the field is virtual and only exists to contain
its child fields.

- Non-static DataTypes: We have a need types that don't fit the static data
type model above.  Examples include fixed width types (e.g. 10 byte
string), polymorphic (inline encoded) types (number or string depending on
record) and repeated nested versions of our other types.  These are a
little more gnarly as we need to support canonicalization of these.  Optiq
has some methods for how to handle this kind of type system so it probably
makes sense to leverage that system.

*Expression Type Materialization*
- LogicalExpression type materialization: Right now, LogicalExpressions
include support for late type binding.  As part of the record batch
execution path, these need to get materialized with correct casting, etc
based on the actual found schema.  As such, we need to have a function
which takes a LogicalExpression tree, applies a materialized BatchSchema
and returns a new LogicalExpression tree with full type settings.  As part
of this process, all types need to be cast as necessary and full validation
of the tree should be done.  Timothy has a pending work for validation
specifically on a pull request that would be a good piece of code to
leverage that need.  We also have a visitor model for the expression tree
that should be able to aid in the updated LogicalExpression construction.
-LogicalExpression to Java expression conversion: We need to be able to
convert our logical expressions into Java code expressions.  Initially,
this should be done in a simplistic way, using something like implicit
boxing and the like just to get something working.  This will likely be
specialized per major type (nullable, non-nullable and repeated) and a
framework might the most sense actually just distinguishing the
LogicalExpression by these types.

*JDBC*
- The Drill JDBC driver layer needs to be updated to leverage our zookeeper
coordination locations so that it can correctly find the cluster location.
- The Drill JDBC driver should also manage reconnects so that if it loses
connection with a particular Drillbit partner, that it will reconnect to
another available node in the cluster.
- Someone should point SQuirreL at Julian's latest work and see how things
go...

*ByteCode Engineering*
- We need to put together a concrete class materialization strategy.  My
thinking for relational operators and code generation is that in most
cases, we'll have an interface and a template class for a particular
relational operator.  We will build a template class that has all the
generic stuff implemented but will make calls to empty methods where it
expects lower level operations to occur.  This allows things like the
looping and certain types of null management to be fully materialized in
source code without having to deal with the complexities of ByteCode
generation.  It also eases testing complexity.  When a particular
implementation is required, the Drillbit will be responsible for generating
updated method bodies as required for the record-level expressions, marking
all the methods and class as final, then loading the implementation into
the query-level classloader.  Note that the production Drillbit will never
load the template class into the JVM and will simply utilize it in ByteCode
form.  I was hoping someone can take a look at trying to pull together a
cohesive approach to doing this using ASM and Janino (likely utilizing the
JDK commons-compiler mode).  The interface should be pretty simple: input
is an interface, a template class name, a set of (method_signature,
method_body_text) objects and a varargs of objects that are required for
object instantiation.  The return should be an instance of the interface.
 The interface should check things like method_signature provided to
available method blocks, the method blocks being replaced are empty, the
object constructor matches the set of object argument provided by the
object instantiation request, etc.

*ByteBuf Improvements*
- Our BufferAllocator should support child allocators (getChild()) with
their own memory maximums and accounting (so we can determine the memory
overhead to particular queries).  We also need to be able to release entire
child allocations at once.
- We need to create a number of primitive type specific wrapping classes
for ByteBuf.  These additions include fixed offset indexing for operations
(e.g. index 1 of an int buffer should be at 4 bytes), adding support for
unsigned values (my preference would be to leverage the work in Guava if
that makes sense) and modifying the hard bounds checks to softer assert
checks to increase production performance.  While we could do this
utilizing the ByteBuf interface, from everything I've experienced and read,
we need to minimize issues with inlining and performance so we really need
to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
wrapping classes.  Of course, it is a final package private class.  Short
term that means we really need to create a number of specific buffer types
that wrap it and just put them in the io.netty.buffer package (or
alternatively create a Drill version or wrapper).

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by David Alves <da...@gmail.com>.

… btw thank you the all the work in laying this out.

Best
David

On Apr 25, 2013, at 11:10 PM, David Alves <da...@gmail.com> wrote:

> Hi Jacques
> 
> 	I can take the RPC stuff.
> 	Have you made any progress in Bit<>Bit comms?
> 
> Best
> David
> 
> On Apr 25, 2013, at 11:06 PM, Jacques Nadeau <ja...@apache.org> wrote:
> 
>> I'm working on the execwork stuff and if someone would like to help out,
>> here are a couple of things that need doing.  I figured I'd drop them here
>> and see if anyone wants to work on them in the next couple of days.  If so,
>> let me know otherwise I'll be picking them up soon.
>> 
>> *RPC*
>> - RPC Layer Handshakes: Currently, I haven't implemented the handshake that
>> should happen in either the User <> Bit or the Bit <> Bit layer.  The plan
>> was to use an additional inserted event handler that removed itself from
>> the event pipeline after a successful handshake or disconnected the channel
>> on a failed handshake (with appropriate logging).  The main validation at
>> this point will be simply confirming that both endpoints are running on the
>> same protocol version.   The only other information that is currently
>> needed is that that in the Bit <> Bit communication, the client should
>> inform the server of its DrillEndpoint so that the server can then map that
>> for future communication in the other direction.
>> 
>> *DataTypes*
>> - General Expansion: Currently, we have a hodgepodge of datatypes within
>> the org.apache.drill.common.expression.types.DataType.  We need to clean
>> this up.  There should be types that map to standard sql types.  My
>> thinking is that we should actually have separate types for each for
>> nullable, non-nullable and repeated (required, optional and repeated in
>> protobuf vernaciular) since we'll generally operate with those values
>> completely differently (and that each type should reveal which it is).  We
>> should also have a relationship mapping from each to the other (e.g. how to
>> convert a signed 32 bit int into a nullable signed 32 bit int.
>> 
>> - Map Types: We don't need nullable but we will need different map types:
>> inline and fieldwise.  I think these will useful for the execution engine
>> and will be leverage depending on the particular needs-- for example
>> fieldwise will be a natural fit where we're operating on columnar data and
>> doing an explode or other fieldwise nested operation and inline will be
>> useful when we're doing things like sorting a complex field.  Inline will
>> also be appropriate where we have extremely sparse record sets.  We'll just
>> need transformation methods between the two variations.  In the case of a
>> fieldwise map type field, the field is virtual and only exists to contain
>> its child fields.
>> 
>> - Non-static DataTypes: We have a need types that don't fit the static data
>> type model above.  Examples include fixed width types (e.g. 10 byte
>> string), polymorphic (inline encoded) types (number or string depending on
>> record) and repeated nested versions of our other types.  These are a
>> little more gnarly as we need to support canonicalization of these.  Optiq
>> has some methods for how to handle this kind of type system so it probably
>> makes sense to leverage that system.
>> 
>> *Expression Type Materialization*
>> - LogicalExpression type materialization: Right now, LogicalExpressions
>> include support for late type binding.  As part of the record batch
>> execution path, these need to get materialized with correct casting, etc
>> based on the actual found schema.  As such, we need to have a function
>> which takes a LogicalExpression tree, applies a materialized BatchSchema
>> and returns a new LogicalExpression tree with full type settings.  As part
>> of this process, all types need to be cast as necessary and full validation
>> of the tree should be done.  Timothy has a pending work for validation
>> specifically on a pull request that would be a good piece of code to
>> leverage that need.  We also have a visitor model for the expression tree
>> that should be able to aid in the updated LogicalExpression construction.
>> -LogicalExpression to Java expression conversion: We need to be able to
>> convert our logical expressions into Java code expressions.  Initially,
>> this should be done in a simplistic way, using something like implicit
>> boxing and the like just to get something working.  This will likely be
>> specialized per major type (nullable, non-nullable and repeated) and a
>> framework might the most sense actually just distinguishing the
>> LogicalExpression by these types.
>> 
>> *JDBC*
>> - The Drill JDBC driver layer needs to be updated to leverage our zookeeper
>> coordination locations so that it can correctly find the cluster location.
>> - The Drill JDBC driver should also manage reconnects so that if it loses
>> connection with a particular Drillbit partner, that it will reconnect to
>> another available node in the cluster.
>> - Someone should point SQuirreL at Julian's latest work and see how things
>> go...
>> 
>> *ByteCode Engineering*
>> - We need to put together a concrete class materialization strategy.  My
>> thinking for relational operators and code generation is that in most
>> cases, we'll have an interface and a template class for a particular
>> relational operator.  We will build a template class that has all the
>> generic stuff implemented but will make calls to empty methods where it
>> expects lower level operations to occur.  This allows things like the
>> looping and certain types of null management to be fully materialized in
>> source code without having to deal with the complexities of ByteCode
>> generation.  It also eases testing complexity.  When a particular
>> implementation is required, the Drillbit will be responsible for generating
>> updated method bodies as required for the record-level expressions, marking
>> all the methods and class as final, then loading the implementation into
>> the query-level classloader.  Note that the production Drillbit will never
>> load the template class into the JVM and will simply utilize it in ByteCode
>> form.  I was hoping someone can take a look at trying to pull together a
>> cohesive approach to doing this using ASM and Janino (likely utilizing the
>> JDK commons-compiler mode).  The interface should be pretty simple: input
>> is an interface, a template class name, a set of (method_signature,
>> method_body_text) objects and a varargs of objects that are required for
>> object instantiation.  The return should be an instance of the interface.
>> The interface should check things like method_signature provided to
>> available method blocks, the method blocks being replaced are empty, the
>> object constructor matches the set of object argument provided by the
>> object instantiation request, etc.
>> 
>> *ByteBuf Improvements*
>> - Our BufferAllocator should support child allocators (getChild()) with
>> their own memory maximums and accounting (so we can determine the memory
>> overhead to particular queries).  We also need to be able to release entire
>> child allocations at once.
>> - We need to create a number of primitive type specific wrapping classes
>> for ByteBuf.  These additions include fixed offset indexing for operations
>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support for
>> unsigned values (my preference would be to leverage the work in Guava if
>> that makes sense) and modifying the hard bounds checks to softer assert
>> checks to increase production performance.  While we could do this
>> utilizing the ByteBuf interface, from everything I've experienced and read,
>> we need to minimize issues with inlining and performance so we really need
>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
>> wrapping classes.  Of course, it is a final package private class.  Short
>> term that means we really need to create a number of specific buffer types
>> that wrap it and just put them in the io.netty.buffer package (or
>> alternatively create a Drill version or wrapper).
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by Jacques Nadeau <ja...@apache.org>.

I've done so more work on the BitComImpl so you should probably constrain
your work to the Rpc base classes, User client and server and BitCom client
and server.

J

On Thu, Apr 25, 2013 at 9:10 PM, David Alves <da...@gmail.com> wrote:

> Hi Jacques
>
>         I can take the RPC stuff.
>         Have you made any progress in Bit<>Bit comms?
>
> Best
> David
>
> On Apr 25, 2013, at 11:06 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > I'm working on the execwork stuff and if someone would like to help out,
> > here are a couple of things that need doing.  I figured I'd drop them
> here
> > and see if anyone wants to work on them in the next couple of days.  If
> so,
> > let me know otherwise I'll be picking them up soon.
> >
> > *RPC*
> > - RPC Layer Handshakes: Currently, I haven't implemented the handshake
> that
> > should happen in either the User <> Bit or the Bit <> Bit layer.  The
> plan
> > was to use an additional inserted event handler that removed itself from
> > the event pipeline after a successful handshake or disconnected the
> channel
> > on a failed handshake (with appropriate logging).  The main validation at
> > this point will be simply confirming that both endpoints are running on
> the
> > same protocol version.   The only other information that is currently
> > needed is that that in the Bit <> Bit communication, the client should
> > inform the server of its DrillEndpoint so that the server can then map
> that
> > for future communication in the other direction.
> >
> > *DataTypes*
> > - General Expansion: Currently, we have a hodgepodge of datatypes within
> > the org.apache.drill.common.expression.types.DataType.  We need to clean
> > this up.  There should be types that map to standard sql types.  My
> > thinking is that we should actually have separate types for each for
> > nullable, non-nullable and repeated (required, optional and repeated in
> > protobuf vernaciular) since we'll generally operate with those values
> > completely differently (and that each type should reveal which it is).
>  We
> > should also have a relationship mapping from each to the other (e.g. how
> to
> > convert a signed 32 bit int into a nullable signed 32 bit int.
> >
> > - Map Types: We don't need nullable but we will need different map types:
> > inline and fieldwise.  I think these will useful for the execution engine
> > and will be leverage depending on the particular needs-- for example
> > fieldwise will be a natural fit where we're operating on columnar data
> and
> > doing an explode or other fieldwise nested operation and inline will be
> > useful when we're doing things like sorting a complex field.  Inline will
> > also be appropriate where we have extremely sparse record sets.  We'll
> just
> > need transformation methods between the two variations.  In the case of a
> > fieldwise map type field, the field is virtual and only exists to contain
> > its child fields.
> >
> > - Non-static DataTypes: We have a need types that don't fit the static
> data
> > type model above.  Examples include fixed width types (e.g. 10 byte
> > string), polymorphic (inline encoded) types (number or string depending
> on
> > record) and repeated nested versions of our other types.  These are a
> > little more gnarly as we need to support canonicalization of these.
>  Optiq
> > has some methods for how to handle this kind of type system so it
> probably
> > makes sense to leverage that system.
> >
> > *Expression Type Materialization*
> > - LogicalExpression type materialization: Right now, LogicalExpressions
> > include support for late type binding.  As part of the record batch
> > execution path, these need to get materialized with correct casting, etc
> > based on the actual found schema.  As such, we need to have a function
> > which takes a LogicalExpression tree, applies a materialized BatchSchema
> > and returns a new LogicalExpression tree with full type settings.  As
> part
> > of this process, all types need to be cast as necessary and full
> validation
> > of the tree should be done.  Timothy has a pending work for validation
> > specifically on a pull request that would be a good piece of code to
> > leverage that need.  We also have a visitor model for the expression tree
> > that should be able to aid in the updated LogicalExpression construction.
> > -LogicalExpression to Java expression conversion: We need to be able to
> > convert our logical expressions into Java code expressions.  Initially,
> > this should be done in a simplistic way, using something like implicit
> > boxing and the like just to get something working.  This will likely be
> > specialized per major type (nullable, non-nullable and repeated) and a
> > framework might the most sense actually just distinguishing the
> > LogicalExpression by these types.
> >
> > *JDBC*
> > - The Drill JDBC driver layer needs to be updated to leverage our
> zookeeper
> > coordination locations so that it can correctly find the cluster
> location.
> > - The Drill JDBC driver should also manage reconnects so that if it loses
> > connection with a particular Drillbit partner, that it will reconnect to
> > another available node in the cluster.
> > - Someone should point SQuirreL at Julian's latest work and see how
> things
> > go...
> >
> > *ByteCode Engineering*
> > - We need to put together a concrete class materialization strategy.  My
> > thinking for relational operators and code generation is that in most
> > cases, we'll have an interface and a template class for a particular
> > relational operator.  We will build a template class that has all the
> > generic stuff implemented but will make calls to empty methods where it
> > expects lower level operations to occur.  This allows things like the
> > looping and certain types of null management to be fully materialized in
> > source code without having to deal with the complexities of ByteCode
> > generation.  It also eases testing complexity.  When a particular
> > implementation is required, the Drillbit will be responsible for
> generating
> > updated method bodies as required for the record-level expressions,
> marking
> > all the methods and class as final, then loading the implementation into
> > the query-level classloader.  Note that the production Drillbit will
> never
> > load the template class into the JVM and will simply utilize it in
> ByteCode
> > form.  I was hoping someone can take a look at trying to pull together a
> > cohesive approach to doing this using ASM and Janino (likely utilizing
> the
> > JDK commons-compiler mode).  The interface should be pretty simple: input
> > is an interface, a template class name, a set of (method_signature,
> > method_body_text) objects and a varargs of objects that are required for
> > object instantiation.  The return should be an instance of the interface.
> > The interface should check things like method_signature provided to
> > available method blocks, the method blocks being replaced are empty, the
> > object constructor matches the set of object argument provided by the
> > object instantiation request, etc.
> >
> > *ByteBuf Improvements*
> > - Our BufferAllocator should support child allocators (getChild()) with
> > their own memory maximums and accounting (so we can determine the memory
> > overhead to particular queries).  We also need to be able to release
> entire
> > child allocations at once.
> > - We need to create a number of primitive type specific wrapping classes
> > for ByteBuf.  These additions include fixed offset indexing for
> operations
> > (e.g. index 1 of an int buffer should be at 4 bytes), adding support for
> > unsigned values (my preference would be to leverage the work in Guava if
> > that makes sense) and modifying the hard bounds checks to softer assert
> > checks to increase production performance.  While we could do this
> > utilizing the ByteBuf interface, from everything I've experienced and
> read,
> > we need to minimize issues with inlining and performance so we really
> need
> > to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
> > wrapping classes.  Of course, it is a final package private class.  Short
> > term that means we really need to create a number of specific buffer
> types
> > that wrap it and just put them in the io.netty.buffer package (or
> > alternatively create a Drill version or wrapper).
>
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by David Alves <da...@gmail.com>.

Hi Jacques

	I can take the RPC stuff.
	Have you made any progress in Bit<>Bit comms?

Best
David

On Apr 25, 2013, at 11:06 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I'm working on the execwork stuff and if someone would like to help out,
> here are a couple of things that need doing.  I figured I'd drop them here
> and see if anyone wants to work on them in the next couple of days.  If so,
> let me know otherwise I'll be picking them up soon.
> 
> *RPC*
> - RPC Layer Handshakes: Currently, I haven't implemented the handshake that
> should happen in either the User <> Bit or the Bit <> Bit layer.  The plan
> was to use an additional inserted event handler that removed itself from
> the event pipeline after a successful handshake or disconnected the channel
> on a failed handshake (with appropriate logging).  The main validation at
> this point will be simply confirming that both endpoints are running on the
> same protocol version.   The only other information that is currently
> needed is that that in the Bit <> Bit communication, the client should
> inform the server of its DrillEndpoint so that the server can then map that
> for future communication in the other direction.
> 
> *DataTypes*
> - General Expansion: Currently, we have a hodgepodge of datatypes within
> the org.apache.drill.common.expression.types.DataType.  We need to clean
> this up.  There should be types that map to standard sql types.  My
> thinking is that we should actually have separate types for each for
> nullable, non-nullable and repeated (required, optional and repeated in
> protobuf vernaciular) since we'll generally operate with those values
> completely differently (and that each type should reveal which it is).  We
> should also have a relationship mapping from each to the other (e.g. how to
> convert a signed 32 bit int into a nullable signed 32 bit int.
> 
> - Map Types: We don't need nullable but we will need different map types:
> inline and fieldwise.  I think these will useful for the execution engine
> and will be leverage depending on the particular needs-- for example
> fieldwise will be a natural fit where we're operating on columnar data and
> doing an explode or other fieldwise nested operation and inline will be
> useful when we're doing things like sorting a complex field.  Inline will
> also be appropriate where we have extremely sparse record sets.  We'll just
> need transformation methods between the two variations.  In the case of a
> fieldwise map type field, the field is virtual and only exists to contain
> its child fields.
> 
> - Non-static DataTypes: We have a need types that don't fit the static data
> type model above.  Examples include fixed width types (e.g. 10 byte
> string), polymorphic (inline encoded) types (number or string depending on
> record) and repeated nested versions of our other types.  These are a
> little more gnarly as we need to support canonicalization of these.  Optiq
> has some methods for how to handle this kind of type system so it probably
> makes sense to leverage that system.
> 
> *Expression Type Materialization*
> - LogicalExpression type materialization: Right now, LogicalExpressions
> include support for late type binding.  As part of the record batch
> execution path, these need to get materialized with correct casting, etc
> based on the actual found schema.  As such, we need to have a function
> which takes a LogicalExpression tree, applies a materialized BatchSchema
> and returns a new LogicalExpression tree with full type settings.  As part
> of this process, all types need to be cast as necessary and full validation
> of the tree should be done.  Timothy has a pending work for validation
> specifically on a pull request that would be a good piece of code to
> leverage that need.  We also have a visitor model for the expression tree
> that should be able to aid in the updated LogicalExpression construction.
> -LogicalExpression to Java expression conversion: We need to be able to
> convert our logical expressions into Java code expressions.  Initially,
> this should be done in a simplistic way, using something like implicit
> boxing and the like just to get something working.  This will likely be
> specialized per major type (nullable, non-nullable and repeated) and a
> framework might the most sense actually just distinguishing the
> LogicalExpression by these types.
> 
> *JDBC*
> - The Drill JDBC driver layer needs to be updated to leverage our zookeeper
> coordination locations so that it can correctly find the cluster location.
> - The Drill JDBC driver should also manage reconnects so that if it loses
> connection with a particular Drillbit partner, that it will reconnect to
> another available node in the cluster.
> - Someone should point SQuirreL at Julian's latest work and see how things
> go...
> 
> *ByteCode Engineering*
> - We need to put together a concrete class materialization strategy.  My
> thinking for relational operators and code generation is that in most
> cases, we'll have an interface and a template class for a particular
> relational operator.  We will build a template class that has all the
> generic stuff implemented but will make calls to empty methods where it
> expects lower level operations to occur.  This allows things like the
> looping and certain types of null management to be fully materialized in
> source code without having to deal with the complexities of ByteCode
> generation.  It also eases testing complexity.  When a particular
> implementation is required, the Drillbit will be responsible for generating
> updated method bodies as required for the record-level expressions, marking
> all the methods and class as final, then loading the implementation into
> the query-level classloader.  Note that the production Drillbit will never
> load the template class into the JVM and will simply utilize it in ByteCode
> form.  I was hoping someone can take a look at trying to pull together a
> cohesive approach to doing this using ASM and Janino (likely utilizing the
> JDK commons-compiler mode).  The interface should be pretty simple: input
> is an interface, a template class name, a set of (method_signature,
> method_body_text) objects and a varargs of objects that are required for
> object instantiation.  The return should be an instance of the interface.
> The interface should check things like method_signature provided to
> available method blocks, the method blocks being replaced are empty, the
> object constructor matches the set of object argument provided by the
> object instantiation request, etc.
> 
> *ByteBuf Improvements*
> - Our BufferAllocator should support child allocators (getChild()) with
> their own memory maximums and accounting (so we can determine the memory
> overhead to particular queries).  We also need to be able to release entire
> child allocations at once.
> - We need to create a number of primitive type specific wrapping classes
> for ByteBuf.  These additions include fixed offset indexing for operations
> (e.g. index 1 of an int buffer should be at 4 bytes), adding support for
> unsigned values (my preference would be to leverage the work in Guava if
> that makes sense) and modifying the hard bounds checks to softer assert
> checks to increase production performance.  While we could do this
> utilizing the ByteBuf interface, from everything I've experienced and read,
> we need to minimize issues with inlining and performance so we really need
> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
> wrapping classes.  Of course, it is a final package private class.  Short
> term that means we really need to create a number of specific buffer types
> that wrap it and just put them in the io.netty.buffer package (or
> alternatively create a Drill version or wrapper).

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by Jacques Nadeau <ja...@apache.org>.

Great news!  Thanks for running that down.

J

On Sat, Apr 27, 2013 at 8:54 AM, kishore g <g....@gmail.com> wrote:
> Good news, the author of larray got back and he will add the apache license
> to the source.
>  On Apr 26, 2013 11:13 AM, "kishore g" <g....@gmail.com> wrote:
>
>> I have interacted with the Author, let me know if you want me to check.
>> Good thing was that he is responsive and even added few things for me.
>>
>>
>> On Fri, Apr 26, 2013 at 10:27 AM, Timothy Chen <tn...@gmail.com> wrote:
>>
>>> Ya, just bringing that up again that. Doubt it will be a blocker.
>>>
>>> Tim
>>>
>>>
>>> On Fri, Apr 26, 2013 at 10:12 AM, David Alves <da...@gmail.com>
>>> wrote:
>>>
>>> > good point, i'll try and ask the author.
>>> > it's a pretty recent lib so that might be an oversight…
>>> >
>>> > -david
>>> >
>>> > On Apr 26, 2013, at 12:04 PM, Timothy Chen <tn...@gmail.com> wrote:
>>> >
>>> > > Jacques I think this is the one I emailed you before that has no
>>> > licensing info.
>>> > >
>>> > > Tim
>>> > >
>>> > > Sent from my iPhone
>>> > >
>>> > > On Apr 26, 2013, at 9:30 AM, David Alves <da...@gmail.com>
>>> wrote:
>>> > >
>>> > >> i've looked through it and looks like it can leverage shared memory,
>>> > which I was looking for anyway.
>>> > >> I also like the way garbage collection works (gc in java also clears
>>> > off-heap).
>>> > >> I'll take a deeper look during the weekend.
>>> > >>
>>> > >> -david
>>> > >>
>>> > >> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <ja...@apache.org>
>>> > wrote:
>>> > >>
>>> > >>> I've looked at that in the past and think the idea of using here is
>>> > very
>>> > >>> good.  It seems like ByteBuf is nice as it has things like endianess
>>> > >>> capabilities, reference counting and management and Netty direct
>>> > support.
>>> > >>> On the flipside, larray is nice for its large array capabilities and
>>> > >>> better input/output interfaces.  The best approach might be to
>>> define
>>> > a new
>>> > >>> ByteBuf implementation that leverages LArray.  I'll take a look at
>>> > this in
>>> > >>> a few days if someone else doesn't want to.
>>> > >>>
>>> > >>> j
>>> > >>>
>>> > >>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com>
>>> > wrote:
>>> > >>>
>>> > >>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>>> > >>>> https://github.com/xerial/larray. It has those wrappers and I
>>> found
>>> > it
>>> > >>>> quite useful. The same person has also written java version for
>>> snappy
>>> > >>>> compression. Not sure if you guys have plan to add compression, but
>>> > one of
>>> > >>>> the nice things I could do was use the memory offsets for
>>> > source(compressed
>>> > >>>> data) and dest(uncompressed array) and do the decompression
>>> off-heap.
>>> > It
>>> > >>>> supports the need for looking up by index and has wrappers for most
>>> > of the
>>> > >>>> primitive data types.
>>> > >>>>
>>> > >>>> Are you looking at something like this?
>>> > >>>>
>>> > >>>> thanks,
>>> > >>>> Kishore G
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <
>>> jacques@apache.org>
>>> > >>>> wrote:
>>> > >>>>
>>> > >>>>> They are on the list but the list is long :)
>>> > >>>>>
>>> > >>>>> Have a good weekend.
>>> > >>>>>
>>> > >>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com>
>>> > wrote:
>>> > >>>>>
>>> > >>>>>> So if no one picks anything up you will be done with all the
>>> work in
>>> > >>>> the
>>> > >>>>>> next couple of days? :)
>>> > >>>>>>
>>> > >>>>>> Would like to help out but I'm traveling to la over the weekend.
>>> > >>>>>>
>>> > >>>>>> I'll sync with you Monday to see how I can help then.
>>> > >>>>>>
>>> > >>>>>> Tim
>>> > >>>>>>
>>> > >>>>>> Sent from my iPhone
>>> > >>>>>>
>>> > >>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
>>> > >>>> wrote:
>>> > >>>>>>
>>> > >>>>>>> I'm working on the execwork stuff and if someone would like to
>>> help
>>> > >>>>> out,
>>> > >>>>>>> here are a couple of things that need doing.  I figured I'd drop
>>> > them
>>> > >>>>>> here
>>> > >>>>>>> and see if anyone wants to work on them in the next couple of
>>> days.
>>> > >>>> If
>>> > >>>>>> so,
>>> > >>>>>>> let me know otherwise I'll be picking them up soon.
>>> > >>>>>>>
>>> > >>>>>>> *RPC*
>>> > >>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>>> > >>>> handshake
>>> > >>>>>> that
>>> > >>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.
>>> >  The
>>> > >>>>>> plan
>>> > >>>>>>> was to use an additional inserted event handler that removed
>>> itself
>>> > >>>>> from
>>> > >>>>>>> the event pipeline after a successful handshake or disconnected
>>> the
>>> > >>>>>> channel
>>> > >>>>>>> on a failed handshake (with appropriate logging).  The main
>>> > >>>> validation
>>> > >>>>> at
>>> > >>>>>>> this point will be simply confirming that both endpoints are
>>> > running
>>> > >>>> on
>>> > >>>>>> the
>>> > >>>>>>> same protocol version.   The only other information that is
>>> > currently
>>> > >>>>>>> needed is that that in the Bit <> Bit communication, the client
>>> > >>>> should
>>> > >>>>>>> inform the server of its DrillEndpoint so that the server can
>>> then
>>> > >>>> map
>>> > >>>>>> that
>>> > >>>>>>> for future communication in the other direction.
>>> > >>>>>>>
>>> > >>>>>>> *DataTypes*
>>> > >>>>>>> - General Expansion: Currently, we have a hodgepodge of
>>> datatypes
>>> > >>>>> within
>>> > >>>>>>> the org.apache.drill.common.expression.types.DataType.  We need
>>> to
>>> > >>>>> clean
>>> > >>>>>>> this up.  There should be types that map to standard sql types.
>>>  My
>>> > >>>>>>> thinking is that we should actually have separate types for each
>>> > for
>>> > >>>>>>> nullable, non-nullable and repeated (required, optional and
>>> > repeated
>>> > >>>> in
>>> > >>>>>>> protobuf vernaciular) since we'll generally operate with those
>>> > values
>>> > >>>>>>> completely differently (and that each type should reveal which
>>> it
>>> > >>>> is).
>>> > >>>>>> We
>>> > >>>>>>> should also have a relationship mapping from each to the other
>>> > (e.g.
>>> > >>>>> how
>>> > >>>>>> to
>>> > >>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>> > >>>>>>>
>>> > >>>>>>> - Map Types: We don't need nullable but we will need different
>>> map
>>> > >>>>> types:
>>> > >>>>>>> inline and fieldwise.  I think these will useful for the
>>> execution
>>> > >>>>> engine
>>> > >>>>>>> and will be leverage depending on the particular needs-- for
>>> > example
>>> > >>>>>>> fieldwise will be a natural fit where we're operating on
>>> columnar
>>> > >>>> data
>>> > >>>>>> and
>>> > >>>>>>> doing an explode or other fieldwise nested operation and inline
>>> > will
>>> > >>>> be
>>> > >>>>>>> useful when we're doing things like sorting a complex field.
>>> >  Inline
>>> > >>>>> will
>>> > >>>>>>> also be appropriate where we have extremely sparse record sets.
>>> > >>>> We'll
>>> > >>>>>> just
>>> > >>>>>>> need transformation methods between the two variations.  In the
>>> > case
>>> > >>>>> of a
>>> > >>>>>>> fieldwise map type field, the field is virtual and only exists
>>> to
>>> > >>>>> contain
>>> > >>>>>>> its child fields.
>>> > >>>>>>>
>>> > >>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>>> > >>>> static
>>> > >>>>>> data
>>> > >>>>>>> type model above.  Examples include fixed width types (e.g. 10
>>> byte
>>> > >>>>>>> string), polymorphic (inline encoded) types (number or string
>>> > >>>> depending
>>> > >>>>>> on
>>> > >>>>>>> record) and repeated nested versions of our other types.  These
>>> > are a
>>> > >>>>>>> little more gnarly as we need to support canonicalization of
>>> these.
>>> > >>>>>> Optiq
>>> > >>>>>>> has some methods for how to handle this kind of type system so
>>> it
>>> > >>>>>> probably
>>> > >>>>>>> makes sense to leverage that system.
>>> > >>>>>>>
>>> > >>>>>>> *Expression Type Materialization*
>>> > >>>>>>> - LogicalExpression type materialization: Right now,
>>> > >>>> LogicalExpressions
>>> > >>>>>>> include support for late type binding.  As part of the record
>>> batch
>>> > >>>>>>> execution path, these need to get materialized with correct
>>> > casting,
>>> > >>>>> etc
>>> > >>>>>>> based on the actual found schema.  As such, we need to have a
>>> > >>>> function
>>> > >>>>>>> which takes a LogicalExpression tree, applies a materialized
>>> > >>>>> BatchSchema
>>> > >>>>>>> and returns a new LogicalExpression tree with full type
>>> settings.
>>> >  As
>>> > >>>>>> part
>>> > >>>>>>> of this process, all types need to be cast as necessary and full
>>> > >>>>>> validation
>>> > >>>>>>> of the tree should be done.  Timothy has a pending work for
>>> > >>>> validation
>>> > >>>>>>> specifically on a pull request that would be a good piece of
>>> code
>>> > to
>>> > >>>>>>> leverage that need.  We also have a visitor model for the
>>> > expression
>>> > >>>>> tree
>>> > >>>>>>> that should be able to aid in the updated LogicalExpression
>>> > >>>>> construction.
>>> > >>>>>>> -LogicalExpression to Java expression conversion: We need to be
>>> > able
>>> > >>>> to
>>> > >>>>>>> convert our logical expressions into Java code expressions.
>>> > >>>> Initially,
>>> > >>>>>>> this should be done in a simplistic way, using something like
>>> > >>>> implicit
>>> > >>>>>>> boxing and the like just to get something working.  This will
>>> > likely
>>> > >>>> be
>>> > >>>>>>> specialized per major type (nullable, non-nullable and repeated)
>>> > and
>>> > >>>> a
>>> > >>>>>>> framework might the most sense actually just distinguishing the
>>> > >>>>>>> LogicalExpression by these types.
>>> > >>>>>>>
>>> > >>>>>>> *JDBC*
>>> > >>>>>>> - The Drill JDBC driver layer needs to be updated to leverage
>>> our
>>> > >>>>>> zookeeper
>>> > >>>>>>> coordination locations so that it can correctly find the cluster
>>> > >>>>>> location.
>>> > >>>>>>> - The Drill JDBC driver should also manage reconnects so that
>>> if it
>>> > >>>>> loses
>>> > >>>>>>> connection with a particular Drillbit partner, that it will
>>> > reconnect
>>> > >>>>> to
>>> > >>>>>>> another available node in the cluster.
>>> > >>>>>>> - Someone should point SQuirreL at Julian's latest work and see
>>> how
>>> > >>>>>> things
>>> > >>>>>>> go...
>>> > >>>>>>>
>>> > >>>>>>> *ByteCode Engineering*
>>> > >>>>>>> - We need to put together a concrete class materialization
>>> > strategy.
>>> > >>>>> My
>>> > >>>>>>> thinking for relational operators and code generation is that in
>>> > most
>>> > >>>>>>> cases, we'll have an interface and a template class for a
>>> > particular
>>> > >>>>>>> relational operator.  We will build a template class that has
>>> all
>>> > the
>>> > >>>>>>> generic stuff implemented but will make calls to empty methods
>>> > where
>>> > >>>> it
>>> > >>>>>>> expects lower level operations to occur.  This allows things
>>> like
>>> > the
>>> > >>>>>>> looping and certain types of null management to be fully
>>> > materialized
>>> > >>>>> in
>>> > >>>>>>> source code without having to deal with the complexities of
>>> > ByteCode
>>> > >>>>>>> generation.  It also eases testing complexity.  When a
>>> particular
>>> > >>>>>>> implementation is required, the Drillbit will be responsible for
>>> > >>>>>> generating
>>> > >>>>>>> updated method bodies as required for the record-level
>>> expressions,
>>> > >>>>>> marking
>>> > >>>>>>> all the methods and class as final, then loading the
>>> implementation
>>> > >>>>> into
>>> > >>>>>>> the query-level classloader.  Note that the production Drillbit
>>> > will
>>> > >>>>>> never
>>> > >>>>>>> load the template class into the JVM and will simply utilize it
>>> in
>>> > >>>>>> ByteCode
>>> > >>>>>>> form.  I was hoping someone can take a look at trying to pull
>>> > >>>> together
>>> > >>>>> a
>>> > >>>>>>> cohesive approach to doing this using ASM and Janino (likely
>>> > >>>> utilizing
>>> > >>>>>> the
>>> > >>>>>>> JDK commons-compiler mode).  The interface should be pretty
>>> simple:
>>> > >>>>> input
>>> > >>>>>>> is an interface, a template class name, a set of
>>> (method_signature,
>>> > >>>>>>> method_body_text) objects and a varargs of objects that are
>>> > required
>>> > >>>>> for
>>> > >>>>>>> object instantiation.  The return should be an instance of the
>>> > >>>>> interface.
>>> > >>>>>>> The interface should check things like method_signature
>>> provided to
>>> > >>>>>>> available method blocks, the method blocks being replaced are
>>> > empty,
>>> > >>>>> the
>>> > >>>>>>> object constructor matches the set of object argument provided
>>> by
>>> > the
>>> > >>>>>>> object instantiation request, etc.
>>> > >>>>>>>
>>> > >>>>>>> *ByteBuf Improvements*
>>> > >>>>>>> - Our BufferAllocator should support child allocators
>>> (getChild())
>>> > >>>> with
>>> > >>>>>>> their own memory maximums and accounting (so we can determine
>>> the
>>> > >>>>> memory
>>> > >>>>>>> overhead to particular queries).  We also need to be able to
>>> > release
>>> > >>>>>> entire
>>> > >>>>>>> child allocations at once.
>>> > >>>>>>> - We need to create a number of primitive type specific wrapping
>>> > >>>>> classes
>>> > >>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>> > >>>>>> operations
>>> > >>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding
>>> > support
>>> > >>>>> for
>>> > >>>>>>> unsigned values (my preference would be to leverage the work in
>>> > Guava
>>> > >>>>> if
>>> > >>>>>>> that makes sense) and modifying the hard bounds checks to softer
>>> > >>>> assert
>>> > >>>>>>> checks to increase production performance.  While we could do
>>> this
>>> > >>>>>>> utilizing the ByteBuf interface, from everything I've
>>> experienced
>>> > and
>>> > >>>>>> read,
>>> > >>>>>>> we need to minimize issues with inlining and performance so we
>>> > really
>>> > >>>>>> need
>>> > >>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly
>>> > for
>>> > >>>>> the
>>> > >>>>>>> wrapping classes.  Of course, it is a final package private
>>> class.
>>> > >>>>> Short
>>> > >>>>>>> term that means we really need to create a number of specific
>>> > buffer
>>> > >>>>>> types
>>> > >>>>>>> that wrap it and just put them in the io.netty.buffer package
>>> (or
>>> > >>>>>>> alternatively create a Drill version or wrapper).
>>> > >>
>>> >
>>> >
>>>
>>
>>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by kishore g <g....@gmail.com>.

Good news, the author of larray got back and he will add the apache license
to the source.
 On Apr 26, 2013 11:13 AM, "kishore g" <g....@gmail.com> wrote:

> I have interacted with the Author, let me know if you want me to check.
> Good thing was that he is responsive and even added few things for me.
>
>
> On Fri, Apr 26, 2013 at 10:27 AM, Timothy Chen <tn...@gmail.com> wrote:
>
>> Ya, just bringing that up again that. Doubt it will be a blocker.
>>
>> Tim
>>
>>
>> On Fri, Apr 26, 2013 at 10:12 AM, David Alves <da...@gmail.com>
>> wrote:
>>
>> > good point, i'll try and ask the author.
>> > it's a pretty recent lib so that might be an oversight…
>> >
>> > -david
>> >
>> > On Apr 26, 2013, at 12:04 PM, Timothy Chen <tn...@gmail.com> wrote:
>> >
>> > > Jacques I think this is the one I emailed you before that has no
>> > licensing info.
>> > >
>> > > Tim
>> > >
>> > > Sent from my iPhone
>> > >
>> > > On Apr 26, 2013, at 9:30 AM, David Alves <da...@gmail.com>
>> wrote:
>> > >
>> > >> i've looked through it and looks like it can leverage shared memory,
>> > which I was looking for anyway.
>> > >> I also like the way garbage collection works (gc in java also clears
>> > off-heap).
>> > >> I'll take a deeper look during the weekend.
>> > >>
>> > >> -david
>> > >>
>> > >> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <ja...@apache.org>
>> > wrote:
>> > >>
>> > >>> I've looked at that in the past and think the idea of using here is
>> > very
>> > >>> good.  It seems like ByteBuf is nice as it has things like endianess
>> > >>> capabilities, reference counting and management and Netty direct
>> > support.
>> > >>> On the flipside, larray is nice for its large array capabilities and
>> > >>> better input/output interfaces.  The best approach might be to
>> define
>> > a new
>> > >>> ByteBuf implementation that leverages LArray.  I'll take a look at
>> > this in
>> > >>> a few days if someone else doesn't want to.
>> > >>>
>> > >>> j
>> > >>>
>> > >>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com>
>> > wrote:
>> > >>>
>> > >>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>> > >>>> https://github.com/xerial/larray. It has those wrappers and I
>> found
>> > it
>> > >>>> quite useful. The same person has also written java version for
>> snappy
>> > >>>> compression. Not sure if you guys have plan to add compression, but
>> > one of
>> > >>>> the nice things I could do was use the memory offsets for
>> > source(compressed
>> > >>>> data) and dest(uncompressed array) and do the decompression
>> off-heap.
>> > It
>> > >>>> supports the need for looking up by index and has wrappers for most
>> > of the
>> > >>>> primitive data types.
>> > >>>>
>> > >>>> Are you looking at something like this?
>> > >>>>
>> > >>>> thanks,
>> > >>>> Kishore G
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <
>> jacques@apache.org>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> They are on the list but the list is long :)
>> > >>>>>
>> > >>>>> Have a good weekend.
>> > >>>>>
>> > >>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com>
>> > wrote:
>> > >>>>>
>> > >>>>>> So if no one picks anything up you will be done with all the
>> work in
>> > >>>> the
>> > >>>>>> next couple of days? :)
>> > >>>>>>
>> > >>>>>> Would like to help out but I'm traveling to la over the weekend.
>> > >>>>>>
>> > >>>>>> I'll sync with you Monday to see how I can help then.
>> > >>>>>>
>> > >>>>>> Tim
>> > >>>>>>
>> > >>>>>> Sent from my iPhone
>> > >>>>>>
>> > >>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
>> > >>>> wrote:
>> > >>>>>>
>> > >>>>>>> I'm working on the execwork stuff and if someone would like to
>> help
>> > >>>>> out,
>> > >>>>>>> here are a couple of things that need doing.  I figured I'd drop
>> > them
>> > >>>>>> here
>> > >>>>>>> and see if anyone wants to work on them in the next couple of
>> days.
>> > >>>> If
>> > >>>>>> so,
>> > >>>>>>> let me know otherwise I'll be picking them up soon.
>> > >>>>>>>
>> > >>>>>>> *RPC*
>> > >>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>> > >>>> handshake
>> > >>>>>> that
>> > >>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.
>> >  The
>> > >>>>>> plan
>> > >>>>>>> was to use an additional inserted event handler that removed
>> itself
>> > >>>>> from
>> > >>>>>>> the event pipeline after a successful handshake or disconnected
>> the
>> > >>>>>> channel
>> > >>>>>>> on a failed handshake (with appropriate logging).  The main
>> > >>>> validation
>> > >>>>> at
>> > >>>>>>> this point will be simply confirming that both endpoints are
>> > running
>> > >>>> on
>> > >>>>>> the
>> > >>>>>>> same protocol version.   The only other information that is
>> > currently
>> > >>>>>>> needed is that that in the Bit <> Bit communication, the client
>> > >>>> should
>> > >>>>>>> inform the server of its DrillEndpoint so that the server can
>> then
>> > >>>> map
>> > >>>>>> that
>> > >>>>>>> for future communication in the other direction.
>> > >>>>>>>
>> > >>>>>>> *DataTypes*
>> > >>>>>>> - General Expansion: Currently, we have a hodgepodge of
>> datatypes
>> > >>>>> within
>> > >>>>>>> the org.apache.drill.common.expression.types.DataType.  We need
>> to
>> > >>>>> clean
>> > >>>>>>> this up.  There should be types that map to standard sql types.
>>  My
>> > >>>>>>> thinking is that we should actually have separate types for each
>> > for
>> > >>>>>>> nullable, non-nullable and repeated (required, optional and
>> > repeated
>> > >>>> in
>> > >>>>>>> protobuf vernaciular) since we'll generally operate with those
>> > values
>> > >>>>>>> completely differently (and that each type should reveal which
>> it
>> > >>>> is).
>> > >>>>>> We
>> > >>>>>>> should also have a relationship mapping from each to the other
>> > (e.g.
>> > >>>>> how
>> > >>>>>> to
>> > >>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>> > >>>>>>>
>> > >>>>>>> - Map Types: We don't need nullable but we will need different
>> map
>> > >>>>> types:
>> > >>>>>>> inline and fieldwise.  I think these will useful for the
>> execution
>> > >>>>> engine
>> > >>>>>>> and will be leverage depending on the particular needs-- for
>> > example
>> > >>>>>>> fieldwise will be a natural fit where we're operating on
>> columnar
>> > >>>> data
>> > >>>>>> and
>> > >>>>>>> doing an explode or other fieldwise nested operation and inline
>> > will
>> > >>>> be
>> > >>>>>>> useful when we're doing things like sorting a complex field.
>> >  Inline
>> > >>>>> will
>> > >>>>>>> also be appropriate where we have extremely sparse record sets.
>> > >>>> We'll
>> > >>>>>> just
>> > >>>>>>> need transformation methods between the two variations.  In the
>> > case
>> > >>>>> of a
>> > >>>>>>> fieldwise map type field, the field is virtual and only exists
>> to
>> > >>>>> contain
>> > >>>>>>> its child fields.
>> > >>>>>>>
>> > >>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>> > >>>> static
>> > >>>>>> data
>> > >>>>>>> type model above.  Examples include fixed width types (e.g. 10
>> byte
>> > >>>>>>> string), polymorphic (inline encoded) types (number or string
>> > >>>> depending
>> > >>>>>> on
>> > >>>>>>> record) and repeated nested versions of our other types.  These
>> > are a
>> > >>>>>>> little more gnarly as we need to support canonicalization of
>> these.
>> > >>>>>> Optiq
>> > >>>>>>> has some methods for how to handle this kind of type system so
>> it
>> > >>>>>> probably
>> > >>>>>>> makes sense to leverage that system.
>> > >>>>>>>
>> > >>>>>>> *Expression Type Materialization*
>> > >>>>>>> - LogicalExpression type materialization: Right now,
>> > >>>> LogicalExpressions
>> > >>>>>>> include support for late type binding.  As part of the record
>> batch
>> > >>>>>>> execution path, these need to get materialized with correct
>> > casting,
>> > >>>>> etc
>> > >>>>>>> based on the actual found schema.  As such, we need to have a
>> > >>>> function
>> > >>>>>>> which takes a LogicalExpression tree, applies a materialized
>> > >>>>> BatchSchema
>> > >>>>>>> and returns a new LogicalExpression tree with full type
>> settings.
>> >  As
>> > >>>>>> part
>> > >>>>>>> of this process, all types need to be cast as necessary and full
>> > >>>>>> validation
>> > >>>>>>> of the tree should be done.  Timothy has a pending work for
>> > >>>> validation
>> > >>>>>>> specifically on a pull request that would be a good piece of
>> code
>> > to
>> > >>>>>>> leverage that need.  We also have a visitor model for the
>> > expression
>> > >>>>> tree
>> > >>>>>>> that should be able to aid in the updated LogicalExpression
>> > >>>>> construction.
>> > >>>>>>> -LogicalExpression to Java expression conversion: We need to be
>> > able
>> > >>>> to
>> > >>>>>>> convert our logical expressions into Java code expressions.
>> > >>>> Initially,
>> > >>>>>>> this should be done in a simplistic way, using something like
>> > >>>> implicit
>> > >>>>>>> boxing and the like just to get something working.  This will
>> > likely
>> > >>>> be
>> > >>>>>>> specialized per major type (nullable, non-nullable and repeated)
>> > and
>> > >>>> a
>> > >>>>>>> framework might the most sense actually just distinguishing the
>> > >>>>>>> LogicalExpression by these types.
>> > >>>>>>>
>> > >>>>>>> *JDBC*
>> > >>>>>>> - The Drill JDBC driver layer needs to be updated to leverage
>> our
>> > >>>>>> zookeeper
>> > >>>>>>> coordination locations so that it can correctly find the cluster
>> > >>>>>> location.
>> > >>>>>>> - The Drill JDBC driver should also manage reconnects so that
>> if it
>> > >>>>> loses
>> > >>>>>>> connection with a particular Drillbit partner, that it will
>> > reconnect
>> > >>>>> to
>> > >>>>>>> another available node in the cluster.
>> > >>>>>>> - Someone should point SQuirreL at Julian's latest work and see
>> how
>> > >>>>>> things
>> > >>>>>>> go...
>> > >>>>>>>
>> > >>>>>>> *ByteCode Engineering*
>> > >>>>>>> - We need to put together a concrete class materialization
>> > strategy.
>> > >>>>> My
>> > >>>>>>> thinking for relational operators and code generation is that in
>> > most
>> > >>>>>>> cases, we'll have an interface and a template class for a
>> > particular
>> > >>>>>>> relational operator.  We will build a template class that has
>> all
>> > the
>> > >>>>>>> generic stuff implemented but will make calls to empty methods
>> > where
>> > >>>> it
>> > >>>>>>> expects lower level operations to occur.  This allows things
>> like
>> > the
>> > >>>>>>> looping and certain types of null management to be fully
>> > materialized
>> > >>>>> in
>> > >>>>>>> source code without having to deal with the complexities of
>> > ByteCode
>> > >>>>>>> generation.  It also eases testing complexity.  When a
>> particular
>> > >>>>>>> implementation is required, the Drillbit will be responsible for
>> > >>>>>> generating
>> > >>>>>>> updated method bodies as required for the record-level
>> expressions,
>> > >>>>>> marking
>> > >>>>>>> all the methods and class as final, then loading the
>> implementation
>> > >>>>> into
>> > >>>>>>> the query-level classloader.  Note that the production Drillbit
>> > will
>> > >>>>>> never
>> > >>>>>>> load the template class into the JVM and will simply utilize it
>> in
>> > >>>>>> ByteCode
>> > >>>>>>> form.  I was hoping someone can take a look at trying to pull
>> > >>>> together
>> > >>>>> a
>> > >>>>>>> cohesive approach to doing this using ASM and Janino (likely
>> > >>>> utilizing
>> > >>>>>> the
>> > >>>>>>> JDK commons-compiler mode).  The interface should be pretty
>> simple:
>> > >>>>> input
>> > >>>>>>> is an interface, a template class name, a set of
>> (method_signature,
>> > >>>>>>> method_body_text) objects and a varargs of objects that are
>> > required
>> > >>>>> for
>> > >>>>>>> object instantiation.  The return should be an instance of the
>> > >>>>> interface.
>> > >>>>>>> The interface should check things like method_signature
>> provided to
>> > >>>>>>> available method blocks, the method blocks being replaced are
>> > empty,
>> > >>>>> the
>> > >>>>>>> object constructor matches the set of object argument provided
>> by
>> > the
>> > >>>>>>> object instantiation request, etc.
>> > >>>>>>>
>> > >>>>>>> *ByteBuf Improvements*
>> > >>>>>>> - Our BufferAllocator should support child allocators
>> (getChild())
>> > >>>> with
>> > >>>>>>> their own memory maximums and accounting (so we can determine
>> the
>> > >>>>> memory
>> > >>>>>>> overhead to particular queries).  We also need to be able to
>> > release
>> > >>>>>> entire
>> > >>>>>>> child allocations at once.
>> > >>>>>>> - We need to create a number of primitive type specific wrapping
>> > >>>>> classes
>> > >>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>> > >>>>>> operations
>> > >>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding
>> > support
>> > >>>>> for
>> > >>>>>>> unsigned values (my preference would be to leverage the work in
>> > Guava
>> > >>>>> if
>> > >>>>>>> that makes sense) and modifying the hard bounds checks to softer
>> > >>>> assert
>> > >>>>>>> checks to increase production performance.  While we could do
>> this
>> > >>>>>>> utilizing the ByteBuf interface, from everything I've
>> experienced
>> > and
>> > >>>>>> read,
>> > >>>>>>> we need to minimize issues with inlining and performance so we
>> > really
>> > >>>>>> need
>> > >>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly
>> > for
>> > >>>>> the
>> > >>>>>>> wrapping classes.  Of course, it is a final package private
>> class.
>> > >>>>> Short
>> > >>>>>>> term that means we really need to create a number of specific
>> > buffer
>> > >>>>>> types
>> > >>>>>>> that wrap it and just put them in the io.netty.buffer package
>> (or
>> > >>>>>>> alternatively create a Drill version or wrapper).
>> > >>
>> >
>> >
>>
>
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by kishore g <g....@gmail.com>.

I have interacted with the Author, let me know if you want me to check.
Good thing was that he is responsive and even added few things for me.


On Fri, Apr 26, 2013 at 10:27 AM, Timothy Chen <tn...@gmail.com> wrote:

> Ya, just bringing that up again that. Doubt it will be a blocker.
>
> Tim
>
>
> On Fri, Apr 26, 2013 at 10:12 AM, David Alves <da...@gmail.com>
> wrote:
>
> > good point, i'll try and ask the author.
> > it's a pretty recent lib so that might be an oversight…
> >
> > -david
> >
> > On Apr 26, 2013, at 12:04 PM, Timothy Chen <tn...@gmail.com> wrote:
> >
> > > Jacques I think this is the one I emailed you before that has no
> > licensing info.
> > >
> > > Tim
> > >
> > > Sent from my iPhone
> > >
> > > On Apr 26, 2013, at 9:30 AM, David Alves <da...@gmail.com>
> wrote:
> > >
> > >> i've looked through it and looks like it can leverage shared memory,
> > which I was looking for anyway.
> > >> I also like the way garbage collection works (gc in java also clears
> > off-heap).
> > >> I'll take a deeper look during the weekend.
> > >>
> > >> -david
> > >>
> > >> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> > >>
> > >>> I've looked at that in the past and think the idea of using here is
> > very
> > >>> good.  It seems like ByteBuf is nice as it has things like endianess
> > >>> capabilities, reference counting and management and Netty direct
> > support.
> > >>> On the flipside, larray is nice for its large array capabilities and
> > >>> better input/output interfaces.  The best approach might be to define
> > a new
> > >>> ByteBuf implementation that leverages LArray.  I'll take a look at
> > this in
> > >>> a few days if someone else doesn't want to.
> > >>>
> > >>> j
> > >>>
> > >>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com>
> > wrote:
> > >>>
> > >>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
> > >>>> https://github.com/xerial/larray. It has those wrappers and I found
> > it
> > >>>> quite useful. The same person has also written java version for
> snappy
> > >>>> compression. Not sure if you guys have plan to add compression, but
> > one of
> > >>>> the nice things I could do was use the memory offsets for
> > source(compressed
> > >>>> data) and dest(uncompressed array) and do the decompression
> off-heap.
> > It
> > >>>> supports the need for looking up by index and has wrappers for most
> > of the
> > >>>> primitive data types.
> > >>>>
> > >>>> Are you looking at something like this?
> > >>>>
> > >>>> thanks,
> > >>>> Kishore G
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <jacques@apache.org
> >
> > >>>> wrote:
> > >>>>
> > >>>>> They are on the list but the list is long :)
> > >>>>>
> > >>>>> Have a good weekend.
> > >>>>>
> > >>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com>
> > wrote:
> > >>>>>
> > >>>>>> So if no one picks anything up you will be done with all the work
> in
> > >>>> the
> > >>>>>> next couple of days? :)
> > >>>>>>
> > >>>>>> Would like to help out but I'm traveling to la over the weekend.
> > >>>>>>
> > >>>>>> I'll sync with you Monday to see how I can help then.
> > >>>>>>
> > >>>>>> Tim
> > >>>>>>
> > >>>>>> Sent from my iPhone
> > >>>>>>
> > >>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> I'm working on the execwork stuff and if someone would like to
> help
> > >>>>> out,
> > >>>>>>> here are a couple of things that need doing.  I figured I'd drop
> > them
> > >>>>>> here
> > >>>>>>> and see if anyone wants to work on them in the next couple of
> days.
> > >>>> If
> > >>>>>> so,
> > >>>>>>> let me know otherwise I'll be picking them up soon.
> > >>>>>>>
> > >>>>>>> *RPC*
> > >>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
> > >>>> handshake
> > >>>>>> that
> > >>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.
> >  The
> > >>>>>> plan
> > >>>>>>> was to use an additional inserted event handler that removed
> itself
> > >>>>> from
> > >>>>>>> the event pipeline after a successful handshake or disconnected
> the
> > >>>>>> channel
> > >>>>>>> on a failed handshake (with appropriate logging).  The main
> > >>>> validation
> > >>>>> at
> > >>>>>>> this point will be simply confirming that both endpoints are
> > running
> > >>>> on
> > >>>>>> the
> > >>>>>>> same protocol version.   The only other information that is
> > currently
> > >>>>>>> needed is that that in the Bit <> Bit communication, the client
> > >>>> should
> > >>>>>>> inform the server of its DrillEndpoint so that the server can
> then
> > >>>> map
> > >>>>>> that
> > >>>>>>> for future communication in the other direction.
> > >>>>>>>
> > >>>>>>> *DataTypes*
> > >>>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
> > >>>>> within
> > >>>>>>> the org.apache.drill.common.expression.types.DataType.  We need
> to
> > >>>>> clean
> > >>>>>>> this up.  There should be types that map to standard sql types.
>  My
> > >>>>>>> thinking is that we should actually have separate types for each
> > for
> > >>>>>>> nullable, non-nullable and repeated (required, optional and
> > repeated
> > >>>> in
> > >>>>>>> protobuf vernaciular) since we'll generally operate with those
> > values
> > >>>>>>> completely differently (and that each type should reveal which it
> > >>>> is).
> > >>>>>> We
> > >>>>>>> should also have a relationship mapping from each to the other
> > (e.g.
> > >>>>> how
> > >>>>>> to
> > >>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
> > >>>>>>>
> > >>>>>>> - Map Types: We don't need nullable but we will need different
> map
> > >>>>> types:
> > >>>>>>> inline and fieldwise.  I think these will useful for the
> execution
> > >>>>> engine
> > >>>>>>> and will be leverage depending on the particular needs-- for
> > example
> > >>>>>>> fieldwise will be a natural fit where we're operating on columnar
> > >>>> data
> > >>>>>> and
> > >>>>>>> doing an explode or other fieldwise nested operation and inline
> > will
> > >>>> be
> > >>>>>>> useful when we're doing things like sorting a complex field.
> >  Inline
> > >>>>> will
> > >>>>>>> also be appropriate where we have extremely sparse record sets.
> > >>>> We'll
> > >>>>>> just
> > >>>>>>> need transformation methods between the two variations.  In the
> > case
> > >>>>> of a
> > >>>>>>> fieldwise map type field, the field is virtual and only exists to
> > >>>>> contain
> > >>>>>>> its child fields.
> > >>>>>>>
> > >>>>>>> - Non-static DataTypes: We have a need types that don't fit the
> > >>>> static
> > >>>>>> data
> > >>>>>>> type model above.  Examples include fixed width types (e.g. 10
> byte
> > >>>>>>> string), polymorphic (inline encoded) types (number or string
> > >>>> depending
> > >>>>>> on
> > >>>>>>> record) and repeated nested versions of our other types.  These
> > are a
> > >>>>>>> little more gnarly as we need to support canonicalization of
> these.
> > >>>>>> Optiq
> > >>>>>>> has some methods for how to handle this kind of type system so it
> > >>>>>> probably
> > >>>>>>> makes sense to leverage that system.
> > >>>>>>>
> > >>>>>>> *Expression Type Materialization*
> > >>>>>>> - LogicalExpression type materialization: Right now,
> > >>>> LogicalExpressions
> > >>>>>>> include support for late type binding.  As part of the record
> batch
> > >>>>>>> execution path, these need to get materialized with correct
> > casting,
> > >>>>> etc
> > >>>>>>> based on the actual found schema.  As such, we need to have a
> > >>>> function
> > >>>>>>> which takes a LogicalExpression tree, applies a materialized
> > >>>>> BatchSchema
> > >>>>>>> and returns a new LogicalExpression tree with full type settings.
> >  As
> > >>>>>> part
> > >>>>>>> of this process, all types need to be cast as necessary and full
> > >>>>>> validation
> > >>>>>>> of the tree should be done.  Timothy has a pending work for
> > >>>> validation
> > >>>>>>> specifically on a pull request that would be a good piece of code
> > to
> > >>>>>>> leverage that need.  We also have a visitor model for the
> > expression
> > >>>>> tree
> > >>>>>>> that should be able to aid in the updated LogicalExpression
> > >>>>> construction.
> > >>>>>>> -LogicalExpression to Java expression conversion: We need to be
> > able
> > >>>> to
> > >>>>>>> convert our logical expressions into Java code expressions.
> > >>>> Initially,
> > >>>>>>> this should be done in a simplistic way, using something like
> > >>>> implicit
> > >>>>>>> boxing and the like just to get something working.  This will
> > likely
> > >>>> be
> > >>>>>>> specialized per major type (nullable, non-nullable and repeated)
> > and
> > >>>> a
> > >>>>>>> framework might the most sense actually just distinguishing the
> > >>>>>>> LogicalExpression by these types.
> > >>>>>>>
> > >>>>>>> *JDBC*
> > >>>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
> > >>>>>> zookeeper
> > >>>>>>> coordination locations so that it can correctly find the cluster
> > >>>>>> location.
> > >>>>>>> - The Drill JDBC driver should also manage reconnects so that if
> it
> > >>>>> loses
> > >>>>>>> connection with a particular Drillbit partner, that it will
> > reconnect
> > >>>>> to
> > >>>>>>> another available node in the cluster.
> > >>>>>>> - Someone should point SQuirreL at Julian's latest work and see
> how
> > >>>>>> things
> > >>>>>>> go...
> > >>>>>>>
> > >>>>>>> *ByteCode Engineering*
> > >>>>>>> - We need to put together a concrete class materialization
> > strategy.
> > >>>>> My
> > >>>>>>> thinking for relational operators and code generation is that in
> > most
> > >>>>>>> cases, we'll have an interface and a template class for a
> > particular
> > >>>>>>> relational operator.  We will build a template class that has all
> > the
> > >>>>>>> generic stuff implemented but will make calls to empty methods
> > where
> > >>>> it
> > >>>>>>> expects lower level operations to occur.  This allows things like
> > the
> > >>>>>>> looping and certain types of null management to be fully
> > materialized
> > >>>>> in
> > >>>>>>> source code without having to deal with the complexities of
> > ByteCode
> > >>>>>>> generation.  It also eases testing complexity.  When a particular
> > >>>>>>> implementation is required, the Drillbit will be responsible for
> > >>>>>> generating
> > >>>>>>> updated method bodies as required for the record-level
> expressions,
> > >>>>>> marking
> > >>>>>>> all the methods and class as final, then loading the
> implementation
> > >>>>> into
> > >>>>>>> the query-level classloader.  Note that the production Drillbit
> > will
> > >>>>>> never
> > >>>>>>> load the template class into the JVM and will simply utilize it
> in
> > >>>>>> ByteCode
> > >>>>>>> form.  I was hoping someone can take a look at trying to pull
> > >>>> together
> > >>>>> a
> > >>>>>>> cohesive approach to doing this using ASM and Janino (likely
> > >>>> utilizing
> > >>>>>> the
> > >>>>>>> JDK commons-compiler mode).  The interface should be pretty
> simple:
> > >>>>> input
> > >>>>>>> is an interface, a template class name, a set of
> (method_signature,
> > >>>>>>> method_body_text) objects and a varargs of objects that are
> > required
> > >>>>> for
> > >>>>>>> object instantiation.  The return should be an instance of the
> > >>>>> interface.
> > >>>>>>> The interface should check things like method_signature provided
> to
> > >>>>>>> available method blocks, the method blocks being replaced are
> > empty,
> > >>>>> the
> > >>>>>>> object constructor matches the set of object argument provided by
> > the
> > >>>>>>> object instantiation request, etc.
> > >>>>>>>
> > >>>>>>> *ByteBuf Improvements*
> > >>>>>>> - Our BufferAllocator should support child allocators
> (getChild())
> > >>>> with
> > >>>>>>> their own memory maximums and accounting (so we can determine the
> > >>>>> memory
> > >>>>>>> overhead to particular queries).  We also need to be able to
> > release
> > >>>>>> entire
> > >>>>>>> child allocations at once.
> > >>>>>>> - We need to create a number of primitive type specific wrapping
> > >>>>> classes
> > >>>>>>> for ByteBuf.  These additions include fixed offset indexing for
> > >>>>>> operations
> > >>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding
> > support
> > >>>>> for
> > >>>>>>> unsigned values (my preference would be to leverage the work in
> > Guava
> > >>>>> if
> > >>>>>>> that makes sense) and modifying the hard bounds checks to softer
> > >>>> assert
> > >>>>>>> checks to increase production performance.  While we could do
> this
> > >>>>>>> utilizing the ByteBuf interface, from everything I've experienced
> > and
> > >>>>>> read,
> > >>>>>>> we need to minimize issues with inlining and performance so we
> > really
> > >>>>>> need
> > >>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly
> > for
> > >>>>> the
> > >>>>>>> wrapping classes.  Of course, it is a final package private
> class.
> > >>>>> Short
> > >>>>>>> term that means we really need to create a number of specific
> > buffer
> > >>>>>> types
> > >>>>>>> that wrap it and just put them in the io.netty.buffer package (or
> > >>>>>>> alternatively create a Drill version or wrapper).
> > >>
> >
> >
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by Timothy Chen <tn...@gmail.com>.

Ya, just bringing that up again that. Doubt it will be a blocker.

Tim


On Fri, Apr 26, 2013 at 10:12 AM, David Alves <da...@gmail.com> wrote:

> good point, i'll try and ask the author.
> it's a pretty recent lib so that might be an oversight…
>
> -david
>
> On Apr 26, 2013, at 12:04 PM, Timothy Chen <tn...@gmail.com> wrote:
>
> > Jacques I think this is the one I emailed you before that has no
> licensing info.
> >
> > Tim
> >
> > Sent from my iPhone
> >
> > On Apr 26, 2013, at 9:30 AM, David Alves <da...@gmail.com> wrote:
> >
> >> i've looked through it and looks like it can leverage shared memory,
> which I was looking for anyway.
> >> I also like the way garbage collection works (gc in java also clears
> off-heap).
> >> I'll take a deeper look during the weekend.
> >>
> >> -david
> >>
> >> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> >>
> >>> I've looked at that in the past and think the idea of using here is
> very
> >>> good.  It seems like ByteBuf is nice as it has things like endianess
> >>> capabilities, reference counting and management and Netty direct
> support.
> >>> On the flipside, larray is nice for its large array capabilities and
> >>> better input/output interfaces.  The best approach might be to define
> a new
> >>> ByteBuf implementation that leverages LArray.  I'll take a look at
> this in
> >>> a few days if someone else doesn't want to.
> >>>
> >>> j
> >>>
> >>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com>
> wrote:
> >>>
> >>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
> >>>> https://github.com/xerial/larray. It has those wrappers and I found
> it
> >>>> quite useful. The same person has also written java version for snappy
> >>>> compression. Not sure if you guys have plan to add compression, but
> one of
> >>>> the nice things I could do was use the memory offsets for
> source(compressed
> >>>> data) and dest(uncompressed array) and do the decompression off-heap.
> It
> >>>> supports the need for looking up by index and has wrappers for most
> of the
> >>>> primitive data types.
> >>>>
> >>>> Are you looking at something like this?
> >>>>
> >>>> thanks,
> >>>> Kishore G
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <ja...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> They are on the list but the list is long :)
> >>>>>
> >>>>> Have a good weekend.
> >>>>>
> >>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com>
> wrote:
> >>>>>
> >>>>>> So if no one picks anything up you will be done with all the work in
> >>>> the
> >>>>>> next couple of days? :)
> >>>>>>
> >>>>>> Would like to help out but I'm traveling to la over the weekend.
> >>>>>>
> >>>>>> I'll sync with you Monday to see how I can help then.
> >>>>>>
> >>>>>> Tim
> >>>>>>
> >>>>>> Sent from my iPhone
> >>>>>>
> >>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
> >>>> wrote:
> >>>>>>
> >>>>>>> I'm working on the execwork stuff and if someone would like to help
> >>>>> out,
> >>>>>>> here are a couple of things that need doing.  I figured I'd drop
> them
> >>>>>> here
> >>>>>>> and see if anyone wants to work on them in the next couple of days.
> >>>> If
> >>>>>> so,
> >>>>>>> let me know otherwise I'll be picking them up soon.
> >>>>>>>
> >>>>>>> *RPC*
> >>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
> >>>> handshake
> >>>>>> that
> >>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.
>  The
> >>>>>> plan
> >>>>>>> was to use an additional inserted event handler that removed itself
> >>>>> from
> >>>>>>> the event pipeline after a successful handshake or disconnected the
> >>>>>> channel
> >>>>>>> on a failed handshake (with appropriate logging).  The main
> >>>> validation
> >>>>> at
> >>>>>>> this point will be simply confirming that both endpoints are
> running
> >>>> on
> >>>>>> the
> >>>>>>> same protocol version.   The only other information that is
> currently
> >>>>>>> needed is that that in the Bit <> Bit communication, the client
> >>>> should
> >>>>>>> inform the server of its DrillEndpoint so that the server can then
> >>>> map
> >>>>>> that
> >>>>>>> for future communication in the other direction.
> >>>>>>>
> >>>>>>> *DataTypes*
> >>>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
> >>>>> within
> >>>>>>> the org.apache.drill.common.expression.types.DataType.  We need to
> >>>>> clean
> >>>>>>> this up.  There should be types that map to standard sql types.  My
> >>>>>>> thinking is that we should actually have separate types for each
> for
> >>>>>>> nullable, non-nullable and repeated (required, optional and
> repeated
> >>>> in
> >>>>>>> protobuf vernaciular) since we'll generally operate with those
> values
> >>>>>>> completely differently (and that each type should reveal which it
> >>>> is).
> >>>>>> We
> >>>>>>> should also have a relationship mapping from each to the other
> (e.g.
> >>>>> how
> >>>>>> to
> >>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
> >>>>>>>
> >>>>>>> - Map Types: We don't need nullable but we will need different map
> >>>>> types:
> >>>>>>> inline and fieldwise.  I think these will useful for the execution
> >>>>> engine
> >>>>>>> and will be leverage depending on the particular needs-- for
> example
> >>>>>>> fieldwise will be a natural fit where we're operating on columnar
> >>>> data
> >>>>>> and
> >>>>>>> doing an explode or other fieldwise nested operation and inline
> will
> >>>> be
> >>>>>>> useful when we're doing things like sorting a complex field.
>  Inline
> >>>>> will
> >>>>>>> also be appropriate where we have extremely sparse record sets.
> >>>> We'll
> >>>>>> just
> >>>>>>> need transformation methods between the two variations.  In the
> case
> >>>>> of a
> >>>>>>> fieldwise map type field, the field is virtual and only exists to
> >>>>> contain
> >>>>>>> its child fields.
> >>>>>>>
> >>>>>>> - Non-static DataTypes: We have a need types that don't fit the
> >>>> static
> >>>>>> data
> >>>>>>> type model above.  Examples include fixed width types (e.g. 10 byte
> >>>>>>> string), polymorphic (inline encoded) types (number or string
> >>>> depending
> >>>>>> on
> >>>>>>> record) and repeated nested versions of our other types.  These
> are a
> >>>>>>> little more gnarly as we need to support canonicalization of these.
> >>>>>> Optiq
> >>>>>>> has some methods for how to handle this kind of type system so it
> >>>>>> probably
> >>>>>>> makes sense to leverage that system.
> >>>>>>>
> >>>>>>> *Expression Type Materialization*
> >>>>>>> - LogicalExpression type materialization: Right now,
> >>>> LogicalExpressions
> >>>>>>> include support for late type binding.  As part of the record batch
> >>>>>>> execution path, these need to get materialized with correct
> casting,
> >>>>> etc
> >>>>>>> based on the actual found schema.  As such, we need to have a
> >>>> function
> >>>>>>> which takes a LogicalExpression tree, applies a materialized
> >>>>> BatchSchema
> >>>>>>> and returns a new LogicalExpression tree with full type settings.
>  As
> >>>>>> part
> >>>>>>> of this process, all types need to be cast as necessary and full
> >>>>>> validation
> >>>>>>> of the tree should be done.  Timothy has a pending work for
> >>>> validation
> >>>>>>> specifically on a pull request that would be a good piece of code
> to
> >>>>>>> leverage that need.  We also have a visitor model for the
> expression
> >>>>> tree
> >>>>>>> that should be able to aid in the updated LogicalExpression
> >>>>> construction.
> >>>>>>> -LogicalExpression to Java expression conversion: We need to be
> able
> >>>> to
> >>>>>>> convert our logical expressions into Java code expressions.
> >>>> Initially,
> >>>>>>> this should be done in a simplistic way, using something like
> >>>> implicit
> >>>>>>> boxing and the like just to get something working.  This will
> likely
> >>>> be
> >>>>>>> specialized per major type (nullable, non-nullable and repeated)
> and
> >>>> a
> >>>>>>> framework might the most sense actually just distinguishing the
> >>>>>>> LogicalExpression by these types.
> >>>>>>>
> >>>>>>> *JDBC*
> >>>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
> >>>>>> zookeeper
> >>>>>>> coordination locations so that it can correctly find the cluster
> >>>>>> location.
> >>>>>>> - The Drill JDBC driver should also manage reconnects so that if it
> >>>>> loses
> >>>>>>> connection with a particular Drillbit partner, that it will
> reconnect
> >>>>> to
> >>>>>>> another available node in the cluster.
> >>>>>>> - Someone should point SQuirreL at Julian's latest work and see how
> >>>>>> things
> >>>>>>> go...
> >>>>>>>
> >>>>>>> *ByteCode Engineering*
> >>>>>>> - We need to put together a concrete class materialization
> strategy.
> >>>>> My
> >>>>>>> thinking for relational operators and code generation is that in
> most
> >>>>>>> cases, we'll have an interface and a template class for a
> particular
> >>>>>>> relational operator.  We will build a template class that has all
> the
> >>>>>>> generic stuff implemented but will make calls to empty methods
> where
> >>>> it
> >>>>>>> expects lower level operations to occur.  This allows things like
> the
> >>>>>>> looping and certain types of null management to be fully
> materialized
> >>>>> in
> >>>>>>> source code without having to deal with the complexities of
> ByteCode
> >>>>>>> generation.  It also eases testing complexity.  When a particular
> >>>>>>> implementation is required, the Drillbit will be responsible for
> >>>>>> generating
> >>>>>>> updated method bodies as required for the record-level expressions,
> >>>>>> marking
> >>>>>>> all the methods and class as final, then loading the implementation
> >>>>> into
> >>>>>>> the query-level classloader.  Note that the production Drillbit
> will
> >>>>>> never
> >>>>>>> load the template class into the JVM and will simply utilize it in
> >>>>>> ByteCode
> >>>>>>> form.  I was hoping someone can take a look at trying to pull
> >>>> together
> >>>>> a
> >>>>>>> cohesive approach to doing this using ASM and Janino (likely
> >>>> utilizing
> >>>>>> the
> >>>>>>> JDK commons-compiler mode).  The interface should be pretty simple:
> >>>>> input
> >>>>>>> is an interface, a template class name, a set of (method_signature,
> >>>>>>> method_body_text) objects and a varargs of objects that are
> required
> >>>>> for
> >>>>>>> object instantiation.  The return should be an instance of the
> >>>>> interface.
> >>>>>>> The interface should check things like method_signature provided to
> >>>>>>> available method blocks, the method blocks being replaced are
> empty,
> >>>>> the
> >>>>>>> object constructor matches the set of object argument provided by
> the
> >>>>>>> object instantiation request, etc.
> >>>>>>>
> >>>>>>> *ByteBuf Improvements*
> >>>>>>> - Our BufferAllocator should support child allocators (getChild())
> >>>> with
> >>>>>>> their own memory maximums and accounting (so we can determine the
> >>>>> memory
> >>>>>>> overhead to particular queries).  We also need to be able to
> release
> >>>>>> entire
> >>>>>>> child allocations at once.
> >>>>>>> - We need to create a number of primitive type specific wrapping
> >>>>> classes
> >>>>>>> for ByteBuf.  These additions include fixed offset indexing for
> >>>>>> operations
> >>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding
> support
> >>>>> for
> >>>>>>> unsigned values (my preference would be to leverage the work in
> Guava
> >>>>> if
> >>>>>>> that makes sense) and modifying the hard bounds checks to softer
> >>>> assert
> >>>>>>> checks to increase production performance.  While we could do this
> >>>>>>> utilizing the ByteBuf interface, from everything I've experienced
> and
> >>>>>> read,
> >>>>>>> we need to minimize issues with inlining and performance so we
> really
> >>>>>> need
> >>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly
> for
> >>>>> the
> >>>>>>> wrapping classes.  Of course, it is a final package private class.
> >>>>> Short
> >>>>>>> term that means we really need to create a number of specific
> buffer
> >>>>>> types
> >>>>>>> that wrap it and just put them in the io.netty.buffer package (or
> >>>>>>> alternatively create a Drill version or wrapper).
> >>
>
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by David Alves <da...@gmail.com>.

good point, i'll try and ask the author.
it's a pretty recent lib so that might be an oversight…

-david

On Apr 26, 2013, at 12:04 PM, Timothy Chen <tn...@gmail.com> wrote:

> Jacques I think this is the one I emailed you before that has no licensing info.
> 
> Tim
> 
> Sent from my iPhone
> 
> On Apr 26, 2013, at 9:30 AM, David Alves <da...@gmail.com> wrote:
> 
>> i've looked through it and looks like it can leverage shared memory, which I was looking for anyway.
>> I also like the way garbage collection works (gc in java also clears off-heap).
>> I'll take a deeper look during the weekend.
>> 
>> -david
>> 
>> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <ja...@apache.org> wrote:
>> 
>>> I've looked at that in the past and think the idea of using here is very
>>> good.  It seems like ByteBuf is nice as it has things like endianess
>>> capabilities, reference counting and management and Netty direct support.
>>> On the flipside, larray is nice for its large array capabilities and
>>> better input/output interfaces.  The best approach might be to define a new
>>> ByteBuf implementation that leverages LArray.  I'll take a look at this in
>>> a few days if someone else doesn't want to.
>>> 
>>> j
>>> 
>>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com> wrote:
>>> 
>>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>>>> https://github.com/xerial/larray. It has those wrappers and I found it
>>>> quite useful. The same person has also written java version for snappy
>>>> compression. Not sure if you guys have plan to add compression, but one of
>>>> the nice things I could do was use the memory offsets for source(compressed
>>>> data) and dest(uncompressed array) and do the decompression off-heap. It
>>>> supports the need for looking up by index and has wrappers for most of the
>>>> primitive data types.
>>>> 
>>>> Are you looking at something like this?
>>>> 
>>>> thanks,
>>>> Kishore G
>>>> 
>>>> 
>>>> 
>>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>> 
>>>>> They are on the list but the list is long :)
>>>>> 
>>>>> Have a good weekend.
>>>>> 
>>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com> wrote:
>>>>> 
>>>>>> So if no one picks anything up you will be done with all the work in
>>>> the
>>>>>> next couple of days? :)
>>>>>> 
>>>>>> Would like to help out but I'm traveling to la over the weekend.
>>>>>> 
>>>>>> I'll sync with you Monday to see how I can help then.
>>>>>> 
>>>>>> Tim
>>>>>> 
>>>>>> Sent from my iPhone
>>>>>> 
>>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
>>>> wrote:
>>>>>> 
>>>>>>> I'm working on the execwork stuff and if someone would like to help
>>>>> out,
>>>>>>> here are a couple of things that need doing.  I figured I'd drop them
>>>>>> here
>>>>>>> and see if anyone wants to work on them in the next couple of days.
>>>> If
>>>>>> so,
>>>>>>> let me know otherwise I'll be picking them up soon.
>>>>>>> 
>>>>>>> *RPC*
>>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>>>> handshake
>>>>>> that
>>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.  The
>>>>>> plan
>>>>>>> was to use an additional inserted event handler that removed itself
>>>>> from
>>>>>>> the event pipeline after a successful handshake or disconnected the
>>>>>> channel
>>>>>>> on a failed handshake (with appropriate logging).  The main
>>>> validation
>>>>> at
>>>>>>> this point will be simply confirming that both endpoints are running
>>>> on
>>>>>> the
>>>>>>> same protocol version.   The only other information that is currently
>>>>>>> needed is that that in the Bit <> Bit communication, the client
>>>> should
>>>>>>> inform the server of its DrillEndpoint so that the server can then
>>>> map
>>>>>> that
>>>>>>> for future communication in the other direction.
>>>>>>> 
>>>>>>> *DataTypes*
>>>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
>>>>> within
>>>>>>> the org.apache.drill.common.expression.types.DataType.  We need to
>>>>> clean
>>>>>>> this up.  There should be types that map to standard sql types.  My
>>>>>>> thinking is that we should actually have separate types for each for
>>>>>>> nullable, non-nullable and repeated (required, optional and repeated
>>>> in
>>>>>>> protobuf vernaciular) since we'll generally operate with those values
>>>>>>> completely differently (and that each type should reveal which it
>>>> is).
>>>>>> We
>>>>>>> should also have a relationship mapping from each to the other (e.g.
>>>>> how
>>>>>> to
>>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>>>>>> 
>>>>>>> - Map Types: We don't need nullable but we will need different map
>>>>> types:
>>>>>>> inline and fieldwise.  I think these will useful for the execution
>>>>> engine
>>>>>>> and will be leverage depending on the particular needs-- for example
>>>>>>> fieldwise will be a natural fit where we're operating on columnar
>>>> data
>>>>>> and
>>>>>>> doing an explode or other fieldwise nested operation and inline will
>>>> be
>>>>>>> useful when we're doing things like sorting a complex field.  Inline
>>>>> will
>>>>>>> also be appropriate where we have extremely sparse record sets.
>>>> We'll
>>>>>> just
>>>>>>> need transformation methods between the two variations.  In the case
>>>>> of a
>>>>>>> fieldwise map type field, the field is virtual and only exists to
>>>>> contain
>>>>>>> its child fields.
>>>>>>> 
>>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>>>> static
>>>>>> data
>>>>>>> type model above.  Examples include fixed width types (e.g. 10 byte
>>>>>>> string), polymorphic (inline encoded) types (number or string
>>>> depending
>>>>>> on
>>>>>>> record) and repeated nested versions of our other types.  These are a
>>>>>>> little more gnarly as we need to support canonicalization of these.
>>>>>> Optiq
>>>>>>> has some methods for how to handle this kind of type system so it
>>>>>> probably
>>>>>>> makes sense to leverage that system.
>>>>>>> 
>>>>>>> *Expression Type Materialization*
>>>>>>> - LogicalExpression type materialization: Right now,
>>>> LogicalExpressions
>>>>>>> include support for late type binding.  As part of the record batch
>>>>>>> execution path, these need to get materialized with correct casting,
>>>>> etc
>>>>>>> based on the actual found schema.  As such, we need to have a
>>>> function
>>>>>>> which takes a LogicalExpression tree, applies a materialized
>>>>> BatchSchema
>>>>>>> and returns a new LogicalExpression tree with full type settings.  As
>>>>>> part
>>>>>>> of this process, all types need to be cast as necessary and full
>>>>>> validation
>>>>>>> of the tree should be done.  Timothy has a pending work for
>>>> validation
>>>>>>> specifically on a pull request that would be a good piece of code to
>>>>>>> leverage that need.  We also have a visitor model for the expression
>>>>> tree
>>>>>>> that should be able to aid in the updated LogicalExpression
>>>>> construction.
>>>>>>> -LogicalExpression to Java expression conversion: We need to be able
>>>> to
>>>>>>> convert our logical expressions into Java code expressions.
>>>> Initially,
>>>>>>> this should be done in a simplistic way, using something like
>>>> implicit
>>>>>>> boxing and the like just to get something working.  This will likely
>>>> be
>>>>>>> specialized per major type (nullable, non-nullable and repeated) and
>>>> a
>>>>>>> framework might the most sense actually just distinguishing the
>>>>>>> LogicalExpression by these types.
>>>>>>> 
>>>>>>> *JDBC*
>>>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
>>>>>> zookeeper
>>>>>>> coordination locations so that it can correctly find the cluster
>>>>>> location.
>>>>>>> - The Drill JDBC driver should also manage reconnects so that if it
>>>>> loses
>>>>>>> connection with a particular Drillbit partner, that it will reconnect
>>>>> to
>>>>>>> another available node in the cluster.
>>>>>>> - Someone should point SQuirreL at Julian's latest work and see how
>>>>>> things
>>>>>>> go...
>>>>>>> 
>>>>>>> *ByteCode Engineering*
>>>>>>> - We need to put together a concrete class materialization strategy.
>>>>> My
>>>>>>> thinking for relational operators and code generation is that in most
>>>>>>> cases, we'll have an interface and a template class for a particular
>>>>>>> relational operator.  We will build a template class that has all the
>>>>>>> generic stuff implemented but will make calls to empty methods where
>>>> it
>>>>>>> expects lower level operations to occur.  This allows things like the
>>>>>>> looping and certain types of null management to be fully materialized
>>>>> in
>>>>>>> source code without having to deal with the complexities of ByteCode
>>>>>>> generation.  It also eases testing complexity.  When a particular
>>>>>>> implementation is required, the Drillbit will be responsible for
>>>>>> generating
>>>>>>> updated method bodies as required for the record-level expressions,
>>>>>> marking
>>>>>>> all the methods and class as final, then loading the implementation
>>>>> into
>>>>>>> the query-level classloader.  Note that the production Drillbit will
>>>>>> never
>>>>>>> load the template class into the JVM and will simply utilize it in
>>>>>> ByteCode
>>>>>>> form.  I was hoping someone can take a look at trying to pull
>>>> together
>>>>> a
>>>>>>> cohesive approach to doing this using ASM and Janino (likely
>>>> utilizing
>>>>>> the
>>>>>>> JDK commons-compiler mode).  The interface should be pretty simple:
>>>>> input
>>>>>>> is an interface, a template class name, a set of (method_signature,
>>>>>>> method_body_text) objects and a varargs of objects that are required
>>>>> for
>>>>>>> object instantiation.  The return should be an instance of the
>>>>> interface.
>>>>>>> The interface should check things like method_signature provided to
>>>>>>> available method blocks, the method blocks being replaced are empty,
>>>>> the
>>>>>>> object constructor matches the set of object argument provided by the
>>>>>>> object instantiation request, etc.
>>>>>>> 
>>>>>>> *ByteBuf Improvements*
>>>>>>> - Our BufferAllocator should support child allocators (getChild())
>>>> with
>>>>>>> their own memory maximums and accounting (so we can determine the
>>>>> memory
>>>>>>> overhead to particular queries).  We also need to be able to release
>>>>>> entire
>>>>>>> child allocations at once.
>>>>>>> - We need to create a number of primitive type specific wrapping
>>>>> classes
>>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>>>>> operations
>>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support
>>>>> for
>>>>>>> unsigned values (my preference would be to leverage the work in Guava
>>>>> if
>>>>>>> that makes sense) and modifying the hard bounds checks to softer
>>>> assert
>>>>>>> checks to increase production performance.  While we could do this
>>>>>>> utilizing the ByteBuf interface, from everything I've experienced and
>>>>>> read,
>>>>>>> we need to minimize issues with inlining and performance so we really
>>>>>> need
>>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
>>>>> the
>>>>>>> wrapping classes.  Of course, it is a final package private class.
>>>>> Short
>>>>>>> term that means we really need to create a number of specific buffer
>>>>>> types
>>>>>>> that wrap it and just put them in the io.netty.buffer package (or
>>>>>>> alternatively create a Drill version or wrapper).
>>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by Timothy Chen <tn...@gmail.com>.

Jacques I think this is the one I emailed you before that has no licensing info.

Tim

Sent from my iPhone

On Apr 26, 2013, at 9:30 AM, David Alves <da...@gmail.com> wrote:

> i've looked through it and looks like it can leverage shared memory, which I was looking for anyway.
> I also like the way garbage collection works (gc in java also clears off-heap).
> I'll take a deeper look during the weekend.
> 
> -david
> 
> On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
>> I've looked at that in the past and think the idea of using here is very
>> good.  It seems like ByteBuf is nice as it has things like endianess
>> capabilities, reference counting and management and Netty direct support.
>> On the flipside, larray is nice for its large array capabilities and
>> better input/output interfaces.  The best approach might be to define a new
>> ByteBuf implementation that leverages LArray.  I'll take a look at this in
>> a few days if someone else doesn't want to.
>> 
>> j
>> 
>> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com> wrote:
>> 
>>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>>> https://github.com/xerial/larray. It has those wrappers and I found it
>>> quite useful. The same person has also written java version for snappy
>>> compression. Not sure if you guys have plan to add compression, but one of
>>> the nice things I could do was use the memory offsets for source(compressed
>>> data) and dest(uncompressed array) and do the decompression off-heap. It
>>> supports the need for looking up by index and has wrappers for most of the
>>> primitive data types.
>>> 
>>> Are you looking at something like this?
>>> 
>>> thanks,
>>> Kishore G
>>> 
>>> 
>>> 
>>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>> 
>>>> They are on the list but the list is long :)
>>>> 
>>>> Have a good weekend.
>>>> 
>>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com> wrote:
>>>> 
>>>>> So if no one picks anything up you will be done with all the work in
>>> the
>>>>> next couple of days? :)
>>>>> 
>>>>> Would like to help out but I'm traveling to la over the weekend.
>>>>> 
>>>>> I'll sync with you Monday to see how I can help then.
>>>>> 
>>>>> Tim
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>>>> 
>>>>>> I'm working on the execwork stuff and if someone would like to help
>>>> out,
>>>>>> here are a couple of things that need doing.  I figured I'd drop them
>>>>> here
>>>>>> and see if anyone wants to work on them in the next couple of days.
>>> If
>>>>> so,
>>>>>> let me know otherwise I'll be picking them up soon.
>>>>>> 
>>>>>> *RPC*
>>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>>> handshake
>>>>> that
>>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.  The
>>>>> plan
>>>>>> was to use an additional inserted event handler that removed itself
>>>> from
>>>>>> the event pipeline after a successful handshake or disconnected the
>>>>> channel
>>>>>> on a failed handshake (with appropriate logging).  The main
>>> validation
>>>> at
>>>>>> this point will be simply confirming that both endpoints are running
>>> on
>>>>> the
>>>>>> same protocol version.   The only other information that is currently
>>>>>> needed is that that in the Bit <> Bit communication, the client
>>> should
>>>>>> inform the server of its DrillEndpoint so that the server can then
>>> map
>>>>> that
>>>>>> for future communication in the other direction.
>>>>>> 
>>>>>> *DataTypes*
>>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
>>>> within
>>>>>> the org.apache.drill.common.expression.types.DataType.  We need to
>>>> clean
>>>>>> this up.  There should be types that map to standard sql types.  My
>>>>>> thinking is that we should actually have separate types for each for
>>>>>> nullable, non-nullable and repeated (required, optional and repeated
>>> in
>>>>>> protobuf vernaciular) since we'll generally operate with those values
>>>>>> completely differently (and that each type should reveal which it
>>> is).
>>>>> We
>>>>>> should also have a relationship mapping from each to the other (e.g.
>>>> how
>>>>> to
>>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>>>>> 
>>>>>> - Map Types: We don't need nullable but we will need different map
>>>> types:
>>>>>> inline and fieldwise.  I think these will useful for the execution
>>>> engine
>>>>>> and will be leverage depending on the particular needs-- for example
>>>>>> fieldwise will be a natural fit where we're operating on columnar
>>> data
>>>>> and
>>>>>> doing an explode or other fieldwise nested operation and inline will
>>> be
>>>>>> useful when we're doing things like sorting a complex field.  Inline
>>>> will
>>>>>> also be appropriate where we have extremely sparse record sets.
>>> We'll
>>>>> just
>>>>>> need transformation methods between the two variations.  In the case
>>>> of a
>>>>>> fieldwise map type field, the field is virtual and only exists to
>>>> contain
>>>>>> its child fields.
>>>>>> 
>>>>>> - Non-static DataTypes: We have a need types that don't fit the
>>> static
>>>>> data
>>>>>> type model above.  Examples include fixed width types (e.g. 10 byte
>>>>>> string), polymorphic (inline encoded) types (number or string
>>> depending
>>>>> on
>>>>>> record) and repeated nested versions of our other types.  These are a
>>>>>> little more gnarly as we need to support canonicalization of these.
>>>>> Optiq
>>>>>> has some methods for how to handle this kind of type system so it
>>>>> probably
>>>>>> makes sense to leverage that system.
>>>>>> 
>>>>>> *Expression Type Materialization*
>>>>>> - LogicalExpression type materialization: Right now,
>>> LogicalExpressions
>>>>>> include support for late type binding.  As part of the record batch
>>>>>> execution path, these need to get materialized with correct casting,
>>>> etc
>>>>>> based on the actual found schema.  As such, we need to have a
>>> function
>>>>>> which takes a LogicalExpression tree, applies a materialized
>>>> BatchSchema
>>>>>> and returns a new LogicalExpression tree with full type settings.  As
>>>>> part
>>>>>> of this process, all types need to be cast as necessary and full
>>>>> validation
>>>>>> of the tree should be done.  Timothy has a pending work for
>>> validation
>>>>>> specifically on a pull request that would be a good piece of code to
>>>>>> leverage that need.  We also have a visitor model for the expression
>>>> tree
>>>>>> that should be able to aid in the updated LogicalExpression
>>>> construction.
>>>>>> -LogicalExpression to Java expression conversion: We need to be able
>>> to
>>>>>> convert our logical expressions into Java code expressions.
>>> Initially,
>>>>>> this should be done in a simplistic way, using something like
>>> implicit
>>>>>> boxing and the like just to get something working.  This will likely
>>> be
>>>>>> specialized per major type (nullable, non-nullable and repeated) and
>>> a
>>>>>> framework might the most sense actually just distinguishing the
>>>>>> LogicalExpression by these types.
>>>>>> 
>>>>>> *JDBC*
>>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
>>>>> zookeeper
>>>>>> coordination locations so that it can correctly find the cluster
>>>>> location.
>>>>>> - The Drill JDBC driver should also manage reconnects so that if it
>>>> loses
>>>>>> connection with a particular Drillbit partner, that it will reconnect
>>>> to
>>>>>> another available node in the cluster.
>>>>>> - Someone should point SQuirreL at Julian's latest work and see how
>>>>> things
>>>>>> go...
>>>>>> 
>>>>>> *ByteCode Engineering*
>>>>>> - We need to put together a concrete class materialization strategy.
>>>> My
>>>>>> thinking for relational operators and code generation is that in most
>>>>>> cases, we'll have an interface and a template class for a particular
>>>>>> relational operator.  We will build a template class that has all the
>>>>>> generic stuff implemented but will make calls to empty methods where
>>> it
>>>>>> expects lower level operations to occur.  This allows things like the
>>>>>> looping and certain types of null management to be fully materialized
>>>> in
>>>>>> source code without having to deal with the complexities of ByteCode
>>>>>> generation.  It also eases testing complexity.  When a particular
>>>>>> implementation is required, the Drillbit will be responsible for
>>>>> generating
>>>>>> updated method bodies as required for the record-level expressions,
>>>>> marking
>>>>>> all the methods and class as final, then loading the implementation
>>>> into
>>>>>> the query-level classloader.  Note that the production Drillbit will
>>>>> never
>>>>>> load the template class into the JVM and will simply utilize it in
>>>>> ByteCode
>>>>>> form.  I was hoping someone can take a look at trying to pull
>>> together
>>>> a
>>>>>> cohesive approach to doing this using ASM and Janino (likely
>>> utilizing
>>>>> the
>>>>>> JDK commons-compiler mode).  The interface should be pretty simple:
>>>> input
>>>>>> is an interface, a template class name, a set of (method_signature,
>>>>>> method_body_text) objects and a varargs of objects that are required
>>>> for
>>>>>> object instantiation.  The return should be an instance of the
>>>> interface.
>>>>>> The interface should check things like method_signature provided to
>>>>>> available method blocks, the method blocks being replaced are empty,
>>>> the
>>>>>> object constructor matches the set of object argument provided by the
>>>>>> object instantiation request, etc.
>>>>>> 
>>>>>> *ByteBuf Improvements*
>>>>>> - Our BufferAllocator should support child allocators (getChild())
>>> with
>>>>>> their own memory maximums and accounting (so we can determine the
>>>> memory
>>>>>> overhead to particular queries).  We also need to be able to release
>>>>> entire
>>>>>> child allocations at once.
>>>>>> - We need to create a number of primitive type specific wrapping
>>>> classes
>>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>>>> operations
>>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support
>>>> for
>>>>>> unsigned values (my preference would be to leverage the work in Guava
>>>> if
>>>>>> that makes sense) and modifying the hard bounds checks to softer
>>> assert
>>>>>> checks to increase production performance.  While we could do this
>>>>>> utilizing the ByteBuf interface, from everything I've experienced and
>>>>> read,
>>>>>> we need to minimize issues with inlining and performance so we really
>>>>> need
>>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
>>>> the
>>>>>> wrapping classes.  Of course, it is a final package private class.
>>>> Short
>>>>>> term that means we really need to create a number of specific buffer
>>>>> types
>>>>>> that wrap it and just put them in the io.netty.buffer package (or
>>>>>> alternatively create a Drill version or wrapper).
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by David Alves <da...@gmail.com>.

i've looked through it and looks like it can leverage shared memory, which I was looking for anyway.
I also like the way garbage collection works (gc in java also clears off-heap).
I'll take a deeper look during the weekend.

-david

On Apr 26, 2013, at 11:25 AM, Jacques Nadeau <ja...@apache.org> wrote:

> I've looked at that in the past and think the idea of using here is very
> good.  It seems like ByteBuf is nice as it has things like endianess
> capabilities, reference counting and management and Netty direct support.
> On the flipside, larray is nice for its large array capabilities and
> better input/output interfaces.  The best approach might be to define a new
> ByteBuf implementation that leverages LArray.  I'll take a look at this in
> a few days if someone else doesn't want to.
> 
> j
> 
> On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com> wrote:
> 
>> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
>> https://github.com/xerial/larray. It has those wrappers and I found it
>> quite useful. The same person has also written java version for snappy
>> compression. Not sure if you guys have plan to add compression, but one of
>> the nice things I could do was use the memory offsets for source(compressed
>> data) and dest(uncompressed array) and do the decompression off-heap. It
>> supports the need for looking up by index and has wrappers for most of the
>> primitive data types.
>> 
>> Are you looking at something like this?
>> 
>> thanks,
>> Kishore G
>> 
>> 
>> 
>> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>> 
>>> They are on the list but the list is long :)
>>> 
>>> Have a good weekend.
>>> 
>>> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com> wrote:
>>> 
>>>> So if no one picks anything up you will be done with all the work in
>> the
>>>> next couple of days? :)
>>>> 
>>>> Would like to help out but I'm traveling to la over the weekend.
>>>> 
>>>> I'll sync with you Monday to see how I can help then.
>>>> 
>>>> Tim
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>>> 
>>>>> I'm working on the execwork stuff and if someone would like to help
>>> out,
>>>>> here are a couple of things that need doing.  I figured I'd drop them
>>>> here
>>>>> and see if anyone wants to work on them in the next couple of days.
>> If
>>>> so,
>>>>> let me know otherwise I'll be picking them up soon.
>>>>> 
>>>>> *RPC*
>>>>> - RPC Layer Handshakes: Currently, I haven't implemented the
>> handshake
>>>> that
>>>>> should happen in either the User <> Bit or the Bit <> Bit layer.  The
>>>> plan
>>>>> was to use an additional inserted event handler that removed itself
>>> from
>>>>> the event pipeline after a successful handshake or disconnected the
>>>> channel
>>>>> on a failed handshake (with appropriate logging).  The main
>> validation
>>> at
>>>>> this point will be simply confirming that both endpoints are running
>> on
>>>> the
>>>>> same protocol version.   The only other information that is currently
>>>>> needed is that that in the Bit <> Bit communication, the client
>> should
>>>>> inform the server of its DrillEndpoint so that the server can then
>> map
>>>> that
>>>>> for future communication in the other direction.
>>>>> 
>>>>> *DataTypes*
>>>>> - General Expansion: Currently, we have a hodgepodge of datatypes
>>> within
>>>>> the org.apache.drill.common.expression.types.DataType.  We need to
>>> clean
>>>>> this up.  There should be types that map to standard sql types.  My
>>>>> thinking is that we should actually have separate types for each for
>>>>> nullable, non-nullable and repeated (required, optional and repeated
>> in
>>>>> protobuf vernaciular) since we'll generally operate with those values
>>>>> completely differently (and that each type should reveal which it
>> is).
>>>> We
>>>>> should also have a relationship mapping from each to the other (e.g.
>>> how
>>>> to
>>>>> convert a signed 32 bit int into a nullable signed 32 bit int.
>>>>> 
>>>>> - Map Types: We don't need nullable but we will need different map
>>> types:
>>>>> inline and fieldwise.  I think these will useful for the execution
>>> engine
>>>>> and will be leverage depending on the particular needs-- for example
>>>>> fieldwise will be a natural fit where we're operating on columnar
>> data
>>>> and
>>>>> doing an explode or other fieldwise nested operation and inline will
>> be
>>>>> useful when we're doing things like sorting a complex field.  Inline
>>> will
>>>>> also be appropriate where we have extremely sparse record sets.
>> We'll
>>>> just
>>>>> need transformation methods between the two variations.  In the case
>>> of a
>>>>> fieldwise map type field, the field is virtual and only exists to
>>> contain
>>>>> its child fields.
>>>>> 
>>>>> - Non-static DataTypes: We have a need types that don't fit the
>> static
>>>> data
>>>>> type model above.  Examples include fixed width types (e.g. 10 byte
>>>>> string), polymorphic (inline encoded) types (number or string
>> depending
>>>> on
>>>>> record) and repeated nested versions of our other types.  These are a
>>>>> little more gnarly as we need to support canonicalization of these.
>>>> Optiq
>>>>> has some methods for how to handle this kind of type system so it
>>>> probably
>>>>> makes sense to leverage that system.
>>>>> 
>>>>> *Expression Type Materialization*
>>>>> - LogicalExpression type materialization: Right now,
>> LogicalExpressions
>>>>> include support for late type binding.  As part of the record batch
>>>>> execution path, these need to get materialized with correct casting,
>>> etc
>>>>> based on the actual found schema.  As such, we need to have a
>> function
>>>>> which takes a LogicalExpression tree, applies a materialized
>>> BatchSchema
>>>>> and returns a new LogicalExpression tree with full type settings.  As
>>>> part
>>>>> of this process, all types need to be cast as necessary and full
>>>> validation
>>>>> of the tree should be done.  Timothy has a pending work for
>> validation
>>>>> specifically on a pull request that would be a good piece of code to
>>>>> leverage that need.  We also have a visitor model for the expression
>>> tree
>>>>> that should be able to aid in the updated LogicalExpression
>>> construction.
>>>>> -LogicalExpression to Java expression conversion: We need to be able
>> to
>>>>> convert our logical expressions into Java code expressions.
>> Initially,
>>>>> this should be done in a simplistic way, using something like
>> implicit
>>>>> boxing and the like just to get something working.  This will likely
>> be
>>>>> specialized per major type (nullable, non-nullable and repeated) and
>> a
>>>>> framework might the most sense actually just distinguishing the
>>>>> LogicalExpression by these types.
>>>>> 
>>>>> *JDBC*
>>>>> - The Drill JDBC driver layer needs to be updated to leverage our
>>>> zookeeper
>>>>> coordination locations so that it can correctly find the cluster
>>>> location.
>>>>> - The Drill JDBC driver should also manage reconnects so that if it
>>> loses
>>>>> connection with a particular Drillbit partner, that it will reconnect
>>> to
>>>>> another available node in the cluster.
>>>>> - Someone should point SQuirreL at Julian's latest work and see how
>>>> things
>>>>> go...
>>>>> 
>>>>> *ByteCode Engineering*
>>>>> - We need to put together a concrete class materialization strategy.
>>> My
>>>>> thinking for relational operators and code generation is that in most
>>>>> cases, we'll have an interface and a template class for a particular
>>>>> relational operator.  We will build a template class that has all the
>>>>> generic stuff implemented but will make calls to empty methods where
>> it
>>>>> expects lower level operations to occur.  This allows things like the
>>>>> looping and certain types of null management to be fully materialized
>>> in
>>>>> source code without having to deal with the complexities of ByteCode
>>>>> generation.  It also eases testing complexity.  When a particular
>>>>> implementation is required, the Drillbit will be responsible for
>>>> generating
>>>>> updated method bodies as required for the record-level expressions,
>>>> marking
>>>>> all the methods and class as final, then loading the implementation
>>> into
>>>>> the query-level classloader.  Note that the production Drillbit will
>>>> never
>>>>> load the template class into the JVM and will simply utilize it in
>>>> ByteCode
>>>>> form.  I was hoping someone can take a look at trying to pull
>> together
>>> a
>>>>> cohesive approach to doing this using ASM and Janino (likely
>> utilizing
>>>> the
>>>>> JDK commons-compiler mode).  The interface should be pretty simple:
>>> input
>>>>> is an interface, a template class name, a set of (method_signature,
>>>>> method_body_text) objects and a varargs of objects that are required
>>> for
>>>>> object instantiation.  The return should be an instance of the
>>> interface.
>>>>> The interface should check things like method_signature provided to
>>>>> available method blocks, the method blocks being replaced are empty,
>>> the
>>>>> object constructor matches the set of object argument provided by the
>>>>> object instantiation request, etc.
>>>>> 
>>>>> *ByteBuf Improvements*
>>>>> - Our BufferAllocator should support child allocators (getChild())
>> with
>>>>> their own memory maximums and accounting (so we can determine the
>>> memory
>>>>> overhead to particular queries).  We also need to be able to release
>>>> entire
>>>>> child allocations at once.
>>>>> - We need to create a number of primitive type specific wrapping
>>> classes
>>>>> for ByteBuf.  These additions include fixed offset indexing for
>>>> operations
>>>>> (e.g. index 1 of an int buffer should be at 4 bytes), adding support
>>> for
>>>>> unsigned values (my preference would be to leverage the work in Guava
>>> if
>>>>> that makes sense) and modifying the hard bounds checks to softer
>> assert
>>>>> checks to increase production performance.  While we could do this
>>>>> utilizing the ByteBuf interface, from everything I've experienced and
>>>> read,
>>>>> we need to minimize issues with inlining and performance so we really
>>>> need
>>>>> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
>>> the
>>>>> wrapping classes.  Of course, it is a final package private class.
>>> Short
>>>>> term that means we really need to create a number of specific buffer
>>>> types
>>>>> that wrap it and just put them in the io.netty.buffer package (or
>>>>> alternatively create a Drill version or wrapper).
>>>> 
>>> 
>>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by Jacques Nadeau <ja...@apache.org>.

I've looked at that in the past and think the idea of using here is very
good.  It seems like ByteBuf is nice as it has things like endianess
capabilities, reference counting and management and Netty direct support.
 On the flipside, larray is nice for its large array capabilities and
better input/output interfaces.  The best approach might be to define a new
ByteBuf implementation that leverages LArray.  I'll take a look at this in
a few days if someone else doesn't want to.

j

On Fri, Apr 26, 2013 at 8:39 AM, kishore g <g....@gmail.com> wrote:

> Fort *ByteBuf Improvements*, Have you looked at LArrayJ
> https://github.com/xerial/larray. It has those wrappers and I found it
> quite useful. The same person has also written java version for snappy
> compression. Not sure if you guys have plan to add compression, but one of
> the nice things I could do was use the memory offsets for source(compressed
> data) and dest(uncompressed array) and do the decompression off-heap. It
> supports the need for looking up by index and has wrappers for most of the
> primitive data types.
>
> Are you looking at something like this?
>
> thanks,
> Kishore G
>
>
>
> On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > They are on the list but the list is long :)
> >
> > Have a good weekend.
> >
> > On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com> wrote:
> >
> > > So if no one picks anything up you will be done with all the work in
> the
> > > next couple of days? :)
> > >
> > > Would like to help out but I'm traveling to la over the weekend.
> > >
> > > I'll sync with you Monday to see how I can help then.
> > >
> > > Tim
> > >
> > > Sent from my iPhone
> > >
> > > On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
> > >
> > > > I'm working on the execwork stuff and if someone would like to help
> > out,
> > > > here are a couple of things that need doing.  I figured I'd drop them
> > > here
> > > > and see if anyone wants to work on them in the next couple of days.
>  If
> > > so,
> > > > let me know otherwise I'll be picking them up soon.
> > > >
> > > > *RPC*
> > > > - RPC Layer Handshakes: Currently, I haven't implemented the
> handshake
> > > that
> > > > should happen in either the User <> Bit or the Bit <> Bit layer.  The
> > > plan
> > > > was to use an additional inserted event handler that removed itself
> > from
> > > > the event pipeline after a successful handshake or disconnected the
> > > channel
> > > > on a failed handshake (with appropriate logging).  The main
> validation
> > at
> > > > this point will be simply confirming that both endpoints are running
> on
> > > the
> > > > same protocol version.   The only other information that is currently
> > > > needed is that that in the Bit <> Bit communication, the client
> should
> > > > inform the server of its DrillEndpoint so that the server can then
> map
> > > that
> > > > for future communication in the other direction.
> > > >
> > > > *DataTypes*
> > > > - General Expansion: Currently, we have a hodgepodge of datatypes
> > within
> > > > the org.apache.drill.common.expression.types.DataType.  We need to
> > clean
> > > > this up.  There should be types that map to standard sql types.  My
> > > > thinking is that we should actually have separate types for each for
> > > > nullable, non-nullable and repeated (required, optional and repeated
> in
> > > > protobuf vernaciular) since we'll generally operate with those values
> > > > completely differently (and that each type should reveal which it
> is).
> > >  We
> > > > should also have a relationship mapping from each to the other (e.g.
> > how
> > > to
> > > > convert a signed 32 bit int into a nullable signed 32 bit int.
> > > >
> > > > - Map Types: We don't need nullable but we will need different map
> > types:
> > > > inline and fieldwise.  I think these will useful for the execution
> > engine
> > > > and will be leverage depending on the particular needs-- for example
> > > > fieldwise will be a natural fit where we're operating on columnar
> data
> > > and
> > > > doing an explode or other fieldwise nested operation and inline will
> be
> > > > useful when we're doing things like sorting a complex field.  Inline
> > will
> > > > also be appropriate where we have extremely sparse record sets.
>  We'll
> > > just
> > > > need transformation methods between the two variations.  In the case
> > of a
> > > > fieldwise map type field, the field is virtual and only exists to
> > contain
> > > > its child fields.
> > > >
> > > > - Non-static DataTypes: We have a need types that don't fit the
> static
> > > data
> > > > type model above.  Examples include fixed width types (e.g. 10 byte
> > > > string), polymorphic (inline encoded) types (number or string
> depending
> > > on
> > > > record) and repeated nested versions of our other types.  These are a
> > > > little more gnarly as we need to support canonicalization of these.
> > >  Optiq
> > > > has some methods for how to handle this kind of type system so it
> > > probably
> > > > makes sense to leverage that system.
> > > >
> > > > *Expression Type Materialization*
> > > > - LogicalExpression type materialization: Right now,
> LogicalExpressions
> > > > include support for late type binding.  As part of the record batch
> > > > execution path, these need to get materialized with correct casting,
> > etc
> > > > based on the actual found schema.  As such, we need to have a
> function
> > > > which takes a LogicalExpression tree, applies a materialized
> > BatchSchema
> > > > and returns a new LogicalExpression tree with full type settings.  As
> > > part
> > > > of this process, all types need to be cast as necessary and full
> > > validation
> > > > of the tree should be done.  Timothy has a pending work for
> validation
> > > > specifically on a pull request that would be a good piece of code to
> > > > leverage that need.  We also have a visitor model for the expression
> > tree
> > > > that should be able to aid in the updated LogicalExpression
> > construction.
> > > > -LogicalExpression to Java expression conversion: We need to be able
> to
> > > > convert our logical expressions into Java code expressions.
>  Initially,
> > > > this should be done in a simplistic way, using something like
> implicit
> > > > boxing and the like just to get something working.  This will likely
> be
> > > > specialized per major type (nullable, non-nullable and repeated) and
> a
> > > > framework might the most sense actually just distinguishing the
> > > > LogicalExpression by these types.
> > > >
> > > > *JDBC*
> > > > - The Drill JDBC driver layer needs to be updated to leverage our
> > > zookeeper
> > > > coordination locations so that it can correctly find the cluster
> > > location.
> > > > - The Drill JDBC driver should also manage reconnects so that if it
> > loses
> > > > connection with a particular Drillbit partner, that it will reconnect
> > to
> > > > another available node in the cluster.
> > > > - Someone should point SQuirreL at Julian's latest work and see how
> > > things
> > > > go...
> > > >
> > > > *ByteCode Engineering*
> > > > - We need to put together a concrete class materialization strategy.
> >  My
> > > > thinking for relational operators and code generation is that in most
> > > > cases, we'll have an interface and a template class for a particular
> > > > relational operator.  We will build a template class that has all the
> > > > generic stuff implemented but will make calls to empty methods where
> it
> > > > expects lower level operations to occur.  This allows things like the
> > > > looping and certain types of null management to be fully materialized
> > in
> > > > source code without having to deal with the complexities of ByteCode
> > > > generation.  It also eases testing complexity.  When a particular
> > > > implementation is required, the Drillbit will be responsible for
> > > generating
> > > > updated method bodies as required for the record-level expressions,
> > > marking
> > > > all the methods and class as final, then loading the implementation
> > into
> > > > the query-level classloader.  Note that the production Drillbit will
> > > never
> > > > load the template class into the JVM and will simply utilize it in
> > > ByteCode
> > > > form.  I was hoping someone can take a look at trying to pull
> together
> > a
> > > > cohesive approach to doing this using ASM and Janino (likely
> utilizing
> > > the
> > > > JDK commons-compiler mode).  The interface should be pretty simple:
> > input
> > > > is an interface, a template class name, a set of (method_signature,
> > > > method_body_text) objects and a varargs of objects that are required
> > for
> > > > object instantiation.  The return should be an instance of the
> > interface.
> > > > The interface should check things like method_signature provided to
> > > > available method blocks, the method blocks being replaced are empty,
> > the
> > > > object constructor matches the set of object argument provided by the
> > > > object instantiation request, etc.
> > > >
> > > > *ByteBuf Improvements*
> > > > - Our BufferAllocator should support child allocators (getChild())
> with
> > > > their own memory maximums and accounting (so we can determine the
> > memory
> > > > overhead to particular queries).  We also need to be able to release
> > > entire
> > > > child allocations at once.
> > > > - We need to create a number of primitive type specific wrapping
> > classes
> > > > for ByteBuf.  These additions include fixed offset indexing for
> > > operations
> > > > (e.g. index 1 of an int buffer should be at 4 bytes), adding support
> > for
> > > > unsigned values (my preference would be to leverage the work in Guava
> > if
> > > > that makes sense) and modifying the hard bounds checks to softer
> assert
> > > > checks to increase production performance.  While we could do this
> > > > utilizing the ByteBuf interface, from everything I've experienced and
> > > read,
> > > > we need to minimize issues with inlining and performance so we really
> > > need
> > > > to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
> > the
> > > > wrapping classes.  Of course, it is a final package private class.
> >  Short
> > > > term that means we really need to create a number of specific buffer
> > > types
> > > > that wrap it and just put them in the io.netty.buffer package (or
> > > > alternatively create a Drill version or wrapper).
> > >
> >
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by kishore g <g....@gmail.com>.

Fort *ByteBuf Improvements*, Have you looked at LArrayJ
https://github.com/xerial/larray. It has those wrappers and I found it
quite useful. The same person has also written java version for snappy
compression. Not sure if you guys have plan to add compression, but one of
the nice things I could do was use the memory offsets for source(compressed
data) and dest(uncompressed array) and do the decompression off-heap. It
supports the need for looking up by index and has wrappers for most of the
primitive data types.

Are you looking at something like this?

thanks,
Kishore G



On Fri, Apr 26, 2013 at 7:53 AM, Jacques Nadeau <ja...@apache.org> wrote:

> They are on the list but the list is long :)
>
> Have a good weekend.
>
> On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com> wrote:
>
> > So if no one picks anything up you will be done with all the work in the
> > next couple of days? :)
> >
> > Would like to help out but I'm traveling to la over the weekend.
> >
> > I'll sync with you Monday to see how I can help then.
> >
> > Tim
> >
> > Sent from my iPhone
> >
> > On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > > I'm working on the execwork stuff and if someone would like to help
> out,
> > > here are a couple of things that need doing.  I figured I'd drop them
> > here
> > > and see if anyone wants to work on them in the next couple of days.  If
> > so,
> > > let me know otherwise I'll be picking them up soon.
> > >
> > > *RPC*
> > > - RPC Layer Handshakes: Currently, I haven't implemented the handshake
> > that
> > > should happen in either the User <> Bit or the Bit <> Bit layer.  The
> > plan
> > > was to use an additional inserted event handler that removed itself
> from
> > > the event pipeline after a successful handshake or disconnected the
> > channel
> > > on a failed handshake (with appropriate logging).  The main validation
> at
> > > this point will be simply confirming that both endpoints are running on
> > the
> > > same protocol version.   The only other information that is currently
> > > needed is that that in the Bit <> Bit communication, the client should
> > > inform the server of its DrillEndpoint so that the server can then map
> > that
> > > for future communication in the other direction.
> > >
> > > *DataTypes*
> > > - General Expansion: Currently, we have a hodgepodge of datatypes
> within
> > > the org.apache.drill.common.expression.types.DataType.  We need to
> clean
> > > this up.  There should be types that map to standard sql types.  My
> > > thinking is that we should actually have separate types for each for
> > > nullable, non-nullable and repeated (required, optional and repeated in
> > > protobuf vernaciular) since we'll generally operate with those values
> > > completely differently (and that each type should reveal which it is).
> >  We
> > > should also have a relationship mapping from each to the other (e.g.
> how
> > to
> > > convert a signed 32 bit int into a nullable signed 32 bit int.
> > >
> > > - Map Types: We don't need nullable but we will need different map
> types:
> > > inline and fieldwise.  I think these will useful for the execution
> engine
> > > and will be leverage depending on the particular needs-- for example
> > > fieldwise will be a natural fit where we're operating on columnar data
> > and
> > > doing an explode or other fieldwise nested operation and inline will be
> > > useful when we're doing things like sorting a complex field.  Inline
> will
> > > also be appropriate where we have extremely sparse record sets.  We'll
> > just
> > > need transformation methods between the two variations.  In the case
> of a
> > > fieldwise map type field, the field is virtual and only exists to
> contain
> > > its child fields.
> > >
> > > - Non-static DataTypes: We have a need types that don't fit the static
> > data
> > > type model above.  Examples include fixed width types (e.g. 10 byte
> > > string), polymorphic (inline encoded) types (number or string depending
> > on
> > > record) and repeated nested versions of our other types.  These are a
> > > little more gnarly as we need to support canonicalization of these.
> >  Optiq
> > > has some methods for how to handle this kind of type system so it
> > probably
> > > makes sense to leverage that system.
> > >
> > > *Expression Type Materialization*
> > > - LogicalExpression type materialization: Right now, LogicalExpressions
> > > include support for late type binding.  As part of the record batch
> > > execution path, these need to get materialized with correct casting,
> etc
> > > based on the actual found schema.  As such, we need to have a function
> > > which takes a LogicalExpression tree, applies a materialized
> BatchSchema
> > > and returns a new LogicalExpression tree with full type settings.  As
> > part
> > > of this process, all types need to be cast as necessary and full
> > validation
> > > of the tree should be done.  Timothy has a pending work for validation
> > > specifically on a pull request that would be a good piece of code to
> > > leverage that need.  We also have a visitor model for the expression
> tree
> > > that should be able to aid in the updated LogicalExpression
> construction.
> > > -LogicalExpression to Java expression conversion: We need to be able to
> > > convert our logical expressions into Java code expressions.  Initially,
> > > this should be done in a simplistic way, using something like implicit
> > > boxing and the like just to get something working.  This will likely be
> > > specialized per major type (nullable, non-nullable and repeated) and a
> > > framework might the most sense actually just distinguishing the
> > > LogicalExpression by these types.
> > >
> > > *JDBC*
> > > - The Drill JDBC driver layer needs to be updated to leverage our
> > zookeeper
> > > coordination locations so that it can correctly find the cluster
> > location.
> > > - The Drill JDBC driver should also manage reconnects so that if it
> loses
> > > connection with a particular Drillbit partner, that it will reconnect
> to
> > > another available node in the cluster.
> > > - Someone should point SQuirreL at Julian's latest work and see how
> > things
> > > go...
> > >
> > > *ByteCode Engineering*
> > > - We need to put together a concrete class materialization strategy.
>  My
> > > thinking for relational operators and code generation is that in most
> > > cases, we'll have an interface and a template class for a particular
> > > relational operator.  We will build a template class that has all the
> > > generic stuff implemented but will make calls to empty methods where it
> > > expects lower level operations to occur.  This allows things like the
> > > looping and certain types of null management to be fully materialized
> in
> > > source code without having to deal with the complexities of ByteCode
> > > generation.  It also eases testing complexity.  When a particular
> > > implementation is required, the Drillbit will be responsible for
> > generating
> > > updated method bodies as required for the record-level expressions,
> > marking
> > > all the methods and class as final, then loading the implementation
> into
> > > the query-level classloader.  Note that the production Drillbit will
> > never
> > > load the template class into the JVM and will simply utilize it in
> > ByteCode
> > > form.  I was hoping someone can take a look at trying to pull together
> a
> > > cohesive approach to doing this using ASM and Janino (likely utilizing
> > the
> > > JDK commons-compiler mode).  The interface should be pretty simple:
> input
> > > is an interface, a template class name, a set of (method_signature,
> > > method_body_text) objects and a varargs of objects that are required
> for
> > > object instantiation.  The return should be an instance of the
> interface.
> > > The interface should check things like method_signature provided to
> > > available method blocks, the method blocks being replaced are empty,
> the
> > > object constructor matches the set of object argument provided by the
> > > object instantiation request, etc.
> > >
> > > *ByteBuf Improvements*
> > > - Our BufferAllocator should support child allocators (getChild()) with
> > > their own memory maximums and accounting (so we can determine the
> memory
> > > overhead to particular queries).  We also need to be able to release
> > entire
> > > child allocations at once.
> > > - We need to create a number of primitive type specific wrapping
> classes
> > > for ByteBuf.  These additions include fixed offset indexing for
> > operations
> > > (e.g. index 1 of an int buffer should be at 4 bytes), adding support
> for
> > > unsigned values (my preference would be to leverage the work in Guava
> if
> > > that makes sense) and modifying the hard bounds checks to softer assert
> > > checks to increase production performance.  While we could do this
> > > utilizing the ByteBuf interface, from everything I've experienced and
> > read,
> > > we need to minimize issues with inlining and performance so we really
> > need
> > > to be able to modify/refer to PooledUnsafeDirectByteBuf directly for
> the
> > > wrapping classes.  Of course, it is a final package private class.
>  Short
> > > term that means we really need to create a number of specific buffer
> > types
> > > that wrap it and just put them in the io.netty.buffer package (or
> > > alternatively create a Drill version or wrapper).
> >
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by Jacques Nadeau <ja...@apache.org>.

They are on the list but the list is long :)

Have a good weekend.

On Thu, Apr 25, 2013 at 9:51 PM, Timothy Chen <tn...@gmail.com> wrote:

> So if no one picks anything up you will be done with all the work in the
> next couple of days? :)
>
> Would like to help out but I'm traveling to la over the weekend.
>
> I'll sync with you Monday to see how I can help then.
>
> Tim
>
> Sent from my iPhone
>
> On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> > I'm working on the execwork stuff and if someone would like to help out,
> > here are a couple of things that need doing.  I figured I'd drop them
> here
> > and see if anyone wants to work on them in the next couple of days.  If
> so,
> > let me know otherwise I'll be picking them up soon.
> >
> > *RPC*
> > - RPC Layer Handshakes: Currently, I haven't implemented the handshake
> that
> > should happen in either the User <> Bit or the Bit <> Bit layer.  The
> plan
> > was to use an additional inserted event handler that removed itself from
> > the event pipeline after a successful handshake or disconnected the
> channel
> > on a failed handshake (with appropriate logging).  The main validation at
> > this point will be simply confirming that both endpoints are running on
> the
> > same protocol version.   The only other information that is currently
> > needed is that that in the Bit <> Bit communication, the client should
> > inform the server of its DrillEndpoint so that the server can then map
> that
> > for future communication in the other direction.
> >
> > *DataTypes*
> > - General Expansion: Currently, we have a hodgepodge of datatypes within
> > the org.apache.drill.common.expression.types.DataType.  We need to clean
> > this up.  There should be types that map to standard sql types.  My
> > thinking is that we should actually have separate types for each for
> > nullable, non-nullable and repeated (required, optional and repeated in
> > protobuf vernaciular) since we'll generally operate with those values
> > completely differently (and that each type should reveal which it is).
>  We
> > should also have a relationship mapping from each to the other (e.g. how
> to
> > convert a signed 32 bit int into a nullable signed 32 bit int.
> >
> > - Map Types: We don't need nullable but we will need different map types:
> > inline and fieldwise.  I think these will useful for the execution engine
> > and will be leverage depending on the particular needs-- for example
> > fieldwise will be a natural fit where we're operating on columnar data
> and
> > doing an explode or other fieldwise nested operation and inline will be
> > useful when we're doing things like sorting a complex field.  Inline will
> > also be appropriate where we have extremely sparse record sets.  We'll
> just
> > need transformation methods between the two variations.  In the case of a
> > fieldwise map type field, the field is virtual and only exists to contain
> > its child fields.
> >
> > - Non-static DataTypes: We have a need types that don't fit the static
> data
> > type model above.  Examples include fixed width types (e.g. 10 byte
> > string), polymorphic (inline encoded) types (number or string depending
> on
> > record) and repeated nested versions of our other types.  These are a
> > little more gnarly as we need to support canonicalization of these.
>  Optiq
> > has some methods for how to handle this kind of type system so it
> probably
> > makes sense to leverage that system.
> >
> > *Expression Type Materialization*
> > - LogicalExpression type materialization: Right now, LogicalExpressions
> > include support for late type binding.  As part of the record batch
> > execution path, these need to get materialized with correct casting, etc
> > based on the actual found schema.  As such, we need to have a function
> > which takes a LogicalExpression tree, applies a materialized BatchSchema
> > and returns a new LogicalExpression tree with full type settings.  As
> part
> > of this process, all types need to be cast as necessary and full
> validation
> > of the tree should be done.  Timothy has a pending work for validation
> > specifically on a pull request that would be a good piece of code to
> > leverage that need.  We also have a visitor model for the expression tree
> > that should be able to aid in the updated LogicalExpression construction.
> > -LogicalExpression to Java expression conversion: We need to be able to
> > convert our logical expressions into Java code expressions.  Initially,
> > this should be done in a simplistic way, using something like implicit
> > boxing and the like just to get something working.  This will likely be
> > specialized per major type (nullable, non-nullable and repeated) and a
> > framework might the most sense actually just distinguishing the
> > LogicalExpression by these types.
> >
> > *JDBC*
> > - The Drill JDBC driver layer needs to be updated to leverage our
> zookeeper
> > coordination locations so that it can correctly find the cluster
> location.
> > - The Drill JDBC driver should also manage reconnects so that if it loses
> > connection with a particular Drillbit partner, that it will reconnect to
> > another available node in the cluster.
> > - Someone should point SQuirreL at Julian's latest work and see how
> things
> > go...
> >
> > *ByteCode Engineering*
> > - We need to put together a concrete class materialization strategy.  My
> > thinking for relational operators and code generation is that in most
> > cases, we'll have an interface and a template class for a particular
> > relational operator.  We will build a template class that has all the
> > generic stuff implemented but will make calls to empty methods where it
> > expects lower level operations to occur.  This allows things like the
> > looping and certain types of null management to be fully materialized in
> > source code without having to deal with the complexities of ByteCode
> > generation.  It also eases testing complexity.  When a particular
> > implementation is required, the Drillbit will be responsible for
> generating
> > updated method bodies as required for the record-level expressions,
> marking
> > all the methods and class as final, then loading the implementation into
> > the query-level classloader.  Note that the production Drillbit will
> never
> > load the template class into the JVM and will simply utilize it in
> ByteCode
> > form.  I was hoping someone can take a look at trying to pull together a
> > cohesive approach to doing this using ASM and Janino (likely utilizing
> the
> > JDK commons-compiler mode).  The interface should be pretty simple: input
> > is an interface, a template class name, a set of (method_signature,
> > method_body_text) objects and a varargs of objects that are required for
> > object instantiation.  The return should be an instance of the interface.
> > The interface should check things like method_signature provided to
> > available method blocks, the method blocks being replaced are empty, the
> > object constructor matches the set of object argument provided by the
> > object instantiation request, etc.
> >
> > *ByteBuf Improvements*
> > - Our BufferAllocator should support child allocators (getChild()) with
> > their own memory maximums and accounting (so we can determine the memory
> > overhead to particular queries).  We also need to be able to release
> entire
> > child allocations at once.
> > - We need to create a number of primitive type specific wrapping classes
> > for ByteBuf.  These additions include fixed offset indexing for
> operations
> > (e.g. index 1 of an int buffer should be at 4 bytes), adding support for
> > unsigned values (my preference would be to leverage the work in Guava if
> > that makes sense) and modifying the hard bounds checks to softer assert
> > checks to increase production performance.  While we could do this
> > utilizing the ByteBuf interface, from everything I've experienced and
> read,
> > we need to minimize issues with inlining and performance so we really
> need
> > to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
> > wrapping classes.  Of course, it is a final package private class.  Short
> > term that means we really need to create a number of specific buffer
> types
> > that wrap it and just put them in the io.netty.buffer package (or
> > alternatively create a Drill version or wrapper).
>

Re: B[yi]teSize execwork tasks someone could potentially help out with...

Posted by Timothy Chen <tn...@gmail.com>.

So if no one picks anything up you will be done with all the work in the next couple of days? :)

Would like to help out but I'm traveling to la over the weekend.

I'll sync with you Monday to see how I can help then.

Tim

Sent from my iPhone

On Apr 25, 2013, at 9:06 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I'm working on the execwork stuff and if someone would like to help out,
> here are a couple of things that need doing.  I figured I'd drop them here
> and see if anyone wants to work on them in the next couple of days.  If so,
> let me know otherwise I'll be picking them up soon.
> 
> *RPC*
> - RPC Layer Handshakes: Currently, I haven't implemented the handshake that
> should happen in either the User <> Bit or the Bit <> Bit layer.  The plan
> was to use an additional inserted event handler that removed itself from
> the event pipeline after a successful handshake or disconnected the channel
> on a failed handshake (with appropriate logging).  The main validation at
> this point will be simply confirming that both endpoints are running on the
> same protocol version.   The only other information that is currently
> needed is that that in the Bit <> Bit communication, the client should
> inform the server of its DrillEndpoint so that the server can then map that
> for future communication in the other direction.
> 
> *DataTypes*
> - General Expansion: Currently, we have a hodgepodge of datatypes within
> the org.apache.drill.common.expression.types.DataType.  We need to clean
> this up.  There should be types that map to standard sql types.  My
> thinking is that we should actually have separate types for each for
> nullable, non-nullable and repeated (required, optional and repeated in
> protobuf vernaciular) since we'll generally operate with those values
> completely differently (and that each type should reveal which it is).  We
> should also have a relationship mapping from each to the other (e.g. how to
> convert a signed 32 bit int into a nullable signed 32 bit int.
> 
> - Map Types: We don't need nullable but we will need different map types:
> inline and fieldwise.  I think these will useful for the execution engine
> and will be leverage depending on the particular needs-- for example
> fieldwise will be a natural fit where we're operating on columnar data and
> doing an explode or other fieldwise nested operation and inline will be
> useful when we're doing things like sorting a complex field.  Inline will
> also be appropriate where we have extremely sparse record sets.  We'll just
> need transformation methods between the two variations.  In the case of a
> fieldwise map type field, the field is virtual and only exists to contain
> its child fields.
> 
> - Non-static DataTypes: We have a need types that don't fit the static data
> type model above.  Examples include fixed width types (e.g. 10 byte
> string), polymorphic (inline encoded) types (number or string depending on
> record) and repeated nested versions of our other types.  These are a
> little more gnarly as we need to support canonicalization of these.  Optiq
> has some methods for how to handle this kind of type system so it probably
> makes sense to leverage that system.
> 
> *Expression Type Materialization*
> - LogicalExpression type materialization: Right now, LogicalExpressions
> include support for late type binding.  As part of the record batch
> execution path, these need to get materialized with correct casting, etc
> based on the actual found schema.  As such, we need to have a function
> which takes a LogicalExpression tree, applies a materialized BatchSchema
> and returns a new LogicalExpression tree with full type settings.  As part
> of this process, all types need to be cast as necessary and full validation
> of the tree should be done.  Timothy has a pending work for validation
> specifically on a pull request that would be a good piece of code to
> leverage that need.  We also have a visitor model for the expression tree
> that should be able to aid in the updated LogicalExpression construction.
> -LogicalExpression to Java expression conversion: We need to be able to
> convert our logical expressions into Java code expressions.  Initially,
> this should be done in a simplistic way, using something like implicit
> boxing and the like just to get something working.  This will likely be
> specialized per major type (nullable, non-nullable and repeated) and a
> framework might the most sense actually just distinguishing the
> LogicalExpression by these types.
> 
> *JDBC*
> - The Drill JDBC driver layer needs to be updated to leverage our zookeeper
> coordination locations so that it can correctly find the cluster location.
> - The Drill JDBC driver should also manage reconnects so that if it loses
> connection with a particular Drillbit partner, that it will reconnect to
> another available node in the cluster.
> - Someone should point SQuirreL at Julian's latest work and see how things
> go...
> 
> *ByteCode Engineering*
> - We need to put together a concrete class materialization strategy.  My
> thinking for relational operators and code generation is that in most
> cases, we'll have an interface and a template class for a particular
> relational operator.  We will build a template class that has all the
> generic stuff implemented but will make calls to empty methods where it
> expects lower level operations to occur.  This allows things like the
> looping and certain types of null management to be fully materialized in
> source code without having to deal with the complexities of ByteCode
> generation.  It also eases testing complexity.  When a particular
> implementation is required, the Drillbit will be responsible for generating
> updated method bodies as required for the record-level expressions, marking
> all the methods and class as final, then loading the implementation into
> the query-level classloader.  Note that the production Drillbit will never
> load the template class into the JVM and will simply utilize it in ByteCode
> form.  I was hoping someone can take a look at trying to pull together a
> cohesive approach to doing this using ASM and Janino (likely utilizing the
> JDK commons-compiler mode).  The interface should be pretty simple: input
> is an interface, a template class name, a set of (method_signature,
> method_body_text) objects and a varargs of objects that are required for
> object instantiation.  The return should be an instance of the interface.
> The interface should check things like method_signature provided to
> available method blocks, the method blocks being replaced are empty, the
> object constructor matches the set of object argument provided by the
> object instantiation request, etc.
> 
> *ByteBuf Improvements*
> - Our BufferAllocator should support child allocators (getChild()) with
> their own memory maximums and accounting (so we can determine the memory
> overhead to particular queries).  We also need to be able to release entire
> child allocations at once.
> - We need to create a number of primitive type specific wrapping classes
> for ByteBuf.  These additions include fixed offset indexing for operations
> (e.g. index 1 of an int buffer should be at 4 bytes), adding support for
> unsigned values (my preference would be to leverage the work in Guava if
> that makes sense) and modifying the hard bounds checks to softer assert
> checks to increase production performance.  While we could do this
> utilizing the ByteBuf interface, from everything I've experienced and read,
> we need to minimize issues with inlining and performance so we really need
> to be able to modify/refer to PooledUnsafeDirectByteBuf directly for the
> wrapping classes.  Of course, it is a final package private class.  Short
> term that means we really need to create a number of specific buffer types
> that wrap it and just put them in the io.netty.buffer package (or
> alternatively create a Drill version or wrapper).