You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Jason Altekruse <al...@gmail.com> on 2015/06/08 20:12:45 UTC

[Discuss] Hive - Smallint and Tinyint

Hello Drillers,

I have been working on DRILL-3209, which aims to speed up reading from hive
tables by re-planning them as native Drill reads in the case where the
tables are backed by files that have available native readers. This will
begin with parquet and delimited text files.

To provide the same behavior as reading through the Serde interface, I must
insert a cast above the read operation to provide the same types that the
Hive scan otherwise would.

The issue I am seeing is that Hive appears to be reading into both the
tinyint and smallint types which I believe are not fully supported
(currently my new injected project is failing to find a function to cast to
tinyint). See the unsupported note in the docs here [1] for smallint,
tinyint is not even listed.

I can simply add the function to provide the same type as we currently read
out of the scan, but I believe we will have other issues with trying to
support this right now as we have not thoroughly tested these other integer
types.

I would like to instead propose that we change the behavior of Hive to read
data of these types into a regular integer columns for now and try to
remove any outstanding references to tinyint and smallint until we can
commit to fully supporting them.

[1] http://drill.apache.org/docs/supported-data-types/

Re: [Discuss] Hive - Smallint and Tinyint

Posted by Jason Altekruse <al...@gmail.com>.
Thanks Daniel, I'll make some sub-JIRAs to try to fill out the task list.
This should be a good opportunity for a newbie contribution if someone
wants to get to know the Drill code.

On Mon, Jun 8, 2015 at 1:51 PM, Daniel Barclay <db...@maprtech.com>
wrote:

> Note DRILL-2470, "Implement SMALLINT and TINYINT [umbrella]".
>
>
> Jacques Nadeau wrote:
>
>> I think it would be worthwhile to first open up a set of JIRAs associated
>> with finishing support for these datatypes.  I'm guessing the scale of
>> effort is less than one might initially guess.  Once those are opened, it
>> would be easier to give feedback on the relative merit of that work versus
>> the alternative solution you suggested.
>>
>> On Mon, Jun 8, 2015 at 11:12 AM, Jason Altekruse <
>> altekrusejason@gmail.com>
>> wrote:
>>
>>  Hello Drillers,
>>>
>>> I have been working on DRILL-3209, which aims to speed up reading from
>>> hive
>>> tables by re-planning them as native Drill reads in the case where the
>>> tables are backed by files that have available native readers. This will
>>> begin with parquet and delimited text files.
>>>
>>> To provide the same behavior as reading through the Serde interface, I
>>> must
>>> insert a cast above the read operation to provide the same types that the
>>> Hive scan otherwise would.
>>>
>>> The issue I am seeing is that Hive appears to be reading into both the
>>> tinyint and smallint types which I believe are not fully supported
>>> (currently my new injected project is failing to find a function to cast
>>> to
>>> tinyint). See the unsupported note in the docs here [1] for smallint,
>>> tinyint is not even listed.
>>>
>>> I can simply add the function to provide the same type as we currently
>>> read
>>> out of the scan, but I believe we will have other issues with trying to
>>> support this right now as we have not thoroughly tested these other
>>> integer
>>> types.
>>>
>>> I would like to instead propose that we change the behavior of Hive to
>>> read
>>> data of these types into a regular integer columns for now and try to
>>> remove any outstanding references to tinyint and smallint until we can
>>> commit to fully supporting them.
>>>
>>> [1] http://drill.apache.org/docs/supported-data-types/
>>>
>>>
>>
>
> --
> Daniel Barclay
> MapR Technologies
>

Re: [Discuss] Hive - Smallint and Tinyint

Posted by Daniel Barclay <db...@maprtech.com>.
Note DRILL-2470, "Implement SMALLINT and TINYINT [umbrella]".

Jacques Nadeau wrote:
> I think it would be worthwhile to first open up a set of JIRAs associated
> with finishing support for these datatypes.  I'm guessing the scale of
> effort is less than one might initially guess.  Once those are opened, it
> would be easier to give feedback on the relative merit of that work versus
> the alternative solution you suggested.
>
> On Mon, Jun 8, 2015 at 11:12 AM, Jason Altekruse <al...@gmail.com>
> wrote:
>
>> Hello Drillers,
>>
>> I have been working on DRILL-3209, which aims to speed up reading from hive
>> tables by re-planning them as native Drill reads in the case where the
>> tables are backed by files that have available native readers. This will
>> begin with parquet and delimited text files.
>>
>> To provide the same behavior as reading through the Serde interface, I must
>> insert a cast above the read operation to provide the same types that the
>> Hive scan otherwise would.
>>
>> The issue I am seeing is that Hive appears to be reading into both the
>> tinyint and smallint types which I believe are not fully supported
>> (currently my new injected project is failing to find a function to cast to
>> tinyint). See the unsupported note in the docs here [1] for smallint,
>> tinyint is not even listed.
>>
>> I can simply add the function to provide the same type as we currently read
>> out of the scan, but I believe we will have other issues with trying to
>> support this right now as we have not thoroughly tested these other integer
>> types.
>>
>> I would like to instead propose that we change the behavior of Hive to read
>> data of these types into a regular integer columns for now and try to
>> remove any outstanding references to tinyint and smallint until we can
>> commit to fully supporting them.
>>
>> [1] http://drill.apache.org/docs/supported-data-types/
>>
>


-- 
Daniel Barclay
MapR Technologies

Re: [Discuss] Hive - Smallint and Tinyint

Posted by Jacques Nadeau <ja...@apache.org>.
Got it.  Should be fine, then.

On Mon, Jun 8, 2015 at 12:46 PM, Jason Altekruse <al...@gmail.com>
wrote:

> I was going to be changing them on the schema side as well. As I am
> currently implementing the feature as a rewrite rule, I have to match the
> schema of the relational tree I am replacing. To make it work in execution
> I have to cast to an integer (or add the tinyint cast). If I choose the
> former, the planning will fail on mismatch types between the tinyint
> expected from the Hives can that differs from the integer coming out of the
> cast.
>
> On Mon, Jun 8, 2015 at 12:43 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > The only concern I have around changing the types in execution is that it
> > may cause strange behaviors.  Are you planning on changing them on the
> > schema side as well?  That way Calcite wouldn't insert weird expression
> > patterns that would cause other problems if you change the execution
> side.
> >
> > On Mon, Jun 8, 2015 at 12:41 PM, Jason Altekruse <
> altekrusejason@gmail.com
> > >
> > wrote:
> >
> > > I am in support of opening JIRAs to enumerate the step necessary to
> fill
> > in
> > > the steps necessary to support these types. However I think it would be
> > > good to get a fix into master for the functional bug that is in the
> code
> > > today. That fix is easy and the only overhead is taking a little more
> > space
> > > for the data after it has been read into Drill.
> > >
> > > As we are looking to keep up with our near-monthly release schedule,
> I'm
> > > uncertain that we can have these types implemented and well tested by
> the
> > > next release, but I think we very realistically could start testing
> Hive
> > > more thoroughly after this small fix.
> > >
> > > On Mon, Jun 8, 2015 at 12:29 PM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > I think it would be worthwhile to first open up a set of JIRAs
> > associated
> > > > with finishing support for these datatypes.  I'm guessing the scale
> of
> > > > effort is less than one might initially guess.  Once those are
> opened,
> > it
> > > > would be easier to give feedback on the relative merit of that work
> > > versus
> > > > the alternative solution you suggested.
> > > >
> > > > On Mon, Jun 8, 2015 at 11:12 AM, Jason Altekruse <
> > > altekrusejason@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hello Drillers,
> > > > >
> > > > > I have been working on DRILL-3209, which aims to speed up reading
> > from
> > > > hive
> > > > > tables by re-planning them as native Drill reads in the case where
> > the
> > > > > tables are backed by files that have available native readers. This
> > > will
> > > > > begin with parquet and delimited text files.
> > > > >
> > > > > To provide the same behavior as reading through the Serde
> interface,
> > I
> > > > must
> > > > > insert a cast above the read operation to provide the same types
> that
> > > the
> > > > > Hive scan otherwise would.
> > > > >
> > > > > The issue I am seeing is that Hive appears to be reading into both
> > the
> > > > > tinyint and smallint types which I believe are not fully supported
> > > > > (currently my new injected project is failing to find a function to
> > > cast
> > > > to
> > > > > tinyint). See the unsupported note in the docs here [1] for
> smallint,
> > > > > tinyint is not even listed.
> > > > >
> > > > > I can simply add the function to provide the same type as we
> > currently
> > > > read
> > > > > out of the scan, but I believe we will have other issues with
> trying
> > to
> > > > > support this right now as we have not thoroughly tested these other
> > > > integer
> > > > > types.
> > > > >
> > > > > I would like to instead propose that we change the behavior of Hive
> > to
> > > > read
> > > > > data of these types into a regular integer columns for now and try
> to
> > > > > remove any outstanding references to tinyint and smallint until we
> > can
> > > > > commit to fully supporting them.
> > > > >
> > > > > [1] http://drill.apache.org/docs/supported-data-types/
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Hive - Smallint and Tinyint

Posted by Jason Altekruse <al...@gmail.com>.
I was going to be changing them on the schema side as well. As I am
currently implementing the feature as a rewrite rule, I have to match the
schema of the relational tree I am replacing. To make it work in execution
I have to cast to an integer (or add the tinyint cast). If I choose the
former, the planning will fail on mismatch types between the tinyint
expected from the Hives can that differs from the integer coming out of the
cast.

On Mon, Jun 8, 2015 at 12:43 PM, Jacques Nadeau <ja...@apache.org> wrote:

> The only concern I have around changing the types in execution is that it
> may cause strange behaviors.  Are you planning on changing them on the
> schema side as well?  That way Calcite wouldn't insert weird expression
> patterns that would cause other problems if you change the execution side.
>
> On Mon, Jun 8, 2015 at 12:41 PM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > I am in support of opening JIRAs to enumerate the step necessary to fill
> in
> > the steps necessary to support these types. However I think it would be
> > good to get a fix into master for the functional bug that is in the code
> > today. That fix is easy and the only overhead is taking a little more
> space
> > for the data after it has been read into Drill.
> >
> > As we are looking to keep up with our near-monthly release schedule, I'm
> > uncertain that we can have these types implemented and well tested by the
> > next release, but I think we very realistically could start testing Hive
> > more thoroughly after this small fix.
> >
> > On Mon, Jun 8, 2015 at 12:29 PM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > I think it would be worthwhile to first open up a set of JIRAs
> associated
> > > with finishing support for these datatypes.  I'm guessing the scale of
> > > effort is less than one might initially guess.  Once those are opened,
> it
> > > would be easier to give feedback on the relative merit of that work
> > versus
> > > the alternative solution you suggested.
> > >
> > > On Mon, Jun 8, 2015 at 11:12 AM, Jason Altekruse <
> > altekrusejason@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hello Drillers,
> > > >
> > > > I have been working on DRILL-3209, which aims to speed up reading
> from
> > > hive
> > > > tables by re-planning them as native Drill reads in the case where
> the
> > > > tables are backed by files that have available native readers. This
> > will
> > > > begin with parquet and delimited text files.
> > > >
> > > > To provide the same behavior as reading through the Serde interface,
> I
> > > must
> > > > insert a cast above the read operation to provide the same types that
> > the
> > > > Hive scan otherwise would.
> > > >
> > > > The issue I am seeing is that Hive appears to be reading into both
> the
> > > > tinyint and smallint types which I believe are not fully supported
> > > > (currently my new injected project is failing to find a function to
> > cast
> > > to
> > > > tinyint). See the unsupported note in the docs here [1] for smallint,
> > > > tinyint is not even listed.
> > > >
> > > > I can simply add the function to provide the same type as we
> currently
> > > read
> > > > out of the scan, but I believe we will have other issues with trying
> to
> > > > support this right now as we have not thoroughly tested these other
> > > integer
> > > > types.
> > > >
> > > > I would like to instead propose that we change the behavior of Hive
> to
> > > read
> > > > data of these types into a regular integer columns for now and try to
> > > > remove any outstanding references to tinyint and smallint until we
> can
> > > > commit to fully supporting them.
> > > >
> > > > [1] http://drill.apache.org/docs/supported-data-types/
> > > >
> > >
> >
>

Re: [Discuss] Hive - Smallint and Tinyint

Posted by Jacques Nadeau <ja...@apache.org>.
The only concern I have around changing the types in execution is that it
may cause strange behaviors.  Are you planning on changing them on the
schema side as well?  That way Calcite wouldn't insert weird expression
patterns that would cause other problems if you change the execution side.

On Mon, Jun 8, 2015 at 12:41 PM, Jason Altekruse <al...@gmail.com>
wrote:

> I am in support of opening JIRAs to enumerate the step necessary to fill in
> the steps necessary to support these types. However I think it would be
> good to get a fix into master for the functional bug that is in the code
> today. That fix is easy and the only overhead is taking a little more space
> for the data after it has been read into Drill.
>
> As we are looking to keep up with our near-monthly release schedule, I'm
> uncertain that we can have these types implemented and well tested by the
> next release, but I think we very realistically could start testing Hive
> more thoroughly after this small fix.
>
> On Mon, Jun 8, 2015 at 12:29 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > I think it would be worthwhile to first open up a set of JIRAs associated
> > with finishing support for these datatypes.  I'm guessing the scale of
> > effort is less than one might initially guess.  Once those are opened, it
> > would be easier to give feedback on the relative merit of that work
> versus
> > the alternative solution you suggested.
> >
> > On Mon, Jun 8, 2015 at 11:12 AM, Jason Altekruse <
> altekrusejason@gmail.com
> > >
> > wrote:
> >
> > > Hello Drillers,
> > >
> > > I have been working on DRILL-3209, which aims to speed up reading from
> > hive
> > > tables by re-planning them as native Drill reads in the case where the
> > > tables are backed by files that have available native readers. This
> will
> > > begin with parquet and delimited text files.
> > >
> > > To provide the same behavior as reading through the Serde interface, I
> > must
> > > insert a cast above the read operation to provide the same types that
> the
> > > Hive scan otherwise would.
> > >
> > > The issue I am seeing is that Hive appears to be reading into both the
> > > tinyint and smallint types which I believe are not fully supported
> > > (currently my new injected project is failing to find a function to
> cast
> > to
> > > tinyint). See the unsupported note in the docs here [1] for smallint,
> > > tinyint is not even listed.
> > >
> > > I can simply add the function to provide the same type as we currently
> > read
> > > out of the scan, but I believe we will have other issues with trying to
> > > support this right now as we have not thoroughly tested these other
> > integer
> > > types.
> > >
> > > I would like to instead propose that we change the behavior of Hive to
> > read
> > > data of these types into a regular integer columns for now and try to
> > > remove any outstanding references to tinyint and smallint until we can
> > > commit to fully supporting them.
> > >
> > > [1] http://drill.apache.org/docs/supported-data-types/
> > >
> >
>

Re: [Discuss] Hive - Smallint and Tinyint

Posted by Jason Altekruse <al...@gmail.com>.
I am in support of opening JIRAs to enumerate the step necessary to fill in
the steps necessary to support these types. However I think it would be
good to get a fix into master for the functional bug that is in the code
today. That fix is easy and the only overhead is taking a little more space
for the data after it has been read into Drill.

As we are looking to keep up with our near-monthly release schedule, I'm
uncertain that we can have these types implemented and well tested by the
next release, but I think we very realistically could start testing Hive
more thoroughly after this small fix.

On Mon, Jun 8, 2015 at 12:29 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I think it would be worthwhile to first open up a set of JIRAs associated
> with finishing support for these datatypes.  I'm guessing the scale of
> effort is less than one might initially guess.  Once those are opened, it
> would be easier to give feedback on the relative merit of that work versus
> the alternative solution you suggested.
>
> On Mon, Jun 8, 2015 at 11:12 AM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > Hello Drillers,
> >
> > I have been working on DRILL-3209, which aims to speed up reading from
> hive
> > tables by re-planning them as native Drill reads in the case where the
> > tables are backed by files that have available native readers. This will
> > begin with parquet and delimited text files.
> >
> > To provide the same behavior as reading through the Serde interface, I
> must
> > insert a cast above the read operation to provide the same types that the
> > Hive scan otherwise would.
> >
> > The issue I am seeing is that Hive appears to be reading into both the
> > tinyint and smallint types which I believe are not fully supported
> > (currently my new injected project is failing to find a function to cast
> to
> > tinyint). See the unsupported note in the docs here [1] for smallint,
> > tinyint is not even listed.
> >
> > I can simply add the function to provide the same type as we currently
> read
> > out of the scan, but I believe we will have other issues with trying to
> > support this right now as we have not thoroughly tested these other
> integer
> > types.
> >
> > I would like to instead propose that we change the behavior of Hive to
> read
> > data of these types into a regular integer columns for now and try to
> > remove any outstanding references to tinyint and smallint until we can
> > commit to fully supporting them.
> >
> > [1] http://drill.apache.org/docs/supported-data-types/
> >
>

Re: [Discuss] Hive - Smallint and Tinyint

Posted by Jacques Nadeau <ja...@apache.org>.
I think it would be worthwhile to first open up a set of JIRAs associated
with finishing support for these datatypes.  I'm guessing the scale of
effort is less than one might initially guess.  Once those are opened, it
would be easier to give feedback on the relative merit of that work versus
the alternative solution you suggested.

On Mon, Jun 8, 2015 at 11:12 AM, Jason Altekruse <al...@gmail.com>
wrote:

> Hello Drillers,
>
> I have been working on DRILL-3209, which aims to speed up reading from hive
> tables by re-planning them as native Drill reads in the case where the
> tables are backed by files that have available native readers. This will
> begin with parquet and delimited text files.
>
> To provide the same behavior as reading through the Serde interface, I must
> insert a cast above the read operation to provide the same types that the
> Hive scan otherwise would.
>
> The issue I am seeing is that Hive appears to be reading into both the
> tinyint and smallint types which I believe are not fully supported
> (currently my new injected project is failing to find a function to cast to
> tinyint). See the unsupported note in the docs here [1] for smallint,
> tinyint is not even listed.
>
> I can simply add the function to provide the same type as we currently read
> out of the scan, but I believe we will have other issues with trying to
> support this right now as we have not thoroughly tested these other integer
> types.
>
> I would like to instead propose that we change the behavior of Hive to read
> data of these types into a regular integer columns for now and try to
> remove any outstanding references to tinyint and smallint until we can
> commit to fully supporting them.
>
> [1] http://drill.apache.org/docs/supported-data-types/
>