You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Khurram Faraaz <kf...@maprtech.com> on 2015/06/08 23:47:21 UTC

Behavior of TIME type column in Drill

Hi All,

Can someone please review this ?

SQL standard (ISO/IEC 9075-1:2011(E)) specifies that a value of datatype
TIME comprises values of DATETIME fields HOUR, MINUTE and SECONDS.
(HH:MM:SS)

The question here is for the same input data Drill and Postgres return
different output/results, when we cast the column to TIME in Drill.

Drill 1.0 results below are with cast to TIME. Note that there are three
digits after the seconds. (HH:MM:SS.sss)

{code}

0: jdbc:drill:schema=dfs.tmp> SELECT cast( columns[8] as TIME ) FROM
`allData.csv` limit 10;

*+---------------+*

*| **   EXPR$0    ** |*

*+---------------+*

*| *08:16:08.580 * |*

*| *09:11:49.170 * |*

*| *18:44:25.430 * |*

*| *20:52:08.560 * |*

*| *19:46:10.420 * |*

*| *15:21:34.390 * |*

*| *10:31:55.240 * |*

*| *01:37:47.310 * |*

*| *01:50:05.110 * |*

*| *11:28:25.100 * |*

*+---------------+*

10 rows selected (0.173 seconds)


explain plan for above query


*| *00-00    Screen

00-01      SelectionVectorRemover

00-02        Limit(fetch=[10])

00-03          Project(EXPR$0=[CAST(ITEM($0, 8)):TIME(0)])

00-04            Scan(groupscan=[EasyGroupScan
[selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
files=[maprfs:///tmp/allData.csv]]])


{code}


Drill results below are without cast to TIME, these are same as Postgres
results. However, we should note that there are only two digits after SS in
the results, and this is not as per the SQL specification for TIME
datatype.


{code}

0: jdbc:drill:schema=dfs.tmp> SELECT columns[8] FROM `allData.csv` limit 10;

*+--------------+*

*| **   EXPR$0   ** |*

*+--------------+*

*| *8:16:8.58   * |*

*| *9:11:49.17  * |*

*| *18:44:25.43 * |*

*| *20:52:8.56  * |*

*| *19:46:10.42 * |*

*| *15:21:34.39 * |*

*| *10:31:55.24 * |*

*| *1:37:47.31  * |*

*| *1:50:5.11   * |*

*| *11:28:25.1  * |*

*+--------------+*

10 rows selected (0.264 seconds)


explain plan for query that does not cast to TIME


*| *00-00    Screen

00-01      SelectionVectorRemover

00-02        Limit(fetch=[10])

00-03          Project(EXPR$0=[ITEM($0, 8)])

00-04            Scan(groupscan=[EasyGroupScan
[selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
files=[maprfs:///tmp/allData.csv]]])

{code}


Postgres 9.3 results are


{code}

postgres=# SELECT col_tm FROM all_typs_tbl limit 10;

   col_tm

-------------

 08:16:08.58

 09:11:49.17

 18:44:25.43

 20:52:08.56

 19:46:10.42

 15:21:34.39

 10:31:55.24

 01:37:47.31

 01:50:05.11

 11:28:25.1

(10 rows)
{code}

Thanks,
Khurram

Re: Behavior of TIME type column in Drill

Posted by Khurram Faraaz <kf...@maprtech.com>.

Yes in the second query, SELECT columns[8] FROM `allData.csv` limit 10;
Drill reads and returns results as varchar (in the format that it was in
the input csv file)

and in the case where we cast the column to TIME we see three digits after
SS field.

Should we return HH:MM:SS.sss
or
Should we return HH:MM:SS (which is the SQL standard format for TIME), when
we cast columns to TIME datatype ?

Thanks,
Khurram

On Mon, Jun 8, 2015 at 3:00 PM, Abdel Hakim Deneche <ad...@maprtech.com>
wrote:

> for the 2nd query (without cast to time) I think Drill just displays the
> column as VARCHAR. You are only seeing 2 digits after SS. because that's
> how the data is stored in the file (I suppose).
>
> I don't know what's the standard is, but looking at SQL Server
> documentation, you can have up to 3 digits after the period:
>
> Milliseconds can be preceded by either a colon (:) or a period (.). If a
> > colon is used, the number means thousandths-of-a-second. If a period is
> > used, a single digit means tenths-of-a-second, two digits mean
> > hundredths-of-a-second, and three digits mean thousandths-of-a-second.
>
>
>
> On Mon, Jun 8, 2015 at 2:47 PM, Khurram Faraaz <kf...@maprtech.com>
> wrote:
>
> > Hi All,
> >
> > Can someone please review this ?
> >
> > SQL standard (ISO/IEC 9075-1:2011(E)) specifies that a value of datatype
> > TIME comprises values of DATETIME fields HOUR, MINUTE and SECONDS.
> > (HH:MM:SS)
> >
> > The question here is for the same input data Drill and Postgres return
> > different output/results, when we cast the column to TIME in Drill.
> >
> > Drill 1.0 results below are with cast to TIME. Note that there are three
> > digits after the seconds. (HH:MM:SS.sss)
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> SELECT cast( columns[8] as TIME ) FROM
> > `allData.csv` limit 10;
> >
> > *+---------------+*
> >
> > *| **   EXPR$0    ** |*
> >
> > *+---------------+*
> >
> > *| *08:16:08.580 * |*
> >
> > *| *09:11:49.170 * |*
> >
> > *| *18:44:25.430 * |*
> >
> > *| *20:52:08.560 * |*
> >
> > *| *19:46:10.420 * |*
> >
> > *| *15:21:34.390 * |*
> >
> > *| *10:31:55.240 * |*
> >
> > *| *01:37:47.310 * |*
> >
> > *| *01:50:05.110 * |*
> >
> > *| *11:28:25.100 * |*
> >
> > *+---------------+*
> >
> > 10 rows selected (0.173 seconds)
> >
> >
> > explain plan for above query
> >
> >
> > *| *00-00    Screen
> >
> > 00-01      SelectionVectorRemover
> >
> > 00-02        Limit(fetch=[10])
> >
> > 00-03          Project(EXPR$0=[CAST(ITEM($0, 8)):TIME(0)])
> >
> > 00-04            Scan(groupscan=[EasyGroupScan
> > [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
> > files=[maprfs:///tmp/allData.csv]]])
> >
> >
> > {code}
> >
> >
> > Drill results below are without cast to TIME, these are same as Postgres
> > results. However, we should note that there are only two digits after SS
> in
> > the results, and this is not as per the SQL specification for TIME
> > datatype.
> >
> >
> > {code}
> >
> > 0: jdbc:drill:schema=dfs.tmp> SELECT columns[8] FROM `allData.csv` limit
> > 10;
> >
> > *+--------------+*
> >
> > *| **   EXPR$0   ** |*
> >
> > *+--------------+*
> >
> > *| *8:16:8.58   * |*
> >
> > *| *9:11:49.17  * |*
> >
> > *| *18:44:25.43 * |*
> >
> > *| *20:52:8.56  * |*
> >
> > *| *19:46:10.42 * |*
> >
> > *| *15:21:34.39 * |*
> >
> > *| *10:31:55.24 * |*
> >
> > *| *1:37:47.31  * |*
> >
> > *| *1:50:5.11   * |*
> >
> > *| *11:28:25.1  * |*
> >
> > *+--------------+*
> >
> > 10 rows selected (0.264 seconds)
> >
> >
> > explain plan for query that does not cast to TIME
> >
> >
> > *| *00-00    Screen
> >
> > 00-01      SelectionVectorRemover
> >
> > 00-02        Limit(fetch=[10])
> >
> > 00-03          Project(EXPR$0=[ITEM($0, 8)])
> >
> > 00-04            Scan(groupscan=[EasyGroupScan
> > [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
> > files=[maprfs:///tmp/allData.csv]]])
> >
> > {code}
> >
> >
> > Postgres 9.3 results are
> >
> >
> > {code}
> >
> > postgres=# SELECT col_tm FROM all_typs_tbl limit 10;
> >
> >    col_tm
> >
> > -------------
> >
> >  08:16:08.58
> >
> >  09:11:49.17
> >
> >  18:44:25.43
> >
> >  20:52:08.56
> >
> >  19:46:10.42
> >
> >  15:21:34.39
> >
> >  10:31:55.24
> >
> >  01:37:47.31
> >
> >  01:50:05.11
> >
> >  11:28:25.1
> >
> > (10 rows)
> > {code}
> >
> > Thanks,
> > Khurram
> >
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> >
>

Re: Behavior of TIME type column in Drill

Posted by Abdel Hakim Deneche <ad...@maprtech.com>.

for the 2nd query (without cast to time) I think Drill just displays the
column as VARCHAR. You are only seeing 2 digits after SS. because that's
how the data is stored in the file (I suppose).

I don't know what's the standard is, but looking at SQL Server
documentation, you can have up to 3 digits after the period:

Milliseconds can be preceded by either a colon (:) or a period (.). If a
> colon is used, the number means thousandths-of-a-second. If a period is
> used, a single digit means tenths-of-a-second, two digits mean
> hundredths-of-a-second, and three digits mean thousandths-of-a-second.



On Mon, Jun 8, 2015 at 2:47 PM, Khurram Faraaz <kf...@maprtech.com> wrote:

> Hi All,
>
> Can someone please review this ?
>
> SQL standard (ISO/IEC 9075-1:2011(E)) specifies that a value of datatype
> TIME comprises values of DATETIME fields HOUR, MINUTE and SECONDS.
> (HH:MM:SS)
>
> The question here is for the same input data Drill and Postgres return
> different output/results, when we cast the column to TIME in Drill.
>
> Drill 1.0 results below are with cast to TIME. Note that there are three
> digits after the seconds. (HH:MM:SS.sss)
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> SELECT cast( columns[8] as TIME ) FROM
> `allData.csv` limit 10;
>
> *+---------------+*
>
> *| **   EXPR$0    ** |*
>
> *+---------------+*
>
> *| *08:16:08.580 * |*
>
> *| *09:11:49.170 * |*
>
> *| *18:44:25.430 * |*
>
> *| *20:52:08.560 * |*
>
> *| *19:46:10.420 * |*
>
> *| *15:21:34.390 * |*
>
> *| *10:31:55.240 * |*
>
> *| *01:37:47.310 * |*
>
> *| *01:50:05.110 * |*
>
> *| *11:28:25.100 * |*
>
> *+---------------+*
>
> 10 rows selected (0.173 seconds)
>
>
> explain plan for above query
>
>
> *| *00-00    Screen
>
> 00-01      SelectionVectorRemover
>
> 00-02        Limit(fetch=[10])
>
> 00-03          Project(EXPR$0=[CAST(ITEM($0, 8)):TIME(0)])
>
> 00-04            Scan(groupscan=[EasyGroupScan
> [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
> files=[maprfs:///tmp/allData.csv]]])
>
>
> {code}
>
>
> Drill results below are without cast to TIME, these are same as Postgres
> results. However, we should note that there are only two digits after SS in
> the results, and this is not as per the SQL specification for TIME
> datatype.
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> SELECT columns[8] FROM `allData.csv` limit
> 10;
>
> *+--------------+*
>
> *| **   EXPR$0   ** |*
>
> *+--------------+*
>
> *| *8:16:8.58   * |*
>
> *| *9:11:49.17  * |*
>
> *| *18:44:25.43 * |*
>
> *| *20:52:8.56  * |*
>
> *| *19:46:10.42 * |*
>
> *| *15:21:34.39 * |*
>
> *| *10:31:55.24 * |*
>
> *| *1:37:47.31  * |*
>
> *| *1:50:5.11   * |*
>
> *| *11:28:25.1  * |*
>
> *+--------------+*
>
> 10 rows selected (0.264 seconds)
>
>
> explain plan for query that does not cast to TIME
>
>
> *| *00-00    Screen
>
> 00-01      SelectionVectorRemover
>
> 00-02        Limit(fetch=[10])
>
> 00-03          Project(EXPR$0=[ITEM($0, 8)])
>
> 00-04            Scan(groupscan=[EasyGroupScan
> [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
> files=[maprfs:///tmp/allData.csv]]])
>
> {code}
>
>
> Postgres 9.3 results are
>
>
> {code}
>
> postgres=# SELECT col_tm FROM all_typs_tbl limit 10;
>
>    col_tm
>
> -------------
>
>  08:16:08.58
>
>  09:11:49.17
>
>  18:44:25.43
>
>  20:52:08.56
>
>  19:46:10.42
>
>  15:21:34.39
>
>  10:31:55.24
>
>  01:37:47.31
>
>  01:50:05.11
>
>  11:28:25.1
>
> (10 rows)
> {code}
>
> Thanks,
> Khurram
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Behavior of TIME type column in Drill

Posted by Khurram Faraaz <kf...@maprtech.com>.

Thanks Daniel.

On Wed, Jun 10, 2015 at 3:08 PM, Daniel Barclay <db...@maprtech.com>
wrote:

> Khurram Faraaz wrote:
>
>> Hi All,
>>
>> Can someone please review this ?
>>
>> SQL standard (ISO/IEC 9075-1:2011(E)) specifies that a value of datatype
>> TIME comprises values of DATETIME fields HOUR, MINUTE and SECONDS.
>> (HH:MM:SS)
>>
> Note that the SECOND field can have a fractional seconds precision.
>
> (Your mention of "HH:MM:SS" suggests that maybe you were thinking that
> seconds can't be fractional.)
>
> Daniel
>
>  The question here is for the same input data Drill and Postgres return
>> different output/results, when we cast the column to TIME in Drill.
>>
>> Drill 1.0 results below are with cast to TIME. Note that there are three
>> digits after the seconds. (HH:MM:SS.sss)
>>
>> {code}
>>
>> 0: jdbc:drill:schema=dfs.tmp> SELECT cast( columns[8] as TIME ) FROM
>> `allData.csv` limit 10;
>>
>> *+---------------+*
>>
>> *| **   EXPR$0    ** |*
>>
>> *+---------------+*
>>
>> *| *08:16:08.580 * |*
>>
>> *| *09:11:49.170 * |*
>>
>> *| *18:44:25.430 * |*
>>
>> *| *20:52:08.560 * |*
>>
>> *| *19:46:10.420 * |*
>>
>> *| *15:21:34.390 * |*
>>
>> *| *10:31:55.240 * |*
>>
>> *| *01:37:47.310 * |*
>>
>> *| *01:50:05.110 * |*
>>
>> *| *11:28:25.100 * |*
>>
>> *+---------------+*
>>
>> 10 rows selected (0.173 seconds)
>>
>>
>> explain plan for above query
>>
>>
>> *| *00-00    Screen
>>
>> 00-01      SelectionVectorRemover
>>
>> 00-02        Limit(fetch=[10])
>>
>> 00-03          Project(EXPR$0=[CAST(ITEM($0, 8)):TIME(0)])
>>
>> 00-04            Scan(groupscan=[EasyGroupScan
>> [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
>> files=[maprfs:///tmp/allData.csv]]])
>>
>>
>> {code}
>>
>>
>> Drill results below are without cast to TIME, these are same as Postgres
>> results. However, we should note that there are only two digits after SS
>> in
>> the results, and this is not as per the SQL specification for TIME
>> datatype.
>>
>>
>> {code}
>>
>> 0: jdbc:drill:schema=dfs.tmp> SELECT columns[8] FROM `allData.csv` limit
>> 10;
>>
>> *+--------------+*
>>
>> *| **   EXPR$0   ** |*
>>
>> *+--------------+*
>>
>> *| *8:16:8.58   * |*
>>
>> *| *9:11:49.17  * |*
>>
>> *| *18:44:25.43 * |*
>>
>> *| *20:52:8.56  * |*
>>
>> *| *19:46:10.42 * |*
>>
>> *| *15:21:34.39 * |*
>>
>> *| *10:31:55.24 * |*
>>
>> *| *1:37:47.31  * |*
>>
>> *| *1:50:5.11   * |*
>>
>> *| *11:28:25.1  * |*
>>
>> *+--------------+*
>>
>> 10 rows selected (0.264 seconds)
>>
>>
>> explain plan for query that does not cast to TIME
>>
>>
>> *| *00-00    Screen
>>
>>
>> 00-01      SelectionVectorRemover
>>
>> 00-02        Limit(fetch=[10])
>>
>> 00-03          Project(EXPR$0=[ITEM($0, 8)])
>>
>> 00-04            Scan(groupscan=[EasyGroupScan
>> [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
>> files=[maprfs:///tmp/allData.csv]]])
>>
>> {code}
>>
>>
>> Postgres 9.3 results are
>>
>>
>> {code}
>>
>> postgres=# SELECT col_tm FROM all_typs_tbl limit 10;
>>
>>     col_tm
>>
>> -------------
>>
>>   08:16:08.58
>>
>>   09:11:49.17
>>
>>   18:44:25.43
>>
>>   20:52:08.56
>>
>>   19:46:10.42
>>
>>   15:21:34.39
>>
>>   10:31:55.24
>>
>>   01:37:47.31
>>
>>   01:50:05.11
>>
>>   11:28:25.1
>>
>> (10 rows)
>> {code}
>>
>> Thanks,
>> Khurram
>>
>>
>
> --
> Daniel Barclay
> MapR Technologies
>
>

Re: Behavior of TIME type column in Drill

Posted by Daniel Barclay <db...@maprtech.com>.

Khurram Faraaz wrote:
> Hi All,
>
> Can someone please review this ?
>
> SQL standard (ISO/IEC 9075-1:2011(E)) specifies that a value of datatype
> TIME comprises values of DATETIME fields HOUR, MINUTE and SECONDS.
> (HH:MM:SS)
Note that the SECOND field can have a fractional seconds precision.

(Your mention of "HH:MM:SS" suggests that maybe you were thinking that seconds can't be fractional.)

Daniel

> The question here is for the same input data Drill and Postgres return
> different output/results, when we cast the column to TIME in Drill.
>
> Drill 1.0 results below are with cast to TIME. Note that there are three
> digits after the seconds. (HH:MM:SS.sss)
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> SELECT cast( columns[8] as TIME ) FROM
> `allData.csv` limit 10;
>
> *+---------------+*
>
> *| **   EXPR$0    ** |*
>
> *+---------------+*
>
> *| *08:16:08.580 * |*
>
> *| *09:11:49.170 * |*
>
> *| *18:44:25.430 * |*
>
> *| *20:52:08.560 * |*
>
> *| *19:46:10.420 * |*
>
> *| *15:21:34.390 * |*
>
> *| *10:31:55.240 * |*
>
> *| *01:37:47.310 * |*
>
> *| *01:50:05.110 * |*
>
> *| *11:28:25.100 * |*
>
> *+---------------+*
>
> 10 rows selected (0.173 seconds)
>
>
> explain plan for above query
>
>
> *| *00-00    Screen
>
> 00-01      SelectionVectorRemover
>
> 00-02        Limit(fetch=[10])
>
> 00-03          Project(EXPR$0=[CAST(ITEM($0, 8)):TIME(0)])
>
> 00-04            Scan(groupscan=[EasyGroupScan
> [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
> files=[maprfs:///tmp/allData.csv]]])
>
>
> {code}
>
>
> Drill results below are without cast to TIME, these are same as Postgres
> results. However, we should note that there are only two digits after SS in
> the results, and this is not as per the SQL specification for TIME
> datatype.
>
>
> {code}
>
> 0: jdbc:drill:schema=dfs.tmp> SELECT columns[8] FROM `allData.csv` limit 10;
>
> *+--------------+*
>
> *| **   EXPR$0   ** |*
>
> *+--------------+*
>
> *| *8:16:8.58   * |*
>
> *| *9:11:49.17  * |*
>
> *| *18:44:25.43 * |*
>
> *| *20:52:8.56  * |*
>
> *| *19:46:10.42 * |*
>
> *| *15:21:34.39 * |*
>
> *| *10:31:55.24 * |*
>
> *| *1:37:47.31  * |*
>
> *| *1:50:5.11   * |*
>
> *| *11:28:25.1  * |*
>
> *+--------------+*
>
> 10 rows selected (0.264 seconds)
>
>
> explain plan for query that does not cast to TIME
>
>
> *| *00-00    Screen
>
> 00-01      SelectionVectorRemover
>
> 00-02        Limit(fetch=[10])
>
> 00-03          Project(EXPR$0=[ITEM($0, 8)])
>
> 00-04            Scan(groupscan=[EasyGroupScan
> [selectionRoot=/tmp/allData.csv, numFiles=1, columns=[`columns`[8]],
> files=[maprfs:///tmp/allData.csv]]])
>
> {code}
>
>
> Postgres 9.3 results are
>
>
> {code}
>
> postgres=# SELECT col_tm FROM all_typs_tbl limit 10;
>
>     col_tm
>
> -------------
>
>   08:16:08.58
>
>   09:11:49.17
>
>   18:44:25.43
>
>   20:52:08.56
>
>   19:46:10.42
>
>   15:21:34.39
>
>   10:31:55.24
>
>   01:37:47.31
>
>   01:50:05.11
>
>   11:28:25.1
>
> (10 rows)
> {code}
>
> Thanks,
> Khurram
>


-- 
Daniel Barclay
MapR Technologies