You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "James Turton (Jira)" <ji...@apache.org> on 2023/04/08 13:50:00 UTC

[jira] [Commented] (DRILL-8421) Parquet TIMESTAMP_MICROS columns in WHERE clauses are not converted to milliseconds before filtering

    [ https://issues.apache.org/jira/browse/DRILL-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17709893#comment-17709893 ] 

James Turton commented on DRILL-8421:
-------------------------------------

[~handmadecode] thanks for creating this. You'll see when you open a PR to the Apache Drill repo that the template will ask you for a link back to here.

> Parquet TIMESTAMP_MICROS columns in WHERE clauses are not converted to milliseconds before filtering
> ----------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-8421
>                 URL: https://issues.apache.org/jira/browse/DRILL-8421
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.21.0
>            Reporter: Peter Franzen
>            Priority: Major
>
> When using Drill with parquet files where the timestamp columns are in microseconds, Drill converts the microsecond values to milliseconds when displayed. However, when using a timestamp column in WHERE clauses it looks like the original microsecond value is used instead of the adjusted millisecond value when filtering records.
> *To Reproduce*
> Assume a parquet file in a directory "Test" with a column _timestampCol_ having the type {{{}org.apache.parquet.schema.OriginalType.TIMESTAMP_MICROS{}}}.
> Assume there are two records with the values 1673981999806149 and 1674759597743552, respectively, in that column (i.e. the UTC dates 2023-01-17T18:59:59.806149 and 2023-01-26T18:59:57.743552)
>  # Execute the query
> {{SELECT timestampCol FROM dfs.Test;}}
> The result includes both records, as expected.
>  # Execute the query
> {{SELECT timestampCol FROM dfs.Test WHERE timestampCol < TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> This produces an empty result although both records have a value less than the argument.
>  # Execute
> {{SELECT timestampCol FROM dfs.Test WHERE timestampCol > TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> The result includes both records although neither have a value greater than the argument.
> *Expected behavior*
> The query in 2) above should produce a result with both records, and the query in 3) should produce an empty result.
> *Additional context*
> Even timestamps long into the future produce results with both records, e.g.:
> {{SELECT timestampCol FROM dfs.Test WHERE timestampCol > TO_TIMESTAMP('2502-04-04 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> Manually converting the timestamp column to milliseconds produces the expected result:
> {{SELECT timestampCol FROM dfs.Test WHERE TO_TIMESTAMP(CONVERT_FROM(CONVERT_TO(timestampCol, 'TIMESTAMP_EPOCH'), 'BIGINT')/1000) < TO_TIMESTAMP('2023-02-01 00:00:00', 'yyyy-MM-dd HH:mm:ss')}}
> produces a result with both records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)