You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Kunal Khatua (JIRA)" <ji...@apache.org> on 2019/01/23 05:38:00 UTC
[jira] [Commented] (DRILL-6994) TIMESTAMP type DOB column in Spark parquet is treated as VARBINARY in Drill

    [ https://issues.apache.org/jira/browse/DRILL-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749539#comment-16749539 ] 

Kunal Khatua commented on DRILL-6994:
-------------------------------------

[~khfaraaz] what does the schema look like according to the \{{parquet-tools}} utility? 

> TIMESTAMP type DOB column in Spark parquet is treated as VARBINARY in Drill
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-6994
>                 URL: https://issues.apache.org/jira/browse/DRILL-6994
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 1.14.0
>            Reporter: Khurram Faraaz
>            Priority: Major
>
> A timestamp type column in a parquet file created from Spark is treated as VARBINARY by Drill 1.14.0., Trying to cast DOB column to DATE results in an Exception, although the monthOfYear field is in the allowed range.
> Data used in the test
> {noformat}
> [test@md123 spark_data]# cat inferSchema_example.csv
> Name,Department,years_of_experience,DOB
> Sam,Software,5,1990-10-10
> Alex,Data Analytics,3,1992-10-10
> {noformat}
> Create the parquet file using the above CSV file
> {noformat}
> [test@md123 bin]# ./spark-shell
> 19/01/22 21:21:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> Spark context Web UI available at http://md123.qa.lab:4040
> Spark context available as 'sc' (master = local[*], app id = local-1548192099796).
> Spark session available as 'spark'.
> Welcome to
>  ____ __
>  / __/__ ___ _____/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 2.3.1-mapr-SNAPSHOT
>  /_/
> Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.apache.spark.sql.\{DataFrame, SQLContext}
> import org.apache.spark.sql.\{DataFrame, SQLContext}
> scala> import org.apache.spark.\{SparkConf, SparkContext}
> import org.apache.spark.\{SparkConf, SparkContext}
> scala> val sqlContext: SQLContext = new SQLContext(sc)
> warning: there was one deprecation warning; re-run with -deprecation for details
> sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@2e0163cb
> scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/apps/inferSchema_example.csv")
> df: org.apache.spark.sql.DataFrame = [Name: string, Department: string ... 2 more fields]
> scala> df.printSchema
> test
>  |-- Name: string (nullable = true)
>  |-- Department: string (nullable = true)
>  |-- years_of_experience: integer (nullable = true)
>  |-- DOB: timestamp (nullable = true)
> scala> df.write.parquet("/apps/infer_schema_example.parquet")
> // Read the parquet file
> scala> val data = sqlContext.read.parquet("/apps/infer_schema_example.parquet")
> data: org.apache.spark.sql.DataFrame = [Name: string, Department: string ... 2 more fields]
> // Print the schema of the parquet file from Spark
> scala> data.printSchema
> test
>  |-- Name: string (nullable = true)
>  |-- Department: string (nullable = true)
>  |-- years_of_experience: integer (nullable = true)
>  |-- DOB: timestamp (nullable = true)
> // Display the contents of parquet file on spark-shell
> // register temp table and do a show on all records,to display.
> scala> data.registerTempTable("employee")
> warning: there was one deprecation warning; re-run with -deprecation for details
> scala> val allrecords = sqlContext.sql("SELeCT * FROM employee")
> allrecords: org.apache.spark.sql.DataFrame = [Name: string, Department: string ... 2 more fields]
> scala> allrecords.show()
> +----+--------------+-------------------+-------------------+
> |Name| Department|years_of_experience| DOB|
> +----+--------------+-------------------+-------------------+
> | Sam| Software| 5|1990-10-10 00:00:00|
> |Alex|Data Analytics| 3|1992-10-10 00:00:00|
> +----+--------------+-------------------+-------------------+
> {noformat}
> Querying the parquet file from Drill 1.14.0-mapr, results in the DOB column (timestamp type in Spark) being treated as VARBINARY.
> {noformat}
> apache drill 1.14.0-mapr
> "a little sql for your nosql"
> 0: jdbc:drill:schema=dfs.tmp> select * from dfs.`/apps/infer_schema_example.parquet`;
> +-------+-----------------+----------------------+--------------+
> | Name | Department | years_of_experience | DOB |
> +-------+-----------------+----------------------+--------------+
> | Sam | Software | 5 | [B@2bef51f2 |
> | Alex | Data Analytics | 3 | [B@650eab8 |
> +-------+-----------------+----------------------+--------------+
> 2 rows selected (0.229 seconds)
> // typeof(DOB) column returns a VARBINARY type, whereas the parquet schema in Spark for DOB: timestamp (nullable = true)
> 0: jdbc:drill:schema=dfs.tmp> select typeof(DOB) from dfs.`/apps/infer_schema_example.parquet`;
> +------------+
> | EXPR$0 |
> +------------+
> | VARBINARY |
> | VARBINARY |
> +------------+
> 2 rows selected (0.199 seconds)
> {noformat}
> // CAST to DATE type results in Exception, though the monthOfYear is in the range [1,12]
> {noformat}
> 0: jdbc:drill:schema=dfs.tmp> select cast(DOB as DATE) from dfs.`/apps/infer_schema_example.parquet`;
> Error: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear must be in the range [1,12]
> Fragment 0:0
> [Error Id: 536c67d8-77c4-4b36-8aec-743344141d31 on md123.qa.lab:31010] (state=,code=0)
> {noformat}
> Stack trace from drillbit.log
> {noformat}
> 2019-01-22 22:13:27,334 [23b86a78-64fc-5873-87b5-7e95d9740e51:frag:0:0] ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear must be in the range [1,12]
> Fragment 0:0
> [Error Id: 536c67d8-77c4-4b36-8aec-743344141d31 on md123.qa.lab:31010]
> org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: IllegalFieldValueException: Value 0 for monthOfYear must be in the range [1,12]
> Fragment 0:0
> [Error Id: 536c67d8-77c4-4b36-8aec-743344141d31 on md123.qa.lab:31010]
>  at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:633) ~[drill-common-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:361) [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:216) [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:327) [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.14.0-mapr.jar:1.14.0-mapr]
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_181]
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_181]
>  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
> Caused by: org.joda.time.IllegalFieldValueException: Value 0 for monthOfYear must be in the range [1,12]
>  at org.joda.time.field.FieldUtils.verifyValueBounds(FieldUtils.java:252) ~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.joda.time.chrono.BasicChronology.getDateMidnightMillis(BasicChronology.java:612) ~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.joda.time.chrono.BasicChronology.getDateTimeMillis(BasicChronology.java:159) ~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.joda.time.chrono.AssembledChronology.getDateTimeMillis(AssembledChronology.java:120) ~[drill-hive-exec-shaded-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getDate(StringFunctionHelpers.java:210) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.test.generated.ProjectorGen977.doEval(ProjectorTemplate.java:41) ~[na:na]
>  at org.apache.drill.exec.test.generated.ProjectorGen977.projectRecords(ProjectorTemplate.java:67) ~[na:na]
>  at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork(ProjectRecordBatch.java:231) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext(AbstractUnaryRecordBatch.java:117) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:142) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:172) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.physical.impl.BasetestExec.next(BasetestExec.java:103) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.physical.impl.ScreenCreator$Screentest.innerNext(ScreenCreator.java:83) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.physical.impl.BasetestExec.next(BasetestExec.java:93) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:294) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:281) ~[drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  at java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_181]
>  at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_181]
>  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) ~[hadoop-common-2.7.0-mapr-1808.jar:na]
>  at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:281) [drill-java-exec-1.14.0-mapr.jar:1.14.0-mapr]
>  ... 4 common frames omitted
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)