You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Liang-Chi Hsieh (JIRA)" <ji...@apache.org> on 2019/04/30 10:24:00 UTC
[jira] [Commented] (SPARK-27594)
spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be
read incorrectly
[ https://issues.apache.org/jira/browse/SPARK-27594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830160#comment-16830160 ]
Liang-Chi Hsieh commented on SPARK-27594:
-----------------------------------------
I can't reproduce it. Is it possibly specific to your environment?
> spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly
> --------------------------------------------------------------------------------------------
>
> Key: SPARK-27594
> URL: https://issues.apache.org/jira/browse/SPARK-27594
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Jan-Willem van der Sijp
> Priority: Major
>
> Using {{spark.sql.orc.impl=native}} and {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, the milliseconds of time timestamp will be doubled.
> Input/output of a Zeppelin session to demonstrate:
> {code:python}
> %pyspark
> from pprint import pprint
> spark.conf.set("spark.sql.orc.impl", "native")
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> pprint(spark.sparkContext.getConf().getAll())
> --------------------
> [('sql.stacktrace', 'false'),
> ('spark.eventLog.enabled', 'true'),
> ('spark.app.id', 'application_1556200632329_0005'),
> ('importImplicit', 'true'),
> ('printREPLOutput', 'true'),
> ('spark.history.ui.port', '18081'),
> ('spark.driver.extraLibraryPath',
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
> ('spark.driver.extraJavaOptions',
> ' -Dfile.encoding=UTF-8 '
> '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties '
> '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
> ('concurrentSQL', 'false'),
> ('spark.driver.port', '40195'),
> ('spark.executor.extraLibraryPath',
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
> ('useHiveContext', 'true'),
> ('spark.jars',
> 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
> ('spark.history.provider',
> 'org.apache.spark.deploy.history.FsHistoryProvider'),
> ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
> ('spark.submit.deployMode', 'client'),
> ('spark.ui.filters',
> 'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
> 'sandbox-hdp.hortonworks.com'),
> ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
> ('spark.repl.class.uri', 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
> ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
> ('master', 'yarn'),
> ('spark.yarn.dist.archives',
> '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
> ('spark.scheduler.mode', 'FAIR'),
> ('spark.yarn.queue', 'default'),
> ('spark.history.kerberos.keytab',
> '/etc/security/keytabs/spark.headless.keytab'),
> ('spark.executor.id', 'driver'),
> ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
> ('spark.history.kerberos.enabled', 'false'),
> ('spark.master', 'yarn'),
> ('spark.sql.catalogImplementation', 'hive'),
> ('spark.history.kerberos.principal', 'none'),
> ('spark.driver.extraClassPath',
> ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
> ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
> ('spark.repl.class.outputDir',
> '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
> ('spark.yarn.isPython', 'true'),
> ('spark.app.name', 'Zeppelin'),
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
> 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
> ('maxResult', '1000'),
> ('spark.executorEnv.PYTHONPATH',
> '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.6-src.zip'),
> ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
> {code}
> {code:python}
> %pyspark
> spark.sql("""
> DROP TABLE IF EXISTS default.hivetest
> """)
> spark.sql("""
> CREATE TABLE default.hivetest (
> day DATE,
> time TIMESTAMP,
> timestring STRING
> )
> USING ORC
> """)
> {code}
> {code:python}
> %pyspark
> df1 = spark.createDataFrame(
> [
> ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123")
> ],
> schema=("date", "timestamp", "string")
> )
> df2 = spark.createDataFrame(
> [
> ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234")
> ],
> schema=("date", "timestamp", "string")
> )
> {code}
> {code:python}
> %pyspark
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> df1.write.insertInto("default.hivetest")
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
> df1.write.insertInto("default.hivetest")
> {code}
> {code:python}
> %pyspark
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> spark.read.table("default.hivetest").show(2, False)
> """
> +----------+-----------------------+-----------------------+
> |day |time |timestring |
> +----------+-----------------------+-----------------------+
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> +----------+-----------------------+-----------------------+
> """
> {code}
> {code:python}
> %pyspark
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
> spark.read.table("default.hivetest").show(2, False)
> """
> +----------+-----------------------+-----------------------+
> |day |time |timestring |
> +----------+-----------------------+-----------------------+
> |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> |2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> +----------+-----------------------+-----------------------+
> """
> {code}
> {code:scala}
> import spark.sql
> import spark.implicits._
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> sql("SELECT * FROM default.hivetest").show(2, false)
> """
> import spark.sql
> import spark.implicits._
> +----------+-----------------------+-----------------------+
> |day |time |timestring |
> +----------+-----------------------+-----------------------+
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> |2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
> +----------+-----------------------+-----------------------+
> """
> {code}
> Querying using HIVE produces the correct data also:
> {code:sql}
> select * from default.hivetest;
> day |time |timestring |
> ----------|-----------------------|-----------------------|
> 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> 2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org