You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jan-Willem van der Sijp (JIRA)" <ji...@apache.org> on 2019/04/29 13:46:00 UTC
[jira] [Created] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly

Jan-Willem van der Sijp created SPARK-27594:
-----------------------------------------------

             Summary: spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly
                 Key: SPARK-27594
                 URL: https://issues.apache.org/jira/browse/SPARK-27594
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Jan-Willem van der Sijp


Using {{spark.sql.orc.impl=native}} and {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, the milliseconds of time timestamp will be doubled.

Input/output of a Zeppelin session to demonstrate:

{code:python}
%pyspark

from pprint import pprint

spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")

pprint(spark.sparkContext.getConf().getAll())
--------------------
[('sql.stacktrace', 'false'),
 ('spark.eventLog.enabled', 'true'),
 ('spark.app.id', 'application_1556200632329_0005'),
 ('importImplicit', 'true'),
 ('printREPLOutput', 'true'),
 ('spark.history.ui.port', '18081'),
 ('spark.driver.extraLibraryPath',
  '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
 ('spark.driver.extraJavaOptions',
  ' -Dfile.encoding=UTF-8 '
  '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties '
  '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
 ('concurrentSQL', 'false'),
 ('spark.driver.port', '40195'),
 ('spark.executor.extraLibraryPath',
  '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
 ('useHiveContext', 'true'),
 ('spark.jars',
  'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
 ('spark.history.provider',
  'org.apache.spark.deploy.history.FsHistoryProvider'),
 ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
 ('spark.submit.deployMode', 'client'),
 ('spark.ui.filters',
  'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
 ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
  'sandbox-hdp.hortonworks.com'),
 ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
 ('spark.repl.class.uri', 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
 ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
 ('master', 'yarn'),
 ('spark.yarn.dist.archives',
  '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
 ('spark.scheduler.mode', 'FAIR'),
 ('spark.yarn.queue', 'default'),
 ('spark.history.kerberos.keytab',
  '/etc/security/keytabs/spark.headless.keytab'),
 ('spark.executor.id', 'driver'),
 ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
 ('spark.history.kerberos.enabled', 'false'),
 ('spark.master', 'yarn'),
 ('spark.sql.catalogImplementation', 'hive'),
 ('spark.history.kerberos.principal', 'none'),
 ('spark.driver.extraClassPath',
  ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
 ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
 ('spark.repl.class.outputDir',
  '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
 ('spark.yarn.isPython', 'true'),
 ('spark.app.name', 'Zeppelin'),
 ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
  'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
 ('maxResult', '1000'),
 ('spark.executorEnv.PYTHONPATH',
  '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.6-src.zip'),
 ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
{code}

{code:python}
%pyspark

spark.sql("""
DROP TABLE IF EXISTS default.hivetest
""")

spark.sql("""
CREATE TABLE default.hivetest (
    day DATE,
    time TIMESTAMP,
    timestring STRING
)
USING ORC
""")
{code}

{code:python}
%pyspark

df1 = spark.createDataFrame(
    [
        ("2019-01-01", "2019-01-01 12:15:31.123", "2019-01-01 12:15:31.123")
    ],
    schema=("date", "timestamp", "string")
)

df2 = spark.createDataFrame(
    [
        ("2019-01-02", "2019-01-02 13:15:32.234", "2019-01-02 13:15:32.234")
    ],
    schema=("date", "timestamp", "string")
)
{code}

{code:python}
%pyspark

spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
df1.write.insertInto("default.hivetest")

spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
df1.write.insertInto("default.hivetest")
{code}

{code:python}
%pyspark

spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
spark.read.table("default.hivetest").show(2, False)

"""
+----------+-----------------------+-----------------------+
|day       |time                   |timestring             |
+----------+-----------------------+-----------------------+
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
+----------+-----------------------+-----------------------+
"""
{code}

{code:python}
%pyspark

spark.conf.set("spark.sql.orc.enableVectorizedReader", "false")
spark.read.table("default.hivetest").show(2, False)

"""
+----------+-----------------------+-----------------------+
|day       |time                   |timestring             |
+----------+-----------------------+-----------------------+
|2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
|2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
+----------+-----------------------+-----------------------+
"""
{code}

{code:scala}
import spark.sql
import spark.implicits._

spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")

sql("SELECT * FROM default.hivetest").show(2, false)

"""
import spark.sql
import spark.implicits._
+----------+-----------------------+-----------------------+
|day       |time                   |timestring             |
+----------+-----------------------+-----------------------+
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
|2019-01-01|2019-01-01 12:15:31.246|2019-01-01 12:15:31.123|
+----------+-----------------------+-----------------------+
"""
{code}

Querying using HIVE produces the correct data also:
{code:sql}
select * from default.hivetest;

day       |time                   |timestring             |
----------|-----------------------|-----------------------|
2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
2019-01-01|2019-01-01 12:15:31.123|2019-01-01 12:15:31.123|
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org