You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Mark Litwintschik (Jira)" <ji...@apache.org> on 2019/10/08 11:04:00 UTC
[jira] [Created] (ARROW-6815) Timestamps saved via Pandas and
PyArrow unreadable in Hive and Presto
Mark Litwintschik created ARROW-6815:
----------------------------------------
Summary: Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto
Key: ARROW-6815
URL: https://issues.apache.org/jira/browse/ARROW-6815
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.13.0
Reporter: Mark Litwintschik
I'm unable to read timestamps saved as Parquet data via Pandas with Hive or Presto. These are the versions of the various pieces of software I'm using:
* Pandas 0.24.1
* PyArrow 0.13.0
* Hadoop 3.0.3
* Hive 2.3.3
* Presto 0.215
The Hadoop setup steps I took can be found [here|https://tech.marksblogg.com/hadoop-3-single-node-install-guide.html].
This is what I did to generate the Parquet file.
{code:python}
from datetime import datetime
from StringIO import StringIO
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame([{'a': 'Test', 'b': datetime.utcnow()}])
parquet_buffer = StringIO()
pa_table = pa.Table.from_pandas(df, preserve_index=False)
writer = pq.ParquetWriter(parquet_buffer, pa_table.schema)
writer.write_table(pa_table)
writer.close()
with open('example.pq', 'w+b') as f:
f.write(parquet_buffer.getvalue())
{code}
This is its raw contents:
{code:bash}
$ hexdump -C example.pq
{code}
{code:none}
00000000 50 41 52 31 15 04 15 10 15 14 4c 15 02 15 04 12 |PAR1......L.....|
00000010 00 00 08 1c 04 00 00 00 54 65 73 74 15 00 15 12 |........Test....|
00000020 15 16 2c 15 02 15 04 15 06 15 06 1c 18 04 54 65 |..,...........Te|
00000030 73 74 18 04 54 65 73 74 16 00 00 00 00 09 20 02 |st..Test...... .|
00000040 00 00 00 02 01 01 02 00 26 90 01 1c 15 0c 19 35 |........&......5|
00000050 04 00 06 19 18 01 61 15 02 16 02 16 80 01 16 88 |......a.........|
00000060 01 26 38 26 08 1c 36 00 28 04 54 65 73 74 18 04 |.&8&..6.(.Test..|
00000070 54 65 73 74 00 00 00 15 04 15 10 15 14 4c 15 02 |Test.........L..|
00000080 15 04 12 00 00 08 1c 10 4f 48 96 63 94 05 00 15 |........OH.c....|
00000090 00 15 12 15 16 2c 15 02 15 04 15 06 15 06 1c 18 |.....,..........|
000000a0 08 10 4f 48 96 63 94 05 00 18 08 10 4f 48 96 63 |..OH.c......OH.c|
000000b0 94 05 00 16 00 00 00 00 09 20 02 00 00 00 02 01 |......... ......|
000000c0 01 02 00 26 86 03 1c 15 04 19 35 04 00 06 19 18 |...&......5.....|
000000d0 01 62 15 02 16 02 16 90 01 16 98 01 26 9e 02 26 |.b..........&..&|
000000e0 ee 01 1c 18 08 10 4f 48 96 63 94 05 00 18 08 10 |......OH.c......|
000000f0 4f 48 96 63 94 05 00 16 00 28 08 10 4f 48 96 63 |OH.c.....(..OH.c|
00000100 94 05 00 18 08 10 4f 48 96 63 94 05 00 00 00 00 |......OH.c......|
00000110 15 02 19 3c 35 00 18 06 73 63 68 65 6d 61 15 04 |...<5...schema..|
00000120 00 15 0c 25 02 18 01 61 00 15 04 25 02 18 01 62 |...%...a...%...b|
00000130 25 14 00 16 02 19 1c 19 2c 26 90 01 1c 15 0c 19 |%.......,&......|
00000140 35 04 00 06 19 18 01 61 15 02 16 02 16 80 01 16 |5......a........|
00000150 88 01 26 38 26 08 1c 36 00 28 04 54 65 73 74 18 |..&8&..6.(.Test.|
00000160 04 54 65 73 74 00 00 00 26 86 03 1c 15 04 19 35 |.Test...&......5|
00000170 04 00 06 19 18 01 62 15 02 16 02 16 90 01 16 98 |......b.........|
00000180 01 26 9e 02 26 ee 01 1c 18 08 10 4f 48 96 63 94 |.&..&......OH.c.|
00000190 05 00 18 08 10 4f 48 96 63 94 05 00 16 00 28 08 |.....OH.c.....(.|
000001a0 10 4f 48 96 63 94 05 00 18 08 10 4f 48 96 63 94 |.OH.c......OH.c.|
000001b0 05 00 00 00 00 16 a0 02 16 02 00 19 1c 18 06 70 |...............p|
000001c0 61 6e 64 61 73 18 db 02 7b 22 63 72 65 61 74 6f |andas...{"creato|
000001d0 72 22 3a 20 7b 22 76 65 72 73 69 6f 6e 22 3a 20 |r": {"version": |
000001e0 22 30 2e 31 33 2e 30 22 2c 20 22 6c 69 62 72 61 |"0.13.0", "libra|
000001f0 72 79 22 3a 20 22 70 79 61 72 72 6f 77 22 7d 2c |ry": "pyarrow"},|
00000200 20 22 70 61 6e 64 61 73 5f 76 65 72 73 69 6f 6e | "pandas_version|
00000210 22 3a 20 6e 75 6c 6c 2c 20 22 69 6e 64 65 78 5f |": null, "index_|
00000220 63 6f 6c 75 6d 6e 73 22 3a 20 5b 5d 2c 20 22 63 |columns": [], "c|
00000230 6f 6c 75 6d 6e 73 22 3a 20 5b 7b 22 6d 65 74 61 |olumns": [{"meta|
00000240 64 61 74 61 22 3a 20 6e 75 6c 6c 2c 20 22 66 69 |data": null, "fi|
00000250 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 61 22 2c 20 |eld_name": "a", |
00000260 22 6e 61 6d 65 22 3a 20 22 61 22 2c 20 22 6e 75 |"name": "a", "nu|
00000270 6d 70 79 5f 74 79 70 65 22 3a 20 22 6f 62 6a 65 |mpy_type": "obje|
00000280 63 74 22 2c 20 22 70 61 6e 64 61 73 5f 74 79 70 |ct", "pandas_typ|
00000290 65 22 3a 20 22 62 79 74 65 73 22 7d 2c 20 7b 22 |e": "bytes"}, {"|
000002a0 6d 65 74 61 64 61 74 61 22 3a 20 6e 75 6c 6c 2c |metadata": null,|
000002b0 20 22 66 69 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 | "field_name": "|
000002c0 62 22 2c 20 22 6e 61 6d 65 22 3a 20 22 62 22 2c |b", "name": "b",|
000002d0 20 22 6e 75 6d 70 79 5f 74 79 70 65 22 3a 20 22 | "numpy_type": "|
000002e0 64 61 74 65 74 69 6d 65 36 34 5b 6e 73 5d 22 2c |datetime64[ns]",|
000002f0 20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20 | "pandas_type": |
00000300 22 64 61 74 65 74 69 6d 65 22 7d 5d 2c 20 22 63 |"datetime"}], "c|
00000310 6f 6c 75 6d 6e 5f 69 6e 64 65 78 65 73 22 3a 20 |olumn_indexes": |
00000320 5b 5d 7d 00 18 22 70 61 72 71 75 65 74 2d 63 70 |[]}.."parquet-cp|
00000330 70 20 76 65 72 73 69 6f 6e 20 31 2e 35 2e 31 2d |p version 1.5.1-|
00000340 53 4e 41 50 53 48 4f 54 19 2c 1c 00 00 1c 00 00 |SNAPSHOT.,......|
00000350 00 41 02 00 00 50 41 52 31 |.A...PAR1|
00000359
{code}
These are the strings extracted from that file.
{code:bash}
$ strings example.pq
{code}
{code:none}
PAR1
Test
Test
Test
Test
Test
schema
Test
Test
pandas
{"creator": {"version": "0.13.0", "library": "pyarrow"}, "pandas_version": null, "index_columns": [], "columns": [{"metadata": null, "field_name": "a", "name": "a", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "b", "name": "b", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}], "column_indexes": []}
"parquet-cpp version 1.5.1-SNAPSHOT
PAR1
{code}
If I copy it onto HDFS I'm unable to read the timestamp. I've tried a variety of data types.
{code:bash}
$ hdfs dfs -mkdir /testing/
$ hdfs dfs -copyFromLocal example.pq /testing/
$ hdfs dfs -ls /testing/example.pq
{code}
{code:none}
-rw-r--r-- 1 ubuntu supergroup 857 2019-10-08 10:26 /testing/example.pq
{code}
{code:bash}
$ hive
{code}
It can't be read as a DATETIME field:
{code:sql}
CREATE EXTERNAL TABLE testing (
a STRING,
b DATETIME
) STORED AS parquet
LOCATION '/testing/';
{code}
{code:none}
FAILED: SemanticException [Error 10099]: DATETIME type isn't supported yet. Please use DATE or TIMESTAMP instead
{code}
It can't be read as a TIMESTAMP:
{code:sql}
CREATE EXTERNAL TABLE testing (
a STRING,
b TIMESTAMP
) STORED AS parquet
LOCATION '/testing/';
SELECT * FROM testing;
{code}
{code:none}
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
{code}
I can extract it as a BIGINT but I'm unable to find a math formula to convert it to the correct time.
{code:sql}
DROP TABLE `testing`;
CREATE EXTERNAL TABLE `testing` (
a STRING,
b BIGINT
) STORED AS parquet
LOCATION '/testing/';
{code}
{code:none}
Test 1570530327547664
{code}
{code:sql}
SELECT CAST(b AS TIMESTAMP) FROM testing;
{code}
{code:none}
51738-02-15 08:19:077.664
{code}
{code:sql}
SELECT CAST(b/1000 AS TIMESTAMP) FROM testing;
{code}
{code:none}
51738-02-15 08:19:077.664
{code}
I'll generate the same content using Hive and show the contents of the Parquet file it produces:
{code:sql}
CREATE TABLE `testing2` (
a STRING,
b TIMESTAMP
) STORED AS parquet
LOCATION '/testing2/';
INSERT INTO testing2 (a, b) VALUES ('Test', "2019-10-08 10:34:30.061");
SELECT * FROM testing2;
{code}
{code:none}
Test 2019-10-08 10:34:30.061
{code}
The timestamp is usable.
{code:sql}
SELECT YEAR(B) FROM testing2;
{code}
{code:none}
2019
{code}
{code:bash}
$ hdfs dfs -copyToLocal /testing2/000000_0 ./
$ hexdump -C 000000_0
{code}
{code:none}
00000000 50 41 52 31 15 00 15 1c 15 1c 2c 15 02 15 00 15 |PAR1......,.....|
00000010 06 15 08 1c 18 04 54 65 73 74 18 04 54 65 73 74 |......Test..Test|
00000020 16 00 00 00 00 02 00 00 00 03 01 04 00 00 00 54 |...............T|
00000030 65 73 74 15 04 15 18 15 18 4c 15 02 15 04 00 00 |est......L......|
00000040 40 65 76 e0 9f 22 00 00 8d 84 25 00 15 00 15 10 |@ev.."....%.....|
00000050 15 10 2c 15 02 15 04 15 06 15 08 1c 18 0c 40 65 |..,...........@e|
00000060 76 e0 9f 22 00 00 8d 84 25 00 18 0c 40 65 76 e0 |v.."....%...@ev.|
00000070 9f 22 00 00 8d 84 25 00 16 00 00 00 00 02 00 00 |."....%.........|
00000080 00 03 01 00 03 15 02 19 3c 48 0b 68 69 76 65 5f |........<H.hive_|
00000090 73 63 68 65 6d 61 15 04 00 15 0c 25 02 18 01 61 |schema.....%...a|
000000a0 25 00 00 15 06 25 02 18 01 62 00 16 02 19 1c 19 |%....%...b......|
000000b0 2c 26 08 1c 15 0c 19 35 08 06 00 19 18 01 61 15 |,&.....5......a.|
000000c0 00 16 02 16 5e 16 5e 26 08 3c 18 04 54 65 73 74 |....^.^&.<..Test|
000000d0 18 04 54 65 73 74 16 00 00 00 00 26 66 1c 15 06 |..Test.....&f...|
000000e0 19 35 08 04 06 19 18 01 62 15 00 16 02 16 a4 01 |.5......b.......|
000000f0 16 a4 01 26 66 3c 18 0c 40 65 76 e0 9f 22 00 00 |...&f<..@ev.."..|
00000100 8d 84 25 00 18 0c 40 65 76 e0 9f 22 00 00 8d 84 |..%...@ev.."....|
00000110 25 00 16 00 00 00 00 16 82 02 16 02 00 28 49 70 |%............(Ip|
00000120 61 72 71 75 65 74 2d 6d 72 20 76 65 72 73 69 6f |arquet-mr versio|
00000130 6e 20 31 2e 38 2e 31 20 28 62 75 69 6c 64 20 34 |n 1.8.1 (build 4|
00000140 61 62 61 34 64 61 65 37 62 62 30 64 34 65 64 62 |aba4dae7bb0d4edb|
00000150 63 66 37 39 32 33 61 65 31 33 33 39 66 32 38 66 |cf7923ae1339f28f|
00000160 64 33 66 37 66 63 66 29 00 e4 00 00 00 50 41 52 |d3f7fcf).....PAR|
00000170 31 |1|
00000171
{code}
{code:bash}
$ strings 000000_0
{code}
{code:none}
PAR1
Test
Test
Test
hive_schema
Test
Test
(Iparquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
PAR1
{code}
I'll do the same for Presto as it has its own writer as well.
{code:bash}
$ presto --server localhost:8085 --schema default --catalog hive
{code}
{code:sql}
CREATE TABLE testing3 WITH (FORMAT='PARQUET') AS
SELECT 'Test' as a,
CAST('2019-10-08 10:34:30.061' AS TIMESTAMP) AS b;
SELECT * FROM testing3;
{code}
{code:none}
a | b
------+-------------------------
Test | 2019-10-08 10:34:30.061
{code}
{code:sql}
SELECT YEAR(b) FROM testing3;
{code}
{code:none}
_col0
-------
2019
{code}
{code:bash}
$ hdfs dfs -copyToLocal /user/hive/warehouse/testing3/20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9 ./
$ hexdump -C 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
{code}
{code:none}
00000000 50 41 52 31 15 00 15 1c 15 40 2c 15 02 15 00 15 |PAR1.....@,.....|
00000010 06 15 08 1c 18 04 54 65 73 74 18 04 54 65 73 74 |......Test..Test|
00000020 16 00 00 00 00 1f 8b 08 00 00 00 00 00 00 03 63 |...............c|
00000030 62 60 60 60 66 64 01 92 21 a9 c5 25 00 fe 12 da |b```fd..!..%....|
00000040 11 0e 00 00 00 15 04 15 18 15 40 4c 15 02 15 04 |..........@L....|
00000050 00 00 1f 8b 08 00 00 00 00 00 00 03 73 48 2d 7b |............sH-{|
00000060 30 5f 89 81 a1 b7 45 95 01 00 64 bc 2a 14 0c 00 |0_....E...d.*...|
00000070 00 00 15 00 15 10 15 38 2c 15 02 15 04 15 06 15 |.......8,.......|
00000080 08 1c 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 |....@ev.."....%.|
00000090 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 16 00 |..@ev.."....%...|
000000a0 00 00 00 1f 8b 08 00 00 00 00 00 00 03 63 62 60 |.............cb`|
000000b0 60 60 66 64 60 06 00 77 4c 79 ad 08 00 00 00 15 |``fd`..wLy......|
000000c0 02 19 3c 48 0b 68 69 76 65 5f 73 63 68 65 6d 61 |..<H.hive_schema|
000000d0 15 04 00 15 0c 25 02 18 01 61 25 00 00 15 06 25 |.....%...a%....%|
000000e0 02 18 01 62 00 16 02 19 1c 19 2c 26 08 1c 15 0c |...b......,&....|
000000f0 19 35 08 00 06 19 18 01 61 15 04 16 02 16 5e 16 |.5......a.....^.|
00000100 82 01 26 08 3c 18 04 54 65 73 74 18 04 54 65 73 |..&.<..Test..Tes|
00000110 74 16 00 00 00 00 26 8a 01 1c 15 06 19 35 04 08 |t.....&......5..|
00000120 06 19 18 01 62 15 04 16 02 16 a4 01 16 f4 01 26 |....b..........&|
00000130 8a 01 3c 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 |..<..@ev.."....%|
00000140 00 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 16 |...@ev.."....%..|
00000150 00 00 00 00 16 82 02 16 02 00 28 0a 70 61 72 71 |..........(.parq|
00000160 75 65 74 2d 6d 72 00 a8 00 00 00 50 41 52 31 |uet-mr.....PAR1|
0000016f
{code}
{code:bash}
$ strings 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
{code}
{code:none}
PAR1
Test
Test
cb```fd
sH-{0_
cb```fd`
hive_schema
Test
Test
parquet-mr
PAR1
{code}
Any idea how I can save timestamps with Pandas and have them readable by Hive and Presto?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)