You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Mark Litwintschik (Jira)" <ji...@apache.org> on 2019/10/08 11:36:00 UTC
[jira] [Closed] (ARROW-6815) Timestamps saved via Pandas and
PyArrow unreadable in Hive and Presto
[ https://issues.apache.org/jira/browse/ARROW-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Litwintschik closed ARROW-6815.
------------------------------------
Resolution: Fixed
Using flavor='spark' creates compatible timestamps.
> Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto
> ---------------------------------------------------------------------
>
> Key: ARROW-6815
> URL: https://issues.apache.org/jira/browse/ARROW-6815
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.13.0
> Reporter: Mark Litwintschik
> Priority: Major
>
> I'm unable to read timestamps saved as Parquet data via Pandas with Hive or Presto. These are the versions of the various pieces of software I'm using:
> * Pandas 0.24.1
> * PyArrow 0.13.0
> * Hadoop 3.0.3
> * Hive 2.3.3
> * Presto 0.215
> The Hadoop setup steps I took can be found [here|https://tech.marksblogg.com/hadoop-3-single-node-install-guide.html].
> This is what I did to generate the Parquet file.
> {code:python}
> from datetime import datetime
> from StringIO import StringIO
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame([{'a': 'Test', 'b': datetime.utcnow()}])
> parquet_buffer = StringIO()
> pa_table = pa.Table.from_pandas(df, preserve_index=False)
> writer = pq.ParquetWriter(parquet_buffer, pa_table.schema)
> writer.write_table(pa_table)
> writer.close()
> with open('example.pq', 'w+b') as f:
> f.write(parquet_buffer.getvalue())
> {code}
> This is its raw contents:
> {code:bash}
> $ hexdump -C example.pq
> {code}
> {code:none}
> 00000000 50 41 52 31 15 04 15 10 15 14 4c 15 02 15 04 12 |PAR1......L.....|
> 00000010 00 00 08 1c 04 00 00 00 54 65 73 74 15 00 15 12 |........Test....|
> 00000020 15 16 2c 15 02 15 04 15 06 15 06 1c 18 04 54 65 |..,...........Te|
> 00000030 73 74 18 04 54 65 73 74 16 00 00 00 00 09 20 02 |st..Test...... .|
> 00000040 00 00 00 02 01 01 02 00 26 90 01 1c 15 0c 19 35 |........&......5|
> 00000050 04 00 06 19 18 01 61 15 02 16 02 16 80 01 16 88 |......a.........|
> 00000060 01 26 38 26 08 1c 36 00 28 04 54 65 73 74 18 04 |.&8&..6.(.Test..|
> 00000070 54 65 73 74 00 00 00 15 04 15 10 15 14 4c 15 02 |Test.........L..|
> 00000080 15 04 12 00 00 08 1c 10 4f 48 96 63 94 05 00 15 |........OH.c....|
> 00000090 00 15 12 15 16 2c 15 02 15 04 15 06 15 06 1c 18 |.....,..........|
> 000000a0 08 10 4f 48 96 63 94 05 00 18 08 10 4f 48 96 63 |..OH.c......OH.c|
> 000000b0 94 05 00 16 00 00 00 00 09 20 02 00 00 00 02 01 |......... ......|
> 000000c0 01 02 00 26 86 03 1c 15 04 19 35 04 00 06 19 18 |...&......5.....|
> 000000d0 01 62 15 02 16 02 16 90 01 16 98 01 26 9e 02 26 |.b..........&..&|
> 000000e0 ee 01 1c 18 08 10 4f 48 96 63 94 05 00 18 08 10 |......OH.c......|
> 000000f0 4f 48 96 63 94 05 00 16 00 28 08 10 4f 48 96 63 |OH.c.....(..OH.c|
> 00000100 94 05 00 18 08 10 4f 48 96 63 94 05 00 00 00 00 |......OH.c......|
> 00000110 15 02 19 3c 35 00 18 06 73 63 68 65 6d 61 15 04 |...<5...schema..|
> 00000120 00 15 0c 25 02 18 01 61 00 15 04 25 02 18 01 62 |...%...a...%...b|
> 00000130 25 14 00 16 02 19 1c 19 2c 26 90 01 1c 15 0c 19 |%.......,&......|
> 00000140 35 04 00 06 19 18 01 61 15 02 16 02 16 80 01 16 |5......a........|
> 00000150 88 01 26 38 26 08 1c 36 00 28 04 54 65 73 74 18 |..&8&..6.(.Test.|
> 00000160 04 54 65 73 74 00 00 00 26 86 03 1c 15 04 19 35 |.Test...&......5|
> 00000170 04 00 06 19 18 01 62 15 02 16 02 16 90 01 16 98 |......b.........|
> 00000180 01 26 9e 02 26 ee 01 1c 18 08 10 4f 48 96 63 94 |.&..&......OH.c.|
> 00000190 05 00 18 08 10 4f 48 96 63 94 05 00 16 00 28 08 |.....OH.c.....(.|
> 000001a0 10 4f 48 96 63 94 05 00 18 08 10 4f 48 96 63 94 |.OH.c......OH.c.|
> 000001b0 05 00 00 00 00 16 a0 02 16 02 00 19 1c 18 06 70 |...............p|
> 000001c0 61 6e 64 61 73 18 db 02 7b 22 63 72 65 61 74 6f |andas...{"creato|
> 000001d0 72 22 3a 20 7b 22 76 65 72 73 69 6f 6e 22 3a 20 |r": {"version": |
> 000001e0 22 30 2e 31 33 2e 30 22 2c 20 22 6c 69 62 72 61 |"0.13.0", "libra|
> 000001f0 72 79 22 3a 20 22 70 79 61 72 72 6f 77 22 7d 2c |ry": "pyarrow"},|
> 00000200 20 22 70 61 6e 64 61 73 5f 76 65 72 73 69 6f 6e | "pandas_version|
> 00000210 22 3a 20 6e 75 6c 6c 2c 20 22 69 6e 64 65 78 5f |": null, "index_|
> 00000220 63 6f 6c 75 6d 6e 73 22 3a 20 5b 5d 2c 20 22 63 |columns": [], "c|
> 00000230 6f 6c 75 6d 6e 73 22 3a 20 5b 7b 22 6d 65 74 61 |olumns": [{"meta|
> 00000240 64 61 74 61 22 3a 20 6e 75 6c 6c 2c 20 22 66 69 |data": null, "fi|
> 00000250 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 61 22 2c 20 |eld_name": "a", |
> 00000260 22 6e 61 6d 65 22 3a 20 22 61 22 2c 20 22 6e 75 |"name": "a", "nu|
> 00000270 6d 70 79 5f 74 79 70 65 22 3a 20 22 6f 62 6a 65 |mpy_type": "obje|
> 00000280 63 74 22 2c 20 22 70 61 6e 64 61 73 5f 74 79 70 |ct", "pandas_typ|
> 00000290 65 22 3a 20 22 62 79 74 65 73 22 7d 2c 20 7b 22 |e": "bytes"}, {"|
> 000002a0 6d 65 74 61 64 61 74 61 22 3a 20 6e 75 6c 6c 2c |metadata": null,|
> 000002b0 20 22 66 69 65 6c 64 5f 6e 61 6d 65 22 3a 20 22 | "field_name": "|
> 000002c0 62 22 2c 20 22 6e 61 6d 65 22 3a 20 22 62 22 2c |b", "name": "b",|
> 000002d0 20 22 6e 75 6d 70 79 5f 74 79 70 65 22 3a 20 22 | "numpy_type": "|
> 000002e0 64 61 74 65 74 69 6d 65 36 34 5b 6e 73 5d 22 2c |datetime64[ns]",|
> 000002f0 20 22 70 61 6e 64 61 73 5f 74 79 70 65 22 3a 20 | "pandas_type": |
> 00000300 22 64 61 74 65 74 69 6d 65 22 7d 5d 2c 20 22 63 |"datetime"}], "c|
> 00000310 6f 6c 75 6d 6e 5f 69 6e 64 65 78 65 73 22 3a 20 |olumn_indexes": |
> 00000320 5b 5d 7d 00 18 22 70 61 72 71 75 65 74 2d 63 70 |[]}.."parquet-cp|
> 00000330 70 20 76 65 72 73 69 6f 6e 20 31 2e 35 2e 31 2d |p version 1.5.1-|
> 00000340 53 4e 41 50 53 48 4f 54 19 2c 1c 00 00 1c 00 00 |SNAPSHOT.,......|
> 00000350 00 41 02 00 00 50 41 52 31 |.A...PAR1|
> 00000359
> {code}
> These are the strings extracted from that file.
> {code:bash}
> $ strings example.pq
> {code}
> {code:none}
> PAR1
> Test
> Test
> Test
> Test
> Test
> schema
> Test
> Test
> pandas
> {"creator": {"version": "0.13.0", "library": "pyarrow"}, "pandas_version": null, "index_columns": [], "columns": [{"metadata": null, "field_name": "a", "name": "a", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "b", "name": "b", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}], "column_indexes": []}
> "parquet-cpp version 1.5.1-SNAPSHOT
> PAR1
> {code}
> If I copy it onto HDFS I'm unable to read the timestamp. I've tried a variety of data types.
> {code:bash}
> $ hdfs dfs -mkdir /testing/
> $ hdfs dfs -copyFromLocal example.pq /testing/
> $ hdfs dfs -ls /testing/example.pq
> {code}
> {code:none}
> -rw-r--r-- 1 ubuntu supergroup 857 2019-10-08 10:26 /testing/example.pq
> {code}
> {code:bash}
> $ hive
> {code}
> It can't be read as a DATETIME field:
> {code:sql}
> CREATE EXTERNAL TABLE testing (
> a STRING,
> b DATETIME
> ) STORED AS parquet
> LOCATION '/testing/';
> {code}
> {code:none}
> FAILED: SemanticException [Error 10099]: DATETIME type isn't supported yet. Please use DATE or TIMESTAMP instead
> {code}
> It can't be read as a TIMESTAMP:
> {code:sql}
> CREATE EXTERNAL TABLE testing (
> a STRING,
> b TIMESTAMP
> ) STORED AS parquet
> LOCATION '/testing/';
> SELECT * FROM testing;
> {code}
> {code:none}
> Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
> {code}
> I can extract it as a BIGINT but I'm unable to find a math formula to convert it to the correct time.
> {code:sql}
> DROP TABLE `testing`;
> CREATE EXTERNAL TABLE `testing` (
> a STRING,
> b BIGINT
> ) STORED AS parquet
> LOCATION '/testing/';
> {code}
> {code:none}
> Test 1570530327547664
> {code}
> {code:sql}
> SELECT CAST(b AS TIMESTAMP) FROM testing;
> {code}
> {code:none}
> 51738-02-15 08:19:077.664
> {code}
> {code:sql}
> SELECT CAST(b/1000 AS TIMESTAMP) FROM testing;
> {code}
> {code:none}
> 51738-02-15 08:19:077.664
> {code}
> I'll generate the same content using Hive and show the contents of the Parquet file it produces:
> {code:sql}
> CREATE TABLE `testing2` (
> a STRING,
> b TIMESTAMP
> ) STORED AS parquet
> LOCATION '/testing2/';
> INSERT INTO testing2 (a, b) VALUES ('Test', "2019-10-08 10:34:30.061");
> SELECT * FROM testing2;
> {code}
> {code:none}
> Test 2019-10-08 10:34:30.061
> {code}
> The timestamp is usable.
> {code:sql}
> SELECT YEAR(B) FROM testing2;
> {code}
> {code:none}
> 2019
> {code}
> {code:bash}
> $ hdfs dfs -copyToLocal /testing2/000000_0 ./
> $ hexdump -C 000000_0
> {code}
> {code:none}
> 00000000 50 41 52 31 15 00 15 1c 15 1c 2c 15 02 15 00 15 |PAR1......,.....|
> 00000010 06 15 08 1c 18 04 54 65 73 74 18 04 54 65 73 74 |......Test..Test|
> 00000020 16 00 00 00 00 02 00 00 00 03 01 04 00 00 00 54 |...............T|
> 00000030 65 73 74 15 04 15 18 15 18 4c 15 02 15 04 00 00 |est......L......|
> 00000040 40 65 76 e0 9f 22 00 00 8d 84 25 00 15 00 15 10 |@ev.."....%.....|
> 00000050 15 10 2c 15 02 15 04 15 06 15 08 1c 18 0c 40 65 |..,...........@e|
> 00000060 76 e0 9f 22 00 00 8d 84 25 00 18 0c 40 65 76 e0 |v.."....%...@ev.|
> 00000070 9f 22 00 00 8d 84 25 00 16 00 00 00 00 02 00 00 |."....%.........|
> 00000080 00 03 01 00 03 15 02 19 3c 48 0b 68 69 76 65 5f |........<H.hive_|
> 00000090 73 63 68 65 6d 61 15 04 00 15 0c 25 02 18 01 61 |schema.....%...a|
> 000000a0 25 00 00 15 06 25 02 18 01 62 00 16 02 19 1c 19 |%....%...b......|
> 000000b0 2c 26 08 1c 15 0c 19 35 08 06 00 19 18 01 61 15 |,&.....5......a.|
> 000000c0 00 16 02 16 5e 16 5e 26 08 3c 18 04 54 65 73 74 |....^.^&.<..Test|
> 000000d0 18 04 54 65 73 74 16 00 00 00 00 26 66 1c 15 06 |..Test.....&f...|
> 000000e0 19 35 08 04 06 19 18 01 62 15 00 16 02 16 a4 01 |.5......b.......|
> 000000f0 16 a4 01 26 66 3c 18 0c 40 65 76 e0 9f 22 00 00 |...&f<..@ev.."..|
> 00000100 8d 84 25 00 18 0c 40 65 76 e0 9f 22 00 00 8d 84 |..%...@ev.."....|
> 00000110 25 00 16 00 00 00 00 16 82 02 16 02 00 28 49 70 |%............(Ip|
> 00000120 61 72 71 75 65 74 2d 6d 72 20 76 65 72 73 69 6f |arquet-mr versio|
> 00000130 6e 20 31 2e 38 2e 31 20 28 62 75 69 6c 64 20 34 |n 1.8.1 (build 4|
> 00000140 61 62 61 34 64 61 65 37 62 62 30 64 34 65 64 62 |aba4dae7bb0d4edb|
> 00000150 63 66 37 39 32 33 61 65 31 33 33 39 66 32 38 66 |cf7923ae1339f28f|
> 00000160 64 33 66 37 66 63 66 29 00 e4 00 00 00 50 41 52 |d3f7fcf).....PAR|
> 00000170 31 |1|
> 00000171
> {code}
> {code:bash}
> $ strings 000000_0
> {code}
> {code:none}
> PAR1
> Test
> Test
> Test
> hive_schema
> Test
> Test
> (Iparquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
> PAR1
> {code}
> I'll do the same for Presto as it has its own writer as well.
> {code:bash}
> $ presto --server localhost:8085 --schema default --catalog hive
> {code}
> {code:sql}
> CREATE TABLE testing3 WITH (FORMAT='PARQUET') AS
> SELECT 'Test' as a,
> CAST('2019-10-08 10:34:30.061' AS TIMESTAMP) AS b;
> SELECT * FROM testing3;
> {code}
> {code:none}
> a | b
> ------+-------------------------
> Test | 2019-10-08 10:34:30.061
> {code}
> {code:sql}
> SELECT YEAR(b) FROM testing3;
> {code}
> {code:none}
> _col0
> -------
> 2019
> {code}
> {code:bash}
> $ hdfs dfs -copyToLocal /user/hive/warehouse/testing3/20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9 ./
> $ hexdump -C 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
> {code}
> {code:none}
> 00000000 50 41 52 31 15 00 15 1c 15 40 2c 15 02 15 00 15 |PAR1.....@,.....|
> 00000010 06 15 08 1c 18 04 54 65 73 74 18 04 54 65 73 74 |......Test..Test|
> 00000020 16 00 00 00 00 1f 8b 08 00 00 00 00 00 00 03 63 |...............c|
> 00000030 62 60 60 60 66 64 01 92 21 a9 c5 25 00 fe 12 da |b```fd..!..%....|
> 00000040 11 0e 00 00 00 15 04 15 18 15 40 4c 15 02 15 04 |..........@L....|
> 00000050 00 00 1f 8b 08 00 00 00 00 00 00 03 73 48 2d 7b |............sH-{|
> 00000060 30 5f 89 81 a1 b7 45 95 01 00 64 bc 2a 14 0c 00 |0_....E...d.*...|
> 00000070 00 00 15 00 15 10 15 38 2c 15 02 15 04 15 06 15 |.......8,.......|
> 00000080 08 1c 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 |....@ev.."....%.|
> 00000090 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 16 00 |..@ev.."....%...|
> 000000a0 00 00 00 1f 8b 08 00 00 00 00 00 00 03 63 62 60 |.............cb`|
> 000000b0 60 60 66 64 60 06 00 77 4c 79 ad 08 00 00 00 15 |``fd`..wLy......|
> 000000c0 02 19 3c 48 0b 68 69 76 65 5f 73 63 68 65 6d 61 |..<H.hive_schema|
> 000000d0 15 04 00 15 0c 25 02 18 01 61 25 00 00 15 06 25 |.....%...a%....%|
> 000000e0 02 18 01 62 00 16 02 19 1c 19 2c 26 08 1c 15 0c |...b......,&....|
> 000000f0 19 35 08 00 06 19 18 01 61 15 04 16 02 16 5e 16 |.5......a.....^.|
> 00000100 82 01 26 08 3c 18 04 54 65 73 74 18 04 54 65 73 |..&.<..Test..Tes|
> 00000110 74 16 00 00 00 00 26 8a 01 1c 15 06 19 35 04 08 |t.....&......5..|
> 00000120 06 19 18 01 62 15 04 16 02 16 a4 01 16 f4 01 26 |....b..........&|
> 00000130 8a 01 3c 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 |..<..@ev.."....%|
> 00000140 00 18 0c 40 65 76 e0 9f 22 00 00 8d 84 25 00 16 |...@ev.."....%..|
> 00000150 00 00 00 00 16 82 02 16 02 00 28 0a 70 61 72 71 |..........(.parq|
> 00000160 75 65 74 2d 6d 72 00 a8 00 00 00 50 41 52 31 |uet-mr.....PAR1|
> 0000016f
> {code}
> {code:bash}
> $ strings 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
> {code}
> {code:none}
> PAR1
> Test
> Test
> cb```fd
> sH-{0_
> cb```fd`
> hive_schema
> Test
> Test
> parquet-mr
> PAR1
> {code}
> Any idea how I can save timestamps with Pandas and have them readable by Hive and Presto?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)