You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Mark Litwintschik (Jira)" <ji...@apache.org> on 2019/10/08 11:36:00 UTC
[jira] [Closed] (ARROW-6815) Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto

     [ https://issues.apache.org/jira/browse/ARROW-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Litwintschik closed ARROW-6815.
------------------------------------
    Resolution: Fixed

Using flavor='spark' creates compatible timestamps.

> Timestamps saved via Pandas and PyArrow unreadable in Hive and Presto
> ---------------------------------------------------------------------
>
>                 Key: ARROW-6815
>                 URL: https://issues.apache.org/jira/browse/ARROW-6815
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>            Reporter: Mark Litwintschik
>            Priority: Major
>
> I'm unable to read timestamps saved as Parquet data via Pandas with Hive or Presto. These are the versions of the various pieces of software I'm using:
> * Pandas 0.24.1
> * PyArrow 0.13.0
> * Hadoop 3.0.3
> * Hive 2.3.3
> * Presto 0.215
> The Hadoop setup steps I took can be found [here|https://tech.marksblogg.com/hadoop-3-single-node-install-guide.html].
> This is what I did to generate the Parquet file.
> {code:python}
> from   datetime import datetime
> from   StringIO import StringIO
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame([{'a': 'Test', 'b': datetime.utcnow()}])
> parquet_buffer = StringIO()
> pa_table = pa.Table.from_pandas(df, preserve_index=False)
> writer = pq.ParquetWriter(parquet_buffer, pa_table.schema)
> writer.write_table(pa_table)
> writer.close()
> with open('example.pq', 'w+b') as f:
>     f.write(parquet_buffer.getvalue())
> {code}
> This is its raw contents:
> {code:bash}
> $ hexdump -C example.pq
> {code}
> {code:none}
> 00000000  50 41 52 31 15 04 15 10  15 14 4c 15 02 15 04 12  |PAR1......L.....|
> 00000010  00 00 08 1c 04 00 00 00  54 65 73 74 15 00 15 12  |........Test....|
> 00000020  15 16 2c 15 02 15 04 15  06 15 06 1c 18 04 54 65  |..,...........Te|
> 00000030  73 74 18 04 54 65 73 74  16 00 00 00 00 09 20 02  |st..Test...... .|
> 00000040  00 00 00 02 01 01 02 00  26 90 01 1c 15 0c 19 35  |........&......5|
> 00000050  04 00 06 19 18 01 61 15  02 16 02 16 80 01 16 88  |......a.........|
> 00000060  01 26 38 26 08 1c 36 00  28 04 54 65 73 74 18 04  |.&8&..6.(.Test..|
> 00000070  54 65 73 74 00 00 00 15  04 15 10 15 14 4c 15 02  |Test.........L..|
> 00000080  15 04 12 00 00 08 1c 10  4f 48 96 63 94 05 00 15  |........OH.c....|
> 00000090  00 15 12 15 16 2c 15 02  15 04 15 06 15 06 1c 18  |.....,..........|
> 000000a0  08 10 4f 48 96 63 94 05  00 18 08 10 4f 48 96 63  |..OH.c......OH.c|
> 000000b0  94 05 00 16 00 00 00 00  09 20 02 00 00 00 02 01  |......... ......|
> 000000c0  01 02 00 26 86 03 1c 15  04 19 35 04 00 06 19 18  |...&......5.....|
> 000000d0  01 62 15 02 16 02 16 90  01 16 98 01 26 9e 02 26  |.b..........&..&|
> 000000e0  ee 01 1c 18 08 10 4f 48  96 63 94 05 00 18 08 10  |......OH.c......|
> 000000f0  4f 48 96 63 94 05 00 16  00 28 08 10 4f 48 96 63  |OH.c.....(..OH.c|
> 00000100  94 05 00 18 08 10 4f 48  96 63 94 05 00 00 00 00  |......OH.c......|
> 00000110  15 02 19 3c 35 00 18 06  73 63 68 65 6d 61 15 04  |...<5...schema..|
> 00000120  00 15 0c 25 02 18 01 61  00 15 04 25 02 18 01 62  |...%...a...%...b|
> 00000130  25 14 00 16 02 19 1c 19  2c 26 90 01 1c 15 0c 19  |%.......,&......|
> 00000140  35 04 00 06 19 18 01 61  15 02 16 02 16 80 01 16  |5......a........|
> 00000150  88 01 26 38 26 08 1c 36  00 28 04 54 65 73 74 18  |..&8&..6.(.Test.|
> 00000160  04 54 65 73 74 00 00 00  26 86 03 1c 15 04 19 35  |.Test...&......5|
> 00000170  04 00 06 19 18 01 62 15  02 16 02 16 90 01 16 98  |......b.........|
> 00000180  01 26 9e 02 26 ee 01 1c  18 08 10 4f 48 96 63 94  |.&..&......OH.c.|
> 00000190  05 00 18 08 10 4f 48 96  63 94 05 00 16 00 28 08  |.....OH.c.....(.|
> 000001a0  10 4f 48 96 63 94 05 00  18 08 10 4f 48 96 63 94  |.OH.c......OH.c.|
> 000001b0  05 00 00 00 00 16 a0 02  16 02 00 19 1c 18 06 70  |...............p|
> 000001c0  61 6e 64 61 73 18 db 02  7b 22 63 72 65 61 74 6f  |andas...{"creato|
> 000001d0  72 22 3a 20 7b 22 76 65  72 73 69 6f 6e 22 3a 20  |r": {"version": |
> 000001e0  22 30 2e 31 33 2e 30 22  2c 20 22 6c 69 62 72 61  |"0.13.0", "libra|
> 000001f0  72 79 22 3a 20 22 70 79  61 72 72 6f 77 22 7d 2c  |ry": "pyarrow"},|
> 00000200  20 22 70 61 6e 64 61 73  5f 76 65 72 73 69 6f 6e  | "pandas_version|
> 00000210  22 3a 20 6e 75 6c 6c 2c  20 22 69 6e 64 65 78 5f  |": null, "index_|
> 00000220  63 6f 6c 75 6d 6e 73 22  3a 20 5b 5d 2c 20 22 63  |columns": [], "c|
> 00000230  6f 6c 75 6d 6e 73 22 3a  20 5b 7b 22 6d 65 74 61  |olumns": [{"meta|
> 00000240  64 61 74 61 22 3a 20 6e  75 6c 6c 2c 20 22 66 69  |data": null, "fi|
> 00000250  65 6c 64 5f 6e 61 6d 65  22 3a 20 22 61 22 2c 20  |eld_name": "a", |
> 00000260  22 6e 61 6d 65 22 3a 20  22 61 22 2c 20 22 6e 75  |"name": "a", "nu|
> 00000270  6d 70 79 5f 74 79 70 65  22 3a 20 22 6f 62 6a 65  |mpy_type": "obje|
> 00000280  63 74 22 2c 20 22 70 61  6e 64 61 73 5f 74 79 70  |ct", "pandas_typ|
> 00000290  65 22 3a 20 22 62 79 74  65 73 22 7d 2c 20 7b 22  |e": "bytes"}, {"|
> 000002a0  6d 65 74 61 64 61 74 61  22 3a 20 6e 75 6c 6c 2c  |metadata": null,|
> 000002b0  20 22 66 69 65 6c 64 5f  6e 61 6d 65 22 3a 20 22  | "field_name": "|
> 000002c0  62 22 2c 20 22 6e 61 6d  65 22 3a 20 22 62 22 2c  |b", "name": "b",|
> 000002d0  20 22 6e 75 6d 70 79 5f  74 79 70 65 22 3a 20 22  | "numpy_type": "|
> 000002e0  64 61 74 65 74 69 6d 65  36 34 5b 6e 73 5d 22 2c  |datetime64[ns]",|
> 000002f0  20 22 70 61 6e 64 61 73  5f 74 79 70 65 22 3a 20  | "pandas_type": |
> 00000300  22 64 61 74 65 74 69 6d  65 22 7d 5d 2c 20 22 63  |"datetime"}], "c|
> 00000310  6f 6c 75 6d 6e 5f 69 6e  64 65 78 65 73 22 3a 20  |olumn_indexes": |
> 00000320  5b 5d 7d 00 18 22 70 61  72 71 75 65 74 2d 63 70  |[]}.."parquet-cp|
> 00000330  70 20 76 65 72 73 69 6f  6e 20 31 2e 35 2e 31 2d  |p version 1.5.1-|
> 00000340  53 4e 41 50 53 48 4f 54  19 2c 1c 00 00 1c 00 00  |SNAPSHOT.,......|
> 00000350  00 41 02 00 00 50 41 52  31                       |.A...PAR1|
> 00000359
> {code}
> These are the strings extracted from that file.
> {code:bash}
> $ strings example.pq
> {code}
> {code:none}
> PAR1
> Test
> Test
> Test
> Test
> Test
> schema
> Test
> Test
> pandas
> {"creator": {"version": "0.13.0", "library": "pyarrow"}, "pandas_version": null, "index_columns": [], "columns": [{"metadata": null, "field_name": "a", "name": "a", "numpy_type": "object", "pandas_type": "bytes"}, {"metadata": null, "field_name": "b", "name": "b", "numpy_type": "datetime64[ns]", "pandas_type": "datetime"}], "column_indexes": []}
> "parquet-cpp version 1.5.1-SNAPSHOT
> PAR1
> {code}
> If I copy it onto HDFS I'm unable to read the timestamp. I've tried a variety of data types.
> {code:bash}
> $ hdfs dfs -mkdir /testing/
> $ hdfs dfs -copyFromLocal example.pq /testing/
> $ hdfs dfs -ls /testing/example.pq
> {code}
> {code:none}
> -rw-r--r--   1 ubuntu supergroup        857 2019-10-08 10:26 /testing/example.pq
> {code}
> {code:bash}
> $ hive
> {code}
> It can't be read as a DATETIME field:
> {code:sql}
> CREATE EXTERNAL TABLE testing (
>     a STRING,
>     b DATETIME
> ) STORED AS parquet
> LOCATION '/testing/';
> {code}
> {code:none}
> FAILED: SemanticException [Error 10099]: DATETIME type isn't supported yet. Please use DATE or TIMESTAMP instead
> {code}
> It can't be read as a TIMESTAMP:
> {code:sql}
> CREATE EXTERNAL TABLE testing (
>     a STRING,
>     b TIMESTAMP
> ) STORED AS parquet
> LOCATION '/testing/';
> SELECT * FROM testing;
> {code}
> {code:none}
> Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable
> {code}
> I can extract it as a BIGINT but I'm unable to find a math formula to convert it to the correct time.
> {code:sql}
> DROP TABLE `testing`;
> CREATE EXTERNAL TABLE `testing` (
>     a STRING,
>     b BIGINT
> ) STORED AS parquet
> LOCATION '/testing/';
> {code}
> {code:none}
> Test    1570530327547664
> {code}
> {code:sql}
> SELECT CAST(b AS TIMESTAMP) FROM testing;
> {code}
> {code:none}
> 51738-02-15 08:19:077.664
> {code}
> {code:sql}
> SELECT CAST(b/1000 AS TIMESTAMP) FROM testing;
> {code}
> {code:none}
> 51738-02-15 08:19:077.664
> {code}
> I'll generate the same content using Hive and show the contents of the Parquet file it produces:
> {code:sql}
> CREATE TABLE `testing2` (
>     a STRING,
>     b TIMESTAMP
> ) STORED AS parquet
> LOCATION '/testing2/';
> INSERT INTO testing2 (a, b) VALUES ('Test', "2019-10-08 10:34:30.061");
> SELECT * FROM testing2;
> {code}
> {code:none}
> Test    2019-10-08 10:34:30.061
> {code}
> The timestamp is usable.
> {code:sql}
> SELECT YEAR(B) FROM testing2;
> {code}
> {code:none}
> 2019
> {code}
> {code:bash}
> $ hdfs dfs -copyToLocal /testing2/000000_0 ./
> $ hexdump -C 000000_0
> {code}
> {code:none}
> 00000000  50 41 52 31 15 00 15 1c  15 1c 2c 15 02 15 00 15  |PAR1......,.....|
> 00000010  06 15 08 1c 18 04 54 65  73 74 18 04 54 65 73 74  |......Test..Test|
> 00000020  16 00 00 00 00 02 00 00  00 03 01 04 00 00 00 54  |...............T|
> 00000030  65 73 74 15 04 15 18 15  18 4c 15 02 15 04 00 00  |est......L......|
> 00000040  40 65 76 e0 9f 22 00 00  8d 84 25 00 15 00 15 10  |@ev.."....%.....|
> 00000050  15 10 2c 15 02 15 04 15  06 15 08 1c 18 0c 40 65  |..,...........@e|
> 00000060  76 e0 9f 22 00 00 8d 84  25 00 18 0c 40 65 76 e0  |v.."....%...@ev.|
> 00000070  9f 22 00 00 8d 84 25 00  16 00 00 00 00 02 00 00  |."....%.........|
> 00000080  00 03 01 00 03 15 02 19  3c 48 0b 68 69 76 65 5f  |........<H.hive_|
> 00000090  73 63 68 65 6d 61 15 04  00 15 0c 25 02 18 01 61  |schema.....%...a|
> 000000a0  25 00 00 15 06 25 02 18  01 62 00 16 02 19 1c 19  |%....%...b......|
> 000000b0  2c 26 08 1c 15 0c 19 35  08 06 00 19 18 01 61 15  |,&.....5......a.|
> 000000c0  00 16 02 16 5e 16 5e 26  08 3c 18 04 54 65 73 74  |....^.^&.<..Test|
> 000000d0  18 04 54 65 73 74 16 00  00 00 00 26 66 1c 15 06  |..Test.....&f...|
> 000000e0  19 35 08 04 06 19 18 01  62 15 00 16 02 16 a4 01  |.5......b.......|
> 000000f0  16 a4 01 26 66 3c 18 0c  40 65 76 e0 9f 22 00 00  |...&f<..@ev.."..|
> 00000100  8d 84 25 00 18 0c 40 65  76 e0 9f 22 00 00 8d 84  |..%...@ev.."....|
> 00000110  25 00 16 00 00 00 00 16  82 02 16 02 00 28 49 70  |%............(Ip|
> 00000120  61 72 71 75 65 74 2d 6d  72 20 76 65 72 73 69 6f  |arquet-mr versio|
> 00000130  6e 20 31 2e 38 2e 31 20  28 62 75 69 6c 64 20 34  |n 1.8.1 (build 4|
> 00000140  61 62 61 34 64 61 65 37  62 62 30 64 34 65 64 62  |aba4dae7bb0d4edb|
> 00000150  63 66 37 39 32 33 61 65  31 33 33 39 66 32 38 66  |cf7923ae1339f28f|
> 00000160  64 33 66 37 66 63 66 29  00 e4 00 00 00 50 41 52  |d3f7fcf).....PAR|
> 00000170  31                                                |1|
> 00000171
> {code}
> {code:bash}
> $ strings 000000_0
> {code}
> {code:none}
> PAR1
> Test
> Test
> Test
> hive_schema
> Test
> Test
> (Iparquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
> PAR1
> {code}
> I'll do the same for Presto as it has its own writer as well.
> {code:bash}
> $ presto --server localhost:8085 --schema default --catalog hive
> {code}
> {code:sql}
> CREATE TABLE testing3 WITH (FORMAT='PARQUET') AS
>     SELECT 'Test' as a,
>            CAST('2019-10-08 10:34:30.061' AS TIMESTAMP) AS b;
> SELECT * FROM testing3;
> {code}
> {code:none}
>   a   |            b
> ------+-------------------------
>  Test | 2019-10-08 10:34:30.061
> {code}
> {code:sql}
> SELECT YEAR(b) FROM testing3;
> {code}
> {code:none}
>  _col0
> -------
>   2019
> {code}
> {code:bash}
> $ hdfs dfs -copyToLocal /user/hive/warehouse/testing3/20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9 ./
> $ hexdump -C 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
> {code}
> {code:none}
> 00000000  50 41 52 31 15 00 15 1c  15 40 2c 15 02 15 00 15  |PAR1.....@,.....|
> 00000010  06 15 08 1c 18 04 54 65  73 74 18 04 54 65 73 74  |......Test..Test|
> 00000020  16 00 00 00 00 1f 8b 08  00 00 00 00 00 00 03 63  |...............c|
> 00000030  62 60 60 60 66 64 01 92  21 a9 c5 25 00 fe 12 da  |b```fd..!..%....|
> 00000040  11 0e 00 00 00 15 04 15  18 15 40 4c 15 02 15 04  |..........@L....|
> 00000050  00 00 1f 8b 08 00 00 00  00 00 00 03 73 48 2d 7b  |............sH-{|
> 00000060  30 5f 89 81 a1 b7 45 95  01 00 64 bc 2a 14 0c 00  |0_....E...d.*...|
> 00000070  00 00 15 00 15 10 15 38  2c 15 02 15 04 15 06 15  |.......8,.......|
> 00000080  08 1c 18 0c 40 65 76 e0  9f 22 00 00 8d 84 25 00  |....@ev.."....%.|
> 00000090  18 0c 40 65 76 e0 9f 22  00 00 8d 84 25 00 16 00  |..@ev.."....%...|
> 000000a0  00 00 00 1f 8b 08 00 00  00 00 00 00 03 63 62 60  |.............cb`|
> 000000b0  60 60 66 64 60 06 00 77  4c 79 ad 08 00 00 00 15  |``fd`..wLy......|
> 000000c0  02 19 3c 48 0b 68 69 76  65 5f 73 63 68 65 6d 61  |..<H.hive_schema|
> 000000d0  15 04 00 15 0c 25 02 18  01 61 25 00 00 15 06 25  |.....%...a%....%|
> 000000e0  02 18 01 62 00 16 02 19  1c 19 2c 26 08 1c 15 0c  |...b......,&....|
> 000000f0  19 35 08 00 06 19 18 01  61 15 04 16 02 16 5e 16  |.5......a.....^.|
> 00000100  82 01 26 08 3c 18 04 54  65 73 74 18 04 54 65 73  |..&.<..Test..Tes|
> 00000110  74 16 00 00 00 00 26 8a  01 1c 15 06 19 35 04 08  |t.....&......5..|
> 00000120  06 19 18 01 62 15 04 16  02 16 a4 01 16 f4 01 26  |....b..........&|
> 00000130  8a 01 3c 18 0c 40 65 76  e0 9f 22 00 00 8d 84 25  |..<..@ev.."....%|
> 00000140  00 18 0c 40 65 76 e0 9f  22 00 00 8d 84 25 00 16  |...@ev.."....%..|
> 00000150  00 00 00 00 16 82 02 16  02 00 28 0a 70 61 72 71  |..........(.parq|
> 00000160  75 65 74 2d 6d 72 00 a8  00 00 00 50 41 52 31     |uet-mr.....PAR1|
> 0000016f
> {code}
> {code:bash}
> $ strings 20191008_105125_00077_trjf2_61453828-1d8f-45a7-88de-3876b84d0cb9
> {code}
> {code:none}
> PAR1
> Test
> Test
> cb```fd
> sH-{0_
> cb```fd`
> hive_schema
> Test
> Test
> parquet-mr
> PAR1
> {code}
> Any idea how I can save timestamps with Pandas and have them readable by Hive and Presto?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)