You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/27 03:45:56 UTC
[GitHub] [hudi] creactiviti opened a new issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
creactiviti opened a new issue #1670:
URL: https://github.com/apache/hudi/issues/1670
I'm attempting to execute the CDC example scenario (http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) and running into an issue when attempting to query the table using Presto.
1. Have DMS generate the raw `.parquet` files in S3.
2. Use `HoodieDeltaStreamer` to process the raw `.parquet` files:
```
spark-submit --jars /usr/lib/spark/external/lib/spark-avro.jar \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
--master yarn \
---mode client /usr/lib/hudi/hudi-utilities-bundle.jar \
--table-type COPY_ON_WRITE \
--source-ordering-field updated_at \
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
--target-base-path s3://my-test-bucket/hudi_orders \
--target-table hudi_orders \
--transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer \
--payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
--enable-hive-sync \
--hoodie-conf hoodie.datasource.write.recordkey.field=order_id,hoodie.datasource.write.partitionpath.field=customer_name,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders,hoodie.datasource.hive_sync.table=orders,hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor,hoodie.datasource.hive_sync.partition_fields=customer_name
```
* Hudi version : 0.5.2 (incubating)
* Spark version : 2.4.5
* Hive version : 2.3.6
* Presto version: 0.232
* Hadoop version : Amazon 2.8.5
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : No
**Querying using Hive**
Running on Hive:
```
hive> select count(*) from orders;
Query ID = root_20200526144157_e4b7cb38-be47-44e0-8317-8aa87c419995
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1590502613834_0007)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 10.32 s
----------------------------------------------------------------------------------------------
OK
7
Time taken: 12.039 seconds, Fetched: 1 row(s)
```
**Querying using Presto**
```
presto:default> select count(*) from orders;
Query 20200526_144243_00006_f8j6h, FAILED, 2 nodes
Splits: 24 total, 0 done (0.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20200526_144243_00006_f8j6h failed: Error opening Hive split s3://my-test-bucket/hudi_orders/nathan/b8fd6f7b-0bf5-458b-8cbb-f11e0ede995e-0_1-23-12020_20200526143655.parquet (offset=0, length=435285): Unknown converted type TIMESTAMP_MICROS
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] creactiviti commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
creactiviti commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635504343
And as far as querying the parquet table using Presto I first created the table in Hive like so:
```
create external table orders_parquet (
order_id int,
order_qty int,
updated_at bigint,
created_at bigint,
op string,
customer_name string
)
stored as parquet location 's3://my-test-bucket/output.parquet/';
```
And when I tried to query with Presto I got:
```
presto:default> select * from orders_parquet;
Query 20200528_175631_00007_uptfd, FAILED, 2 nodes
Splits: 17 total, 0 done (0.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20200528_175631_00007_uptfd failed: The column updated_at is declared as type bigint, but the Parquet file declares the column as type INT96
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] creactiviti commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
creactiviti commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-634199560
Was able to verify that Hudi is writing out using `TIMESTAMP_MICROS` (which presto does not seem to support) even though the source file is using `TIMESTAMP_MILLIS`:
```
$ java -jar parquet-tools-1.8.2.jar schema source.parquet
message schema {
optional int32 order_id;
optional int32 order_qty;
optional binary customer_name (UTF8);
optional int64 updated_at (TIMESTAMP_MILLIS);
optional int64 created_at (TIMESTAMP_MILLIS);
}
$ java -jar parquet-tools-1.8.2.jar schema 68aa1859-0a69-4483-ac3a-ee0f5fb79972-0_2-22-12020_20200526173709.parquet
message hoodie.orders.orders_record {
optional binary _hoodie_commit_time (UTF8);
optional binary _hoodie_commit_seqno (UTF8);
optional binary _hoodie_record_key (UTF8);
optional binary _hoodie_partition_path (UTF8);
optional binary _hoodie_file_name (UTF8);
optional int32 order_id;
optional int32 order_qty;
optional binary customer_name (UTF8);
optional int64 updated_at (TIMESTAMP_MICROS);
optional int64 created_at (TIMESTAMP_MICROS);
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1670:
URL: https://github.com/apache/hudi/issues/1670
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-637256739
IIUC this fix is pending on a hive bug fix.. if you can workaround using a different data type, please do so
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] creactiviti commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
creactiviti commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635482937
Thanks @bvaradar! this is interesting. here's what I got:
```
$ java -jar ~/parquet/parquet-tools-1.8.2.jar schema /tmp/parq/out.parquet/
message spark_schema {
optional int32 order_id;
optional int32 order_qty;
optional binary customer_name (UTF8);
optional int96 updated_at;
optional int96 created_at;
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] anismiles commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
anismiles commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-637723431
Is there a link to the hive bug mentioned above?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] creactiviti edited a comment on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
creactiviti edited a comment on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635504343
And as far as querying the parquet table using Presto I first created the table in Hive like so:
```
create external table orders_parquet (
order_id int,
order_qty int,
updated_at bigint,
created_at bigint,
op string,
customer_name string
)
stored as parquet location 's3://my-test-bucket/output.parquet/';
```
And when I tried to query with Presto I got:
```
presto:default> select * from orders_parquet;
Query 20200528_175631_00007_uptfd, FAILED, 2 nodes
Splits: 17 total, 0 done (0.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
Query 20200528_175631_00007_uptfd failed: The column updated_at is declared as type bigint, but the Parquet file declares the column as type INT96
```
So I guess I just need to avoid the TIMESTAMP type then?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635007877
@creactiviti : I think this is coming from spark. For ParquetDFSSource, Hudi uses spark.read().parquet() to get schema.
Can you rewrite the same data again as plain parquet dataset through spark (E:g: spark.read.parquet(...).write().format("parquet").save(.....) and then query using presto ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] vinothchandar commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS
Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-638521271
https://issues.apache.org/jira/browse/HUDI-83 Should have all the context
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org