You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/05/27 03:45:56 UTC

[GitHub] [hudi] creactiviti opened a new issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

creactiviti opened a new issue #1670:
URL: https://github.com/apache/hudi/issues/1670


   I'm attempting to execute the CDC example scenario (http://hudi.apache.org/blog/change-capture-using-aws/) on Amazon EMR (5.30.0) and running into an issue when attempting to query the table using Presto.
   
   1. Have DMS generate the raw `.parquet` files in S3.
   2. Use `HoodieDeltaStreamer` to process the raw `.parquet` files:
   
   ```
   spark-submit --jars /usr/lib/spark/external/lib/spark-avro.jar  \
                          --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
                          --master yarn \
                          ---mode client /usr/lib/hudi/hudi-utilities-bundle.jar \
                          --table-type COPY_ON_WRITE   \
                          --source-ordering-field updated_at   \
                          --source-class org.apache.hudi.utilities.sources.ParquetDFSSource   \
                          --target-base-path s3://my-test-bucket/hudi_orders \
                          --target-table hudi_orders   \
                          --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer   \
                          --payload-class org.apache.hudi.payload.AWSDmsAvroPayload  \
                          --enable-hive-sync \
                          --hoodie-conf hoodie.datasource.write.recordkey.field=order_id,hoodie.datasource.write.partitionpath.field=customer_name,hoodie.deltastreamer.source.dfs.root=s3:/my-test-bucket/hudi_dms/orders,hoodie.datasource.hive_sync.table=orders,hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor,hoodie.datasource.hive_sync.partition_fields=customer_name
   ```
   
   * Hudi version : 0.5.2 (incubating)
   
   * Spark version : 2.4.5
   
   * Hive version : 2.3.6
   
   * Presto version: 0.232
   
   * Hadoop version : Amazon 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   **Querying using Hive**
   
   Running on Hive:
   
   ```
   hive> select count(*) from orders;
   Query ID = root_20200526144157_e4b7cb38-be47-44e0-8317-8aa87c419995
   Total jobs = 1
   Launching Job 1 out of 1
   Status: Running (Executing on YARN cluster with App id application_1590502613834_0007)
   
   ----------------------------------------------------------------------------------------------
           VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
   ----------------------------------------------------------------------------------------------
   Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0  
   Reducer 2 ...... container     SUCCEEDED      1          1        0        0       0       0  
   ----------------------------------------------------------------------------------------------
   VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 10.32 s    
   ----------------------------------------------------------------------------------------------
   OK
   7
   Time taken: 12.039 seconds, Fetched: 1 row(s)
   ```
   
   **Querying using Presto**
   
   ```
   presto:default> select count(*) from orders;
   
   Query 20200526_144243_00006_f8j6h, FAILED, 2 nodes
   Splits: 24 total, 0 done (0.00%)
   0:01 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20200526_144243_00006_f8j6h failed: Error opening Hive split s3://my-test-bucket/hudi_orders/nathan/b8fd6f7b-0bf5-458b-8cbb-f11e0ede995e-0_1-23-12020_20200526143655.parquet (offset=0, length=435285): Unknown converted type TIMESTAMP_MICROS
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] creactiviti commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

creactiviti commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635504343


   And as far as querying the parquet table using Presto I first created the table in Hive like so:
   
   ```
   create external table orders_parquet (
     order_id                int,
     order_qty               int,
     updated_at              bigint,
     created_at              bigint,
     op                      string,
     customer_name           string
   )
   stored as parquet location 's3://my-test-bucket/output.parquet/';
   ```
   
   And when I tried to query with Presto I got:
   
   ```
   presto:default> select * from orders_parquet;
   
   Query 20200528_175631_00007_uptfd, FAILED, 2 nodes
   Splits: 17 total, 0 done (0.00%)
   0:00 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20200528_175631_00007_uptfd failed: The column updated_at is declared as type bigint, but the Parquet file declares the column as type INT96
   ```
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] creactiviti commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

creactiviti commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-634199560


   Was able to verify that Hudi is writing out using `TIMESTAMP_MICROS` (which presto does not seem to support) even though the source file is using `TIMESTAMP_MILLIS`:
   
   ```
   $ java -jar parquet-tools-1.8.2.jar schema source.parquet 
   
   message schema {
     optional int32 order_id;
     optional int32 order_qty;
     optional binary customer_name (UTF8);
     optional int64 updated_at (TIMESTAMP_MILLIS);
     optional int64 created_at (TIMESTAMP_MILLIS);
   }
   
   $ java -jar parquet-tools-1.8.2.jar schema 68aa1859-0a69-4483-ac3a-ee0f5fb79972-0_2-22-12020_20200526173709.parquet
   
   message hoodie.orders.orders_record {
     optional binary _hoodie_commit_time (UTF8);
     optional binary _hoodie_commit_seqno (UTF8);
     optional binary _hoodie_record_key (UTF8);
     optional binary _hoodie_partition_path (UTF8);
     optional binary _hoodie_file_name (UTF8);
     optional int32 order_id;
     optional int32 order_qty;
     optional binary customer_name (UTF8);
     optional int64 updated_at (TIMESTAMP_MICROS);
     optional int64 created_at (TIMESTAMP_MICROS);
   }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1670:
URL: https://github.com/apache/hudi/issues/1670


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-637256739


   IIUC this fix is pending on a hive bug fix.. if you can workaround using a different data type, please do so


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] creactiviti commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

creactiviti commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635482937


   Thanks @bvaradar! this is interesting. here's what I got:
   
   ```
   $ java -jar ~/parquet/parquet-tools-1.8.2.jar schema /tmp/parq/out.parquet/
   
   message spark_schema {
     optional int32 order_id;
     optional int32 order_qty;
     optional binary customer_name (UTF8);
     optional int96 updated_at;
     optional int96 created_at;
   }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] anismiles commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

anismiles commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-637723431


   Is there a link to the hive bug mentioned above? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] creactiviti edited a comment on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

creactiviti edited a comment on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635504343


   And as far as querying the parquet table using Presto I first created the table in Hive like so:
   
   ```
   create external table orders_parquet (
     order_id                int,
     order_qty               int,
     updated_at              bigint,
     created_at              bigint,
     op                      string,
     customer_name           string
   )
   stored as parquet location 's3://my-test-bucket/output.parquet/';
   ```
   
   And when I tried to query with Presto I got:
   
   ```
   presto:default> select * from orders_parquet;
   
   Query 20200528_175631_00007_uptfd, FAILED, 2 nodes
   Splits: 17 total, 0 done (0.00%)
   0:00 [0 rows, 0B] [0 rows/s, 0B/s]
   
   Query 20200528_175631_00007_uptfd failed: The column updated_at is declared as type bigint, but the Parquet file declares the column as type INT96
   ```
   
   So I guess I just need to avoid the TIMESTAMP type then?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-635007877


   @creactiviti : I think this is coming from spark. For ParquetDFSSource, Hudi uses spark.read().parquet() to get schema.
   
   Can you rewrite the same data again as plain parquet dataset through spark (E:g: spark.read.parquet(...).write().format("parquet").save(.....) and then query using presto ?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1670: Error opening Hive split: Unknown converted type TIMESTAMP_MICROS

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #1670:
URL: https://github.com/apache/hudi/issues/1670#issuecomment-638521271


   https://issues.apache.org/jira/browse/HUDI-83 Should have all the context


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org