You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/21 03:39:23 UTC
[GitHub] [hudi] qingyuan18 opened a new issue #1854: query MOR table using spark sql error
qingyuan18 opened a new issue #1854:
URL: https://github.com/apache/hudi/issues/1854
version using
JDK: Jdk 1.8.0_242
Scala: 2.11.12
Spark: 2.4.0
Hudi Spark bundle: 0.5.2-incubating
Steps to reproduce the behavior:
1. create managed hive table
2. using Spark datasource to upset record into it
def upsert(albumDf: DataFrame, tableName: String, key: String, combineKey: String, tablePath:String):Unit = {
albumDf.write
.format("hudi")
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
.option("hoodie.upsert.shuffle.parallelism", "2")
.mode(SaveMode.Append)
.save(tablePath)
}
3. using spark sql to read the result
val spark: SparkSession = SparkSession.builder()
.appName("hudi-test")
.master("yarn")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.hive.convertMetastoreParquet", "false") // Uses Hive SerDe, this is mandatory for MoR tables
.getOrCreate()
spark.sql("select * from xxxx.xxxx_acidtest2 ").show()
submit command: spark-submit --master yarn --conf spark.sql.hive.convertMetastoreParquet=false HudiTechSpike-jar-with-dependencies.jar
errors:
java.io.IOException: Not a file: hdfs://nameservice1/data/operations/racoe/epi/hive/raw/xxxx_acidtest2/default
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
seems like it does not recognize the hudi data format/path structure
* Running on Docker? : No
**Additional context**: using spark-shell is the same error
spark-shell --master yarn --conf spark.sql.hive.convertMetastoreParquet=false --jars hudi-spark-bundle_2.11-0.5.3.jar
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1854: query MOR table using spark sql error
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-661695358
Is the table (xxxx.xxxx_acidtest2) registered as Hive table. If so, can you provide the complete table description of the table (desc formatted <table>) in Hive metastore.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #1854: query MOR table using spark sql error
Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1854:
URL: https://github.com/apache/hudi/issues/1854
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] qingyuan18 commented on issue #1854: query MOR table using spark sql error
Posted by GitBox <gi...@apache.org>.
qingyuan18 commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-662761051
> Is the table (xxxx.xxxx_acidtest2) registered as Hive table. If so, can you provide the complete table description of the table (desc formatted
>
> ) in Hive metastore.
hi Balaji
Thanks for your support!
yes, it's hive managed table , please see below table details:
hive -e "desc formatted xxxx.xxxx_acidtest2"
# col_name data_type comment
case_no string
case_id string
evnt_nm string
evnt_crt_loc_dt string
evnt_crt_loc_ts string
ordr_ver_no string
evnt_stat_desc string
prcs_tp_nm string
lanid string
team string
cntry_nm string
# Detailed Table Information
Database: xxxx
OwnerType: USER
CreateTime: Mon Jul 20 17:29:11 AEST 2020
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://nameservice1/data/operations/hive/raw/xxxx_acidtest2
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1595230151
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I have added the spark.sql.hive.convertMetastoreParquet=false to try to avoid spark sql use it native parquet parser
Thanks again
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1854: query MOR table using spark sql error
Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-675438304
Closing this ticket as it is not a Hudi issue.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] bhasudha commented on issue #1854: query MOR table using spark sql error
Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-663083174
@qingyuan18 InputFormat and outputFormat for this table `xxxx.xxxx_acidtest2` does not seem to be Hudi related. The SerDe formats for MOR tables look like :
for read optimized MOR table
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
| InputFormat: | org.apache.hudi.hadoop.HoodieParquetInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
and for _rt suffixed MOR table
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | NULL |
| InputFormat: | org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | NULL |
Can you share more context on how this table was loaded initially? Also add your write configs.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org