You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/21 03:39:23 UTC

[GitHub] [hudi] qingyuan18 opened a new issue #1854: query MOR table using spark sql error

qingyuan18 opened a new issue #1854:
URL: https://github.com/apache/hudi/issues/1854


   version using 
   JDK: Jdk 1.8.0_242
   Scala: 2.11.12
   Spark: 2.4.0
   Hudi Spark bundle: 0.5.2-incubating
   
   Steps to reproduce the behavior:
   1. create managed hive table
   2. using Spark datasource to upset record into it
    def upsert(albumDf: DataFrame, tableName: String, key: String, combineKey: String, tablePath:String):Unit = {
       albumDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option("hoodie.upsert.shuffle.parallelism", "2")
         .mode(SaveMode.Append)
         .save(tablePath)
     }
   3.  using spark sql to read the result
     val spark: SparkSession = SparkSession.builder()
       .appName("hudi-test")
       .master("yarn")
       .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
       .config("spark.sql.hive.convertMetastoreParquet", "false") // Uses Hive SerDe, this is mandatory for MoR tables
       .getOrCreate()
       spark.sql("select  * from  xxxx.xxxx_acidtest2 ").show()
   
   submit command:  spark-submit --master yarn --conf spark.sql.hive.convertMetastoreParquet=false HudiTechSpike-jar-with-dependencies.jar
   
   errors:
   java.io.IOException: Not a file: hdfs://nameservice1/data/operations/racoe/epi/hive/raw/xxxx_acidtest2/default
     at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329)
     at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
     at scala.Option.getOrElse(Option.scala:121)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
     at scala.Option.getOrElse(Option.scala:121)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
     at scala.Option.getOrElse(Option.scala:121)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
     at scala.Option.getOrElse(Option.scala:121)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
     at scala.Option.getOrElse(Option.scala:121)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
     at scala.Option.getOrElse(Option.scala:121)
   seems like it does not recognize the hudi data format/path structure 
   
   * Running on Docker? : No
   **Additional context**:  using spark-shell is the same error
   spark-shell --master yarn --conf spark.sql.hive.convertMetastoreParquet=false --jars hudi-spark-bundle_2.11-0.5.3.jar
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1854: query MOR table using spark sql error

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-661695358


   Is the table (xxxx.xxxx_acidtest2) registered as Hive table. If so, can you provide the complete table description of the table (desc formatted <table>) in Hive metastore.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar closed issue #1854: query MOR table using spark sql error

Posted by GitBox <gi...@apache.org>.

bvaradar closed issue #1854:
URL: https://github.com/apache/hudi/issues/1854


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] qingyuan18 commented on issue #1854: query MOR table using spark sql error

Posted by GitBox <gi...@apache.org>.

qingyuan18 commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-662761051


   > Is the table (xxxx.xxxx_acidtest2) registered as Hive table. If so, can you provide the complete table description of the table (desc formatted
   > 
   > ) in Hive metastore.
   
   hi Balaji
      Thanks for your support!
      yes, it's hive managed table , please see below table details:
   hive -e "desc formatted xxxx.xxxx_acidtest2"
   # col_name              data_type               comment             
                    
   case_no                 string                                      
   case_id                 string                                      
   evnt_nm                 string                                      
   evnt_crt_loc_dt         string                                      
   evnt_crt_loc_ts         string                                      
   ordr_ver_no             string                                      
   evnt_stat_desc          string                                      
   prcs_tp_nm              string                                      
   lanid                   string                                      
   team                    string                                      
   cntry_nm                string                                      
                    
   # Detailed Table Information             
   Database:               xxxx                    
   OwnerType:              USER
   CreateTime:             Mon Jul 20 17:29:11 AEST 2020    
   LastAccessTime:         UNKNOWN                  
   Retention:              0                        
   Location:               hdfs://nameservice1/data/operations/hive/raw/xxxx_acidtest2        
   Table Type:             MANAGED_TABLE            
   Table Parameters:                
           transient_lastDdlTime   1595230151          
                    
   # Storage Information            
   SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
   InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
   OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
   Compressed:             No                       
   Num Buckets:            -1                       
   Bucket Columns:         []                       
   Sort Columns:           []                       
   Storage Desc Params:             
           serialization.format    1 
   
   I have added the spark.sql.hive.convertMetastoreParquet=false to try to avoid spark sql use it native parquet parser
   Thanks again
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #1854: query MOR table using spark sql error

Posted by GitBox <gi...@apache.org>.

bvaradar commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-675438304


   Closing this ticket as it is not a Hudi issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1854: query MOR table using spark sql error

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1854:
URL: https://github.com/apache/hudi/issues/1854#issuecomment-663083174


   @qingyuan18  InputFormat and outputFormat for this table `xxxx.xxxx_acidtest2` does not seem to be Hudi related.  The SerDe formats for MOR tables look like :
   
   for read optimized MOR table
   | # Storage Information         | NULL                                                            | NULL                  |
   | SerDe Library:                | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe     | NULL                  |
   | InputFormat:                  | org.apache.hudi.hadoop.HoodieParquetInputFormat                 | NULL                  |
   | OutputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat  | NULL                  |
   
   and for _rt suffixed MOR table
   
   | # Storage Information         | NULL                                                              | NULL                  |
   | SerDe Library:                | org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe       | NULL                  |
   | InputFormat:                  | org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat  | NULL                  |
   | OutputFormat:                 | org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat    | NULL                  |
   
   
   Can you share more context on how this table was loaded initially? Also add your write configs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org