You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/05 01:53:06 UTC

[GitHub] [spark] attilapiros opened a new pull request, #38516: Initial version

attilapiros opened a new pull request, #38516:
URL: https://github.com/apache/spark/pull/38516

   ### What changes were proposed in this pull request?
   
   This is an update of https://github.com/apache/spark/pull/29178 which was closed because the root cause of the error was just vaguely defined there but here I will give an explanation why `HiveHBaseTableInputFormat` does not work well with the `NewHadoopRDD` (see in the next section). 
   
   The PR modify `TableReader.scala` to create `OldHadoopRDD` when input format is 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'. 
   
   - environments (Cloudera distribution 7.1.7.SP1):
   hadoop 3.1.1 
   hive 3.1.300
   spark 3.2.1 
   hbase 2.2.3
   
   ### Why are the changes needed?
   
   With the `NewHadoopRDD` the following exception is raised:
   
   ```
   java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
     at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253)
     at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
     at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
     at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
     at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:446)
     at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429)
     at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
     at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
     at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
     at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
     at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
     at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
     at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:806)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:765)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:774)
     ... 47 elided
   Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
     at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:557)
     at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:248)
     ... 86 more
   ```
   
   
   ### Short summary of the reason
   
   There are two interfaces:
   
   - the new `org.apache.hadoop.mapreduce.InputFormat`: providing a one arg method `getSplits(JobContext context)` (returning `List<InputSplit>`)
   - the old `org.apache.hadoop.mapred.InputFormat`: providing a two arg method `getSplits(JobConf job, int numSplits)` (returning `InputSplit[]`)
   
   And in Hive both are implemented by `HiveHBaseTableInputFormat` but only the old method leads to required initialisation and this why `NewHadoopRDD` fails here.
   
   ### Details
   
   Here all the link refers to master branches latest commits of components to get the right line numbers in the future too: 
   
   Spark in `NewHadoopRDD` uses the new interface providing the one arg method:
   https://github.com/apache/spark/blob/5556cfc59aa97a3ad4ea0baacebe19859ec0bcb7/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L136
   
   Hive on the other hand binds the initialisation to the two args method coming from the old interface. 
   See 
   https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L268
   which calls `getSplitsInternal` with the initialisation:
   https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L299
   
   Interesting that Hive also uses the one arg method internally within the `getSplitsInternal`:
   https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L356
   but that does not help us as the initialisation done earlier in the  `getSplitsInternal` method.
   
   By calling the new interface method only (what `NewHadoopRDD` does) the call goes `org.apache.hadoop.hbase.mapreduce.TableInputFormatBase`:
   https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L230
   
   Where there would be JobContext based initialisation:
   https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L234-L237
   
   But that's not implemented by Hive (it won't be as easy as it is based on a `JobContext`):
   https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L628-L640
    
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   1) create hbase table
   
   ```
    hbase(main):001:0>create 'hbase_test1', 'cf1'
    hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
   ```
   
   
   2) create hive table related to hbase table
   
   hive> 
   ```
   CREATE EXTERNAL TABLE `hivetest.hbase_test`(
     `key` string COMMENT '', 
     `value` string COMMENT '')
   ROW FORMAT SERDE 
     'org.apache.hadoop.hive.hbase.HBaseSerDe' 
   STORED BY 
     'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
   WITH SERDEPROPERTIES ( 
     'hbase.columns.mapping'=':key,cf1:v1', 
     'serialization.format'='1')
   TBLPROPERTIES (
     'hbase.table.name'='hbase_test')
   ```
    
   
   3): spark-shell query hive table while data in HBase
   
   ```
   scala> spark.sql("select * from hivetest.hbase_test").show()
   22/11/05 01:14:16 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist
   22/11/05 01:14:16 WARN client.HiveClientImpl: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic
   Hive Session ID = f05b6866-86df-4d88-9eea-f1c45043bb5f
   +---+-----+
   |key|value|
   +---+-----+
   | r1|  123|
   +---+-----+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38516: [SPARK-32380][SQL] Fixing access of HBase table via Hive from Spark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38516:
URL: https://github.com/apache/spark/pull/38516#issuecomment-1304537957

   Also merged to branch-3.3 and branch-3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] attilapiros commented on pull request #38516: [SPARK-32380][SQL] Fixing access of HBase table via Hive

Posted by GitBox <gi...@apache.org>.
attilapiros commented on PR #38516:
URL: https://github.com/apache/spark/pull/38516#issuecomment-1304373701

   cc @dongjoon-hyun, @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38516: [SPARK-32380][SQL] Fixing access of HBase table via Hive from Spark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38516:
URL: https://github.com/apache/spark/pull/38516#issuecomment-1304537551

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38516: [SPARK-32380][SQL] Fixing access of HBase table via Hive from Spark

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38516: [SPARK-32380][SQL] Fixing access of HBase table via Hive from Spark
URL: https://github.com/apache/spark/pull/38516


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org