You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2016/11/14 05:43:58 UTC
[jira] [Commented] (HIVE-15189) No base file for ACID table

    [ https://issues.apache.org/jira/browse/HIVE-15189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15662845#comment-15662845 ] 

Alan Gates commented on HIVE-15189:
-----------------------------------

This is by design.  Work needs to be done in Spark to make it so that it understands the delta files (including which ones to read).  Note also that it's possible with ACID tables to have multiple base files simultaneously (since it works by keeping multiple versions of the data and using the metadata to determine which version a given reader should see), and I assume this would mess up spark readers as well.

You can work around the situation by forcing a major compaction before you read the data by doing "alter table compact" (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionCompact ),

> No base file for ACID table
> ---------------------------
>
>                 Key: HIVE-15189
>                 URL: https://issues.apache.org/jira/browse/HIVE-15189
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 1.2.1
>         Environment: HDP 2.4, HDP2.5
>            Reporter: Benjamin BONNET
>
> Hi,
> When one creates a new ACID table and inserts data into it using INSERT INTO, Hive does not write a 'base' file : it only creates a delta. That may lead to two issues (at least):
> # when you try to read it, you might get  a 'serious problem' like that :
> {noformat}
> java.lang.RuntimeException: serious problem
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
>         at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>         at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>         at scala.Option.getOrElse(Option.scala:120)
>         at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>         at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
>         at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>         at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>         at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
>         at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:166)
>         at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
>         at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:635)
>         at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:64)
>         at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311)
>         at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>         at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
>         at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>         at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>         at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: delta_0000000_0000000 does not start with base_
>         at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>         at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:998)
>         ... 44 more
> {noformat}
> I do not always get this error, but when I get it, I need to drop/recreate table.
> # Spark-sql does not see data as long as there is no base file. So as long as compaction has not occurred, no data can be read using Spar-SQL. See https://issues.apache.org/jira/browse/SPARK-16996
> I know Hive always creates a base file when ou INSERT OVERWRITE, but you cannot always use OVERWRITE instead of INTO : in my use case, I use a statement that writes data into partitions that already exist and partitions that do not exist yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)