You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jim Krehl <ji...@rd.io> on 2012/10/16 01:29:29 UTC

Lost data during INSERT query

I'm trying to load a table using an INSERT query [1].  Not all the
data is making it from the original table into the new table.  The
Hadoop task tracker logs show that the query works without error until
the last second of the job.  The job typically takes about 45 minutes,
but in actual last second of the job a number of IOExceptions arise
[2].  The exceptions are a result of temporary Hive files disappearing
during a map task.  The INSERT query actually spawns 2 Hadoop jobs,
one which takes the aforementioned ~45 minutes and a second task which
takes approximately 10 seconds.  Both tasks have the same
mapred.job.name and hive.query.string in the job config.  By examining
the task tracker logs this second task is just renaming the very
temporary files that the previous Hadoop job errors on.  According to
the Hadoop job tracker these jobs don't overlap, that is, the second
job starts immediately after the first job completes, but something's
amiss.

What's the purpose of the second job?  How can I fix this?

Thanks,
Jim Krehl

[1] https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-InsertingdataintoHiveTablesfromqueries

[2] ERROR org.apache.hadoop.hdfs.DFSClient: Failed to close file
/tmp/hive-hive/hive_2012-10-15_13-45-21_245_1936216192130095423/_task_tmp.-ext-10002/month=2012-01/_tmp.000000_1
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on /tmp/hive-hive/hive_2012-10-15_13-45-21_245_1936216192130095423/_task_tmp.-ext-10002/month=2012-01/_tmp.000000_1
File does not exist. Holder DFSClient_NONMAPREDUCE_-672101740_1 does
not have any open files.