You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Aaron.Dossett" <Aa...@target.com> on 2015/09/25 23:27:16 UTC

Hive queries fail on an external avro table with empty files

Situation: I have an external avro table in Hive.  Under certain circumstances zero length files can end up in the top level directory housing the external data.  This causes all hive queries on the table to fail.  This is with Hive 0.14, but looking at current code base I think the same problem would occur with the current code.  ( A stack trace is below.)

This issue is that org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader creates a new org.apache.avro.file.DataFileReader and DataFileReader throws an exception when trying to read an empty file (because the empty file lacks the magic number marking it as avro).  It seems like it be straight forward to modify AvroGenericRecordReader to detect an empty file and then behave sensibly.  For example, next() would always return false; getPos() would return zero, etc.

If that approach sounds sensible I will open a JIRA and take a stab at a patch.  Thank you in advance for any feedback!

-Aaron

Caused by: java.io.IOException: Not a data file.
at org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:102)
at org.apache.avro.file.DataFileReader.<init>(DataFileReader.java:97)
at org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.<init>(AvroGenericRecordReader.java:81)
at org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat.getRecordReader(AvroContainerInputFormat.java:51)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:246)
... 25 more