You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Igor Kuzmenko <f1...@gmail.com> on 2016/08/03 13:09:55 UTC

Malformed orc file

Hello, I've got a malformed ORC file in my Hive table. File was created by
Hive Streaming API and I have no idea under what circumstances it
became corrupted.

File on google drive: link
<https://drive.google.com/file/d/0ByB92PAoAkrKeFFZRUN4WWVQY1U/view?usp=sharing>

Exception message when trying to perform select from table:

ERROR : Vertex failed, vertexName=Map 1,
vertexId=vertex_1468498236400_1106_6_00, diagnostics=[Task failed,
taskId=task_1468498236400_1106_6_00_000000, diagnostics=[TaskAttempt 0
failed, info=[Error: Failure while running task:java.lang.RuntimeException:
java.lang.RuntimeException: java.io.IOException:
org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://
sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000.
Invalid postscript length 0
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException:
org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://
sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000.
Invalid postscript length 0
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:142)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:326)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
... 14 more
Caused by: java.io.IOException:
org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://
sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000.
Invalid postscript length 0
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 19 more
Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC
file hdfs://
sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000.
Invalid postscript length 0
at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.java:236)
at
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:376)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:317)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:238)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1259)
at
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1151)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:249)
... 20 more

Does anyone encountered such a situation?

Re: Malformed orc file

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

If you are using one of the latest hive releases, then we orcfiledump have an option for recovering such files. It backtrack the files for intermediate footers.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump]
    [--backup-path <new-path>] <location-of-orc-file-or-directory>

Thanks
Prasanth

On Aug 5, 2016, at 1:36 PM, Owen O'Malley <om...@apache.org>> wrote:

The file has trailing data. If you want to recover the data, you can use:

% strings -3 -t d ~/Downloads/bucket_00000 | grep ORC

will print the offsets where ORC occurs with in the file:

0 ORC
4559 ORC

That means that there is one intermediate footer within the file. If you slice the file at the right point (ORC offset + 4), you can get the data back:

% dd bs=1 count=4563 < ~/Downloads/bucket_00000 > recover.orc

and

% orc-metadata recover.orc

{ "name": "recover.orc",
  "type": "struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<data_type:string,source_file_name:string,telco_id:int,begin_connection_time:bigint,duration:int,call_type_id:int,supplement_service_id:int,in_abonent_type:int,out_abonent_type:int,switch_id:string,inbound_bunch:bigint,outbound_bunch:bigint,term_cause:int,phone_card_number:string,in_info_directory_number:string,in_info_internal_number:string,dialed_digits:string,out_info_directory_number:string,out_info_internal_number:string,forwarding_identifier:string,border_switch_id:string>>",
  "rows": 115,
  "stripe count": 1,
  "format": "0.12", "writer version": "HIVE-8732",
  "compression": "zlib", "compression block": 16384,
  "file length": 4563,
  "content": 3454, "stripe stats": 339, "footer": 744, "postscript": 25,
  "row index stride": 10000,
  "user metadata": {
    "hive.acid.key.index": "71698156,0,114;",
    "hive.acid.stats": "115,0,0"
  },
  "stripes": [
    { "stripe": 0, "rows": 115,
      "offset": 3, "length": 3451,
      "index": 825, "data": 2353, "footer": 273
    }
  ]
}

.. Owen

On Fri, Aug 5, 2016 at 2:47 AM, Igor Kuzmenko <f1...@gmail.com>> wrote:
Unfortunately, I сan't provide more information, this file I got from our tester and he already droped table.

On Thu, Aug 4, 2016 at 9:16 PM, Prasanth Jayachandran <pj...@hortonworks.com>> wrote:
Hi

In case of streaming, when a transaction is open orc file is not closed and hence may not be flushed completely. Did the transaction commit successfully? Or was there any exception thrown during writes/commit?

Thanks
Prasanth

On Aug 3, 2016, at 6:09 AM, Igor Kuzmenko <f1...@gmail.com>> wrote:

Hello, I've got a malformed ORC file in my Hive table. File was created by Hive Streaming API and I have no idea under what circumstances it became corrupted.

File on google drive: link<https://drive.google.com/file/d/0ByB92PAoAkrKeFFZRUN4WWVQY1U/view?usp=sharing>

Exception message when trying to perform select from table:
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1468498236400_1106_6_00, diagnostics=[Task failed, taskId=task_1468498236400_1106_6_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.do<http://javax.security.auth.subject.do/>As(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:142)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:326)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
... 14 more
Caused by: java.io.IOException: org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.hive.io<http://org.apache.hadoop.hive.io/>.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io<http://org.apache.hadoop.hive.io/>.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 19 more
Caused by: org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.java:236)
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:376)
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.orc.ReaderImpl.<init>(ReaderImpl.java:317)
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.orc.OrcFile.createReader(OrcFile.java:238)
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.orc.OrcInputFormat.getReader(OrcInputFormat.java:1259)
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1151)
at org.apache.hadoop.hive.ql.io<http://org.apache.hadoop.hive.ql.io/>.HiveInputFormat.getRecordReader(HiveInputFormat.java:249)
... 20 more

Does anyone encountered such a situation?

Re: Malformed orc file

Posted by Owen O'Malley <om...@apache.org>.

The file has trailing data. If you want to recover the data, you can use:

% strings -3 -t d ~/Downloads/bucket_00000 | grep ORC

will print the offsets where ORC occurs with in the file:

0 ORC
4559 ORC

That means that there is one intermediate footer within the file. If you
slice the file at the right point (ORC offset + 4), you can get the data
back:

% dd bs=1 count=4563 < ~/Downloads/bucket_00000 > recover.orc

and

% orc-metadata recover.orc

{ "name": "recover.orc",
  "type":
"struct<operation:int,originalTransaction:bigint,bucket:int,rowId:bigint,currentTransaction:bigint,row:struct<data_type:string,source_file_name:string,telco_id:int,begin_connection_time:bigint,duration:int,call_type_id:int,supplement_service_id:int,in_abonent_type:int,out_abonent_type:int,switch_id:string,inbound_bunch:bigint,outbound_bunch:bigint,term_cause:int,phone_card_number:string,in_info_directory_number:string,in_info_internal_number:string,dialed_digits:string,out_info_directory_number:string,out_info_internal_number:string,forwarding_identifier:string,border_switch_id:string>>",
  "rows": 115,
  "stripe count": 1,
  "format": "0.12", "writer version": "HIVE-8732",
  "compression": "zlib", "compression block": 16384,
  "file length": 4563,
  "content": 3454, "stripe stats": 339, "footer": 744, "postscript": 25,
  "row index stride": 10000,
  "user metadata": {
    "hive.acid.key.index": "71698156,0,114;",
    "hive.acid.stats": "115,0,0"
  },
  "stripes": [
    { "stripe": 0, "rows": 115,
      "offset": 3, "length": 3451,
      "index": 825, "data": 2353, "footer": 273
    }
  ]
}

.. Owen

On Fri, Aug 5, 2016 at 2:47 AM, Igor Kuzmenko <f1...@gmail.com> wrote:

> Unfortunately, I сan't provide more information, this file I got from our
> tester and he already droped table.
>
> On Thu, Aug 4, 2016 at 9:16 PM, Prasanth Jayachandran <
> pjayachandran@hortonworks.com> wrote:
>
>> Hi
>>
>> In case of streaming, when a transaction is open orc file is not closed
>> and hence may not be flushed completely. Did the transaction commit
>> successfully? Or was there any exception thrown during writes/commit?
>>
>> Thanks
>> Prasanth
>>
>> On Aug 3, 2016, at 6:09 AM, Igor Kuzmenko <f1...@gmail.com> wrote:
>>
>> Hello, I've got a malformed ORC file in my Hive table. File was created
>> by Hive Streaming API and I have no idea under what circumstances it
>> became corrupted.
>>
>> File on google drive: link
>> <https://drive.google.com/file/d/0ByB92PAoAkrKeFFZRUN4WWVQY1U/view?usp=sharing>
>>
>> Exception message when trying to perform select from table:
>>
>> ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1468498236400_1106_6_00,
>> diagnostics=[Task failed, taskId=task_1468498236400_1106_6_00_000000,
>> diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running
>> task:java.lang.RuntimeException: java.lang.RuntimeException:
>> java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException:
>> Malformed ORC file hdfs://sorm-master01.msk.mts.r
>> u:8020/apps/hive/warehouse/pstn_connections/dt=20160711/dire
>> ctory_number_last_digit=5/delta_71700156_71700255/bucket_00000. Invalid
>> postscript length 0
>> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAn
>> dRunProcessor(TezProcessor.java:173)
>> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProce
>> ssor.java:139)
>> at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(Log
>> icalIOProcessorRuntimeTask.java:344)
>> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable
>> $1.run(TezTaskRunner.java:181)
>> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable
>> $1.run(TezTaskRunner.java:172)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> upInformation.java:1657)
>> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable
>> .callInternal(TezTaskRunner.java:172)
>> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable
>> .callInternal(TezTaskRunner.java:168)
>> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.RuntimeException: java.io.IOException:
>> org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file
>> hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pst
>> n_connections/dt=20160711/directory_number_last_digit=5/delt
>> a_71700156_71700255/bucket_00000. Invalid postscript length 0
>> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$T
>> ezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedS
>> plitsInputFormat.java:196)
>> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$T
>> ezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:142)
>> at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMap
>> red.java:113)
>> at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecor
>> d(MapRecordSource.java:61)
>> at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(
>> MapRecordProcessor.java:326)
>> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAn
>> dRunProcessor(TezProcessor.java:150)
>> ... 14 more
>> Caused by: java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException:
>> Malformed ORC file hdfs://sorm-master01.msk.mts.r
>> u:8020/apps/hive/warehouse/pstn_connections/dt=20160711/dire
>> ctory_number_last_digit=5/delta_71700156_71700255/bucket_00000. Invalid
>> postscript length 0
>> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handle
>> RecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
>> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleR
>> ecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
>> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader
>> (HiveInputFormat.java:251)
>> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$T
>> ezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedS
>> plitsInputFormat.java:193)
>> ... 19 more
>> Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed
>> ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pst
>> n_connections/dt=20160711/directory_number_last_digit=5/delt
>> a_71700156_71700255/bucket_00000. Invalid postscript length 0
>> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(
>> ReaderImpl.java:236)
>> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoF
>> romFooter(ReaderImpl.java:376)
>> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImp
>> l.java:317)
>> at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFil
>> e.java:238)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(Or
>> cInputFormat.java:1259)
>> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordRea
>> der(OrcInputFormat.java:1151)
>> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader
>> (HiveInputFormat.java:249)
>> ... 20 more
>>
>> Does anyone encountered such a situation?
>>
>>
>>
>

Re: Malformed orc file

Posted by Igor Kuzmenko <f1...@gmail.com>.

Unfortunately, I сan't provide more information, this file I got from our
tester and he already droped table.

On Thu, Aug 4, 2016 at 9:16 PM, Prasanth Jayachandran <
pjayachandran@hortonworks.com> wrote:

> Hi
>
> In case of streaming, when a transaction is open orc file is not closed
> and hence may not be flushed completely. Did the transaction commit
> successfully? Or was there any exception thrown during writes/commit?
>
> Thanks
> Prasanth
>
> On Aug 3, 2016, at 6:09 AM, Igor Kuzmenko <f1...@gmail.com> wrote:
>
> Hello, I've got a malformed ORC file in my Hive table. File was created by
> Hive Streaming API and I have no idea under what circumstances it
> became corrupted.
>
> File on google drive: link
> <https://drive.google.com/file/d/0ByB92PAoAkrKeFFZRUN4WWVQY1U/view?usp=sharing>
>
> Exception message when trying to perform select from table:
>
> ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1468498236400_1106_6_00,
> diagnostics=[Task failed, taskId=task_1468498236400_1106_6_00_000000,
> diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running
> task:java.lang.RuntimeException: java.lang.RuntimeException:
> java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException:
> Malformed ORC file hdfs://sorm-master01.msk.mts.
> ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/
> directory_number_last_digit=5/delta_71700156_71700255/bucket_00000.
> Invalid postscript length 0
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.
> initializeAndRunProcessor(TezProcessor.java:173)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(
> TezProcessor.java:139)
> at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(
> LogicalIOProcessorRuntimeTask.java:344)
> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(
> TezTaskRunner.java:181)
> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(
> TezTaskRunner.java:172)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.
> callInternal(TezTaskRunner.java:172)
> at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.
> callInternal(TezTaskRunner.java:168)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: java.io.IOException:
> org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file
> hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/
> pstn_connections/dt=20160711/directory_number_last_digit=5/
> delta_71700156_71700255/bucket_00000. Invalid postscript length 0
> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$
> TezGroupedSplitsRecordReader.initNextRecordReader(
> TezGroupedSplitsInputFormat.java:196)
> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$
> TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:142)
> at org.apache.tez.mapreduce.lib.MRReaderMapred.next(
> MRReaderMapred.java:113)
> at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.
> pushRecord(MapRecordSource.java:61)
> at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.
> run(MapRecordProcessor.java:326)
> at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.
> initializeAndRunProcessor(TezProcessor.java:150)
> ... 14 more
> Caused by: java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException:
> Malformed ORC file hdfs://sorm-master01.msk.mts.
> ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/
> directory_number_last_digit=5/delta_71700156_71700255/bucket_00000.
> Invalid postscript length 0
> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.
> handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
> at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.
> handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(
> HiveInputFormat.java:251)
> at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$
> TezGroupedSplitsRecordReader.initNextRecordReader(
> TezGroupedSplitsInputFormat.java:193)
> ... 19 more
> Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed
> ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/
> pstn_connections/dt=20160711/directory_number_last_digit=5/
> delta_71700156_71700255/bucket_00000. Invalid postscript length 0
> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.
> java:236)
> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(
> ReaderImpl.java:376)
> at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:317)
> at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:238)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(
> OrcInputFormat.java:1259)
> at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(
> OrcInputFormat.java:1151)
> at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(
> HiveInputFormat.java:249)
> ... 20 more
>
> Does anyone encountered such a situation?
>
>
>

Re: Malformed orc file

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Hi

In case of streaming, when a transaction is open orc file is not closed and hence may not be flushed completely. Did the transaction commit successfully? Or was there any exception thrown during writes/commit?

Thanks
Prasanth

On Aug 3, 2016, at 6:09 AM, Igor Kuzmenko <f1...@gmail.com>> wrote:

Hello, I've got a malformed ORC file in my Hive table. File was created by Hive Streaming API and I have no idea under what circumstances it became corrupted.

File on google drive: link<https://drive.google.com/file/d/0ByB92PAoAkrKeFFZRUN4WWVQY1U/view?usp=sharing>

Exception message when trying to perform select from table:
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1468498236400_1106_6_00, diagnostics=[Task failed, taskId=task_1468498236400_1106_6_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:196)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:142)
at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:113)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:61)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:326)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:150)
... 14 more
Caused by: java.io.IOException: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:251)
at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:193)
... 19 more
Caused by: org.apache.hadoop.hive.ql.io.FileFormatException: Malformed ORC file hdfs://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000<http://sorm-master01.msk.mts.ru:8020/apps/hive/warehouse/pstn_connections/dt=20160711/directory_number_last_digit=5/delta_71700156_71700255/bucket_00000>. Invalid postscript length 0
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.ensureOrcFooter(ReaderImpl.java:236)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoFromFooter(ReaderImpl.java:376)
at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:317)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:238)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1259)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1151)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:249)
... 20 more

Does anyone encountered such a situation?