You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Yonatan Augarten <yo...@intango.com> on 2017/10/01 09:08:59 UTC
Re: Google Protobuf Version

Indeed the problem was a corner case in which our app would crash and not
close the file.

Thank you for your help

On Wed, Sep 27, 2017 at 1:02 AM, Owen O'Malley <ow...@gmail.com>
wrote:

> The extra characters after the instances of ORC are because the following
> characters look like valid characters and the strings command is a generic
> tool. Of course you could accidentally get 0x4f, 0x52, 0x43 "ORC" in the
> file, but that is relatively unlikely.
>
> Your output that implies that you used Writer.writeIntermediateFooter to
> put in to intermediate footers into the file. Since there is a large gap
> from the last offset to the length of the file, I would guess that your
> application didn't close the writer to get the final footer at the end of
> the file. Try passing in 33162188 in as the ReaderOptions.maxLength().
> You should get a valid reader then and be able to read the data before that
> footer (ignoring the last 6mb of data in the file).
>
> .. Owen
>
>
> On Tue, Sep 26, 2017 at 12:09 PM, Yonatan Augarten <yo...@intango.com>
> wrote:
>
>> Thank you for the detailed explanation!
>>
>> Interesting. I'm getting the following (very strange) output (including
>> the spaces before the 0):
>>
>>>       0 ORC&
>>> 10288812     ORC
>>> 14991902 ORC
>>> 33162184 ORC_R
>>>
>>
>> The file size is 39845888 bytes.
>>
>> On Tue, Sep 26, 2017 at 11:49 AM, Owen O'Malley <ow...@gmail.com>
>> wrote:
>>
>>> Ok, it was reading the postscript (via OrcProto$Postscript.parseFrom),
>>> which is the very first thing it does.
>>>
>>> The first thing to try is to see if you have a proper postscript
>>> somewhere in the file. If you are on Mac or Linux,
>>> try:
>>>
>>> % strings -n 3 -t d example/decimal.orc | grep ORC
>>>
>>> Replacing example/decimal.orc with your ORC file. You'll get an output
>>> like:
>>>
>>> 0 ORC
>>> 16333 ORC
>>>
>>> which are the offsets where "ORC" is located. The ORC format puts it
>>> once at the front of the file (so that the "file" command can detect the
>>> format) and once at the end of the postscript. (There is always one byte
>>> after the last ORC, which is the length of the postscript, so the total
>>> length of the file should be the final offset + 4.)
>>>
>>> .. Owen
>>>
>>> On Tue, Sep 26, 2017 at 1:36 AM, Yonatan Augarten <yo...@intango.com>
>>> wrote:
>>>
>>>> No, the file is invalid. The problem is that our code sometimes
>>>> generates invalid ORC files.
>>>> The code is always called from a single thread, and it performs a
>>>> series of "addRowBatch" actions on a writer.
>>>> The file is then closed and loaded to a hive table.
>>>> This works 99% of the times, but in some cases the resulting file is
>>>> somehow corrupt.
>>>> See below the stack trace of an attempt to run orcfiledump on this file.
>>>>
>>>> Thanks for your help,
>>>> Yoni.
>>>>
>>>> Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException:
>>>> Protocol message tag had invalid wire type.
>>>>     at com.google.protobuf.InvalidProtocolBufferException.invalidWi
>>>> reType(InvalidProtocolBufferException.java:99)
>>>>     at com.google.protobuf.UnknownFieldSet$Builder.mergeFieldFrom(U
>>>> nknownFieldSet.java:498)
>>>>     at com.google.protobuf.GeneratedMessage.parseUnknownField(Gener
>>>> atedMessage.java:193)
>>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>(
>>>> OrcProto.java:16466)
>>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.<init>(
>>>> OrcProto.java:16424)
>>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse
>>>> PartialFrom(OrcProto.java:16562)
>>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript$1.parse
>>>> PartialFrom(OrcProto.java:16557)
>>>>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
>>>> java:89)
>>>>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
>>>> java:95)
>>>>     at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.
>>>> java:49)
>>>>     at org.apache.hadoop.hive.ql.io.orc.OrcProto$PostScript.parseFr
>>>> om(OrcProto.java:16910)
>>>>     at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.extractMetaInfoF
>>>> romFooter(ReaderImpl.java:374)
>>>>     at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImp
>>>> l.java:311)
>>>>     at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFil
>>>> e.java:228)
>>>>     at org.apache.hadoop.hive.ql.io.orc.FileDump.printMetaData(File
>>>> Dump.java:96)
>>>>     at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:81)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>>>> ssorImpl.java:62)
>>>>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>>>> thodAccessorImpl.java:43)
>>>>     at java.lang.reflect.Method.invoke(Method.java:497)
>>>>     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>>>>     at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>>>>
>>>>
>>>>
>>>> On Tue, Sep 26, 2017 at 12:11 AM, Owen O'Malley <owen.omalley@gmail.com
>>>> > wrote:
>>>>
>>>>> On Mon, Sep 25, 2017 at 12:47 PM, Yonatan Augarten <yo...@intango.com>
>>>>> wrote:
>>>>>
>>>>>> Would you say that it's likely that this error (*Protocol message
>>>>>> contained an invalid tag (zero)*) is caused by the wrong version?
>>>>>>
>>>>>
>>>>>  No, it is likely something else. However, I haven't seen that error
>>>>> coming out of the ORC reader before. Can you give me the whole stack trace?
>>>>> Are you sure that it is a valid ORC file?
>>>>>
>>>>> Thanks,
>>>>>    Owen
>>>>>
>>>>
>>>>
>>>
>>
>