You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Xiening Dai <xn...@live.com> on 2019/06/14 16:52:23 UTC

The Orc magic string

Hi all,

In Orc appending scenario, the append operation (including writing the additional data and the new footer) needs to be atomic. Otherwise if it failed in between, the file tail would be unrecognizable. Unfortunately not all file system can garantee atomic write. When failure does happen, in order to recover the data before append, we would need to locate the previous file footer by searching backward. And the only way to search for the footer is by looking for the “ORC” magic string. But the current magic string only has three characters and it’s likely the same string appears in user data which will result in parsing a wrong footer, and the behavior is undefined.

So I am thinking that if we can change the magic string into some 16-byte UUID. This way we can safely use it to locate the footer. The idea is very similar to the sync maker in Avro.

Thanks.

Re: The Orc magic string

Posted by Owen O'Malley <ow...@gmail.com>.
It is expected, but like most of Hive's ACID layout is badly documented.
The code is in OrcAcidUtils
<https://github.com/apache/orc/blob/1c5a020382059b9fea3344ffe428b1f8986b0a12/java/core/src/java/org/apache/orc/impl/OrcAcidUtils.java#L42>
.

.. Owen


On Sat, Jun 15, 2019 at 12:25 PM Dain Sundstrom <da...@iq80.com> wrote:

> Is this expected behavior of ORC acid writers?  If so, is it documented
> somewhere?
>
> -dain
>
> ----
> Dain Sundstrom
> Co-founder @ Presto Software Foundation, Co-creator of Presto (
> https://prestosql.io)
>
> > On Jun 14, 2019, at 6:17 PM, Owen O'Malley <ow...@gmail.com>
> wrote:
> >
> > The hive acid format uses a side file that provides a sequence of the 8
> byte file offsets for completed file footers. If the file is there, it
> passes the last offset to the reader and it will treat that as the end of
> the file.
> >
> > In the case where you don't have that, searching for the string
> “\003ORC” works really well for finding the tails. In the corrupted files
> I've seen I've never needed more than that.
> >
> > .. Owen
> >
> >> On Jun 14, 2019, at 09:52, Xiening Dai <xn...@live.com> wrote:
> >>
> >> Hi all,
> >>
> >> In Orc appending scenario, the append operation (including writing the
> additional data and the new footer) needs to be atomic. Otherwise if it
> failed in between, the file tail would be unrecognizable. Unfortunately not
> all file system can garantee atomic write. When failure does happen, in
> order to recover the data before append, we would need to locate the
> previous file footer by searching backward. And the only way to search for
> the footer is by looking for the “ORC” magic string. But the current magic
> string only has three characters and it’s likely the same string appears in
> user data which will result in parsing a wrong footer, and the behavior is
> undefined.
> >>
> >> So I am thinking that if we can change the magic string into some
> 16-byte UUID. This way we can safely use it to locate the footer. The idea
> is very similar to the sync maker in Avro.
> >>
> >> Thanks.
>
>

Re: The Orc magic string

Posted by Dain Sundstrom <da...@iq80.com>.
Is this expected behavior of ORC acid writers?  If so, is it documented somewhere?

-dain

----
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io)

> On Jun 14, 2019, at 6:17 PM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> The hive acid format uses a side file that provides a sequence of the 8 byte file offsets for completed file footers. If the file is there, it passes the last offset to the reader and it will treat that as the end of the file. 
> 
> In the case where you don't have that, searching for the string “\003ORC” works really well for finding the tails. In the corrupted files I've seen I've never needed more than that. 
> 
> .. Owen
> 
>> On Jun 14, 2019, at 09:52, Xiening Dai <xn...@live.com> wrote:
>> 
>> Hi all,
>> 
>> In Orc appending scenario, the append operation (including writing the additional data and the new footer) needs to be atomic. Otherwise if it failed in between, the file tail would be unrecognizable. Unfortunately not all file system can garantee atomic write. When failure does happen, in order to recover the data before append, we would need to locate the previous file footer by searching backward. And the only way to search for the footer is by looking for the “ORC” magic string. But the current magic string only has three characters and it’s likely the same string appears in user data which will result in parsing a wrong footer, and the behavior is undefined.
>> 
>> So I am thinking that if we can change the magic string into some 16-byte UUID. This way we can safely use it to locate the footer. The idea is very similar to the sync maker in Avro.
>> 
>> Thanks.


Re: The Orc magic string

Posted by Owen O'Malley <ow...@gmail.com>.
The hive acid format uses a side file that provides a sequence of the 8 byte file offsets for completed file footers. If the file is there, it passes the last offset to the reader and it will treat that as the end of the file. 

In the case where you don't have that, searching for the string “\003ORC” works really well for finding the tails. In the corrupted files I've seen I've never needed more than that. 

.. Owen

> On Jun 14, 2019, at 09:52, Xiening Dai <xn...@live.com> wrote:
> 
> Hi all,
> 
> In Orc appending scenario, the append operation (including writing the additional data and the new footer) needs to be atomic. Otherwise if it failed in between, the file tail would be unrecognizable. Unfortunately not all file system can garantee atomic write. When failure does happen, in order to recover the data before append, we would need to locate the previous file footer by searching backward. And the only way to search for the footer is by looking for the “ORC” magic string. But the current magic string only has three characters and it’s likely the same string appears in user data which will result in parsing a wrong footer, and the behavior is undefined.
> 
> So I am thinking that if we can change the magic string into some 16-byte UUID. This way we can safely use it to locate the footer. The idea is very similar to the sync maker in Avro.
> 
> Thanks.