You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by Dain Sundstrom <da...@iq80.com> on 2017/12/14 19:00:13 UTC

ORC magic

Does the ORC spec require that a file start with “ORC”?

-dain

Re: ORC magic

Posted by Xiening Dai <xn...@live.com>.
Hi Deepak,

ORC C++ writer does write “ORC” magic at the beginning of file. But the reader is not verify it when open the file (same for Java reader as far as I can tell). But there’s probably a reason for that - since the reader already verifies the postscript at file tail it’s not necessary to check the header again which will require an additional IO.


> On Dec 15, 2017, at 8:55 AM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> On a side note, I put a patch in to the Linux 'file' command that makes it
> recognize ORC files. If you've got file 5.31 or later, you'll get:
> 
> owen@laptop> file examples/*.orc
> examples/TestOrcFile.columnProjection.orc:              Apache ORC
> examples/TestOrcFile.emptyFile.orc:                     Apache ORC
> examples/TestOrcFile.metaData.orc:                      Apache ORC
> examples/TestOrcFile.test1.orc:                         Apache ORC
> examples/TestOrcFile.testDate1900.orc:                  Apache ORC
> examples/TestOrcFile.testDate2038.orc:                  Apache ORC
> examples/TestOrcFile.testMemoryManagementV11.orc:       Apache ORC
> examples/TestOrcFile.testMemoryManagementV12.orc:       Apache ORC
> examples/TestOrcFile.testPredicatePushdown.orc:         Apache ORC
> examples/TestOrcFile.testSeek.orc:                      Apache ORC
> examples/TestOrcFile.testSnappy.orc:                    Apache ORC
> examples/TestOrcFile.testStringAndBinaryStatistics.orc: Apache ORC
> examples/TestOrcFile.testStripeLevelStats.orc:          Apache ORC
> examples/TestOrcFile.testTimestamp.orc:                 Apache ORC
> examples/TestOrcFile.testUnionAndTimestamp.orc:         Apache ORC
> examples/TestOrcFile.testWithoutIndex.orc:              Apache ORC
> examples/TestVectorOrcFile.testLz4.orc:                 Apache ORC
> examples/TestVectorOrcFile.testLzo.orc:                 Apache ORC
> examples/decimal.orc:                                   Apache ORC
> examples/demo-11-none.orc:                              Apache ORC
> examples/demo-11-zlib.orc:                              Apache ORC
> examples/demo-12-zlib.orc:                              Apache ORC
> examples/nulls-at-end-snappy.orc:                       Apache ORC
> examples/orc-file-11-format.orc:                        Apache ORC
> examples/orc_index_int_string.orc:                      Apache ORC
> examples/orc_split_elim.orc:                            Apache ORC
> examples/orc_split_elim_new.orc:                        Apache ORC
> examples/over1k_bloom.orc:                              Apache ORC
> examples/version1999.orc:                               Apache ORC
> examples/zero.orc:                                      empty
> 
> It looks like the file command has finally added negative offsets from the
> end of the file, so we could extend it with more information.
> 
> .. Owen
> 
> 
> On Fri, Dec 15, 2017 at 7:00 AM, Deepak Majeti <ma...@gmail.com>
> wrote:
> 
>> Hi Xiening,
>> 
>> The readers (both java and c++) just use the "magic" bits present in the
>> Tail to verify ORC files. But the spec requires "ORC" bits to be present in
>> the header as well to support tools that scan from the front.
>> You can verify this from the ORC files written by the Java writer.
>> I just observed this requirement today as well. We should support this with
>> the C++ writer too if we don't already.
>> 
>> 
>> On Fri, Dec 15, 2017 at 2:45 AM, Dain Sundstrom <da...@iq80.com> wrote:
>> 
>>> Thanks Deepak. I was searching for “magic” and missed this part.
>>> 
>>> -dain
>>> 
>>>> On Dec 14, 2017, at 7:16 PM, Deepak Majeti <ma...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi Dain,
>>>> 
>>>> The ORC spec requires that a file start with "ORC".
>>>> 
>>>> From https://orc.apache.org/docs/file-tail.html
>>>> 
>>>> "The file is broken in to three parts- Header, Body, and Tail. The
>> Header
>>>> consists of the bytes “ORC’’ to support tools that want to scan the
>> front
>>>> of the file to determine the type of the file."
>>>> 
>>>> On Thu, Dec 14, 2017 at 2:00 PM, Dain Sundstrom <da...@iq80.com> wrote:
>>>> 
>>>>> Does the ORC spec require that a file start with “ORC”?
>>>>> 
>>>>> -dain
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> regards,
>>>> Deepak Majeti
>>> 
>>> 
>> 
>> 
>> --
>> regards,
>> Deepak Majeti
>> 


Re: ORC magic

Posted by Owen O'Malley <ow...@gmail.com>.
On a side note, I put a patch in to the Linux 'file' command that makes it
recognize ORC files. If you've got file 5.31 or later, you'll get:

owen@laptop> file examples/*.orc
examples/TestOrcFile.columnProjection.orc:              Apache ORC
examples/TestOrcFile.emptyFile.orc:                     Apache ORC
examples/TestOrcFile.metaData.orc:                      Apache ORC
examples/TestOrcFile.test1.orc:                         Apache ORC
examples/TestOrcFile.testDate1900.orc:                  Apache ORC
examples/TestOrcFile.testDate2038.orc:                  Apache ORC
examples/TestOrcFile.testMemoryManagementV11.orc:       Apache ORC
examples/TestOrcFile.testMemoryManagementV12.orc:       Apache ORC
examples/TestOrcFile.testPredicatePushdown.orc:         Apache ORC
examples/TestOrcFile.testSeek.orc:                      Apache ORC
examples/TestOrcFile.testSnappy.orc:                    Apache ORC
examples/TestOrcFile.testStringAndBinaryStatistics.orc: Apache ORC
examples/TestOrcFile.testStripeLevelStats.orc:          Apache ORC
examples/TestOrcFile.testTimestamp.orc:                 Apache ORC
examples/TestOrcFile.testUnionAndTimestamp.orc:         Apache ORC
examples/TestOrcFile.testWithoutIndex.orc:              Apache ORC
examples/TestVectorOrcFile.testLz4.orc:                 Apache ORC
examples/TestVectorOrcFile.testLzo.orc:                 Apache ORC
examples/decimal.orc:                                   Apache ORC
examples/demo-11-none.orc:                              Apache ORC
examples/demo-11-zlib.orc:                              Apache ORC
examples/demo-12-zlib.orc:                              Apache ORC
examples/nulls-at-end-snappy.orc:                       Apache ORC
examples/orc-file-11-format.orc:                        Apache ORC
examples/orc_index_int_string.orc:                      Apache ORC
examples/orc_split_elim.orc:                            Apache ORC
examples/orc_split_elim_new.orc:                        Apache ORC
examples/over1k_bloom.orc:                              Apache ORC
examples/version1999.orc:                               Apache ORC
examples/zero.orc:                                      empty

It looks like the file command has finally added negative offsets from the
end of the file, so we could extend it with more information.

.. Owen


On Fri, Dec 15, 2017 at 7:00 AM, Deepak Majeti <ma...@gmail.com>
wrote:

> Hi Xiening,
>
> The readers (both java and c++) just use the "magic" bits present in the
> Tail to verify ORC files. But the spec requires "ORC" bits to be present in
> the header as well to support tools that scan from the front.
> You can verify this from the ORC files written by the Java writer.
> I just observed this requirement today as well. We should support this with
> the C++ writer too if we don't already.
>
>
> On Fri, Dec 15, 2017 at 2:45 AM, Dain Sundstrom <da...@iq80.com> wrote:
>
> > Thanks Deepak. I was searching for “magic” and missed this part.
> >
> > -dain
> >
> > > On Dec 14, 2017, at 7:16 PM, Deepak Majeti <ma...@gmail.com>
> > wrote:
> > >
> > > Hi Dain,
> > >
> > > The ORC spec requires that a file start with "ORC".
> > >
> > > From https://orc.apache.org/docs/file-tail.html
> > >
> > > "The file is broken in to three parts- Header, Body, and Tail. The
> Header
> > > consists of the bytes “ORC’’ to support tools that want to scan the
> front
> > > of the file to determine the type of the file."
> > >
> > > On Thu, Dec 14, 2017 at 2:00 PM, Dain Sundstrom <da...@iq80.com> wrote:
> > >
> > >> Does the ORC spec require that a file start with “ORC”?
> > >>
> > >> -dain
> > >
> > >
> > >
> > >
> > > --
> > > regards,
> > > Deepak Majeti
> >
> >
>
>
> --
> regards,
> Deepak Majeti
>

Re: ORC magic

Posted by Deepak Majeti <ma...@gmail.com>.
Hi Xiening,

The readers (both java and c++) just use the "magic" bits present in the
Tail to verify ORC files. But the spec requires "ORC" bits to be present in
the header as well to support tools that scan from the front.
You can verify this from the ORC files written by the Java writer.
I just observed this requirement today as well. We should support this with
the C++ writer too if we don't already.


On Fri, Dec 15, 2017 at 2:45 AM, Dain Sundstrom <da...@iq80.com> wrote:

> Thanks Deepak. I was searching for “magic” and missed this part.
>
> -dain
>
> > On Dec 14, 2017, at 7:16 PM, Deepak Majeti <ma...@gmail.com>
> wrote:
> >
> > Hi Dain,
> >
> > The ORC spec requires that a file start with "ORC".
> >
> > From https://orc.apache.org/docs/file-tail.html
> >
> > "The file is broken in to three parts- Header, Body, and Tail. The Header
> > consists of the bytes “ORC’’ to support tools that want to scan the front
> > of the file to determine the type of the file."
> >
> > On Thu, Dec 14, 2017 at 2:00 PM, Dain Sundstrom <da...@iq80.com> wrote:
> >
> >> Does the ORC spec require that a file start with “ORC”?
> >>
> >> -dain
> >
> >
> >
> >
> > --
> > regards,
> > Deepak Majeti
>
>


-- 
regards,
Deepak Majeti

Re: ORC magic

Posted by Dain Sundstrom <da...@iq80.com>.
Thanks Deepak. I was searching for “magic” and missed this part.

-dain

> On Dec 14, 2017, at 7:16 PM, Deepak Majeti <ma...@gmail.com> wrote:
> 
> Hi Dain,
> 
> The ORC spec requires that a file start with "ORC".
> 
> From https://orc.apache.org/docs/file-tail.html
> 
> "The file is broken in to three parts- Header, Body, and Tail. The Header
> consists of the bytes “ORC’’ to support tools that want to scan the front
> of the file to determine the type of the file."
> 
> On Thu, Dec 14, 2017 at 2:00 PM, Dain Sundstrom <da...@iq80.com> wrote:
> 
>> Does the ORC spec require that a file start with “ORC”?
>> 
>> -dain
> 
> 
> 
> 
> -- 
> regards,
> Deepak Majeti


Re: ORC magic

Posted by Xiening Dai <xn...@live.com>.
It looks like our reader implementation (both java and c++) doesn’t verify file begins with “ORC”.

> On Dec 14, 2017, at 7:16 PM, Deepak Majeti <ma...@gmail.com> wrote:
> 
> Hi Dain,
> 
> The ORC spec requires that a file start with "ORC".
> 
> From https://orc.apache.org/docs/file-tail.html
> 
> "The file is broken in to three parts- Header, Body, and Tail. The Header
> consists of the bytes “ORC’’ to support tools that want to scan the front
> of the file to determine the type of the file."
> 
> On Thu, Dec 14, 2017 at 2:00 PM, Dain Sundstrom <da...@iq80.com> wrote:
> 
>> Does the ORC spec require that a file start with “ORC”?
>> 
>> -dain
> 
> 
> 
> 
> -- 
> regards,
> Deepak Majeti


Re: ORC magic

Posted by Deepak Majeti <ma...@gmail.com>.
Hi Dain,

The ORC spec requires that a file start with "ORC".

From https://orc.apache.org/docs/file-tail.html

"The file is broken in to three parts- Header, Body, and Tail. The Header
consists of the bytes “ORC’’ to support tools that want to scan the front
of the file to determine the type of the file."

On Thu, Dec 14, 2017 at 2:00 PM, Dain Sundstrom <da...@iq80.com> wrote:

> Does the ORC spec require that a file start with “ORC”?
>
> -dain




-- 
regards,
Deepak Majeti