You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Zhiyuan Dong <zh...@gmail.com> on 2019/01/20 15:16:31 UTC

access entire column in ORC files

Hi

I am working in marketing research field, and find that at times I need to
extract contents of ORC files into analytical packages like R, Julia, etc,
without using tools like JDBC, etc ( which offers ability to access ORC
files )

I have been using C++ to access ORC file contents, following examples
provided in the ORC file C++ distribution example, e.g. meta info,
contents, etc. My datasets are basic 2d tables, with rows and columns, each
column has very basic data types : int64, or double. I have found the ORC
file C++ access APIs very helpful and handy!

Since R or Julia has column major storage format in their matrix, and I
would like to extract the contents of ORC files column by column. In the
example that gets the file contents made available on the ORC file C++
official website, the C++ code reads the entire ORC file contents by
batches, and within each batch, it reads the contents row by row, creating
a string version of the data, JSON like.

My question is : ( since I don't know how ORC file structure details ), Can
the user read ORC file contents column by column using the C++ APIs you
guys published ? is there speed advantage of doing this ( as opposed to
read in batches, and within each batch parse contents row by row ).

if possible : Is there an example that I can follow to read contents column
by column?

Is it possible that the example C++ codes can give a (char*) type pointer
to the user , each time it reads a row element within a column, so that
users can read that into desired data type, e.g. int64, double, etc,
directly without building the JSON like text output rows ? Or there are
even more there already to read a ORC file column directly into a in-memory
T* that stores the data with corresponding data type, e.g. int64, double,
etc. ?

Many many thanks!

Best,

Zhiyuan

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

Many thanks Gang for your prompt reply. Yes your answers make sense to me!

Best,

Zhiyuan

On Fri, Jan 25, 2019 at 11:53 PM Gang Wu <us...@gmail.com> wrote:

> Unfortunately we don't have an API to return a row of data. You have to
> extract each column from the batches.
>
> For seekToRow(uint64_t rowNumber), you can  jump to the row specified by
> rowNumber and then use rowReader->next() to get the batch. It is pretty
> straightforward.
>
> You can actually create two rowReaders. The 1st rowReader only include the
> 1st column you need via rowReaderOptions and try to gather the columns you
> want. Then you create the 2nd rowReader which only include those columns
> you want  via rowReaderOptions. Does that make sense?
>
> Let me know if you have any questions.
>
> Gang
>
> On Fri, Jan 25, 2019 at 7:48 PM Zhiyuan Dong <zh...@gmail.com>
> wrote:
>
>> in the   RowReader class, there is a function seekToRow(uint64_t
>> rowNumber), I am wondering there are code example showing how to use this
>> function to read columns in a row.
>>
>> Many thanks
>>
>> Best,
>>
>> Zhiyuan
>>
>> On Fri, Jan 25, 2019 at 8:10 PM Zhiyuan Dong <zh...@gmail.com>
>> wrote:
>>
>>> Let us add some context which may help explain my question better a
>>> little bit.
>>>
>>> suppose I have an orc files having many columns, e.g. 5000+ columns, the
>>> first column of each row stores some information I can use to decide if I
>>> need to extract a row or not.
>>>
>>> in the first pass, I read the first column from start to end to find out
>>> which are the subset of the rows that I need to extract, and allocate right
>>> amount of memory ready to store the rows identified, containing all the
>>> rest of columns.
>>>
>>> now, when I do a 2nd pass, for the rest of  5000+ columns, is there any
>>> ORC C++ API that I can use to only extract those row positions identified
>>> by the 1st pass ?
>>>
>>> what I am doing now is to extract the rest of columns, batch by batch,
>>>
>>> within each batch, all columns are populated to vectors its correct
>>> subtype, e.g. double, , and I pre-decide a set of read/skip steps within
>>> the rows of each batch, so that I can extract certain row
>>> positions.identified by the first pass, but not sure if this is an
>>> efficient way in given that there maybe  ORC C++. API there already built
>>> to handle situations like this.
>>>
>>> Many many thanks!
>>>
>>> Best,
>>>
>>> Zhiyuan
>>>
>>>
>>>
>>>
>>> On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan Dong <zh...@gmail.com>
>>> wrote:
>>>
>>>> Thanks Xiening!!
>>>>
>>>> A follow-up  question :
>>>>
>>>> suppose I have an orc files having many columns,
>>>>
>>>> in the first pass, I read the first column from start to end to find
>>>> out which are the subset of the rows that I need to extract.
>>>>
>>>> now, when I do a 2nd pass, for the rest of columns, is there any
>>>> efficient way that I can only extract the row positions that I identified
>>>> in the first pass ?
>>>>
>>>> what I am doing now is to extract the rest of columns, batch by batch,
>>>> and only extract those rows identified by the first pass, but not sure if
>>>> this is an efficient way.
>>>>
>>>> Many thanks!!
>>>>
>>>> Best,
>>>>
>>>> Zhiyuan
>>>>
>>>
>>>
>>> --
>>> Zhiyuan Dong, Ph.D.
>>>
>>
>>
>> --
>> Zhiyuan Dong, Ph.D.
>>
>

-- 
Zhiyuan Dong, Ph.D.

Re: access entire column in ORC files

Posted by Gang Wu <us...@gmail.com>.

Unfortunately we don't have an API to return a row of data. You have to
extract each column from the batches.

For seekToRow(uint64_t rowNumber), you can  jump to the row specified by
rowNumber and then use rowReader->next() to get the batch. It is pretty
straightforward.

You can actually create two rowReaders. The 1st rowReader only include the
1st column you need via rowReaderOptions and try to gather the columns you
want. Then you create the 2nd rowReader which only include those columns
you want  via rowReaderOptions. Does that make sense?

Let me know if you have any questions.

Gang

On Fri, Jan 25, 2019 at 7:48 PM Zhiyuan Dong <zh...@gmail.com> wrote:

> in the   RowReader class, there is a function seekToRow(uint64_t
> rowNumber), I am wondering there are code example showing how to use this
> function to read columns in a row.
>
> Many thanks
>
> Best,
>
> Zhiyuan
>
> On Fri, Jan 25, 2019 at 8:10 PM Zhiyuan Dong <zh...@gmail.com>
> wrote:
>
>> Let us add some context which may help explain my question better a
>> little bit.
>>
>> suppose I have an orc files having many columns, e.g. 5000+ columns, the
>> first column of each row stores some information I can use to decide if I
>> need to extract a row or not.
>>
>> in the first pass, I read the first column from start to end to find out
>> which are the subset of the rows that I need to extract, and allocate right
>> amount of memory ready to store the rows identified, containing all the
>> rest of columns.
>>
>> now, when I do a 2nd pass, for the rest of  5000+ columns, is there any
>> ORC C++ API that I can use to only extract those row positions identified
>> by the 1st pass ?
>>
>> what I am doing now is to extract the rest of columns, batch by batch,
>>
>> within each batch, all columns are populated to vectors its correct
>> subtype, e.g. double, , and I pre-decide a set of read/skip steps within
>> the rows of each batch, so that I can extract certain row
>> positions.identified by the first pass, but not sure if this is an
>> efficient way in given that there maybe  ORC C++. API there already built
>> to handle situations like this.
>>
>> Many many thanks!
>>
>> Best,
>>
>> Zhiyuan
>>
>>
>>
>>
>> On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan Dong <zh...@gmail.com>
>> wrote:
>>
>>> Thanks Xiening!!
>>>
>>> A follow-up  question :
>>>
>>> suppose I have an orc files having many columns,
>>>
>>> in the first pass, I read the first column from start to end to find out
>>> which are the subset of the rows that I need to extract.
>>>
>>> now, when I do a 2nd pass, for the rest of columns, is there any
>>> efficient way that I can only extract the row positions that I identified
>>> in the first pass ?
>>>
>>> what I am doing now is to extract the rest of columns, batch by batch,
>>> and only extract those rows identified by the first pass, but not sure if
>>> this is an efficient way.
>>>
>>> Many thanks!!
>>>
>>> Best,
>>>
>>> Zhiyuan
>>>
>>
>>
>> --
>> Zhiyuan Dong, Ph.D.
>>
>
>
> --
> Zhiyuan Dong, Ph.D.
>

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

in the   RowReader class, there is a function seekToRow(uint64_t
rowNumber), I am wondering there are code example showing how to use this
function to read columns in a row.

Many thanks

Best,

Zhiyuan

On Fri, Jan 25, 2019 at 8:10 PM Zhiyuan Dong <zh...@gmail.com> wrote:

> Let us add some context which may help explain my question better a little
> bit.
>
> suppose I have an orc files having many columns, e.g. 5000+ columns, the
> first column of each row stores some information I can use to decide if I
> need to extract a row or not.
>
> in the first pass, I read the first column from start to end to find out
> which are the subset of the rows that I need to extract, and allocate right
> amount of memory ready to store the rows identified, containing all the
> rest of columns.
>
> now, when I do a 2nd pass, for the rest of  5000+ columns, is there any
> ORC C++ API that I can use to only extract those row positions identified
> by the 1st pass ?
>
> what I am doing now is to extract the rest of columns, batch by batch,
>
> within each batch, all columns are populated to vectors its correct
> subtype, e.g. double, , and I pre-decide a set of read/skip steps within
> the rows of each batch, so that I can extract certain row
> positions.identified by the first pass, but not sure if this is an
> efficient way in given that there maybe  ORC C++. API there already built
> to handle situations like this.
>
> Many many thanks!
>
> Best,
>
> Zhiyuan
>
>
>
>
> On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan Dong <zh...@gmail.com>
> wrote:
>
>> Thanks Xiening!!
>>
>> A follow-up  question :
>>
>> suppose I have an orc files having many columns,
>>
>> in the first pass, I read the first column from start to end to find out
>> which are the subset of the rows that I need to extract.
>>
>> now, when I do a 2nd pass, for the rest of columns, is there any
>> efficient way that I can only extract the row positions that I identified
>> in the first pass ?
>>
>> what I am doing now is to extract the rest of columns, batch by batch,
>> and only extract those rows identified by the first pass, but not sure if
>> this is an efficient way.
>>
>> Many thanks!!
>>
>> Best,
>>
>> Zhiyuan
>>
>
>
> --
> Zhiyuan Dong, Ph.D.
>


-- 
Zhiyuan Dong, Ph.D.

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

Let us add some context which may help explain my question better a little
bit.

suppose I have an orc files having many columns, e.g. 5000+ columns, the
first column of each row stores some information I can use to decide if I
need to extract a row or not.

in the first pass, I read the first column from start to end to find out
which are the subset of the rows that I need to extract, and allocate right
amount of memory ready to store the rows identified, containing all the
rest of columns.

now, when I do a 2nd pass, for the rest of  5000+ columns, is there any ORC
C++ API that I can use to only extract those row positions identified by
the 1st pass ?

what I am doing now is to extract the rest of columns, batch by batch,

within each batch, all columns are populated to vectors its correct
subtype, e.g. double, , and I pre-decide a set of read/skip steps within
the rows of each batch, so that I can extract certain row
positions.identified by the first pass, but not sure if this is an
efficient way in given that there maybe  ORC C++. API there already built
to handle situations like this.

Many many thanks!

Best,

Zhiyuan

On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan Dong <zh...@gmail.com> wrote:

> Thanks Xiening!!
>
> A follow-up  question :
>
> suppose I have an orc files having many columns,
>
> in the first pass, I read the first column from start to end to find out
> which are the subset of the rows that I need to extract.
>
> now, when I do a 2nd pass, for the rest of columns, is there any efficient
> way that I can only extract the row positions that I identified in the
> first pass ?
>
> what I am doing now is to extract the rest of columns, batch by batch, and
> only extract those rows identified by the first pass, but not sure if this
> is an efficient way.
>
> Many thanks!!
>
> Best,
>
> Zhiyuan
>

-- 
Zhiyuan Dong, Ph.D.

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

Thanks Xiening!!

A follow-up  question :

suppose I have an orc files having many columns,

in the first pass, I read the first column from start to end to find out
which are the subset of the rows that I need to extract.

now, when I do a 2nd pass, for the rest of columns, is there any efficient
way that I can only extract the row positions that I identified in the
first pass ?

what I am doing now is to extract the rest of columns, batch by batch, and
only extract those rows identified by the first pass, but not sure if this
is an efficient way.

Many thanks!!

Best,

Zhiyuan

Re: access entire column in ORC files

Posted by Xiening Dai <xn...@live.com>.

They can be different types for sure.

On Jan 24, 2019, at 11:21 AM, Zhiyuan Dong <zh...@gmail.com>> wrote:

the fields, e.g. fields[0], fields[1], etc, in StructVectorBatch needs to be of the same subtype ? Or they can have different subtype ?

Many thanks!

Best,

Zhiyuan

On Sun, Jan 20, 2019 at 11:53 AM Gang Wu <us...@gmail.com>> wrote:
To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch, which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and orc::StringVectorBatch for char *, check the API here: https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41

Gang

On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zh...@gmail.com>> wrote:
Hi Owen,

Let me follow the github example link you provided.

Appreciate the prompt response. Many thanks!

Best,

Zhiyuan

On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <ow...@gmail.com>> wrote:
Yes, ORC files are set up so that reading individual columns is much faster (and reads less data) than reading the entire row.

You need to call RowReaderOptions::include or includeType depending on whether you want to select by name or id.

Look at the tool code for file contents about how to do this.

https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77

.. Owen

On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com>> wrote:
Hi

I am working in marketing research field, and find that at times I need to extract contents of ORC files into analytical packages like R, Julia, etc, without using tools like JDBC, etc ( which offers ability to access ORC files )

I have been using C++ to access ORC file contents, following examples provided in the ORC file C++ distribution example, e.g. meta info, contents, etc. My datasets are basic 2d tables, with rows and columns, each column has very basic data types : int64, or double. I have found the ORC file C++ access APIs very helpful and handy!

Since R or Julia has column major storage format in their matrix, and I would like to extract the contents of ORC files column by column. In the example that gets the file contents made available on the ORC file C++ official website, the C++ code reads the entire ORC file contents by batches, and within each batch, it reads the contents row by row, creating a string version of the data, JSON like.

My question is : ( since I don't know how ORC file structure details ), Can the user read ORC file contents column by column using the C++ APIs you guys published ? is there speed advantage of doing this ( as opposed to read in batches, and within each batch parse contents row by row ).

if possible : Is there an example that I can follow to read contents column by column?

Is it possible that the example C++ codes can give a (char*) type pointer to the user , each time it reads a row element within a column, so that users can read that into desired data type, e.g. int64, double, etc, directly without building the JSON like text output rows ? Or there are even more there already to read a ORC file column directly into a in-memory T* that stores the data with corresponding data type, e.g. int64, double, etc. ?

Many many thanks!

Best,

Zhiyuan

--
Zhiyuan Dong, Ph.D.

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

the fields, e.g. fields[0], fields[1], etc,  in StructVectorBatch needs to
be of the same subtype ? Or they can have different subtype ?

Many thanks!

Best,

Zhiyuan



On Sun, Jan 20, 2019 at 11:53 AM Gang Wu <us...@gmail.com> wrote:

> To read the desired type of each column, you just need to cast the base
> orc::ColumnVectorBatch, which you get from rowReader->next(), to its
> desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and
> orc::StringVectorBatch for char *, check the API here:
> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41
>
> Gang
>
> On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zh...@gmail.com>
> wrote:
>
>> Hi Owen,
>>
>> Let me follow the github example link you provided.
>>
>> Appreciate the prompt response. Many thanks!
>>
>> Best,
>>
>> Zhiyuan
>>
>> On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <ow...@gmail.com>
>> wrote:
>>
>>> Yes, ORC files are set up so that reading individual columns is much
>>> faster (and reads less data) than reading the entire row.
>>>
>>> You need to call RowReaderOptions::include or includeType depending on
>>> whether you want to select by name or id.
>>>
>>> Look at the tool code for file contents about how to do this.
>>>
>>>
>>> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>>>
>>> .. Owen
>>>
>>> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I am working in marketing research field, and find that at times I need
>>>> to extract contents of ORC files into analytical packages like R, Julia,
>>>> etc, without using tools like JDBC, etc ( which offers ability to access
>>>> ORC files )
>>>>
>>>> I have been using C++ to access ORC file contents, following examples
>>>> provided in the ORC file C++ distribution example, e.g. meta info,
>>>> contents, etc. My datasets are basic 2d tables, with rows and columns, each
>>>> column has very basic data types : int64, or double. I have found the ORC
>>>> file C++ access APIs very helpful and handy!
>>>>
>>>> Since R or Julia has column major storage format in their matrix, and I
>>>> would like to extract the contents of ORC files column by column. In the
>>>> example that gets the file contents made available on the ORC file C++
>>>> official website, the C++ code reads the entire ORC file contents by
>>>> batches, and within each batch, it reads the contents row by row, creating
>>>> a string version of the data, JSON like.
>>>>
>>>> My question is : ( since I don't know how ORC file structure details ),
>>>> Can the user read ORC file contents column by column using the C++ APIs you
>>>> guys published ? is there speed advantage of doing this ( as opposed to
>>>> read in batches, and within each batch parse contents row by row ).
>>>>
>>>> if possible : Is there an example that I can follow to read contents
>>>> column by column?
>>>>
>>>> Is it possible that the example C++ codes can give a (char*) type
>>>> pointer to the user , each time it reads a row element within a column, so
>>>> that users can read that into desired data type, e.g. int64, double, etc,
>>>> directly without building the JSON like text output rows ? Or there are
>>>> even more there already to read a ORC file column directly into a in-memory
>>>> T* that stores the data with corresponding data type, e.g. int64, double,
>>>> etc. ?
>>>>
>>>> Many many thanks!
>>>>
>>>> Best,
>>>>
>>>> Zhiyuan
>>>>
>>>
>>
>> --
>> Zhiyuan Dong, Ph.D.
>>
>

-- 
Zhiyuan Dong, Ph.D.

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

Thanks for pointing this out!!

Sent from my iPhone

> On Jan 22, 2019, at 11:39 AM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> It is important to use the RowReaderOptions::include method since that is what controls whether the bytes are read and decompressed or not.
> 
> .. Owen
> 
>> On Jan 20, 2019, at 9:52 AM, Gang Wu <us...@gmail.com> wrote:
>> 
>> To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch, which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and orc::StringVectorBatch for char *, check the API here:  https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41
>> 
>> Gang
>> 
>>> On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zh...@gmail.com> wrote:
>>> Hi Owen,
>>> 
>>> Let me follow the github example link you provided. 
>>> 
>>> Appreciate the prompt response. Many thanks!
>>> 
>>> Best,
>>> 
>>> Zhiyuan
>>> 
>>>> On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <ow...@gmail.com> wrote:
>>>> Yes, ORC files are set up so that reading individual columns is much faster (and reads less data) than reading the entire row.
>>>> 
>>>> You need to call RowReaderOptions::include or includeType depending on whether you want to select by name or id.
>>>> 
>>>> Look at the tool code for file contents about how to do this. 
>>>> 
>>>> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>>>> 
>>>> .. Owen
>>>> 
>>>>> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com> wrote:
>>>>> Hi 
>>>>> 
>>>>> I am working in marketing research field, and find that at times I need to extract contents of ORC files into analytical packages like R, Julia, etc, without using tools like JDBC, etc ( which offers ability to access ORC files )
>>>>> 
>>>>> I have been using C++ to access ORC file contents, following examples provided in the ORC file C++ distribution example, e.g. meta info, contents, etc. My datasets are basic 2d tables, with rows and columns, each column has very basic data types : int64, or double. I have found the ORC file C++ access APIs very helpful and handy!
>>>>> 
>>>>> Since R or Julia has column major storage format in their matrix, and I would like to extract the contents of ORC files column by column. In the example that gets the file contents made available on the ORC file C++ official website, the C++ code reads the entire ORC file contents by batches, and within each batch, it reads the contents row by row, creating a string version of the data, JSON like.
>>>>> 
>>>>> My question is : ( since I don't know how ORC file structure details ), Can the user read ORC file contents column by column using the C++ APIs you guys published ? is there speed advantage of doing this ( as opposed to read in batches, and within each batch parse contents row by row ).
>>>>> 
>>>>> if possible : Is there an example that I can follow to read contents column by column? 
>>>>> 
>>>>> Is it possible that the example C++ codes can give a (char*) type pointer to the user , each time it reads a row element within a column, so that users can read that into desired data type, e.g. int64, double, etc, directly without building the JSON like text output rows ? Or there are even more there already to read a ORC file column directly into a in-memory T* that stores the data with corresponding data type, e.g. int64, double, etc. ?
>>>>> 
>>>>> Many many thanks!
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Zhiyuan
>>> 
>>> 
>>> -- 
>>> Zhiyuan Dong, Ph.D.
>

Re: access entire column in ORC files

Posted by Owen O'Malley <ow...@gmail.com>.

It is important to use the RowReaderOptions::include method since that is
what controls whether the bytes are read and decompressed or not.

.. Owen

On Jan 20, 2019, at 9:52 AM, Gang Wu <us...@gmail.com> wrote:

To read the desired type of each column, you just need to cast the base
orc::ColumnVectorBatch, which you get from rowReader->next(), to its
desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and
orc::StringVectorBatch for char *, check the API here:
https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41

Gang

On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zh...@gmail.com> wrote:

> Hi Owen,
>
> Let me follow the github example link you provided.
>
> Appreciate the prompt response. Many thanks!
>
> Best,
>
> Zhiyuan
>
> On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <ow...@gmail.com>
> wrote:
>
>> Yes, ORC files are set up so that reading individual columns is much
>> faster (and reads less data) than reading the entire row.
>>
>> You need to call RowReaderOptions::include or includeType depending on
>> whether you want to select by name or id.
>>
>> Look at the tool code for file contents about how to do this.
>>
>>
>> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>>
>> .. Owen
>>
>> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I am working in marketing research field, and find that at times I need
>>> to extract contents of ORC files into analytical packages like R, Julia,
>>> etc, without using tools like JDBC, etc ( which offers ability to access
>>> ORC files )
>>>
>>> I have been using C++ to access ORC file contents, following examples
>>> provided in the ORC file C++ distribution example, e.g. meta info,
>>> contents, etc. My datasets are basic 2d tables, with rows and columns, each
>>> column has very basic data types : int64, or double. I have found the ORC
>>> file C++ access APIs very helpful and handy!
>>>
>>> Since R or Julia has column major storage format in their matrix, and I
>>> would like to extract the contents of ORC files column by column. In the
>>> example that gets the file contents made available on the ORC file C++
>>> official website, the C++ code reads the entire ORC file contents by
>>> batches, and within each batch, it reads the contents row by row, creating
>>> a string version of the data, JSON like.
>>>
>>> My question is : ( since I don't know how ORC file structure details ),
>>> Can the user read ORC file contents column by column using the C++ APIs you
>>> guys published ? is there speed advantage of doing this ( as opposed to
>>> read in batches, and within each batch parse contents row by row ).
>>>
>>> if possible : Is there an example that I can follow to read contents
>>> column by column?
>>>
>>> Is it possible that the example C++ codes can give a (char*) type
>>> pointer to the user , each time it reads a row element within a column, so
>>> that users can read that into desired data type, e.g. int64, double, etc,
>>> directly without building the JSON like text output rows ? Or there are
>>> even more there already to read a ORC file column directly into a in-memory
>>> T* that stores the data with corresponding data type, e.g. int64, double,
>>> etc. ?
>>>
>>> Many many thanks!
>>>
>>> Best,
>>>
>>> Zhiyuan
>>>
>>
>
> --
> Zhiyuan Dong, Ph.D.
>

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

Many Thanks Gang!!

Sent from my iPhone

> On Jan 20, 2019, at 11:52 AM, Gang Wu <us...@gmail.com> wrote:
> 
> To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch, which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and orc::StringVectorBatch for char *, check the API here:  https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41
> 
> Gang
> 
>> On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zh...@gmail.com> wrote:
>> Hi Owen,
>> 
>> Let me follow the github example link you provided. 
>> 
>> Appreciate the prompt response. Many thanks!
>> 
>> Best,
>> 
>> Zhiyuan
>> 
>>> On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <ow...@gmail.com> wrote:
>>> Yes, ORC files are set up so that reading individual columns is much faster (and reads less data) than reading the entire row.
>>> 
>>> You need to call RowReaderOptions::include or includeType depending on whether you want to select by name or id.
>>> 
>>> Look at the tool code for file contents about how to do this. 
>>> 
>>> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>>> 
>>> .. Owen
>>> 
>>>> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com> wrote:
>>>> Hi 
>>>> 
>>>> I am working in marketing research field, and find that at times I need to extract contents of ORC files into analytical packages like R, Julia, etc, without using tools like JDBC, etc ( which offers ability to access ORC files )
>>>> 
>>>> I have been using C++ to access ORC file contents, following examples provided in the ORC file C++ distribution example, e.g. meta info, contents, etc. My datasets are basic 2d tables, with rows and columns, each column has very basic data types : int64, or double. I have found the ORC file C++ access APIs very helpful and handy!
>>>> 
>>>> Since R or Julia has column major storage format in their matrix, and I would like to extract the contents of ORC files column by column. In the example that gets the file contents made available on the ORC file C++ official website, the C++ code reads the entire ORC file contents by batches, and within each batch, it reads the contents row by row, creating a string version of the data, JSON like.
>>>> 
>>>> My question is : ( since I don't know how ORC file structure details ), Can the user read ORC file contents column by column using the C++ APIs you guys published ? is there speed advantage of doing this ( as opposed to read in batches, and within each batch parse contents row by row ).
>>>> 
>>>> if possible : Is there an example that I can follow to read contents column by column? 
>>>> 
>>>> Is it possible that the example C++ codes can give a (char*) type pointer to the user , each time it reads a row element within a column, so that users can read that into desired data type, e.g. int64, double, etc, directly without building the JSON like text output rows ? Or there are even more there already to read a ORC file column directly into a in-memory T* that stores the data with corresponding data type, e.g. int64, double, etc. ?
>>>> 
>>>> Many many thanks!
>>>> 
>>>> Best,
>>>> 
>>>> Zhiyuan
>> 
>> 
>> -- 
>> Zhiyuan Dong, Ph.D.

Re: access entire column in ORC files

Posted by Gang Wu <us...@gmail.com>.

To read the desired type of each column, you just need to cast the base
orc::ColumnVectorBatch, which you get from rowReader->next(), to its
desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and
orc::StringVectorBatch for char *, check the API here:
https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41

Gang

On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zh...@gmail.com> wrote:

> Hi Owen,
>
> Let me follow the github example link you provided.
>
> Appreciate the prompt response. Many thanks!
>
> Best,
>
> Zhiyuan
>
> On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <ow...@gmail.com>
> wrote:
>
>> Yes, ORC files are set up so that reading individual columns is much
>> faster (and reads less data) than reading the entire row.
>>
>> You need to call RowReaderOptions::include or includeType depending on
>> whether you want to select by name or id.
>>
>> Look at the tool code for file contents about how to do this.
>>
>>
>> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>>
>> .. Owen
>>
>> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I am working in marketing research field, and find that at times I need
>>> to extract contents of ORC files into analytical packages like R, Julia,
>>> etc, without using tools like JDBC, etc ( which offers ability to access
>>> ORC files )
>>>
>>> I have been using C++ to access ORC file contents, following examples
>>> provided in the ORC file C++ distribution example, e.g. meta info,
>>> contents, etc. My datasets are basic 2d tables, with rows and columns, each
>>> column has very basic data types : int64, or double. I have found the ORC
>>> file C++ access APIs very helpful and handy!
>>>
>>> Since R or Julia has column major storage format in their matrix, and I
>>> would like to extract the contents of ORC files column by column. In the
>>> example that gets the file contents made available on the ORC file C++
>>> official website, the C++ code reads the entire ORC file contents by
>>> batches, and within each batch, it reads the contents row by row, creating
>>> a string version of the data, JSON like.
>>>
>>> My question is : ( since I don't know how ORC file structure details ),
>>> Can the user read ORC file contents column by column using the C++ APIs you
>>> guys published ? is there speed advantage of doing this ( as opposed to
>>> read in batches, and within each batch parse contents row by row ).
>>>
>>> if possible : Is there an example that I can follow to read contents
>>> column by column?
>>>
>>> Is it possible that the example C++ codes can give a (char*) type
>>> pointer to the user , each time it reads a row element within a column, so
>>> that users can read that into desired data type, e.g. int64, double, etc,
>>> directly without building the JSON like text output rows ? Or there are
>>> even more there already to read a ORC file column directly into a in-memory
>>> T* that stores the data with corresponding data type, e.g. int64, double,
>>> etc. ?
>>>
>>> Many many thanks!
>>>
>>> Best,
>>>
>>> Zhiyuan
>>>
>>
>
> --
> Zhiyuan Dong, Ph.D.
>

Re: access entire column in ORC files

Posted by Zhiyuan Dong <zh...@gmail.com>.

Hi Owen,

Let me follow the github example link you provided.

Appreciate the prompt response. Many thanks!

Best,

Zhiyuan

On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <ow...@gmail.com>
wrote:

> Yes, ORC files are set up so that reading individual columns is much
> faster (and reads less data) than reading the entire row.
>
> You need to call RowReaderOptions::include or includeType depending on
> whether you want to select by name or id.
>
> Look at the tool code for file contents about how to do this.
>
>
> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>
> .. Owen
>
> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com>
> wrote:
>
>> Hi
>>
>> I am working in marketing research field, and find that at times I need
>> to extract contents of ORC files into analytical packages like R, Julia,
>> etc, without using tools like JDBC, etc ( which offers ability to access
>> ORC files )
>>
>> I have been using C++ to access ORC file contents, following examples
>> provided in the ORC file C++ distribution example, e.g. meta info,
>> contents, etc. My datasets are basic 2d tables, with rows and columns, each
>> column has very basic data types : int64, or double. I have found the ORC
>> file C++ access APIs very helpful and handy!
>>
>> Since R or Julia has column major storage format in their matrix, and I
>> would like to extract the contents of ORC files column by column. In the
>> example that gets the file contents made available on the ORC file C++
>> official website, the C++ code reads the entire ORC file contents by
>> batches, and within each batch, it reads the contents row by row, creating
>> a string version of the data, JSON like.
>>
>> My question is : ( since I don't know how ORC file structure details ),
>> Can the user read ORC file contents column by column using the C++ APIs you
>> guys published ? is there speed advantage of doing this ( as opposed to
>> read in batches, and within each batch parse contents row by row ).
>>
>> if possible : Is there an example that I can follow to read contents
>> column by column?
>>
>> Is it possible that the example C++ codes can give a (char*) type pointer
>> to the user , each time it reads a row element within a column, so that
>> users can read that into desired data type, e.g. int64, double, etc,
>> directly without building the JSON like text output rows ? Or there are
>> even more there already to read a ORC file column directly into a in-memory
>> T* that stores the data with corresponding data type, e.g. int64, double,
>> etc. ?
>>
>> Many many thanks!
>>
>> Best,
>>
>> Zhiyuan
>>
>

-- 
Zhiyuan Dong, Ph.D.

Re: access entire column in ORC files

Posted by Owen O'Malley <ow...@gmail.com>.

Yes, ORC files are set up so that reading individual columns is much faster
(and reads less data) than reading the entire row.

You need to call RowReaderOptions::include or includeType depending on
whether you want to select by name or id.

Look at the tool code for file contents about how to do this.

https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77

.. Owen

On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zh...@gmail.com> wrote:

> Hi
>
> I am working in marketing research field, and find that at times I need to
> extract contents of ORC files into analytical packages like R, Julia, etc,
> without using tools like JDBC, etc ( which offers ability to access ORC
> files )
>
> I have been using C++ to access ORC file contents, following examples
> provided in the ORC file C++ distribution example, e.g. meta info,
> contents, etc. My datasets are basic 2d tables, with rows and columns, each
> column has very basic data types : int64, or double. I have found the ORC
> file C++ access APIs very helpful and handy!
>
> Since R or Julia has column major storage format in their matrix, and I
> would like to extract the contents of ORC files column by column. In the
> example that gets the file contents made available on the ORC file C++
> official website, the C++ code reads the entire ORC file contents by
> batches, and within each batch, it reads the contents row by row, creating
> a string version of the data, JSON like.
>
> My question is : ( since I don't know how ORC file structure details ),
> Can the user read ORC file contents column by column using the C++ APIs you
> guys published ? is there speed advantage of doing this ( as opposed to
> read in batches, and within each batch parse contents row by row ).
>
> if possible : Is there an example that I can follow to read contents
> column by column?
>
> Is it possible that the example C++ codes can give a (char*) type pointer
> to the user , each time it reads a row element within a column, so that
> users can read that into desired data type, e.g. int64, double, etc,
> directly without building the JSON like text output rows ? Or there are
> even more there already to read a ORC file column directly into a in-memory
> T* that stores the data with corresponding data type, e.g. int64, double,
> etc. ?
>
> Many many thanks!
>
> Best,
>
> Zhiyuan
>