You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@orc.apache.org by Hinko Kocevar <Hi...@ess.eu> on 2023/01/10 16:22:55 UTC

ORC list column size

I would like to use ORC file to hold several columns of data. One of the columns will be a list (array) of floats that could span 10 000 - 50 000 elements is length. Other columns will not be lists, but of different data types.

Is having such long lists in any way an issue in terms of performance or otherwise for the ORC file?

Thank you in advance!

//Hinko

Re: ORC list column size

Posted by Hinko Kocevar <Hi...@ess.eu>.

On 10 Jan 2023, at 18:36, Dongjoon Hyun <do...@gmail.com> wrote:


It sounds interesting. Are you writing and reading ORC files progmatically via ORC library? Or, do you use Spark/Flink/PyArrow/Dask?

Writing will be done using ORC C++ library.
Reading will be done using Spark with python.

//hinko


Dongjoon

On Tue, Jan 10, 2023 at 8:23 AM Hinko Kocevar <Hi...@ess.eu>> wrote:
I would like to use ORC file to hold several columns of data. One of the columns will be a list (array) of floats that could span 10 000 - 50 000 elements is length. Other columns will not be lists, but of different data types.

Is having such long lists in any way an issue in terms of performance or otherwise for the ORC file?

Thank you in advance!

//Hinko

Re: ORC list column size

Posted by Dongjoon Hyun <do...@gmail.com>.
It sounds interesting. Are you writing and reading ORC files progmatically
via ORC library? Or, do you use Spark/Flink/PyArrow/Dask?

Dongjoon

On Tue, Jan 10, 2023 at 8:23 AM Hinko Kocevar <Hi...@ess.eu> wrote:

> I would like to use ORC file to hold several columns of data. One of the
> columns will be a list (array) of floats that could span 10 000 - 50 000
> elements is length. Other columns will not be lists, but of different data
> types.
>
> Is having such long lists in any way an issue in terms of performance or
> otherwise for the ORC file?
>
> Thank you in advance!
>
> //Hinko

Re: ORC list column size

Posted by Dain Sundstrom <da...@iq80.com>.
> On Jan 10, 2023, at 3:21 PM, Hinko Kocevar <Hi...@ess.eu> wrote:
> 
> 
>> On 10 Jan 2023, at 19:16, Dain Sundstrom <da...@iq80.com> wrote:
>> 
>> 50,000 * 4 bytes ~= 200 kB, so this shouldn’t be a problem.  Generally, large values can be a problem for some compute engines, but 200 kB isn’t that large.
> 
> Thanks for the input! What counts as ”large” in your opinion?

Megabytes for a single row.

> I plan on using Spark + python on the read / process side. 

I doubt it will be a problem.

-dain

Re: ORC list column size

Posted by Hinko Kocevar <Hi...@ess.eu>.
> On 10 Jan 2023, at 19:16, Dain Sundstrom <da...@iq80.com> wrote:
> 
> 50,000 * 4 bytes ~= 200 kB, so this shouldn’t be a problem.  Generally, large values can be a problem for some compute engines, but 200 kB isn’t that large.

Thanks for the input! What counts as ”large” in your opinion?

I plan on using Spark + python on the read / process side. 

//hinko

> 
>> On Jan 10, 2023, at 8:22 AM, Hinko Kocevar <Hi...@ess.eu> wrote:
>> 
>> I would like to use ORC file to hold several columns of data. One of the columns will be a list (array) of floats that could span 10 000 - 50 000 elements is length. Other columns will not be lists, but of different data types.
>> 
>> Is having such long lists in any way an issue in terms of performance or otherwise for the ORC file?
>> 
>> Thank you in advance!
>> 
>> //Hinko
> 

Re: ORC list column size

Posted by Dain Sundstrom <da...@iq80.com>.
50,000 * 4 bytes ~= 200 kB, so this shouldn’t be a problem.  Generally, large values can be a problem for some compute engines, but 200 kB isn’t that large.

> On Jan 10, 2023, at 8:22 AM, Hinko Kocevar <Hi...@ess.eu> wrote:
> 
> I would like to use ORC file to hold several columns of data. One of the columns will be a list (array) of floats that could span 10 000 - 50 000 elements is length. Other columns will not be lists, but of different data types.
> 
> Is having such long lists in any way an issue in terms of performance or otherwise for the ORC file?
> 
> Thank you in advance!
> 
> //Hinko